patches/series��������������������������������������������������������������������������������������0000644�0000764�0000764�00000065243�11043606367�012470� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������# # base tree: 2.6.24.7 # # # fixes from mainline # futex-fix-fault-damage.patch futex-remove-warn-on.patch x86-64-fix-copy-user.patch mm-fix-race-in-cow-logic.patch hrtimer-20080427.patch hrtimer-deadlock-fix.patch hrtimer-infinite-loop-fix.patch hrtimer-dont-migrate-raisesoftirq.patch linux-2.6.24-pollfix.patch CVE-2008-1615-linux-2.6-paranoid-iret.patch CVE-2007-6694-ppc-chrs-null-fix.patch CVE-2008-1673-ans1_sanity_check_on_BER_decoding.patch CVE-2008-2136-missing_kfree_skb_on_pskb_may_pull.patch CVE-2008-2148-simplify_sched_fair.patch CVE-2007-6282-2.6.24.1_esp_iv_bug.patch CVE-2008-2148-fix_utimensat_permissions_check.patch CVE-2008-2372-reinstate_ZERO_PAGE_optimization_in_get_user_pages_and_fix_XIP.patch CVE-2008-2372-fix_ZERO_PAGE_breakage_with_vmware.patch fix_inotify_user_coalescing-bz453990.patch sctp-fix_sctp_addr_overflow.patch x86_64-ia32_syscall_restart_fix.patch x86_64-ptrace_sign_extend_orig_rax_to_64bits.patch x86_fix_vsyscall_wreckage.patch # # m68knommu upstream patches necessary for -rt support # m68knommu-upstream-patches.patch # # RT Balancing code # # Taken from sched-devel.git # 0001-sched-count-of-queued-RT-tasks.patch 0002-sched-track-highest-prio-task-queued.patch 0003-sched-add-RT-task-pushing.patch 0004-sched-add-rt-overload-tracking.patch 0005-sched-pull-RT-tasks-from-overloaded-runqueues.patch 0006-sched-push-RT-tasks-from-overloaded-CPUs.patch 0007-sched-disable-standard-balancer-for-RT-tasks.patch 0008-sched-add-RT-balance-cpu-weight.patch 0009-sched-clean-up-this_rq-use-in-kernel-sched_rt.c.patch 0010-sched-de-SCHED_OTHER-ize-the-RT-path.patch 0011-sched-break-out-search-for-RT-tasks.patch 0012-sched-RT-balancing-include-current-CPU.patch 0013-sched-pre-route-RT-tasks-on-wakeup.patch 0014-sched-optimize-RT-affinity.patch 0015-sched-wake-balance-fixes.patch 0016-sched-RT-balance-avoid-overloading.patch 0017-sched-break-out-early-if-RT-task-cannot-be-migrated.patch 0018-sched-RT-balance-optimize.patch 0019-sched-RT-balance-optimize-cpu-search.patch 0020-sched-RT-balance-on-new-task.patch 0021-sched-clean-up-pick_next_highest_task_rt.patch 0022-sched-clean-up-find_lock_lowest_rq.patch # 0023-sched-clean-up-overlong-line-in-kernel-sched_debug.patch 0024-sched-clean-up-kernel-sched_rt.c.patch 0025-sched-remove-rt_overload.patch 0026-sched-remove-leftover-debugging.patch 0027-sched-clean-up-pull_rt_task.patch 0028-sched-clean-up-schedule_balance_rt.patch 0029-sched-add-sched-domain-roots.patch 0030-sched-update-root-domain-spans-upon-departure.patch 0031-Subject-SCHED-Only-balance-our-RT-tasks-within-ou.patch 0032-sched-fix-sched_rt.c-join-leave_domain.patch # 0033-sched-remove-unused-JIFFIES_TO_NS-macro.patch 0034-sched-style-cleanup-2.patch 0035-sched-add-credits-for-RT-balancing-improvements.patch # 0036-sched-reactivate-fork-balancing.patch 0037-sched-whitespace-cleanups-in-topology.h.patch 0038-sched-no-need-for-affine-wakeup-balancing-in.patch 0039-sched-get-rid-of-new_cpu-in-try_to_wake_up.patch # 0040-sched-remove-do_div-from-__sched_slice.patch 0041-sched-RT-balance-replace-hooks-with-pre-post-sched.patch 0042-sched-RT-balance-add-new-methods-to-sched_class.patch 0043-sched-RT-balance-only-adjust-overload-state-when-c.patch 0044-sched-remove-some-old-cpuset-logic.patch # Last patch of the RT balancing code (not yet in sched-devel) sched-use-a-2d-bitmap-search-prio-cpu.patch remove-unused-var-warning.patch # FTRACE tracing markers-upstream.patch ftrace-upstream.patch ftrace-disable-daemon.patch ftrace-safe-traversal-hlist.patch ftrace-update-cnt-stat-fix.patch ftrace-function-record-nop.patch # 01-ftrace.patch - dynamic-tick-rcu patch #02-ftrace.patch #03-ftrace.patch #04-ftrace.patch #05-ftrace.patch #06-ftrace.patch #07-ftrace.patch #08-ftrace.patch #09-ftrace.patch #10-ftrace.patch #11-ftrace.patch #12-ftrace.patch #13-ftrace.patch #14-ftrace.patch #15-ftrace.patch #16-ftrace.patch #17-ftrace.patch #18-ftrace.patch #ftrace-nop-calls.patch #ftrace-move-memory-management-to-generic.patch #ftrace-direct-calls.patch #ftrace-filter-functions.patch #ftrace-alloc-pages.patch #ftrace-debug-use-preempt-disable-notrace.patch #ftrace-irqsoff-smp-processor-id-fix.patch #ftrace-lockdep-notrace-annotations.patch #ftrace-dont-use-raw-irq-save.patch #ftrace-max-update-fixes.patch #ftrace-latest-updates.patch #ftrace-add-sched-cmdline-record-to-function-trace.patch #ftrace-unlock-mutex-in-output.patch #ftrace-remove-max-printks.patch #ftrace-flip-fix.patch # ftrace RT extensions tracer-add-event-markers.patch #tracer-use-sched-clock.patch tracer-event-trace.patch trace-histograms.patch trace_hist-divzero.patch event-tracer-syscall-x86_64.patch event-tracer-syscall-i386.patch trace-events-handle-syscalls.patch preempt-trace.patch # MCOUNT tracing #mcount-add-basic-support-for-gcc-profiler-instrum.patch #mcount-annotate-generic-code.patch #mcount-add-x86_64-notrace-annotations.patch #mcount-add-x86-vdso-notrace-annotations.patch #mcount-nmi-notrace-annotations.patch #rt-time-starvation-fix.patch #initialize-clocksource-to-jiffies.patch #get-monotonic-cycles.patch #mcount-add-time-notrace-annotations.patch #mcount-preempt-notrace.patch #mcount-function-tracer.patch #add-trace-hooks-to-sched.patch #parse-out-task-state-to-char-string.patch #trace-add-cmdline-switch.patch #trace-generic-cmdline.patch #trace-sched-hooks.patch #add-markers-to-wakeup.patch #mcount-trace-wakeup-latency.patch #mcount-tracer-latency-trace-irqs-off.patch #mcount-trace-latency-trace-preempt-off.patch #tracer-add-event-markers.patch #tracer-event-trace.patch #event-tracer-syscall-x86_64.patch #trace-events-handle-syscalls.patch #event-tracer-syscall-i386.patch # ARM trace hook trace-add-event-markers-arm.patch # PPC MCOUNT updates ppc-rename-xmon-mcount.patch #ppc-add-mcount.patch #ppc-mcount-dummy-functions.patch #ppc-mark-notrace-mainline.patch #ppc-add-ppc32-mcount.patch #ppc-select-mcount.patch powerpc-add-ftrace.patch powerpc-ftrace-cleanups.patch powerpc-remove-ip-converted.patch powerpc-ftrace-store-mcount.patch powerpc-ftrace-stop-on-oops.patch # Extra notrace additions # mcount-preemptcount-notrace-annotations.patch #mcount-fault-notrace-annotations.patch #mcount-irqs-notrace-annotations.patch #mcount-rcu-notrace-annotations.patch # m68knommu ftrace ftrace-m68knommu-add-FTRACE-support.patch ftrace-m68knommu-generic-stacktrace-function.patch # KVM - RT fixes kvm-fix-preemption-bug.patch kvm-lapic-migrate-latency-fix.patch kvm-make-less-noise.patch kvm-preempt-rt-resched-delayed.patch sched-enable-irqs-in-preempt-in-notifier-call.patch # # ARM clock events & co # ep93xx-timer-accuracy.patch ep93xx-clockevents.patch ep93xx-clockevents-fix.patch # CHECKME arm-leds-timer.patch # # Check what's in mainline / mm or might be # upstream material. # spinlock-trylock-cleanup-sungem.patch x86_64-tsc-sync-irqflags-fix.patch neptune-no-at-keyboard.patch rtmutex-debug.h-cleanup.patch netpoll-8139too-fix.patch kprobes-preempt-fix.patch replace-bugon-by-warn-on.patch # Suspend / resume fixups i386-mark-atomic-irq-ops-raw.patch msi-suspend-resume-workaround.patch floppy-resume-fix.patch # # assorted fixlets from -mm: # # Check if they are really in -mm or should be submitted # hrtimers-overrun-api.patch mm-fix-latency.patch ioapic-fix-too-fast-clocks.patch fix-acpi-build-weirdness.patch write-try-lock-irqsave.patch move-native-irq.patch dont-unmask-io_apic.patch # # misc build beautification patches: # gcc-warnings-shut-up.patch # # Various fixlets # # # Debugging patches # apic-dumpstack.patch netfilter-more-debugging.patch # # Latency tracer # # We are using the new tracer, I've put a '# x ' # in front of all the patches that I needed to # remove to do so. # nmi-profiling-base.patch #redo-regparm-option.patch # x latency-tracing.patch # x latency-tracing-remove-trace-array.patch #latency-tracer-disable-across-trace-cmdline.patch #latency-tracing-i386-paravirt-fastcall.patch # x latency-tracing-i386.patch # x latency-tracing-x86_64.patch latency-tracing-ppc.patch # x latency-tracer-printk-fix.patch latency-tracing-arm.patch # x latency-tracing-exclude-printk.patch #latency-tracing-prctl-api-hack.patch #ftrace-eventtrace-fixup.patch # x latency-tracing-raw-spinlock-hack.patch # x latency-tracer-one-off-fix.patch # x smaller-trace.patch # x trace-name-plus.patch # x trace-with-caller-addr.patch # x trace-sti-mwait.patch # x latency-tracer-optimize-a-bit.patch # x idle-stop-critical-timing.patch arm-latency-tracer-support.patch # x latency-tracer-variable-threshold.patch # Needs to be rewritten to trigger on the procfs variable ! # x reset-latency-histogram.patch # tracing # x undo-latency-tracing-raw-spinlock-hack.patch random-driver-latency-fix.patch latency-measurement-drivers.patch latency-measurement-drivers-fix.patch # x latency-tracing-use-now.patch # x preempt_max_latency-in-all-modes.patch # x latency-hist-add-resetting-for-all-timing-options.patch # x latency-trace-sysctl-config-fix.patch # x latency-trace-convert-back-to-ms.patch # x latency-trace-fix.patch # x trace-cpuidle.patch # # lockdep queue: # lockdep-show-held-locks.patch lockdep-lock_set_subclass.patch lockdep-prettify.patch lockdep-more-entries.patch # # Revert loopback bh assumption patch # loopback-revert.patch # # hrtimer # # x hrtimer-trace.patch # # PPC gtod and highres support # ** upstream as of 2.6.24-rc2 ** #ppc-gtod-support.patch #ppc-gtod-support-fix.patch #ppc-a-2.patch #ppc-fix-clocksource-timebase-shift.patch #ppc-remove-broken-vsyscall.patch #ppc-read-persistent-clock.patch ppc-gtod-notrace-fix.patch #ppc-clockevents.patch #ppc-clockevents-fix.patch #ppc-highres-dyntick.patch # # -rt queue: # #inet_hash_bits.patch #inet-hash-bits-ipv6-fix.patch # # RCU preempt patches from Paul: # # The old patches #rcu-1.patch #rcu-2.patch #rcu-3.patch #rcu-4.patch #### New Experimental Preempt RCU implemntation #### rcu-new-1.patch rcu-new-2.patch rcu-new-3.patch rcu-new-4.patch rcu-new-5.patch #rcu-new-6.patch # keep commented out rcu-new-7.patch #rcu-new-8.patch # keep commented out rcu-new-9.patch # Paul's and Steve's patches rcu-new-10.patch rcu-fix-rcu-preempt.patch rcu-dynticks-update.patch ### New implementation ends here ### # new rcu implementation shouldn't need these. #rcu-preempt-fix-nmi-watchdog.patch #rcu-preempt-fix-rcu-torture.patch rcu-hrt-fixups.patch #dynticks-rcu-rt-fixlet.patch #rcu-tasklet-softirq.patch #rcu-classic-fixup.patch #rcu-warn-underflow.patch # # ARM preperatory patches # arm-cmpxchg.patch arm-fix-atomic-cmpxchg.patch arm-cmpxchg-support-armv6.patch arm-futex-atomic-cmpxchg.patch arm-preempt-config.patch # # m68knommu # m68knommu-add-cmpxchg-in-default-fashion.patch m68knommu-make-cmpxchg-RT-safe.patch m68knommu-add-read_barrier_depends-and-irqs_disab.patch # # IRQ threading # preempt-softirqs-core.patch preempt-irqs-core.patch preempt-irqs-softirq-in-hardirq.patch preempt-irqs-direct-debug-keyboard.patch preempt-irqs-timer.patch preempt-irqs-hrtimer.patch preempt-irqs-i386.patch preempt-irqs-i386-ioapic-mask-quirk.patch preempt-irqs-mips.patch preempt-irqs-x86-64.patch preempt-irqs-x86-64-ioapic-mask-quirk.patch preempt-irqs-arm.patch preempt-irqs-arm-fix-oprofile.patch preempt-irqs-ppc.patch preempt-irqs-ppc-ack-irq-fixups.patch preempt-irqs-ppc-fix-b5.patch preempt-irqs-ppc-fix-b6.patch preempt-irqs-ppc-celleb-beatic-eoi.patch preempt-irqs-ppc-fix-more-fasteoi.patch preempt-irqs-ppc-preempt-schedule-irq-entry-fix.patch preempt-irqs-m68knommu-make-timer-interrupt-non-threaded.patch preempt-irqs-Kconfig.patch # # Real real time stuff :) # rt-apis.patch rt-slab-new.patch rt-page_alloc.patch # # rt-mutexes # rt-mutex-preempt-debugging.patch rt-mutex-irq-flags-checking.patch rt-mutex-trivial-tcp-preempt-fix.patch rt-mutex-trivial-route-cast-fix.patch rt-mutex-delayed-resched.patch rt-mutex-core.patch rt-mutex-trylock-export.patch rt-mutex-spinlock-might-sleep.patch rt-mutex-i386.patch rt-mutex-mips.patch rt-mutex-ppc.patch rt-mutex-ppc-fix-a5.patch rt-mutex-x86-64.patch rt-mutex-arm.patch rt-mutex-arm-fix.patch rt-mutex-m68knommu-add-compat_semaphore.patch rt-mutex-m68knommu-consider-TIF_NEED_RESCHED_DELAYED-on-resc.patch rt-mutex-drop-generic-TIF_NEED_RESCHED_DELAYED.patch rt-mutex-compat-semaphores.patch # # Per-CPU locking assumption cleanups: # percpu-locked-mm.patch percpu-locked-netfilter.patch percpu-locked-netfilter2.patch percpu-locked-powerpc-fixups.patch percpu-locked-powerpc-fixups-a6.patch # # Various preempt fixups # net-core-preempt-fix.patch bh-uptodate-lock.patch bh-state-lock.patch jbd_assertions_smp_only.patch # # Tasklet redesign # tasklet-redesign.patch tasklet-busy-loop-hack.patch tasklet-fix-preemption-race.patch tasklet-more-fixes.patch # # Diable irq poll on -rt # disable-irqpoll.patch # # Inaccurate -rt stats (should be replaced by CFS) # kstat-add-rt-stats.patch # Misc preempt-realtime-warn-and-bug-on.patch # # Posix-cpu-timers in a thread # cputimer-thread-rt_A0.patch cputimer-thread-rt-fix.patch posix-cpu-timers-fix.patch # # Various broken drivers # vortex-fix.patch serial-locking-rt-cleanup.patch fix-emac-locking-2.6.16.patch # # Serial optimizing # serial-slow-machines.patch # # Realtime patches # # ARM: preempt-realtime-arm.patch preempt-realtime-arm-rawlock-in-mmu_context-h.patch arm-trace-preempt-idle.patch preempt-realtime-arm-bagde4.patch preempt-realtime-arm-footbridge.patch preempt-realtime-arm-integrator.patch preempt-realtime-arm-ixp4xx.patch preempt-realtime-arm-pxa.patch preempt-realtime-arm-shark.patch # MIPS: needs splitting preempt-realtime-mips.patch #mips-gtod_clocksource.patch # X86_64: needs splitting preempt-realtime-x86_64.patch # IA64: needs splitting preempt-realtime-ia64.patch # PPC: needs cleanup preempt-realtime-ppc-need-resched-delayed.patch preempt-realtime-ppc-more-resched-fixups.patch preempt-realtime-powerpc.patch preempt-realtime-powerpc-update.patch preempt-realtime-powerpc-a7.patch preempt-realtime-powerpc-b2.patch preempt-realtime-powerpc-b3.patch preempt-realtime-powerpc-b4.patch preempt-realtime-powerpc-add-raw-relax-macros.patch preempt-realtime-powerpc-tlb-batching.patch preempt-realtime-powerpc-celleb-raw-spinlocks.patch preempt-realtime-powerpc-missing-raw-spinlocks.patch # SuperH: needs splitting preempt-realtime-sh.patch # i386 preempt-realtime-i386.patch remove-check-pgt-cache-calls.patch preempt-irqs-i386-idle-poll-loop-fix.patch # # Core patch # # Note this is a convenience split up it is not supposed to compile # step by step. Needs some care, but it is way easier to handle than # the previous touch all in one patch # preempt-realtime-ftrace.patch preempt-realtime-ftrace-disable-ftraced.patch preempt-realtime-sched.patch preempt-realtime-mmdrop-delayed.patch preempt-realtime-sched-i386.patch preempt-realtime-prevent-idle-boosting.patch # preempt-realtime-cfs-accounting-fix.patch schedule-tail-balance-disable-irqs.patch preempt-realtime-sched-cpupri.patch preempt-realtime-core.patch preempt-realtime-fs-block.patch preempt-realtime-acpi.patch preempt-realtime-ipc.patch preempt-realtime-sound.patch preempt-realtime-mm.patch preempt-realtime-init-show-enabled-debugs.patch preempt-realtime-compile-fixes.patch preempt-realtime-console.patch preempt-realtime-debug-sysctl.patch preempt-realtime-ide.patch preempt-realtime-input.patch preempt-realtime-irqs.patch preempt-realtime-net-drivers.patch #preempt-realtime-netconsole.patch preempt-realtime-printk.patch preempt-realtime-profiling.patch preempt-realtime-rawlocks.patch preempt-realtime-rcu.patch preempt-realtime-timer.patch kstat-fix-spurious-system-load-spikes-in-proc-loadavgrt.patch preempt-realtime-usb.patch preempt-realtime-warn-and-bug-on-fix.patch # # Various -rt fixups # preempt-realtime-supress-cpulock-warning.patch preempt-realtime-supress-nohz-softirq-warning.patch preempt-realtime-net.patch preempt-realtime-net-softirq-fixups.patch preempt-realtime-loopback.patch #preempt-realtime-8139too-rt-irq-flags-fix.patch preempt-realtime-mellanox-driver-fix.patch # # Utility patches (not for upstream inclusion): # preempt-realtime-supress-rtc-printk.patch hrtimer-no-printk.patch nmi-profiling.patch panic-dont-stop-box.patch nmi-watchdog-disable.patch # # soft watchdog queue: # #softlockup-fix.patch softlockup-add-irq-regs-h.patch #softlockup-better-printout.patch #softlockup-cleanups.patch #softlockup-use-cpu-clock.patch # # Not yet reviewed # gtod-optimize.patch # RCU rcu-various-fixups.patch # # Futex updates # futex-performance-hack.patch futex-performance-hack-sysctl-fix.patch # # Pete's file locking scalability changes: # s_files-schedule_on_each_cpu_wq.patch ## Missing patch -- SDR ## See http://programming.kicks-ass.net/kernel-patches/schedule_on_cpu.patch # schedule_on_cpu.patch s_files-pipe-fix.patch # # Pete's file locking scalability changes: # lockdep_lock_set_subclass_fix.patch qrcu.patch lock_list.patch percpu_list.patch s_files.patch fix-circular-locking-deadlock.patch # # START of Pete's ccur-pagecache queue # # # lockless pagecache # #2.6.21-rc6-lockless1-prep-find_lock_page.patch #2.6.21-rc6-lockless2-radix-tree-use-indirect-bit.patch 2.6.21-rc6-lockless3-radix-tree-gang-slot-lookups.patch #2.6.21-rc6-lockless4-__add_to_swap_cache-stuff.patch 2.6.21-rc6-lockless5-lockless-probe.patch 2.6.21-rc6-lockless6-speculative-get-page.patch 2.6.21-rc6-lockless7-lockless-pagecache-lookups.patch 2.6.21-rc6-lockless8-spinlock-tree_lock.patch # # concurrent (write side) page cache # radix-tree-concurrent.patch mapping_nrpages.patch lock_page_ref.patch mm-concurrent-pagecache.patch radix-tree-optimistic.patch radix-tree-optimistic-hist.patch radix-concurrent-lockdep.patch #radix-tree-path-compression.patch # # -rt bits # mm-concurrent-pagecache-rt.patch # # END of Pete's ccur-pagecache queue # # # kmap atomix fixes # kmap-atomic-prepare.patch pagefault-disable-cleanup.patch nommu-fix-build.patch kmap-atomic-i386-fix.patch # # Not yet reviewed # select-error-leak-fix.patch fix-emergency-reboot.patch timer-freq-tweaks.patch # # Highmem modifications # highmem-revert-mainline.patch highmem_rewrite.patch highmem-redo-mainline.patch rt-kmap-scale-fix.patch # # Debug patches: # pause-on-oops-head-tail.patch i386-nmi-watchdog-show-regs.patch x86-64-traps-move-held-locks-output.patch # # x86-64 vsyscall modifications # x86-64-tscless-vgettimeofday.patch #vsyscall-fixadder-pa.patch # # Timekeeping fixups # # x rt-time-starvation-fix.patch # x rt-time-starvation-fix-update.patch # # RT-Java testing stuff # Add-dev-rmem-device-driver-for-real-time-JVM-testing.patch Allocate-RTSJ-memory-for-TCK-conformance-test.patch # # Softirq modifications # new-softirq-code.patch softirq-per-cpu-assumptions-fixes.patch fix-migrating-softirq.patch only-run-softirqs-from-irq-thread-when-irq-affinity-is-set.patch fix-softirq-checks-for-non-rt-preempt-hardirq.patch smp-processor-id-fixups.patch # # Weird crap unearthed by -rt which needs to be investigated # irda-fix.patch nf_conntrack-weird-crash-fix.patch nf_conntrack-fix-smp-processor-id.patch # # Needs proper fix # print-might-sleep-hack.patch lockdep-rt-mutex.patch lockstat-rt-hooks.patch lockstat_bounce_rt.patch # # KVM: # #kvm-rt.patch # # Add RT to uname and apply the version # RT_utsname.patch # # not yet backmerged tail patches: # preempt-rt-no-slub.patch paravirt-function-pointer-fix.patch quicklist-release-before-free-page.patch quicklist-release-before-free-page-fix.patch disable-lpptest-on-nonlinux.patch sched-rt-stats.patch mitigate-resched-flood.patch genirq-soft-resend.patch rcu-preempt-hotplug-hackaround.patch relay-fix.patch schedule_on_each_cpu-enhance.patch schedule_on_each_cpu-enhance-rt.patch lockdep-rt-recursion-limit-fix.patch cond_resched_softirq-WARN-fix.patch irq-mask-fix.patch # stuff Ingo put into version.patch export-schedule-on-each-cpu.patch # Tony Breeds POWERPC patches powerpc-rearrange-thread-flags-to-work-with-andi-instruction.patch powerpc-count_active_rt_tasks-is-undefined-for-non-preempt-rt.patch powerpc-match-__rw_yield-function-declaration-to-prototype.patch #powerpc-flags-as-passed-to-spin-x-irqsave-should-be-unsigned-long.patch powerpc-flush_tlb_pending-is-no-more.patch fix-alternate_node_alloc.patch fix-compilation-for-non-RT-in-timer.patch hack-convert-i_alloc_sem-for-direct_io-craziness.patch dont-let-rt-rw_semaphores-do-non_owner-locks.patch rt-s_files-kill-a-union.patch loadavg_fixes_weird_loads.patch # HPET patches watchdog_use_timer_and_hpet_on_x86_64.patch pmtmr-override.patch call_rcu_bh-rename-of-call_rcu.patch introduce-pick-function-macro.patch replace-PICK_OP-with-PICK_FUNCTION.patch fix-PICK_FUNCTION-spin_trylock_irq.patch seqlocks-use-PICK_FUNCTION.patch fork-desched_thread-comment-rework.patch # x stop-critical-timing-in-idle.patch # rt-wakeup-fix.patch disable-ist-x86_64.patch rcu-trace-fix-free.patch rcu-preempt-fix-bad-dyntick-accounting.patch rcu-preempt-boost-sdr.patch rcu-preempt-boost-default.patch rcu-preempt-boost-fix.patch rcu-torture-preempt-update.patch rcupreempt-boost-early-init.patch plist-debug.patch seq-irqsave.patch numa-slab-freeing.patch # Peter's patches # # workqueue PI # rt_mutex_setprio.patch rt-list-mods.patch rt-plist-mods.patch rt-workqeue-prio.patch rt-workqueue-barrier.patch rt-wq-barrier-fix.patch rt-delayed-prio.patch sched_prio.patch # x critical-timing-kconfig.patch lock-init-plist-fix.patch ntfs-local-irq-save-nort.patch dont-disable-preemption-without-IST.patch #rt-powerpc-workarounds.patch irq-flags-unsigned-long.patch filemap-dont-bug-non-atomic.patch fix-bug-on-in-filemap.patch rt-sched-groups.patch send-nmi-all-preempt-disable.patch printk-dont-bug-on-sched.patch user-no-irq-disable.patch drain-all-local-pages-via-sched.patch local_irq_save_nort-in-swap.patch # x latency-tracer-arch-low-address.patch proportions-raw-locks.patch arm-compile-fix.patch no-warning-for-irqs-disabled-in-local-bh-enable.patch page-alloc-use-real-time-pcp-locking-for-page-draining.patch #power-fixes-for-kernbench.patch handle-pending-in-simple-irq.patch # AT91 patches use-edge-triggered-irq-handler-instead-of-simple-irq.patch # x latency-tracer-dont-panic-on-failed-bootmem-alloc.patch dev-queue-xmit-preempt-fix.patch dynamically-update-root-domain-span-online-maps.patch ppc-hacks-to-allow-rt-to-run-kernbench.patch ppc64-non-smp-compile-fix-per-cpu.patch rcu-preempt-trace-markers-1.patch rcu-preempt-trace-markers-2.patch # x time-accumulate-offset-fix.patch kernel-bug-after-entering-something-from-login.patch ppc-make-tlb-batch-64-only.patch ppc-chpr-set-rtc-lock.patch disable-run-softirq-from-hardirq-completely.patch hack-fix-rt-migration.patch mips-remove-conlicting-rtc-lock-declaration.patch mips-remove-finish-arch-switch.patch mips-change-raw-spinlock-type.patch ppc32-latency-compile-hack-fixes.patch mips-remove-duplicate-kconfig.patch ppc32_notrace_init_functions.patch apic-level-smp-affinity.patch timer-warning-fix.patch printk-in-atomic.patch root-domain-kfree-in-atomic.patch rt-balance-check-rq.patch printk-in-atomic-hack-fix.patch slab-irq-nopreempt-fix.patch sysctl-compile-fix.patch kthread-cpus-allowed-init.patch ppc-tlbflush-preempt.patch swap-spinlock-fix.patch remove-spinlock-define.patch migrate-dying.patch #added to 2.6.24.7-rt5 nmi-watchdog-fix-1.patch nmi-watchdog-fix-2.patch nmi-watchdog-fix-3.patch nmi-watchdog-fix-4.patch rt-avoid-deadlock-in-swap.patch rt-shorten-softirq-thread-names.patch # This patch breaks rt-migrate-test #sched-rt-push-only-new.patch time-gcc-linker-error.patch trace-fix-hist-name-spellings.patch cache_pci_find_capability.patch rt-move-update-wall-time-back-to-do-timer.patch rtmutex-lateral-steal.patch rtmutex-rearrange.patch rtmutex-remove-xchg.patch adaptive-spinlock-lite-v2.patch # RW Locks multiple readers rwsems-mulitple-readers.patch rwlocks-lateral-steal.patch rwlocks-multiple-readers.patch multi-reader-account.patch multi-reader-limit.patch multi-reader-lock-account.patch multi-reader-pi.patch rwlocks-default-nr-readers-nr-cpus.patch rwlock-typecast-cmpxchg.patch rwlock-implement-downgrade-write.patch sched-nr-migrate-lower-default-preempt-rt.patch arm-fix-compile-error-trace-exit-idle.patch # Peter's fair load_balance break out patches sched-wake_up_idle_cpu-rt.patch sched_load_balance_flags.patch sched_load_balance_lockbreak.patch sched-load_balance-iterator.patch sched-load_balance-stop.patch sched-load_balance-is_runnable.patch # some ftrace fix ups ftrace-trace-sched.patch lockdep-avoid-fork-waring.patch ftrace-dont-trace-markers.patch ftrace-record-comm-on-ctrl.patch ftrace-print-missing-cmdline.patch # Peter's lockstat updates lockstat-fix-contention-points.patch lockstat-output.patch # Luis's gtod updates fix_vdso_gtod_vsyscall64_2.patch rwlocks-fix-no-preempt-rt.patch git-ignore-module-markers.patch git-ignore-script-lpp.patch adaptive-optimize-rt-lock-wakeup.patch adaptive-task-oncpu.patch adaptive-adjust-pi-wakeup.patch adapt-remove-extra-try-to-lock.patch adaptive-earlybreak-on-steal.patch x86-disable-spinlock-preempt.patch x86-fifo-ticket-spinlocks.patch realtime-preempt-warn-about-tracing.patch #sched-double-lock-balance-enable-irqs.patch x86-delay-enable-preempt-tglx.patch ftrace-compile-fixes.patch ftrace-fix-header.patch #latency-hist-divide-by-zero.patch rcupreempt-trace-marker-update.patch #trace_hist-latediv.patch -p0 marker-upstream-example.patch nmi-show-regs-fix.patch sched-fix-rt-task-wakeup.patch sched-fix-sched-fair-wakeup.patch trace_hist-latediv.patch rwlock-prio-fix.patch rwlock-fixes.patch event-trace-hrtimer-trace.patch rwlock-torture.patch ftrace-wakeup-rawspinlock.patch radix-tree-lockdep-plus1.patch sched-cpupri-hotplug-support.patch sched-cpupri-priocount.patch ftrace-hotplug-fix.patch rwlock-pi-lock-reader.patch fix-adaptive-hack.patch rwlock-slowunlock-mutex-fix.patch rwlock-slowunlock-mutex-fix2.patch rt-mutex-use-inline.patch rt-mutex-namespace.patch rtmutex-debug-fix.patch rwlock-protect-reader_lock_count.patch ftrace-stop-trace-on-crash.patch rwlock-torture-no-rt.patch fix-config-debug-rt-mutex-lock-underflow-warnings.patch cpu-hotplug-vs-slab.patch cpu-hotplug-vs-page-alloc.patch cpu-hotplug-cpu-up-vs-preempt-rt.patch rcu-backport-rcu-cpu-hotplug-support.patch cpu-hotplug-cpu-down-vs-preempt-rt.patch re-cpu-hotplug-cpu-down-vs-preempt-rt.patch rt-rwlock-conservative-locking.patch ftrace-call-function-pointer.patch idle-fix.diff cpu-hotplug-cpu-down-vs-preempt-rt_fix.patch fix_misplaced_mb.patch fix_sys_sched_rr_get_interval_slice_for_SCHED_FIFO_tasks.patch ftrace-preempt-trace-check.patch fix_SCHED_FIFO_spec_violation.patch ppc64-fix-preempt-unsafe-paths-accessing-per_cpu-variables.patch bz235099-idle-load-fix.patch raw-spinlocks-for-nmi-print.patch fix-a-previously-reverted-fix.patch powerpc-xics-move-the-call-to-irq-radix-revmap-from-xics-startup-to-xics-host-map.patch powerpc-make-the-irq-reverse-mapping-radix-tree-lockless.patch trace-do-not-wakeup-when-irqs-disabled.patch acpi-fix-enter-c1.patch hotplug-smp-boot-fix.patch cpu-hotplug-fix-fix-fix.patch sched-fix-dequeued-race.patch x86-64-fix-compile.patch trace-ktime-scalar.patch nfs-stats-miss-preemption.patch version.patch �������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/futex-fix-fault-damage.patch����������������������������������������������������������������0000664�0000764�0000764�00000010771�11041673275�016536� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������Subject: futex-fix-fault-damage.patch From: Thomas Gleixner <tglx@linutronix.de> Date: Sat, 21 Jun 2008 09:09:44 +0200 Signed-off-by: Thomas Gleixner <tglx@linutronix.de> --- kernel/futex.c | 93 ++++++++++++++++++++++++++++++++++++++++++++------------- 1 file changed, 73 insertions(+), 20 deletions(-) Index: linux-2.6.24.7/kernel/futex.c =================================================================== --- linux-2.6.24.7.orig/kernel/futex.c +++ linux-2.6.24.7/kernel/futex.c @@ -1107,21 +1107,64 @@ static void unqueue_me_pi(struct futex_q * private futexes. */ static int fixup_pi_state_owner(u32 __user *uaddr, struct futex_q *q, - struct task_struct *newowner) + struct task_struct *newowner, + struct rw_semaphore *fshared) { u32 newtid = task_pid_vnr(newowner) | FUTEX_WAITERS; struct futex_pi_state *pi_state = q->pi_state; + struct task_struct *oldowner = pi_state->owner; u32 uval, curval, newval; - int ret; + int ret, attempt = 0; /* Owner died? */ + if (!pi_state->owner) + newtid |= FUTEX_OWNER_DIED; + + /* + * We are here either because we stole the rtmutex from the + * pending owner or we are the pending owner which failed to + * get the rtmutex. We have to replace the pending owner TID + * in the user space variable. This must be atomic as we have + * preserve the owner died bit here. + * + * Note: We write the user space value _before_ changing the + * pi_state because we can fault here. Imagine swapped out + * pages or a fork, which was running right before we acquired + * mmap_sem, that marked all the anonymous memory readonly for + * cow. + * + * Modifying pi_state _before_ the user space value would + * leave the pi_state in an inconsistent state when we fault + * here, because we need to drop the hash bucket lock to + * handle the fault. This might be observed in the PID check + * in lookup_pi_state. + */ +retry: + if (get_futex_value_locked(&uval, uaddr)) + goto handle_fault; + + while (1) { + newval = (uval & FUTEX_OWNER_DIED) | newtid; + + curval = cmpxchg_futex_value_locked(uaddr, uval, newval); + + if (curval == -EFAULT) + goto handle_fault; + if (curval == uval) + break; + uval = curval; + } + + /* + * We fixed up user space. Now we need to fix the pi_state + * itself. + */ if (pi_state->owner != NULL) { spin_lock_irq(&pi_state->owner->pi_lock); WARN_ON(list_empty(&pi_state->list)); list_del_init(&pi_state->list); spin_unlock_irq(&pi_state->owner->pi_lock); - } else - newtid |= FUTEX_OWNER_DIED; + } pi_state->owner = newowner; @@ -1129,26 +1172,35 @@ static int fixup_pi_state_owner(u32 __us WARN_ON(!list_empty(&pi_state->list)); list_add(&pi_state->list, &newowner->pi_state_list); spin_unlock_irq(&newowner->pi_lock); + return 0; /* - * We own it, so we have to replace the pending owner - * TID. This must be atomic as we have preserve the - * owner died bit here. + * To handle the page fault we need to drop the hash bucket + * lock here. That gives the other task (either the pending + * owner itself or the task which stole the rtmutex) the + * chance to try the fixup of the pi_state. So once we are + * back from handling the fault we need to check the pi_state + * after reacquiring the hash bucket lock and before trying to + * do another fixup. When the fixup has been done already we + * simply return. */ - ret = get_futex_value_locked(&uval, uaddr); +handle_fault: + spin_unlock(q->lock_ptr); - while (!ret) { - newval = (uval & FUTEX_OWNER_DIED) | newtid; + ret = futex_handle_fault((unsigned long)uaddr, fshared, attempt++); - curval = cmpxchg_futex_value_locked(uaddr, uval, newval); + spin_lock(q->lock_ptr); - if (curval == -EFAULT) - ret = -EFAULT; - if (curval == uval) - break; - uval = curval; - } - return ret; + /* + * Check if someone else fixed it for us: + */ + if (pi_state->owner != oldowner) + return 0; + + if (ret) + return ret; + + goto retry; } /* @@ -1505,7 +1557,7 @@ static int futex_lock_pi(u32 __user *uad * that case: */ if (q.pi_state->owner != curr) - ret = fixup_pi_state_owner(uaddr, &q, curr); + ret = fixup_pi_state_owner(uaddr, &q, curr, fshared); } else { /* * Catch the rare case, where the lock was released @@ -1537,7 +1589,8 @@ static int futex_lock_pi(u32 __user *uad int res; owner = rt_mutex_owner(&q.pi_state->pi_mutex); - res = fixup_pi_state_owner(uaddr, &q, owner); + res = fixup_pi_state_owner(uaddr, &q, owner, + fshared); WARN_ON(rt_mutex_owner(&q.pi_state->pi_mutex) != owner); �������patches/futex-remove-warn-on.patch������������������������������������������������������������������0000664�0000764�0000764�00000001304�11041673275�016267� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������Subject: futex-unlock-rtmutex-on-fault.patch From: Thomas Gleixner <tglx@linutronix.de> Date: Sat, 21 Jun 2008 14:14:08 +0200 Signed-off-by: Thomas Gleixner <tglx@linutronix.de> --- kernel/futex.c | 3 --- 1 file changed, 3 deletions(-) Index: linux-2.6.24.7/kernel/futex.c =================================================================== --- linux-2.6.24.7.orig/kernel/futex.c +++ linux-2.6.24.7/kernel/futex.c @@ -1592,9 +1592,6 @@ static int futex_lock_pi(u32 __user *uad res = fixup_pi_state_owner(uaddr, &q, owner, fshared); - WARN_ON(rt_mutex_owner(&q.pi_state->pi_mutex) != - owner); - /* propagate -EFAULT, if the fixup failed */ if (res) ret = res; ����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/x86-64-fix-copy-user.patch������������������������������������������������������������������0000664�0000764�0000764�00000013151�11041657732�015651� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������commit 42a886af728c089df8da1b0017b0e7e6c81b5335 Author: Linus Torvalds <torvalds@linux-foundation.org> Date: Tue Jun 17 17:47:50 2008 -0700 x86-64: Fix "bytes left to copy" return value for copy_from_user() Most users by far do not care about the exact return value (they only really care about whether the copy succeeded in its entirety or not), but a few special core routines actually care deeply about exactly how many bytes were copied from user space. And the unrolled versions of the x86-64 user copy routines would sometimes report that it had copied more bytes than it actually had. Very few uses actually have partial copies to begin with, but to make this bug even harder to trigger, most x86 CPU's use the "rep string" instructions for normal user copies, and that version didn't have this issue. To make it even harder to hit, the one user of this that really cared about the return value (and used the uncached version of the copy that doesn't use the "rep string" instructions) was the generic write routine, which pre-populated its source, once more hiding the problem by avoiding the exception case that triggers the bug. In other words, very special thanks to Bron Gondwana who not only triggered this, but created a test-program to show it, and bisected the behavior down to commit 08291429cfa6258c4cd95d8833beb40f828b194e ("mm: fix pagecache write deadlocks") which changed the access pattern just enough that you can now trigger it with 'writev()' with multiple iovec's. That commit itself was not the cause of the bug, it just allowed all the stars to align just right that you could trigger the problem. [ Side note: this is just the minimal fix to make the copy routines (with __copy_from_user_inatomic_nocache as the particular version that was involved in showing this) have the right return values. We really should improve on the exceptional case further - to make the copy do a byte-accurate copy up to the exact page limit that causes it to fail. As it is, the callers have to do extra work to handle the limit case gracefully. ] Reported-by: Bron Gondwana <brong@fastmail.fm> Cc: Nick Piggin <npiggin@suse.de> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andi Kleen <andi@firstfloor.org> Cc: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> (which didn't have this problem), and since most users that do the carethis was very hard to trigger, but --- arch/x86/lib/copy_user_64.S | 25 +++++++++++-------------- arch/x86/lib/copy_user_nocache_64.S | 25 +++++++++++-------------- 2 files changed, 22 insertions(+), 28 deletions(-) Index: linux-2.6.24.7/arch/x86/lib/copy_user_64.S =================================================================== --- linux-2.6.24.7.orig/arch/x86/lib/copy_user_64.S +++ linux-2.6.24.7/arch/x86/lib/copy_user_64.S @@ -217,19 +217,19 @@ ENTRY(copy_user_generic_unrolled) /* table sorted by exception address */ .section __ex_table,"a" .align 8 - .quad .Ls1,.Ls1e - .quad .Ls2,.Ls2e - .quad .Ls3,.Ls3e - .quad .Ls4,.Ls4e - .quad .Ld1,.Ls1e + .quad .Ls1,.Ls1e /* Ls1-Ls4 have copied zero bytes */ + .quad .Ls2,.Ls1e + .quad .Ls3,.Ls1e + .quad .Ls4,.Ls1e + .quad .Ld1,.Ls1e /* Ld1-Ld4 have copied 0-24 bytes */ .quad .Ld2,.Ls2e .quad .Ld3,.Ls3e .quad .Ld4,.Ls4e - .quad .Ls5,.Ls5e - .quad .Ls6,.Ls6e - .quad .Ls7,.Ls7e - .quad .Ls8,.Ls8e - .quad .Ld5,.Ls5e + .quad .Ls5,.Ls5e /* Ls5-Ls8 have copied 32 bytes */ + .quad .Ls6,.Ls5e + .quad .Ls7,.Ls5e + .quad .Ls8,.Ls5e + .quad .Ld5,.Ls5e /* Ld5-Ld8 have copied 32-56 bytes */ .quad .Ld6,.Ls6e .quad .Ld7,.Ls7e .quad .Ld8,.Ls8e @@ -244,11 +244,8 @@ ENTRY(copy_user_generic_unrolled) .quad .Le5,.Le_zero .previous - /* compute 64-offset for main loop. 8 bytes accuracy with error on the - pessimistic side. this is gross. it would be better to fix the - interface. */ /* eax: zero, ebx: 64 */ -.Ls1e: addl $8,%eax +.Ls1e: addl $8,%eax /* eax is bytes left uncopied within the loop (Ls1e: 64 .. Ls8e: 8) */ .Ls2e: addl $8,%eax .Ls3e: addl $8,%eax .Ls4e: addl $8,%eax Index: linux-2.6.24.7/arch/x86/lib/copy_user_nocache_64.S =================================================================== --- linux-2.6.24.7.orig/arch/x86/lib/copy_user_nocache_64.S +++ linux-2.6.24.7/arch/x86/lib/copy_user_nocache_64.S @@ -145,19 +145,19 @@ ENTRY(__copy_user_nocache) /* table sorted by exception address */ .section __ex_table,"a" .align 8 - .quad .Ls1,.Ls1e - .quad .Ls2,.Ls2e - .quad .Ls3,.Ls3e - .quad .Ls4,.Ls4e - .quad .Ld1,.Ls1e + .quad .Ls1,.Ls1e /* .Ls[1-4] - 0 bytes copied */ + .quad .Ls2,.Ls1e + .quad .Ls3,.Ls1e + .quad .Ls4,.Ls1e + .quad .Ld1,.Ls1e /* .Ld[1-4] - 0..24 bytes coped */ .quad .Ld2,.Ls2e .quad .Ld3,.Ls3e .quad .Ld4,.Ls4e - .quad .Ls5,.Ls5e - .quad .Ls6,.Ls6e - .quad .Ls7,.Ls7e - .quad .Ls8,.Ls8e - .quad .Ld5,.Ls5e + .quad .Ls5,.Ls5e /* .Ls[5-8] - 32 bytes copied */ + .quad .Ls6,.Ls5e + .quad .Ls7,.Ls5e + .quad .Ls8,.Ls5e + .quad .Ld5,.Ls5e /* .Ld[5-8] - 32..56 bytes copied */ .quad .Ld6,.Ls6e .quad .Ld7,.Ls7e .quad .Ld8,.Ls8e @@ -172,11 +172,8 @@ ENTRY(__copy_user_nocache) .quad .Le5,.Le_zero .previous - /* compute 64-offset for main loop. 8 bytes accuracy with error on the - pessimistic side. this is gross. it would be better to fix the - interface. */ /* eax: zero, ebx: 64 */ -.Ls1e: addl $8,%eax +.Ls1e: addl $8,%eax /* eax: bytes left uncopied: Ls1e: 64 .. Ls8e: 8 */ .Ls2e: addl $8,%eax .Ls3e: addl $8,%eax .Ls4e: addl $8,%eax �����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/mm-fix-race-in-cow-logic.patch��������������������������������������������������������������0000664�0000764�0000764�00000006415�11041673275�016664� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From: Nick Piggin <npiggin@suse.de> Subject: mm: fix race in the COW logic There is a race in the COW logic. It contains a shortcut to avoid the COW and reuse the page if we have the sole reference on the page, however it is possible to have two racing do_wp_page()ers with one causing the other to mistakenly believe it is safe to take the shortcut when it is not. This could lead to data corruption. Process 1 and process2 each have a wp pte of the same anon page (ie. one forked the other). The page's mapcount is 2. Then they both attempt to write to it around the same time... proc1 proc2 thr1 proc2 thr2 CPU0 CPU1 CPU3 do_wp_page() do_wp_page() trylock_page() can_share_swap_page() load page mapcount (==2) reuse = 0 pte unlock copy page to new_page pte lock page_remove_rmap(page); trylock_page() can_share_swap_page() load page mapcount (==1) reuse = 1 ptep_set_access_flags (allow W) write private key into page read from page ptep_clear_flush() set_pte_at(pte of new_page) Fix this by moving the page_remove_rmap of the old page after the pte clear and flush. Potentially the entire branch could be moved down here, but in order to stay consistent, I won't (should probably move all the *_mm_counter stuff with one patch). Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: williams@redhat.com Cc: "Luis Claudio R. Goncalves" <lgoncalv@redhat.com> Cc: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> --- --- mm/memory.c | 27 ++++++++++++++++++++++++++- 1 file changed, 26 insertions(+), 1 deletion(-) Index: linux-2.6.24.7/mm/memory.c =================================================================== --- linux-2.6.24.7.orig/mm/memory.c +++ linux-2.6.24.7/mm/memory.c @@ -1639,7 +1639,6 @@ gotten: page_table = pte_offset_map_lock(mm, pmd, address, &ptl); if (likely(pte_same(*page_table, orig_pte))) { if (old_page) { - page_remove_rmap(old_page, vma); if (!PageAnon(old_page)) { dec_mm_counter(mm, file_rss); inc_mm_counter(mm, anon_rss); @@ -1661,6 +1660,32 @@ gotten: lru_cache_add_active(new_page); page_add_new_anon_rmap(new_page, vma, address); + if (old_page) { + /* + * Only after switching the pte to the new page may + * we remove the mapcount here. Otherwise another + * process may come and find the rmap count decremented + * before the pte is switched to the new page, and + * "reuse" the old page writing into it while our pte + * here still points into it and can be read by other + * threads. + * + * The critical issue is to order this + * page_remove_rmap with the ptp_clear_flush above. + * Those stores are ordered by (if nothing else,) + * the barrier present in the atomic_add_negative + * in page_remove_rmap. + * + * Then the TLB flush in ptep_clear_flush ensures that + * no process can access the old page before the + * decremented mapcount is visible. And the old page + * cannot be reused until after the decremented + * mapcount is visible. So transitively, TLBs to + * old page will be flushed before it can be reused. + */ + page_remove_rmap(old_page, vma); + } + /* Free the old page.. */ new_page = old_page; ret |= VM_FAULT_WRITE; ���������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/hrtimer-20080427.patch����������������������������������������������������������������������0000664�0000764�0000764�00000002313�11041673275�014641� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������Subject: [GTI pull] hrtimer fixes From: Thomas Gleixner <tglx@linutronix.de> Linus, please pull hrtimer fixes from: ssh://master.kernel.org/pub/scm/linux/kernel/git/tglx/linux-2.6-hrt.git master This fixes a long standing hrtimer reprogramming bug. Thanks, tglx --- kernel/hrtimer.c | 15 +++++++++++++-- 1 file changed, 13 insertions(+), 2 deletions(-) Index: linux-2.6.24.7/kernel/hrtimer.c =================================================================== --- linux-2.6.24.7.orig/kernel/hrtimer.c +++ linux-2.6.24.7/kernel/hrtimer.c @@ -1172,8 +1172,19 @@ static void run_hrtimer_softirq(struct s * If the timer was rearmed on another CPU, reprogram * the event device. */ - if (timer->base->first == &timer->node) - hrtimer_reprogram(timer, timer->base); + struct hrtimer_clock_base *base = timer->base; + + if (base->first == &timer->node && + hrtimer_reprogram(timer, base)) { + /* + * Timer is expired. Thus move it from tree to + * pending list again. + */ + __remove_hrtimer(timer, base, + HRTIMER_STATE_PENDING, 0); + list_add_tail(&timer->cb_entry, + &base->cpu_base->cb_pending); + } } } spin_unlock_irq(&cpu_base->lock); ���������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/hrtimer-deadlock-fix.patch������������������������������������������������������������������0000664�0000764�0000764�00000004202�11041663504�016256� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������Subject: [GTI pull] hrtimer fixes From: Thomas Gleixner <tglx@linutronix.de> Linus, please pull hrtimer fixes from: ssh://master.kernel.org/pub/scm/linux/kernel/git/tglx/linux-2.6-hrt.git master Fix for a potential deadlock which was introduced with the scheduler hrtimer changes in .25. Thanks, tglx --- kernel/hrtimer.c | 18 +++++++++++++++++- 1 file changed, 17 insertions(+), 1 deletion(-) Index: linux-2.6.24.7/kernel/hrtimer.c =================================================================== --- linux-2.6.24.7.orig/kernel/hrtimer.c +++ linux-2.6.24.7/kernel/hrtimer.c @@ -593,7 +593,6 @@ static inline int hrtimer_enqueue_reprog list_add_tail(&timer->cb_entry, &base->cpu_base->cb_pending); timer->state = HRTIMER_STATE_PENDING; - raise_softirq(HRTIMER_SOFTIRQ); return 1; default: BUG(); @@ -636,6 +635,11 @@ static int hrtimer_switch_to_hres(void) return 1; } +static inline void hrtimer_raise_softirq(void) +{ + raise_softirq(HRTIMER_SOFTIRQ); +} + #else static inline int hrtimer_hres_active(void) { return 0; } @@ -651,6 +655,7 @@ static inline int hrtimer_cb_pending(str static inline void hrtimer_remove_cb_pending(struct hrtimer *timer) { } static inline void hrtimer_init_hres(struct hrtimer_cpu_base *base) { } static inline void hrtimer_init_timer_hres(struct hrtimer *timer) { } +static inline void hrtimer_raise_softirq(void) { } #endif /* CONFIG_HIGH_RES_TIMERS */ @@ -852,6 +857,7 @@ hrtimer_start(struct hrtimer *timer, kti struct hrtimer_clock_base *base, *new_base; unsigned long flags; int ret; + int raise; base = lock_hrtimer_base(timer, &flags); @@ -885,8 +891,18 @@ hrtimer_start(struct hrtimer *timer, kti enqueue_hrtimer(timer, new_base, new_base->cpu_base == &__get_cpu_var(hrtimer_bases)); + /* + * The timer may be expired and moved to the cb_pending + * list. We can not raise the softirq with base lock held due + * to a possible deadlock with runqueue lock. + */ + raise = timer->state == HRTIMER_STATE_PENDING; + unlock_hrtimer_base(timer, &flags); + if (raise) + hrtimer_raise_softirq(); + return ret; } EXPORT_SYMBOL_GPL(hrtimer_start); ����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/hrtimer-infinite-loop-fix.patch�������������������������������������������������������������0000664�0000764�0000764�00000003203�11041663514�017265� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������Subject: [RHEL5.2 PATCH] CVE-2007-6712 kernel: infinite loop in highres From: Michal Schmidt <mschmidt@redhat.com> timers (kernel hang) Description =========== (from Thomas Gleixner's patch description:) hrtimer_forward() does not check for the possible overflow of timer->expires. This can happen on 64 bit machines with large interval values and results currently in an endless loop in the softirq because the expiry value becomes negative and therefor the timer is expired all the time. Check for this condition and set the expiry value to the max. expiry time in the future. The fix should be applied to stable kernel series as well. Upstream status =============== Upstream commit 13788ccc41ceea5893f9c747c59bc0b28f2416c2 Author: Thomas Gleixner <tglx@linutronix.de> Date: Fri Mar 16 13:38:20 2007 -0800 [PATCH] hrtimer: prevent overrun DoS in hrtimer_forward() Testing ======= Scratch build in Brew: http://brewweb.devel.redhat.com/brew/taskinfo?taskID=1269316 A reproducer is attached to the BZ. I tested it successfully on a x86_64 system. Please ACK. Michal --- kernel/hrtimer.c | 6 ++++++ 1 file changed, 6 insertions(+) Index: linux-2.6.24.7/kernel/hrtimer.c =================================================================== --- linux-2.6.24.7.orig/kernel/hrtimer.c +++ linux-2.6.24.7/kernel/hrtimer.c @@ -717,6 +717,12 @@ hrtimer_forward(struct hrtimer *timer, k orun++; } timer->expires = ktime_add_safe(timer->expires, interval); + /* + * Make sure, that the result did not wrap with a very large + * interval. + */ + if (timer->expires.tv64 < 0) + timer->expires = ktime_set(KTIME_SEC_MAX, 0); return orun; } ���������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/hrtimer-dont-migrate-raisesoftirq.patch�����������������������������������������������������0000664�0000764�0000764�00000002454�11041663517�021042� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������Subject: hrtimer: prevent migration for raising CPU From: Steven Rostedt <srostedt@redhat.com> Due to a possible deadlock, the waking of the softirq was pushed outside of the hrtimer base locks. Unfortunately this allows the task to migrate after setting up the softirq and raising it. Since softirqs run a queue that is per-cpu we may raise the softirq on the wrong CPU and this will keep the queued softirq task from running. To solve this issue, this patch disables preemption around the releasing of the hrtimer lock and raising of the softirq. Signed-off-by: Steven Rostedt <srostedt@redhat.com> --- kernel/hrtimer.c | 8 ++++++++ 1 file changed, 8 insertions(+) Index: linux-2.6.24.7/kernel/hrtimer.c =================================================================== --- linux-2.6.24.7.orig/kernel/hrtimer.c +++ linux-2.6.24.7/kernel/hrtimer.c @@ -904,10 +904,18 @@ hrtimer_start(struct hrtimer *timer, kti */ raise = timer->state == HRTIMER_STATE_PENDING; + /* + * We use preempt_disable to prevent this task from migrating after + * setting up the softirq and raising it. Otherwise, if me migrate + * we will raise the softirq on the wrong CPU. + */ + preempt_disable(); + unlock_hrtimer_base(timer, &flags); if (raise) hrtimer_raise_softirq(); + preempt_enable(); return ret; } ��������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/linux-2.6.24-pollfix.patch������������������������������������������������������������������0000664�0000764�0000764�00000001554�11041673274�015631� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������Make sys_poll() wait at least timeout ms From: Karsten Wiese <fzu@wemgehoertderstaat.de> schedule_timeout(jiffies) waits for at least jiffies - 1. Add 1 jiffie to the timeout_jiffies calculated in sys_poll() to wait at least timeout_msecs, like poll() manpage says. Signed-off-by: Karsten Wiese <fzu@wemgehoertderstaat.de> --- fs/select.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) Index: linux-2.6.24.7/fs/select.c =================================================================== --- linux-2.6.24.7.orig/fs/select.c +++ linux-2.6.24.7/fs/select.c @@ -739,7 +739,7 @@ asmlinkage long sys_poll(struct pollfd _ timeout_jiffies = -1; else #endif - timeout_jiffies = msecs_to_jiffies(timeout_msecs); + timeout_jiffies = msecs_to_jiffies(timeout_msecs) + 1; } else { /* Infinite (< 0) or no (0) timeout */ timeout_jiffies = timeout_msecs; ����������������������������������������������������������������������������������������������������������������������������������������������������patches/CVE-2008-1615-linux-2.6-paranoid-iret.patch�������������������������������������������������0000664�0000764�0000764�00000001054�11041673274�020055� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������� From: Clark Williams <williams@redhat.com> --- arch/x86/kernel/entry_64.S | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) Index: linux-2.6.24.7/arch/x86/kernel/entry_64.S =================================================================== --- linux-2.6.24.7.orig/arch/x86/kernel/entry_64.S +++ linux-2.6.24.7/arch/x86/kernel/entry_64.S @@ -779,7 +779,7 @@ paranoid_swapgs\trace: swapgs paranoid_restore\trace: RESTORE_ALL 8 - iretq + jmp iret_label paranoid_userspace\trace: GET_THREAD_INFO(%rcx) movl threadinfo_flags(%rcx),%ebx ������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/CVE-2007-6694-ppc-chrs-null-fix.patch�������������������������������������������������������0000664�0000764�0000764�00000002756�11041673274�017157� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������arch/powerpc/platforms/chrp/setup.c | 13 +++++++++---- From: Clark Williams <williams@redhat.com> 1 files changed, 9 insertions(+), 4 deletions(-) --- arch/powerpc/platforms/chrp/setup.c | 13 +++++++++---- 1 file changed, 9 insertions(+), 4 deletions(-) Index: linux-2.6.24.7/arch/powerpc/platforms/chrp/setup.c =================================================================== --- linux-2.6.24.7.orig/arch/powerpc/platforms/chrp/setup.c +++ linux-2.6.24.7/arch/powerpc/platforms/chrp/setup.c @@ -115,7 +115,7 @@ void chrp_show_cpuinfo(struct seq_file * seq_printf(m, "machine\t\t: CHRP %s\n", model); /* longtrail (goldengate) stuff */ - if (!strncmp(model, "IBM,LongTrail", 13)) { + if (model && !strncmp(model, "IBM,LongTrail", 13)) { /* VLSI VAS96011/12 `Golden Gate 2' */ /* Memory banks */ sdramen = (in_le32(gg2_pci_config_base + GG2_PCI_DRAM_CTRL) @@ -203,15 +203,20 @@ static void __init sio_fixup_irq(const c static void __init sio_init(void) { struct device_node *root; + const char *model; - if ((root = of_find_node_by_path("/")) && - !strncmp(of_get_property(root, "model", NULL), - "IBM,LongTrail", 13)) { + root = of_find_node_by_path("/"); + if (!root) + return; + + model = of_get_property(root, "model", NULL); + if (model && !strncmp(model,"IBM,LongTrail", 13)) { /* logical device 0 (KBC/Keyboard) */ sio_fixup_irq("keyboard", 0, 1, 2); /* select logical device 1 (KBC/Mouse) */ sio_fixup_irq("mouse", 1, 12, 2); } + of_node_put(root); } ������������������patches/CVE-2008-1673-ans1_sanity_check_on_BER_decoding.patch���������������������������������������0000664�0000764�0000764�00000006206�11041673274�022275� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������Subject: asn1: additional sanity checking during BER decoding From: Chris Wright <chrisw@sous-sol.org> X-Git-Tag: v2.6.26-rc6~100 X-Git-Url: http://git.kernel.org/?p=linux%2Fkernel%2Fgit%2Ftorvalds%2Flinux-2.6.git;a=commitdiff_plain;h=ddb2c43594f22843e9f3153da151deaba1a834c5 asn1: additional sanity checking during BER decoding - Don't trust a length which is greater than the working buffer. An invalid length could cause overflow when calculating buffer size for decoding oid. - An oid length of zero is invalid and allows for an off-by-one error when decoding oid because the first subid actually encodes first 2 subids. - A primitive encoding may not have an indefinite length. Thanks to Wei Wang from McAfee for report. Cc: Steven French <sfrench@us.ibm.com> Cc: stable@kernel.org Acked-by: Patrick McHardy <kaber@trash.net> Signed-off-by: Chris Wright <chrisw@sous-sol.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> --- fs/cifs/asn1.c | 14 ++++++++++++++ net/ipv4/netfilter/nf_nat_snmp_basic.c | 14 ++++++++++++++ 2 files changed, 28 insertions(+) Index: linux-2.6.24.7/fs/cifs/asn1.c =================================================================== --- linux-2.6.24.7.orig/fs/cifs/asn1.c +++ linux-2.6.24.7/fs/cifs/asn1.c @@ -186,6 +186,11 @@ asn1_length_decode(struct asn1_ctx *ctx, } } } + + /* don't trust len bigger than ctx buffer */ + if (*len > ctx->end - ctx->pointer) + return 0; + return 1; } @@ -203,6 +208,10 @@ asn1_header_decode(struct asn1_ctx *ctx, if (!asn1_length_decode(ctx, &def, &len)) return 0; + /* primitive shall be definite, indefinite shall be constructed */ + if (*con == ASN1_PRI && !def) + return 0; + if (def) *eoc = ctx->pointer + len; else @@ -389,6 +398,11 @@ asn1_oid_decode(struct asn1_ctx *ctx, unsigned long *optr; size = eoc - ctx->pointer + 1; + + /* first subid actually encodes first two subids */ + if (size < 2 || size > ULONG_MAX/sizeof(unsigned long)) + return 0; + *oid = kmalloc(size * sizeof(unsigned long), GFP_ATOMIC); if (*oid == NULL) return 0; Index: linux-2.6.24.7/net/ipv4/netfilter/nf_nat_snmp_basic.c =================================================================== --- linux-2.6.24.7.orig/net/ipv4/netfilter/nf_nat_snmp_basic.c +++ linux-2.6.24.7/net/ipv4/netfilter/nf_nat_snmp_basic.c @@ -231,6 +231,11 @@ static unsigned char asn1_length_decode( } } } + + /* don't trust len bigger than ctx buffer */ + if (*len > ctx->end - ctx->pointer) + return 0; + return 1; } @@ -249,6 +254,10 @@ static unsigned char asn1_header_decode( if (!asn1_length_decode(ctx, &def, &len)) return 0; + /* primitive shall be definite, indefinite shall be constructed */ + if (*con == ASN1_PRI && !def) + return 0; + if (def) *eoc = ctx->pointer + len; else @@ -433,6 +442,11 @@ static unsigned char asn1_oid_decode(str unsigned long *optr; size = eoc - ctx->pointer + 1; + + /* first subid actually encodes first two subids */ + if (size < 2 || size > ULONG_MAX/sizeof(unsigned long)) + return 0; + *oid = kmalloc(size * sizeof(unsigned long), GFP_ATOMIC); if (*oid == NULL) { if (net_ratelimit()) ������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/CVE-2008-2136-missing_kfree_skb_on_pskb_may_pull.patch��������������������������������������0000664�0000764�0000764�00000001616�11041673273�022762� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������Subject: sit: Add missing kfree_skb() on pskb_may_pull() failure. From: David S. Miller <davem@davemloft.net> X-Git-Tag: v2.6.26-rc2~19^2 X-Git-Url: http://git.kernel.org/?p=linux%2Fkernel%2Fgit%2Ftorvalds%2Flinux-2.6.git;a=commitdiff_plain;h=36ca34cc3b8335eb1fe8bd9a1d0a2592980c3f02 sit: Add missing kfree_skb() on pskb_may_pull() failure. Noticed by Paul Marks <paul@pmarks.net>. Signed-off-by: David S. Miller <davem@davemloft.net> --- net/ipv6/sit.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) Index: linux-2.6.24.7/net/ipv6/sit.c =================================================================== --- linux-2.6.24.7.orig/net/ipv6/sit.c +++ linux-2.6.24.7/net/ipv6/sit.c @@ -395,9 +395,9 @@ static int ipip6_rcv(struct sk_buff *skb } icmp_send(skb, ICMP_DEST_UNREACH, ICMP_PORT_UNREACH, 0); - kfree_skb(skb); read_unlock(&ipip6_lock); out: + kfree_skb(skb); return 0; } ������������������������������������������������������������������������������������������������������������������patches/CVE-2008-2148-simplify_sched_fair.patch�����������������������������������������������������0000664�0000764�0000764�00000002611�11041673273�017663� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������Subject: sched: simplify sched_slice() From: Ingo Molnar <mingo@elte.hu> X-Git-Tag: v2.6.25-rc6~5 X-Git-Url: http://git.kernel.org/?p=linux%2Fkernel%2Fgit%2Ftorvalds%2Flinux-2.6.git;a=commitdiff_plain;h=6a6029b8cefe0ca7e82f27f3904dbedba3de4e06 sched: simplify sched_slice() Use the existing calc_delta_mine() calculation for sched_slice(). This saves a divide and simplifies the code because we share it with the other /cfs_rq->load users. It also improves code size: text data bss dec hex filename 42659 2740 144 45543 b1e7 sched.o.before 42093 2740 144 44977 afb1 sched.o.after Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> --- kernel/sched_fair.c | 8 ++------ 1 file changed, 2 insertions(+), 6 deletions(-) Index: linux-2.6.24.7/kernel/sched_fair.c =================================================================== --- linux-2.6.24.7.orig/kernel/sched_fair.c +++ linux-2.6.24.7/kernel/sched_fair.c @@ -263,12 +263,8 @@ static u64 __sched_period(unsigned long */ static u64 sched_slice(struct cfs_rq *cfs_rq, struct sched_entity *se) { - u64 slice = __sched_period(cfs_rq->nr_running); - - slice *= se->load.weight; - do_div(slice, cfs_rq->load.weight); - - return slice; + return calc_delta_mine(__sched_period(cfs_rq->nr_running), + se->load.weight, &cfs_rq->load); } /* �����������������������������������������������������������������������������������������������������������������������patches/CVE-2007-6282-2.6.24.1_esp_iv_bug.patch�����������������������������������������������������0000664�0000764�0000764�00000001746�11041673273�016764� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������� From: Clark Williams <williams@redhat.com> --- net/ipv4/esp4.c | 2 +- net/ipv6/esp6.c | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) Index: linux-2.6.24.7/net/ipv4/esp4.c =================================================================== --- linux-2.6.24.7.orig/net/ipv4/esp4.c +++ linux-2.6.24.7/net/ipv4/esp4.c @@ -165,7 +165,7 @@ static int esp_input(struct xfrm_state * int padlen; int err; - if (!pskb_may_pull(skb, sizeof(*esph))) + if (!pskb_may_pull(skb, sizeof(*esph) + esp->conf.ivlen)) goto out; if (elen <= 0 || (elen & (blksize-1))) Index: linux-2.6.24.7/net/ipv6/esp6.c =================================================================== --- linux-2.6.24.7.orig/net/ipv6/esp6.c +++ linux-2.6.24.7/net/ipv6/esp6.c @@ -155,7 +155,7 @@ static int esp6_input(struct xfrm_state int nfrags; int ret = 0; - if (!pskb_may_pull(skb, sizeof(*esph))) { + if (!pskb_may_pull(skb, sizeof(*esph) + esp->conf.ivlen)) { ret = -EINVAL; goto out; } ��������������������������patches/CVE-2008-2148-fix_utimensat_permissions_check.patch�����������������������������������������0000664�0000764�0000764�00000004755�11041673273�022342� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������Subject: vfs: fix permission checking in sys_utimensat From: Miklos Szeredi <mszeredi@suse.cz> X-Git-Tag: v2.6.25.3~15 X-Git-Url: http://git.kernel.org/?p=linux%2Fkernel%2Fgit%2Fstable%2Flinux-2.6.25.y.git;a=commitdiff_plain;h=f9dfda1ad0637a89a64d001cf81478bd8d9b6306 vfs: fix permission checking in sys_utimensat commit: 02c6be615f1fcd37ac5ed93a3ad6692ad8991cd9 upstream If utimensat() is called with both times set to UTIME_NOW or one of them to UTIME_NOW and the other to UTIME_OMIT, then it will update the file time without any permission checking. I don't think this can be used for anything other than a local DoS, but could be quite bewildering at that (e.g. "Why was that large source tree rebuilt when I didn't modify anything???") This affects all kernels from 2.6.22, when the utimensat() syscall was introduced. Fix by doing the same permission checking as for the "times == NULL" case. Thanks to Michael Kerrisk, whose utimensat-non-conformances-and-fixes.patch in -mm also fixes this (and breaks other stuff), only he didn't realize the security implications of this bug. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Cc: Ulrich Drepper <drepper@redhat.com> Cc: Michael Kerrisk <mtk-manpages@gmx.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de> --- fs/utimes.c | 17 +++++++++++++++-- 1 file changed, 15 insertions(+), 2 deletions(-) Index: linux-2.6.24.7/fs/utimes.c =================================================================== --- linux-2.6.24.7.orig/fs/utimes.c +++ linux-2.6.24.7/fs/utimes.c @@ -38,9 +38,14 @@ asmlinkage long sys_utime(char __user *f #endif +static bool nsec_special(long nsec) +{ + return nsec == UTIME_OMIT || nsec == UTIME_NOW; +} + static bool nsec_valid(long nsec) { - if (nsec == UTIME_OMIT || nsec == UTIME_NOW) + if (nsec_special(nsec)) return true; return nsec >= 0 && nsec <= 999999999; @@ -114,7 +119,15 @@ long do_utimes(int dfd, char __user *fil newattrs.ia_mtime.tv_nsec = times[1].tv_nsec; newattrs.ia_valid |= ATTR_MTIME_SET; } - } else { + } + + /* + * If times is NULL or both times are either UTIME_OMIT or + * UTIME_NOW, then need to check permissions, because + * inode_change_ok() won't do it. + */ + if (!times || (nsec_special(times[0].tv_nsec) && + nsec_special(times[1].tv_nsec))) { error = -EACCES; if (IS_IMMUTABLE(inode)) goto dput_and_out; �������������������patches/CVE-2008-2372-reinstate_ZERO_PAGE_optimization_in_get_user_pages_and_fix_XIP.patch����������0000664�0000764�0000764�00000011341�11041673273�030070� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������Subject: Reinstate ZERO_PAGE optimization in 'get_user_pages()' and fix XIP From: Linus Torvalds <torvalds@linux-foundation.org> X-Git-Tag: v2.6.26-rc7~12 X-Git-Url: http://git.kernel.org/?p=linux%2Fkernel%2Fgit%2Ftorvalds%2Flinux-2.6.git;a=commitdiff_plain;h=89f5b7da2a6bad2e84670422ab8192382a5aeb9f Reinstate ZERO_PAGE optimization in 'get_user_pages()' and fix XIP KAMEZAWA Hiroyuki and Oleg Nesterov point out that since the commit 557ed1fa2620dc119adb86b34c614e152a629a80 ("remove ZERO_PAGE") removed the ZERO_PAGE from the VM mappings, any users of get_user_pages() will generally now populate the VM with real empty pages needlessly. We used to get the ZERO_PAGE when we did the "handle_mm_fault()", but since fault handling no longer uses ZERO_PAGE for new anonymous pages, we now need to handle that special case in follow_page() instead. In particular, the removal of ZERO_PAGE effectively removed the core file writing optimization where we would skip writing pages that had not been populated at all, and increased memory pressure a lot by allocating all those useless newly zeroed pages. This reinstates the optimization by making the unmapped PTE case the same as for a non-existent page table, which already did this correctly. While at it, this also fixes the XIP case for follow_page(), where the caller could not differentiate between the case of a page that simply could not be used (because it had no "struct page" associated with it) and a page that just wasn't mapped. We do that by simply returning an error pointer for pages that could not be turned into a "struct page *". The error is arbitrarily picked to be EFAULT, since that was what get_user_pages() already used for the equivalent IO-mapped page case. [ Also removed an impossible test for pte_offset_map_lock() failing: that's not how that function works ] Acked-by: Oleg Nesterov <oleg@tv-sign.ru> Acked-by: Nick Piggin <npiggin@suse.de> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Hugh Dickins <hugh@veritas.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> --- arch/powerpc/kernel/vdso.c | 2 +- mm/memory.c | 17 +++++++++++++---- mm/migrate.c | 10 ++++++++++ 3 files changed, 24 insertions(+), 5 deletions(-) Index: linux-2.6.24.7/arch/powerpc/kernel/vdso.c =================================================================== --- linux-2.6.24.7.orig/arch/powerpc/kernel/vdso.c +++ linux-2.6.24.7/arch/powerpc/kernel/vdso.c @@ -141,7 +141,7 @@ static void dump_one_vdso_page(struct pa printk("kpg: %p (c:%d,f:%08lx)", __va(page_to_pfn(pg) << PAGE_SHIFT), page_count(pg), pg->flags); - if (upg/* && pg != upg*/) { + if (upg && !IS_ERR(upg) /* && pg != upg*/) { printk(" upg: %p (c:%d,f:%08lx)", __va(page_to_pfn(upg) << PAGE_SHIFT), page_count(upg), Index: linux-2.6.24.7/mm/memory.c =================================================================== --- linux-2.6.24.7.orig/mm/memory.c +++ linux-2.6.24.7/mm/memory.c @@ -934,17 +934,15 @@ struct page *follow_page(struct vm_area_ } ptep = pte_offset_map_lock(mm, pmd, address, &ptl); - if (!ptep) - goto out; pte = *ptep; if (!pte_present(pte)) - goto unlock; + goto no_page; if ((flags & FOLL_WRITE) && !pte_write(pte)) goto unlock; page = vm_normal_page(vma, address, pte); if (unlikely(!page)) - goto unlock; + goto bad_page; if (flags & FOLL_GET) get_page(page); @@ -959,6 +957,15 @@ unlock: out: return page; +bad_page: + pte_unmap_unlock(ptep, ptl); + return ERR_PTR(-EFAULT); + +no_page: + pte_unmap_unlock(ptep, ptl); + if (!pte_none(pte)) + return page; + /* Fall through to ZERO_PAGE handling */ no_page_table: /* * When core dumping an enormous anonymous area that nobody @@ -1095,6 +1102,8 @@ int get_user_pages(struct task_struct *t cond_resched(); } + if (IS_ERR(page)) + return i ? i : PTR_ERR(page); if (pages) { pages[i] = page; Index: linux-2.6.24.7/mm/migrate.c =================================================================== --- linux-2.6.24.7.orig/mm/migrate.c +++ linux-2.6.24.7/mm/migrate.c @@ -823,6 +823,11 @@ static int do_move_pages(struct mm_struc goto set_status; page = follow_page(vma, pp->addr, FOLL_GET); + + err = PTR_ERR(page); + if (IS_ERR(page)) + goto set_status; + err = -ENOENT; if (!page) goto set_status; @@ -886,6 +891,11 @@ static int do_pages_stat(struct mm_struc goto set_status; page = follow_page(vma, pm->addr, 0); + + err = PTR_ERR(page); + if (IS_ERR(page)) + goto set_status; + err = -ENOENT; /* Use PageReserved to check for zero page */ if (!page || PageReserved(page)) �����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/CVE-2008-2372-fix_ZERO_PAGE_breakage_with_vmware.patch��������������������������������������0000664�0000764�0000764�00000005322�11041673272�022336� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������Subject: Fix ZERO_PAGE breakage with vmware From: Linus Torvalds <torvalds@linux-foundation.org> X-Git-Tag: v2.6.26-rc8~16 X-Git-Url: http://git.kernel.org/?p=linux%2Fkernel%2Fgit%2Ftorvalds%2Flinux-2.6.git;a=commitdiff_plain;h=672ca28e300c17bf8d792a2a7a8631193e580c74 Fix ZERO_PAGE breakage with vmware Commit 89f5b7da2a6bad2e84670422ab8192382a5aeb9f ("Reinstate ZERO_PAGE optimization in 'get_user_pages()' and fix XIP") broke vmware, as reported by Jeff Chua: "This broke vmware 6.0.4. Jun 22 14:53:03.845: vmx| NOT_IMPLEMENTED /build/mts/release/bora-93057/bora/vmx/main/vmmonPosix.c:774" and the reason seems to be that there's an old bug in how we handle do FOLL_ANON on VM_SHARED areas in get_user_pages(), but since it only triggered if the whole page table was missing, nobody had apparently hit it before. The recent changes to 'follow_page()' made the FOLL_ANON logic trigger not just for whole missing page tables, but for individual pages as well, and exposed this problem. This fixes it by making the test for when FOLL_ANON is used more careful, and also makes the code easier to read and understand by moving the logic to a separate inline function. Reported-and-tested-by: Jeff Chua <jeff.chua.linux@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> --- mm/memory.c | 24 +++++++++++++++++++++--- 1 file changed, 21 insertions(+), 3 deletions(-) Index: linux-2.6.24.7/mm/memory.c =================================================================== --- linux-2.6.24.7.orig/mm/memory.c +++ linux-2.6.24.7/mm/memory.c @@ -980,6 +980,26 @@ no_page_table: return page; } +/* Can we do the FOLL_ANON optimization? */ +static inline int use_zero_page(struct vm_area_struct *vma) +{ + /* + * We don't want to optimize FOLL_ANON for make_pages_present() + * when it tries to page in a VM_LOCKED region. As to VM_SHARED, + * we want to get the page from the page tables to make sure + * that we serialize and update with any other user of that + * mapping. + */ + if (vma->vm_flags & (VM_LOCKED | VM_SHARED)) + return 0; + /* + * And if we have a fault or a nopfn routine, it's not an + * anonymous region. + */ + return !vma->vm_ops || + (!vma->vm_ops->fault && !vma->vm_ops->nopfn); +} + int get_user_pages(struct task_struct *tsk, struct mm_struct *mm, unsigned long start, int len, int write, int force, struct page **pages, struct vm_area_struct **vmas) @@ -1054,9 +1074,7 @@ int get_user_pages(struct task_struct *t foll_flags = FOLL_TOUCH; if (pages) foll_flags |= FOLL_GET; - if (!write && !(vma->vm_flags & VM_LOCKED) && - (!vma->vm_ops || (!vma->vm_ops->nopage && - !vma->vm_ops->fault))) + if (!write && use_zero_page(vma)) foll_flags |= FOLL_ANON; do { ��������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/fix_inotify_user_coalescing-bz453990.patch��������������������������������������������������0000664�0000764�0000764�00000004112�11041673272�021142� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������Subject: A potential bug in inotify_user.c From: Yan Zheng <yanzheng@21cn.com> X-Git-Tag: v2.6.25-rc1~775 X-Git-Url: http://git.kernel.org/?p=linux%2Fkernel%2Fgit%2Ftorvalds%2Flinux-2.6.git;a=commitdiff_plain;h=1c17d18e3775485bf1e0ce79575eb637a94494a2;hp=19c561a60ffe52df88dd63de0bff480ca094efe4 A potential bug in inotify_user.c Following comment is at fs/inotify_user.c:287 /* coalescing: drop this event if it is a dupe of the previous */ I think the previous event in the comment should be the last event in the link list. But inotify_dev_get_event return the first event in the list. In addition, it doesn't check whether the list is empty Signed-off-by: Yan Zheng<yanzheng@21cn.com> Acked-by: Robert Love <rlove@rlove.org> Cc: John McCutchan <ttb@tentacle.dhs.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> --- fs/inotify_user.c | 15 ++++++++++++++- 1 file changed, 14 insertions(+), 1 deletion(-) Index: linux-2.6.24.7/fs/inotify_user.c =================================================================== --- linux-2.6.24.7.orig/fs/inotify_user.c +++ linux-2.6.24.7/fs/inotify_user.c @@ -248,6 +248,19 @@ inotify_dev_get_event(struct inotify_dev } /* + * inotify_dev_get_last_event - return the last event in the given dev's queue + * + * Caller must hold dev->ev_mutex. + */ +static inline struct inotify_kernel_event * +inotify_dev_get_last_event(struct inotify_device *dev) +{ + if (list_empty(&dev->events)) + return NULL; + return list_entry(dev->events.prev, struct inotify_kernel_event, list); +} + +/* * inotify_dev_queue_event - event handler registered with core inotify, adds * a new event to the given device * @@ -273,7 +286,7 @@ static void inotify_dev_queue_event(stru put_inotify_watch(w); /* final put */ /* coalescing: drop this event if it is a dupe of the previous */ - last = inotify_dev_get_event(dev); + last = inotify_dev_get_last_event(dev); if (last && last->event.mask == mask && last->event.wd == wd && last->event.cookie == cookie) { const char *lastname = last->name; ������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/sctp-fix_sctp_addr_overflow.patch�����������������������������������������������������������0000664�0000764�0000764�00000002040�11041673272�017760� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������sctp: Make sure N * sizeof(union sctp_addr) does not overflow. From: Clark Williams <williams@redhat.com> As noticed by Gabriel Campana, the kmalloc() length arg passed in by sctp_getsockopt_local_addrs_old() can overflow if ->addr_num is large enough. Therefore, enforce an appropriate limit. Signed-off-by: David S. Miller <davem@davemloft.net> --- net/sctp/socket.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) Index: linux-2.6.24.7/net/sctp/socket.c =================================================================== --- linux-2.6.24.7.orig/net/sctp/socket.c +++ linux-2.6.24.7/net/sctp/socket.c @@ -4391,7 +4391,9 @@ static int sctp_getsockopt_local_addrs_o if (copy_from_user(&getaddrs, optval, len)) return -EFAULT; - if (getaddrs.addr_num <= 0) return -EINVAL; + if (getaddrs.addr_num <= 0 || + getaddrs.addr_num >= (INT_MAX / sizeof(union sctp_addr))) + return -EINVAL; /* * For UDP-style sockets, id specifies the association to query. * If the id field is set to the value '0' then the locally bound ������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/x86_64-ia32_syscall_restart_fix.patch�������������������������������������������������������0000664�0000764�0000764�00000007717�11041673272�020133� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������Bugzilla: From: Clark Williams <williams@redhat.com> https://bugzilla.redhat.com/show_bug.cgi?id=434998 Description: The code to restart syscalls after signals depends on checking for a negative orig_rax, and for particular negative -ERESTART* values in rax. These fields are 64 bits and for a 32-bit task they get zero-extended, therefore they are never negative and syscall restart behavior is lost. Solution: Doing sign-extension where it matters. For orig_rax, the only time the value should be -1 but winds up as 0x0ffffffff is via a 32-bit ptrace call. So the patch changes ptrace to sign-extend the 32-bit orig_eax value when it's stored; it doesn't change the checks on orig_rax, though it uses the new current_syscall() inline to better document the subtle importance of the used of signedness there. The rax value is stored a lot of ways and it seems hard to get them all sign-extended at their origins. So for that, we use the current_syscall_ret() to sign-extend it only for 32-bit tasks at the time of the -ERESTART* comparisons. Upstream status: commit 40f0933d51f4cba26a5c009a26bb230f4514c1b6 Test status: Built on all arch and tested on x86_64 using the reproducer provided in bugzilla. Brew build: http://brewweb.devel.redhat.com/brew/taskinfo?taskID=1365391 Regards, Jerome --- arch/x86/ia32/ptrace32.c | 10 +++++++++- arch/x86/kernel/signal_64.c | 38 +++++++++++++++++++++++++++++++++----- 2 files changed, 42 insertions(+), 6 deletions(-) Index: linux-2.6.24.7/arch/x86/ia32/ptrace32.c =================================================================== --- linux-2.6.24.7.orig/arch/x86/ia32/ptrace32.c +++ linux-2.6.24.7/arch/x86/ia32/ptrace32.c @@ -75,10 +75,18 @@ static int putreg32(struct task_struct * R32(esi, rsi); R32(ebp, rbp); R32(eax, rax); - R32(orig_eax, orig_rax); R32(eip, rip); R32(esp, rsp); + case offsetof(struct user_regs_struct32, orig_eax): { + /* + * Sign-extend the value so that orig_eax = -1 + * causes (long)orig_rax < 0 tests to fire correctly. + */ + stack[offsetof(struct pt_regs, orig_rax)/8] = (long) (s32) val; + break; + } + case offsetof(struct user32, regs.eflags): { __u64 *flags = &stack[offsetof(struct pt_regs, eflags)/8]; val &= FLAG_MASK; Index: linux-2.6.24.7/arch/x86/kernel/signal_64.c =================================================================== --- linux-2.6.24.7.orig/arch/x86/kernel/signal_64.c +++ linux-2.6.24.7/arch/x86/kernel/signal_64.c @@ -311,6 +311,35 @@ give_sigsegv: } /* + * Return -1L or the syscall number that @regs is executing. + */ +static long current_syscall(struct pt_regs *regs) +{ + /* + * We always sign-extend a -1 value being set here, + * so this is always either -1L or a syscall number. + */ + return regs->orig_rax; +} + +/* + * Return a value that is -EFOO if the system call in @regs->orig_rax + * returned an error. This only works for @regs from @current. + */ +static long current_syscall_ret(struct pt_regs *regs) +{ +#ifdef CONFIG_IA32_EMULATION + if (test_thread_flag(TIF_IA32)) + /* + * Sign-extend the value so (int)-EFOO becomes (long)-EFOO + * and will match correctly in comparisons. + */ + return (int) regs->rax; +#endif + return regs->rax; +} + +/* * OK, we're invoking a handler */ @@ -327,9 +356,9 @@ handle_signal(unsigned long sig, siginfo #endif /* Are we from a system call? */ - if ((long)regs->orig_rax >= 0) { + if (current_syscall(regs) >= 0) { /* If so, check system call restarting.. */ - switch (regs->rax) { + switch (current_syscall_ret(regs)) { case -ERESTART_RESTARTBLOCK: case -ERESTARTNOHAND: regs->rax = -EINTR; @@ -430,10 +459,9 @@ static void do_signal(struct pt_regs *re } /* Did we come from a system call? */ - if ((long)regs->orig_rax >= 0) { + if (current_syscall(regs) >= 0) { /* Restart the system call - no handlers present */ - long res = regs->rax; - switch (res) { + switch (current_syscall_ret(regs)) { case -ERESTARTNOHAND: case -ERESTARTSYS: case -ERESTARTNOINTR: �������������������������������������������������patches/x86_64-ptrace_sign_extend_orig_rax_to_64bits.patch������������������������������������������0000664�0000764�0000764�00000003071�11041673271�022651� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������Bugzilla: From: Clark Williams <williams@redhat.com> https://bugzilla.redhat.com/show_bug.cgi?id=437882 Description: GDB testsuite failure for x86_64 debugger running i386 debuggee. GDB sets orig_rax to 0x00000000ffffffff which is not recognized by kernel as -1. That bug is revealed by the fix of 434998. It could not happen before. Solution: Make ptrace always sign-extend orig_rax to 64 bits Upstream status: commit 84c6f6046c5a2189160a8f0dca8b90427bf690ea Test status: Built on all arch, tested on x86_64 using the reproducer provided on bugzilla. Brew build (also includes patch for bz434998): http://brewweb.devel.redhat.com/brew/taskinfo?taskID=1365281 Regards, Jerome --- arch/x86/kernel/ptrace_64.c | 10 ++++++++++ 1 file changed, 10 insertions(+) Index: linux-2.6.24.7/arch/x86/kernel/ptrace_64.c =================================================================== --- linux-2.6.24.7.orig/arch/x86/kernel/ptrace_64.c +++ linux-2.6.24.7/arch/x86/kernel/ptrace_64.c @@ -267,6 +267,16 @@ static int putreg(struct task_struct *ch return -EIO; child->thread.gs = value; return 0; + case offsetof(struct user_regs_struct, orig_rax): + /* + * Orig_rax is really just a flag with small positive + * and negative values, so make sure to always + * sign-extend it from 32 bits so that it works + * correctly regardless of whether we come from a + * 32-bit environment or not. + */ + value = (long) (s32) value; + break; case offsetof(struct user_regs_struct, eflags): value &= FLAG_MASK; tmp = get_stack_long(child, EFL_OFFSET); �����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/x86_fix_vsyscall_wreckage.patch�������������������������������������������������������������0000664�0000764�0000764�00000007716�11041664162�017355� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������# Commit: ce28b9864b853803320c3f1d8de1b81aa4120b14 From: Clark Williams <williams@redhat.com> # Parent: d4afe414189b098d56bcd24280c018aa2ac9a990 # Author: Thomas Gleixner <[EMAIL PROTECTED]> # AuthorDate: Wed Feb 20 23:57:30 2008 +0100 # Committer: Ingo Molnar <[EMAIL PROTECTED]> # CommitDate: Tue Feb 26 12:55:57 2008 +0100 x86: fix vsyscall wreckage based on a report from Arne Georg Gleditsch about user-space apps misbehaving after toggling /proc/sys/kernel/vsyscall64, a review of the code revealed that the "NOP patching" done there is fundamentally unsafe for a number of reasons: 1) the patching code runs without synchronizing other CPUs 2) it inserts NOPs even if there is no clock source which provides vread 3) when the clock source changes to one without vread we run in exactly the same problem as in #2 4) if nobody toggles the proc entry from 1 to 0 and to 1 again, then the syscall is not patched out as a result it is possible to break user-space via this patching. The only safe thing for now is to remove the patching. This code was broken since v2.6.21. Reported-by: Arne Georg Gleditsch <[EMAIL PROTECTED]> Signed-off-by: Thomas Gleixner <[EMAIL PROTECTED]> Signed-off-by: Ingo Molnar <[EMAIL PROTECTED]> --- arch/x86/kernel/vsyscall_64.c | 52 ++---------------------------------------- 1 file changed, 3 insertions(+), 49 deletions(-) Index: linux-2.6.24.7/arch/x86/kernel/vsyscall_64.c =================================================================== --- linux-2.6.24.7.orig/arch/x86/kernel/vsyscall_64.c +++ linux-2.6.24.7/arch/x86/kernel/vsyscall_64.c @@ -44,11 +44,6 @@ #define __vsyscall(nr) __attribute__ ((unused,__section__(".vsyscall_" #nr))) #define __syscall_clobber "r11","rcx","memory" -#define __pa_vsymbol(x) \ - ({unsigned long v; \ - extern char __vsyscall_0; \ - asm("" : "=r" (v) : "0" (x)); \ - ((v - VSYSCALL_START) + __pa_symbol(&__vsyscall_0)); }) /* * vsyscall_gtod_data contains data that is : @@ -102,7 +97,7 @@ static __always_inline void do_get_tz(st static __always_inline int gettimeofday(struct timeval *tv, struct timezone *tz) { int ret; - asm volatile("vsysc2: syscall" + asm volatile("syscall" : "=a" (ret) : "0" (__NR_gettimeofday),"D" (tv),"S" (tz) : __syscall_clobber ); @@ -112,7 +107,7 @@ static __always_inline int gettimeofday( static __always_inline long time_syscall(long *t) { long secs; - asm volatile("vsysc1: syscall" + asm volatile("syscall" : "=a" (secs) : "0" (__NR_time),"D" (t) : __syscall_clobber); return secs; @@ -227,50 +222,10 @@ long __vsyscall(3) venosys_1(void) } #ifdef CONFIG_SYSCTL - -#define SYSCALL 0x050f -#define NOP2 0x9090 - -/* - * NOP out syscall in vsyscall page when not needed. - */ -static int vsyscall_sysctl_change(ctl_table *ctl, int write, struct file * filp, - void __user *buffer, size_t *lenp, loff_t *ppos) -{ - extern u16 vsysc1, vsysc2; - u16 __iomem *map1; - u16 __iomem *map2; - int ret = proc_dointvec(ctl, write, filp, buffer, lenp, ppos); - if (!write) - return ret; - /* gcc has some trouble with __va(__pa()), so just do it this - way. */ - map1 = ioremap(__pa_vsymbol(&vsysc1), 2); - if (!map1) - return -ENOMEM; - map2 = ioremap(__pa_vsymbol(&vsysc2), 2); - if (!map2) { - ret = -ENOMEM; - goto out; - } - if (!vsyscall_gtod_data.sysctl_enabled) { - writew(SYSCALL, map1); - writew(SYSCALL, map2); - } else { - writew(NOP2, map1); - writew(NOP2, map2); - } - iounmap(map2); -out: - iounmap(map1); - return ret; -} - static ctl_table kernel_table2[] = { { .procname = "vsyscall64", .data = &vsyscall_gtod_data.sysctl_enabled, .maxlen = sizeof(int), - .mode = 0644, - .proc_handler = vsyscall_sysctl_change }, + .mode = 0644 }, {} }; @@ -279,7 +234,6 @@ static ctl_table kernel_root_table2[] = .child = kernel_table2 }, {} }; - #endif /* Assume __initcall executes before all user space. Hopefully kmod ��������������������������������������������������patches/m68knommu-upstream-patches.patch������������������������������������������������������������0000664�0000764�0000764�00001016307�11043075272�017413� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From: Sebastian Siewior <bigeasy@linutronix.de> Date: Sun, 27 Apr 2008 14:28:21 +0200 Subject: m68knommu: upstream pending patches That's a conglomerate of Greg Ungerers and Sebastian Siewiors m68knommu patches which are not in .24 but on the way upstream (partly included in .25/.26) Compressed-into-one-patch-by: tglx --- arch/m68knommu/Kconfig | 24 arch/m68knommu/Makefile | 31 arch/m68knommu/kernel/asm-offsets.c | 1 arch/m68knommu/kernel/irq.c | 11 arch/m68knommu/kernel/setup.c | 2 arch/m68knommu/kernel/time.c | 109 - arch/m68knommu/kernel/traps.c | 112 + arch/m68knommu/kernel/vmlinux.lds.S | 90 - arch/m68knommu/platform/5206/config.c | 76 - arch/m68knommu/platform/5206e/config.c | 80 - arch/m68knommu/platform/520x/config.c | 101 + arch/m68knommu/platform/523x/config.c | 138 +- arch/m68knommu/platform/5249/config.c | 77 - arch/m68knommu/platform/5272/config.c | 89 - arch/m68knommu/platform/527x/config.c | 87 + arch/m68knommu/platform/528x/config.c | 355 +++++ arch/m68knommu/platform/5307/Makefile | 14 arch/m68knommu/platform/5307/config.c | 88 - arch/m68knommu/platform/5307/entry.S | 235 --- arch/m68knommu/platform/5307/head.S | 222 --- arch/m68knommu/platform/5307/pit.c | 97 - arch/m68knommu/platform/5307/timers.c | 155 -- arch/m68knommu/platform/5307/vectors.c | 105 - arch/m68knommu/platform/532x/config.c | 179 +- arch/m68knommu/platform/532x/spi-mcf532x.c | 176 ++ arch/m68knommu/platform/532x/usb-mcf532x.c | 171 ++ arch/m68knommu/platform/5407/config.c | 83 - arch/m68knommu/platform/68328/ints.c | 2 arch/m68knommu/platform/68328/timers.c | 56 arch/m68knommu/platform/68360/config.c | 5 arch/m68knommu/platform/coldfire/Makefile | 32 arch/m68knommu/platform/coldfire/dma.c | 39 arch/m68knommu/platform/coldfire/dma_timer.c | 84 + arch/m68knommu/platform/coldfire/entry.S | 241 +++ arch/m68knommu/platform/coldfire/head.S | 222 +++ arch/m68knommu/platform/coldfire/irq_chip.c | 110 + arch/m68knommu/platform/coldfire/pit.c | 180 ++ arch/m68knommu/platform/coldfire/timers.c | 182 ++ arch/m68knommu/platform/coldfire/vectors.c | 105 + drivers/net/fec.c | 1812 +++++++++++++-------------- drivers/serial/68328serial.c | 2 drivers/serial/mcf.c | 22 drivers/serial/mcfserial.c | 121 - fs/nfs/file.c | 4 include/asm-generic/vmlinux.lds.h | 40 include/asm-m68knommu/bitops.h | 30 include/asm-m68knommu/byteorder.h | 16 include/asm-m68knommu/cacheflush.h | 2 include/asm-m68knommu/commproc.h | 19 include/asm-m68knommu/dma.h | 3 include/asm-m68knommu/m523xsim.h | 147 ++ include/asm-m68knommu/m528xsim.h | 63 include/asm-m68knommu/m532xsim.h | 86 - include/asm-m68knommu/mcfcache.h | 2 include/asm-m68knommu/mcfuart.h | 3 mm/nommu.c | 10 mm/page_alloc.c | 8 57 files changed, 4172 insertions(+), 2384 deletions(-) Index: linux-2.6.24.7/arch/m68knommu/Kconfig =================================================================== --- linux-2.6.24.7.orig/arch/m68knommu/Kconfig +++ linux-2.6.24.7/arch/m68knommu/Kconfig @@ -53,10 +53,22 @@ config GENERIC_CALIBRATE_DELAY bool default y +config GENERIC_TIME + bool + default y + +config GENERIC_CMOS_UPDATE + bool + default y + config TIME_LOW_RES bool default y +config GENERIC_CLOCKEVENTS + bool + default n + config NO_IOPORT def_bool y @@ -100,11 +112,14 @@ config M5206e config M520x bool "MCF520x" + select GENERIC_CLOCKEVENTS help Freescale Coldfire 5207/5208 processor support. config M523x bool "MCF523x" + select GENERIC_CLOCKEVENTS + select GENERIC_HARDIRQS_NO__DO_IRQ help Freescale Coldfire 5230/1/2/4/5 processor support @@ -130,6 +145,7 @@ config M5275 config M528x bool "MCF528x" + select GENERIC_CLOCKEVENTS help Motorola ColdFire 5280/5282 processor support. @@ -153,6 +169,7 @@ endchoice config M527x bool depends on (M5271 || M5275) + select GENERIC_CLOCKEVENTS default y config COLDFIRE @@ -658,6 +675,13 @@ config ROMKERNEL endchoice +config GENERIC_HARDIRQS_NO__DO_IRQ + bool "Force generic IRQ implementation" + +source "kernel/time/Kconfig" +if COLDFIRE +source "kernel/Kconfig.preempt" +endif source "mm/Kconfig" endmenu Index: linux-2.6.24.7/arch/m68knommu/Makefile =================================================================== --- linux-2.6.24.7.orig/arch/m68knommu/Makefile +++ linux-2.6.24.7/arch/m68knommu/Makefile @@ -61,17 +61,17 @@ MODEL := $(model-y) # for the selected cpu. ONLY need to define this for the non-base member # of the family. # -cpuclass-$(CONFIG_M5206) := 5307 -cpuclass-$(CONFIG_M5206e) := 5307 -cpuclass-$(CONFIG_M520x) := 5307 -cpuclass-$(CONFIG_M523x) := 5307 -cpuclass-$(CONFIG_M5249) := 5307 -cpuclass-$(CONFIG_M527x) := 5307 -cpuclass-$(CONFIG_M5272) := 5307 -cpuclass-$(CONFIG_M528x) := 5307 -cpuclass-$(CONFIG_M5307) := 5307 -cpuclass-$(CONFIG_M532x) := 5307 -cpuclass-$(CONFIG_M5407) := 5307 +cpuclass-$(CONFIG_M5206) := coldfire +cpuclass-$(CONFIG_M5206e) := coldfire +cpuclass-$(CONFIG_M520x) := coldfire +cpuclass-$(CONFIG_M523x) := coldfire +cpuclass-$(CONFIG_M5249) := coldfire +cpuclass-$(CONFIG_M527x) := coldfire +cpuclass-$(CONFIG_M5272) := coldfire +cpuclass-$(CONFIG_M528x) := coldfire +cpuclass-$(CONFIG_M5307) := coldfire +cpuclass-$(CONFIG_M532x) := coldfire +cpuclass-$(CONFIG_M5407) := coldfire cpuclass-$(CONFIG_M68328) := 68328 cpuclass-$(CONFIG_M68EZ328) := 68328 cpuclass-$(CONFIG_M68VZ328) := 68328 @@ -90,13 +90,14 @@ export PLATFORM BOARD MODEL CPUCLASS cflags-$(CONFIG_M5206) := -m5200 cflags-$(CONFIG_M5206e) := -m5200 cflags-$(CONFIG_M520x) := -m5307 -cflags-$(CONFIG_M523x) := -m5307 +cflags-$(CONFIG_M523x) := $(call cc-option,-mcpu=523x,-m5307) cflags-$(CONFIG_M5249) := -m5200 -cflags-$(CONFIG_M527x) := -m5307 +cflags-$(CONFIG_M5271) := $(call cc-option,-mcpu=5271,-m5307) cflags-$(CONFIG_M5272) := -m5307 -cflags-$(CONFIG_M528x) := -m5307 +cflags-$(CONFIG_M5275) := $(call cc-option,-mcpu=5275,-m5307) +cflags-$(CONFIG_M528x) := $(call cc-option,-m528x,-m5307) cflags-$(CONFIG_M5307) := -m5307 -cflags-$(CONFIG_M532x) := -m5307 +cflags-$(CONFIG_M532x) := $(call cc-option,-mcpu=532x,-m5307) cflags-$(CONFIG_M5407) := -m5200 cflags-$(CONFIG_M68328) := -m68000 cflags-$(CONFIG_M68EZ328) := -m68000 Index: linux-2.6.24.7/arch/m68knommu/kernel/asm-offsets.c =================================================================== --- linux-2.6.24.7.orig/arch/m68knommu/kernel/asm-offsets.c +++ linux-2.6.24.7/arch/m68knommu/kernel/asm-offsets.c @@ -91,6 +91,7 @@ int main(void) DEFINE(TI_TASK, offsetof(struct thread_info, task)); DEFINE(TI_EXECDOMAIN, offsetof(struct thread_info, exec_domain)); DEFINE(TI_FLAGS, offsetof(struct thread_info, flags)); + DEFINE(TI_PREEMPTCOUNT, offsetof(struct thread_info, preempt_count)); DEFINE(TI_CPU, offsetof(struct thread_info, cpu)); return 0; Index: linux-2.6.24.7/arch/m68knommu/kernel/irq.c =================================================================== --- linux-2.6.24.7.orig/arch/m68knommu/kernel/irq.c +++ linux-2.6.24.7/arch/m68knommu/kernel/irq.c @@ -23,7 +23,7 @@ asmlinkage void do_IRQ(int irq, struct p struct pt_regs *oldregs = set_irq_regs(regs); irq_enter(); - __do_IRQ(irq); + generic_handle_irq(irq); irq_exit(); set_irq_regs(oldregs); @@ -34,12 +34,16 @@ void ack_bad_irq(unsigned int irq) printk(KERN_ERR "IRQ: unexpected irq=%d\n", irq); } +#ifndef CONFIG_M523x static struct irq_chip m_irq_chip = { .name = "M68K-INTC", .enable = enable_vector, .disable = disable_vector, .ack = ack_vector, }; +#else +void coldfire_init_irq_chip(void); +#endif void __init init_IRQ(void) { @@ -47,12 +51,16 @@ void __init init_IRQ(void) init_vectors(); +#ifndef CONFIG_M523x for (irq = 0; (irq < NR_IRQS); irq++) { irq_desc[irq].status = IRQ_DISABLED; irq_desc[irq].action = NULL; irq_desc[irq].depth = 1; irq_desc[irq].chip = &m_irq_chip; } +#else + coldfire_init_irq_chip(); +#endif } int show_interrupts(struct seq_file *p, void *v) @@ -79,4 +87,3 @@ int show_interrupts(struct seq_file *p, return 0; } - Index: linux-2.6.24.7/arch/m68knommu/kernel/setup.c =================================================================== --- linux-2.6.24.7.orig/arch/m68knommu/kernel/setup.c +++ linux-2.6.24.7/arch/m68knommu/kernel/setup.c @@ -165,7 +165,7 @@ void __init setup_arch(char **cmdline_p) printk(KERN_INFO "DragonEngine II board support by Georges Menie\n"); #endif #ifdef CONFIG_M5235EVB - printk(KERN_INFO "Motorola M5235EVB support (C)2005 Syn-tech Systems, Inc. (Jate Sujjavanich)"); + printk(KERN_INFO "Motorola M5235EVB support (C)2005 Syn-tech Systems, Inc. (Jate Sujjavanich)\n"); #endif #ifdef DEBUG Index: linux-2.6.24.7/arch/m68knommu/kernel/time.c =================================================================== --- linux-2.6.24.7.orig/arch/m68knommu/kernel/time.c +++ linux-2.6.24.7/arch/m68knommu/kernel/time.c @@ -22,7 +22,6 @@ #include <linux/timex.h> #include <asm/machdep.h> -#include <asm/io.h> #include <asm/irq_regs.h> #define TICK_SIZE (tick_nsec / 1000) @@ -34,14 +33,13 @@ static inline int set_rtc_mmss(unsigned return -1; } +#ifndef CONFIG_GENERIC_CLOCKEVENTS /* * timer_interrupt() needs to keep up the real-time clock, * as well as call the "do_timer()" routine every clocktick */ irqreturn_t arch_timer_interrupt(int irq, void *dummy) { - /* last time the cmos clock got updated */ - static long last_rtc_update=0; write_seqlock(&xtime_lock); @@ -52,49 +50,12 @@ irqreturn_t arch_timer_interrupt(int irq if (current->pid) profile_tick(CPU_PROFILING); - /* - * If we have an externally synchronized Linux clock, then update - * CMOS clock accordingly every ~11 minutes. Set_rtc_mmss() has to be - * called as close as possible to 500 ms before the new second starts. - */ - if (ntp_synced() && - xtime.tv_sec > last_rtc_update + 660 && - (xtime.tv_nsec / 1000) >= 500000 - ((unsigned) TICK_SIZE) / 2 && - (xtime.tv_nsec / 1000) <= 500000 + ((unsigned) TICK_SIZE) / 2) { - if (set_rtc_mmss(xtime.tv_sec) == 0) - last_rtc_update = xtime.tv_sec; - else - last_rtc_update = xtime.tv_sec - 600; /* do it again in 60 s */ - } -#ifdef CONFIG_HEARTBEAT - /* use power LED as a heartbeat instead -- much more useful - for debugging -- based on the version for PReP by Cort */ - /* acts like an actual heart beat -- ie thump-thump-pause... */ - if (mach_heartbeat) { - static unsigned cnt = 0, period = 0, dist = 0; - - if (cnt == 0 || cnt == dist) - mach_heartbeat( 1 ); - else if (cnt == 7 || cnt == dist+7) - mach_heartbeat( 0 ); - - if (++cnt > period) { - cnt = 0; - /* The hyperbolic function below modifies the heartbeat period - * length in dependency of the current (5min) load. It goes - * through the points f(0)=126, f(1)=86, f(5)=51, - * f(inf)->30. */ - period = ((672<<FSHIFT)/(5*avenrun[0]+(7<<FSHIFT))) + 30; - dist = period / 4; - } - } -#endif /* CONFIG_HEARTBEAT */ - write_sequnlock(&xtime_lock); return(IRQ_HANDLED); } +#endif -void time_init(void) +static unsigned long read_rtc_mmss(void) { unsigned int year, mon, day, hour, min, sec; @@ -105,67 +66,21 @@ void time_init(void) if ((year += 1900) < 1970) year += 100; - xtime.tv_sec = mktime(year, mon, day, hour, min, sec); - xtime.tv_nsec = 0; - wall_to_monotonic.tv_sec = -xtime.tv_sec; - hw_timer_init(); + return mktime(year, mon, day, hour, min, sec);; } -/* - * This version of gettimeofday has near microsecond resolution. - */ -void do_gettimeofday(struct timeval *tv) +unsigned long read_persistent_clock(void) { - unsigned long flags; - unsigned long seq; - unsigned long usec, sec; - - do { - seq = read_seqbegin_irqsave(&xtime_lock, flags); - usec = hw_timer_offset(); - sec = xtime.tv_sec; - usec += (xtime.tv_nsec / 1000); - } while (read_seqretry_irqrestore(&xtime_lock, seq, flags)); - - while (usec >= 1000000) { - usec -= 1000000; - sec++; - } - - tv->tv_sec = sec; - tv->tv_usec = usec; + return read_rtc_mmss(); } -EXPORT_SYMBOL(do_gettimeofday); - -int do_settimeofday(struct timespec *tv) +int update_persistent_clock(struct timespec now) { - time_t wtm_sec, sec = tv->tv_sec; - long wtm_nsec, nsec = tv->tv_nsec; - - if ((unsigned long)tv->tv_nsec >= NSEC_PER_SEC) - return -EINVAL; - - write_seqlock_irq(&xtime_lock); - /* - * This is revolting. We need to set the xtime.tv_usec - * correctly. However, the value in this location is - * is value at the last tick. - * Discover what correction gettimeofday - * would have done, and then undo it! - */ - nsec -= (hw_timer_offset() * 1000); - - wtm_sec = wall_to_monotonic.tv_sec + (xtime.tv_sec - sec); - wtm_nsec = wall_to_monotonic.tv_nsec + (xtime.tv_nsec - nsec); - - set_normalized_timespec(&xtime, sec, nsec); - set_normalized_timespec(&wall_to_monotonic, wtm_sec, wtm_nsec); + return set_rtc_mmss(now.tv_sec); +} - ntp_clear(); - write_sequnlock_irq(&xtime_lock); - clock_was_set(); - return 0; +void time_init(void) +{ + hw_timer_init(); } -EXPORT_SYMBOL(do_settimeofday); Index: linux-2.6.24.7/arch/m68knommu/kernel/traps.c =================================================================== --- linux-2.6.24.7.orig/arch/m68knommu/kernel/traps.c +++ linux-2.6.24.7/arch/m68knommu/kernel/traps.c @@ -28,6 +28,7 @@ #include <linux/linkage.h> #include <linux/init.h> #include <linux/ptrace.h> +#include <linux/kallsyms.h> #include <asm/setup.h> #include <asm/fpu.h> @@ -102,56 +103,79 @@ asmlinkage void buserr_c(struct frame *f force_sig(SIGSEGV, current); } +static void print_this_address(unsigned long addr, int i) +{ +#ifdef CONFIG_KALLSYMS + printk(KERN_EMERG " [%08lx] ", addr); + print_symbol(KERN_CONT "%s\n", addr); +#else + if (i % 5) + printk(KERN_CONT " [%08lx] ", addr); + else + printk(KERN_CONT "\n" KERN_EMERG " [%08lx] ", addr); + i++; +#endif +} int kstack_depth_to_print = 48; -void show_stack(struct task_struct *task, unsigned long *stack) +static void __show_stack(struct task_struct *task, unsigned long *stack) { unsigned long *endstack, addr; - extern char _start, _etext; +#ifdef CONFIG_FRAME_POINTER + unsigned long *last_stack; +#endif int i; - if (!stack) { - if (task) - stack = (unsigned long *)task->thread.ksp; - else - stack = (unsigned long *)&stack; - } + if (!stack) + stack = (unsigned long *)task->thread.ksp; addr = (unsigned long) stack; endstack = (unsigned long *) PAGE_ALIGN(addr); printk(KERN_EMERG "Stack from %08lx:", (unsigned long)stack); for (i = 0; i < kstack_depth_to_print; i++) { - if (stack + 1 > endstack) + if (stack + 1 + i > endstack) break; if (i % 8 == 0) printk("\n" KERN_EMERG " "); - printk(" %08lx", *stack++); + printk(" %08lx", *(stack + i)); } printk("\n"); - - printk(KERN_EMERG "Call Trace:"); i = 0; - while (stack + 1 <= endstack) { + +#ifdef CONFIG_FRAME_POINTER + printk(KERN_EMERG "Call Trace:\n"); + + last_stack = stack - 1; + while (stack <= endstack && stack > last_stack) { + + addr = *(stack + 1); + print_this_address(addr, i); + i++; + + last_stack = stack; + stack = (unsigned long *)*stack; + } + printk("\n"); +#else + printk(KERN_EMERG "Call Trace with CONFIG_FRAME_POINTER disabled:\n"); + while (stack <= endstack) { addr = *stack++; /* - * If the address is either in the text segment of the - * kernel, or in the region which contains vmalloc'ed - * memory, it *may* be the address of a calling - * routine; if so, print it so that someone tracing - * down the cause of the crash will be able to figure - * out the call path that was taken. + * If the address is either in the text segment of the kernel, + * or in a region which is occupied by a module then it *may* + * be the address of a calling routine; if so, print it so that + * someone tracing down the cause of the crash will be able to + * figure out the call path that was taken. */ - if (((addr >= (unsigned long) &_start) && - (addr <= (unsigned long) &_etext))) { - if (i % 4 == 0) - printk("\n" KERN_EMERG " "); - printk(" [<%08lx>]", addr); + if (__kernel_text_address(addr)) { + print_this_address(addr, i); i++; } } - printk("\n"); + printk(KERN_CONT "\n"); +#endif } void bad_super_trap(struct frame *fp) @@ -298,19 +322,47 @@ asmlinkage void set_esp0(unsigned long s current->thread.esp0 = ssp; } - /* * The architecture-independent backtrace generator */ void dump_stack(void) { - unsigned long stack; - - show_stack(current, &stack); + /* + * We need frame pointers for this little trick, which works as follows: + * + * +------------+ 0x00 + * | Next SP | -> 0x0c + * +------------+ 0x04 + * | Caller | + * +------------+ 0x08 + * | Local vars | -> our stack var + * +------------+ 0x0c + * | Next SP | -> 0x18, that is what we pass to show_stack() + * +------------+ 0x10 + * | Caller | + * +------------+ 0x14 + * | Local vars | + * +------------+ 0x18 + * | ... | + * +------------+ + */ + + unsigned long *stack; + + stack = (unsigned long *)&stack; + stack++; + __show_stack(current, stack); } - EXPORT_SYMBOL(dump_stack); +void show_stack(struct task_struct *task, unsigned long *stack) +{ + if (!stack && !task) + dump_stack(); + else + __show_stack(task, stack); +} + #ifdef CONFIG_M68KFPU_EMU asmlinkage void fpemu_signal(int signal, int code, void *addr) { Index: linux-2.6.24.7/arch/m68knommu/kernel/vmlinux.lds.S =================================================================== --- linux-2.6.24.7.orig/arch/m68knommu/kernel/vmlinux.lds.S +++ linux-2.6.24.7/arch/m68knommu/kernel/vmlinux.lds.S @@ -7,6 +7,8 @@ * run kernels. */ +#define OUTPUT_DATA_SECTION > DATA + #include <asm-generic/vmlinux.lds.h> #if defined(CONFIG_RAMKERNEL) @@ -34,7 +36,6 @@ #define DATA_ADDR #endif - OUTPUT_ARCH(m68k) ENTRY(_start) @@ -64,81 +65,32 @@ SECTIONS { _stext = . ; TEXT_TEXT SCHED_TEXT + LOCK_TEXT *(.text.lock) - . = ALIGN(16); /* Exception table */ - __start___ex_table = .; - *(__ex_table) - __stop___ex_table = .; - - *(.rodata) *(.rodata.*) - *(__vermagic) /* Kernel version magic */ - *(.rodata1) - *(.rodata.str1.1) - - /* Kernel symbol table: Normal symbols */ - . = ALIGN(4); - __start___ksymtab = .; - *(__ksymtab) - __stop___ksymtab = .; - - /* Kernel symbol table: GPL-only symbols */ - __start___ksymtab_gpl = .; - *(__ksymtab_gpl) - __stop___ksymtab_gpl = .; - - /* Kernel symbol table: Normal unused symbols */ - __start___ksymtab_unused = .; - *(__ksymtab_unused) - __stop___ksymtab_unused = .; - - /* Kernel symbol table: GPL-only unused symbols */ - __start___ksymtab_unused_gpl = .; - *(__ksymtab_unused_gpl) - __stop___ksymtab_unused_gpl = .; - - /* Kernel symbol table: GPL-future symbols */ - __start___ksymtab_gpl_future = .; - *(__ksymtab_gpl_future) - __stop___ksymtab_gpl_future = .; - - /* Kernel symbol table: Normal symbols */ - __start___kcrctab = .; - *(__kcrctab) - __stop___kcrctab = .; - - /* Kernel symbol table: GPL-only symbols */ - __start___kcrctab_gpl = .; - *(__kcrctab_gpl) - __stop___kcrctab_gpl = .; - - /* Kernel symbol table: GPL-future symbols */ - __start___kcrctab_gpl_future = .; - *(__kcrctab_gpl_future) - __stop___kcrctab_gpl_future = .; - - /* Kernel symbol table: strings */ - *(__ksymtab_strings) - - /* Built-in module parameters */ - . = ALIGN(4) ; - __start___param = .; - *(__param) - __stop___param = .; - . = ALIGN(4) ; - _etext = . ; } > TEXT + _etext = . ; + + RODATA + .data DATA_ADDR : { . = ALIGN(4); _sdata = . ; DATA_DATA + . = ALIGN(16); /* Exception table */ + __start___ex_table = .; + *(__ex_table) + __stop___ex_table = .; . = ALIGN(8192) ; *(.data.init_task) _edata = . ; } > DATA + BUG_TABLE + PERCPU(4096) + .init : { . = ALIGN(4096); __init_begin = .; @@ -169,12 +121,6 @@ SECTIONS { __init_end = .; } > INIT - /DISCARD/ : { - *(.exit.text) - *(.exit.data) - *(.exitcall.exit) - } - .bss : { . = ALIGN(4); _sbss = . ; @@ -184,5 +130,11 @@ SECTIONS { _ebss = . ; } > BSS -} + _end = . ; + /DISCARD/ : { + *(.exit.text) + *(.exit.data) + *(.exitcall.exit) + } +} Index: linux-2.6.24.7/arch/m68knommu/platform/5206/config.c =================================================================== --- linux-2.6.24.7.orig/arch/m68knommu/platform/5206/config.c +++ linux-2.6.24.7/arch/m68knommu/platform/5206/config.c @@ -13,12 +13,11 @@ #include <linux/param.h> #include <linux/init.h> #include <linux/interrupt.h> -#include <asm/dma.h> +#include <linux/io.h> #include <asm/machdep.h> #include <asm/coldfire.h> -#include <asm/mcftimer.h> #include <asm/mcfsim.h> -#include <asm/mcfdma.h> +#include <asm/mcfuart.h> /***************************************************************************/ @@ -26,15 +25,51 @@ void coldfire_reset(void); /***************************************************************************/ -/* - * DMA channel base address table. - */ -unsigned int dma_base_addr[MAX_M68K_DMA_CHANNELS] = { - MCF_MBAR + MCFDMA_BASE0, - MCF_MBAR + MCFDMA_BASE1, +static struct mcf_platform_uart m5206_uart_platform[] = { + { + .mapbase = MCF_MBAR + MCFUART_BASE1, + .irq = 73, + }, + { + .mapbase = MCF_MBAR + MCFUART_BASE2, + .irq = 74, + }, + { }, }; -unsigned int dma_device_address[MAX_M68K_DMA_CHANNELS]; +static struct platform_device m5206_uart = { + .name = "mcfuart", + .id = 0, + .dev.platform_data = m5206_uart_platform, +}; + +static struct platform_device *m5206_devices[] __initdata = { + &m5206_uart, +}; + +/***************************************************************************/ + +static void __init m5206_uart_init_line(int line, int irq) +{ + if (line == 0) { + writel(MCFSIM_ICR_LEVEL6 | MCFSIM_ICR_PRI1, MCF_MBAR + MCFSIM_UART1ICR); + writeb(irq, MCFUART_BASE1 + MCFUART_UIVR); + mcf_setimr(mcf_getimr() & ~MCFSIM_IMR_UART1); + } else if (line == 1) { + writel(MCFSIM_ICR_LEVEL6 | MCFSIM_ICR_PRI2, MCF_MBAR + MCFSIM_UART2ICR); + writeb(irq, MCFUART_BASE2 + MCFUART_UIVR); + mcf_setimr(mcf_getimr() & ~MCFSIM_IMR_UART2); + } +} + +static void __init m5206_uarts_init(void) +{ + const int nrlines = ARRAY_SIZE(m5206_uart_platform); + int line; + + for (line = 0; (line < nrlines); line++) + m5206_uart_init_line(line, m5206_uart_platform[line].irq); +} /***************************************************************************/ @@ -74,24 +109,21 @@ void mcf_settimericr(unsigned int timer, /***************************************************************************/ -int mcf_timerirqpending(int timer) +void __init config_BSP(char *commandp, int size) { - unsigned int imr = 0; - - switch (timer) { - case 1: imr = MCFSIM_IMR_TIMER1; break; - case 2: imr = MCFSIM_IMR_TIMER2; break; - default: break; - } - return (mcf_getipr() & imr); + mcf_setimr(MCFSIM_IMR_MASKALL); + mach_reset = coldfire_reset; } /***************************************************************************/ -void config_BSP(char *commandp, int size) +static int __init init_BSP(void) { - mcf_setimr(MCFSIM_IMR_MASKALL); - mach_reset = coldfire_reset; + m5206_uarts_init(); + platform_add_devices(m5206_devices, ARRAY_SIZE(m5206_devices)); + return 0; } +arch_initcall(init_BSP); + /***************************************************************************/ Index: linux-2.6.24.7/arch/m68knommu/platform/5206e/config.c =================================================================== --- linux-2.6.24.7.orig/arch/m68knommu/platform/5206e/config.c +++ linux-2.6.24.7/arch/m68knommu/platform/5206e/config.c @@ -10,8 +10,9 @@ #include <linux/kernel.h> #include <linux/param.h> +#include <linux/init.h> #include <linux/interrupt.h> -#include <asm/dma.h> +#include <linux/io.h> #include <asm/machdep.h> #include <asm/coldfire.h> #include <asm/mcfsim.h> @@ -23,15 +24,51 @@ void coldfire_reset(void); /***************************************************************************/ -/* - * DMA channel base address table. - */ -unsigned int dma_base_addr[MAX_M68K_DMA_CHANNELS] = { - MCF_MBAR + MCFDMA_BASE0, - MCF_MBAR + MCFDMA_BASE1, +static struct mcf_platform_uart m5206_uart_platform[] = { + { + .mapbase = MCF_MBAR + MCFUART_BASE1, + .irq = 73, + }, + { + .mapbase = MCF_MBAR + MCFUART_BASE2, + .irq = 74, + }, + { }, }; -unsigned int dma_device_address[MAX_M68K_DMA_CHANNELS]; +static struct platform_device m5206_uart = { + .name = "mcfuart", + .id = 0, + .dev.platform_data = m5206_uart_platform, +}; + +static struct platform_device *m5206_devices[] __initdata = { + &m5206_uart, +}; + +/***************************************************************************/ + +static void __init m5206_uart_init_line(int line, int irq) +{ + if (line == 0) { + writel(MCFSIM_ICR_LEVEL6 | MCFSIM_ICR_PRI1, MCF_MBAR + MCFSIM_UART1ICR); + writeb(irq, MCFUART_BASE1 + MCFUART_UIVR); + mcf_setimr(mcf_getimr() & ~MCFSIM_IMR_UART1); + } else if (line == 1) { + writel(MCFSIM_ICR_LEVEL6 | MCFSIM_ICR_PRI2, MCF_MBAR + MCFSIM_UART2ICR); + writeb(irq, MCFUART_BASE2 + MCFUART_UIVR); + mcf_setimr(mcf_getimr() & ~MCFSIM_IMR_UART2); + } +} + +static void __init m5206_uarts_init(void) +{ + const int nrlines = ARRAY_SIZE(m5206_uart_platform); + int line; + + for (line = 0; (line < nrlines); line++) + m5206_uart_init_line(line, m5206_uart_platform[line].irq); +} /***************************************************************************/ @@ -71,21 +108,7 @@ void mcf_settimericr(unsigned int timer, /***************************************************************************/ -int mcf_timerirqpending(int timer) -{ - unsigned int imr = 0; - - switch (timer) { - case 1: imr = MCFSIM_IMR_TIMER1; break; - case 2: imr = MCFSIM_IMR_TIMER2; break; - default: break; - } - return (mcf_getipr() & imr); -} - -/***************************************************************************/ - -void config_BSP(char *commandp, int size) +void __init config_BSP(char *commandp, int size) { mcf_setimr(MCFSIM_IMR_MASKALL); @@ -99,3 +122,14 @@ void config_BSP(char *commandp, int size } /***************************************************************************/ + +static int __init init_BSP(void) +{ + m5206_uarts_init(); + platform_add_devices(m5206_devices, ARRAY_SIZE(m5206_devices)); + return 0; +} + +arch_initcall(init_BSP); + +/***************************************************************************/ Index: linux-2.6.24.7/arch/m68knommu/platform/520x/config.c =================================================================== --- linux-2.6.24.7.orig/arch/m68knommu/platform/520x/config.c +++ linux-2.6.24.7/arch/m68knommu/platform/520x/config.c @@ -5,7 +5,7 @@ * * Copyright (C) 2005, Freescale (www.freescale.com) * Copyright (C) 2005, Intec Automation (mike@steroidmicros.com) - * Copyright (C) 1999-2003, Greg Ungerer (gerg@snapgear.com) + * Copyright (C) 1999-2007, Greg Ungerer (gerg@snapgear.com) * Copyright (C) 2001-2003, SnapGear Inc. (www.snapgear.com) */ @@ -13,21 +13,93 @@ #include <linux/kernel.h> #include <linux/param.h> +#include <linux/init.h> #include <linux/interrupt.h> +#include <linux/io.h> #include <asm/machdep.h> -#include <asm/dma.h> +#include <asm/coldfire.h> +#include <asm/mcfsim.h> +#include <asm/mcfuart.h> /***************************************************************************/ -/* - * DMA channel base address table. - */ -unsigned int dma_base_addr[MAX_M68K_DMA_CHANNELS]; -unsigned int dma_device_address[MAX_M68K_DMA_CHANNELS]; +void coldfire_reset(void); /***************************************************************************/ -void coldfire_reset(void); +static struct mcf_platform_uart m520x_uart_platform[] = { + { + .mapbase = MCF_MBAR + MCFUART_BASE1, + .irq = MCFINT_VECBASE + MCFINT_UART0, + }, + { + .mapbase = MCF_MBAR + MCFUART_BASE2, + .irq = MCFINT_VECBASE + MCFINT_UART1, + }, + { + .mapbase = MCF_MBAR + MCFUART_BASE3, + .irq = MCFINT_VECBASE + MCFINT_UART2, + }, + { }, +}; + +static struct platform_device m520x_uart = { + .name = "mcfuart", + .id = 0, + .dev.platform_data = m520x_uart_platform, +}; + +static struct platform_device *m520x_devices[] __initdata = { + &m520x_uart, +}; + +/***************************************************************************/ + +#define INTC0 (MCF_MBAR + MCFICM_INTC0) + +static void __init m520x_uart_init_line(int line, int irq) +{ + u32 imr; + u16 par; + u8 par2; + + writeb(0x03, INTC0 + MCFINTC_ICR0 + MCFINT_UART0 + line); + + imr = readl(INTC0 + MCFINTC_IMRL); + imr &= ~((1 << (irq - MCFINT_VECBASE)) | 1); + writel(imr, INTC0 + MCFINTC_IMRL); + + switch (line) { + case 0: + par = readw(MCF_IPSBAR + MCF_GPIO_PAR_UART); + par |= MCF_GPIO_PAR_UART_PAR_UTXD0 | + MCF_GPIO_PAR_UART_PAR_URXD0; + writew(par, MCF_IPSBAR + MCF_GPIO_PAR_UART); + break; + case 1: + par = readw(MCF_IPSBAR + MCF_GPIO_PAR_UART); + par |= MCF_GPIO_PAR_UART_PAR_UTXD1 | + MCF_GPIO_PAR_UART_PAR_URXD1; + writew(par, MCF_IPSBAR + MCF_GPIO_PAR_UART); + break; + case 2: + par2 = readb(MCF_IPSBAR + MCF_GPIO_PAR_FECI2C); + par2 &= ~0x0F; + par2 |= MCF_GPIO_PAR_FECI2C_PAR_SCL_UTXD2 | + MCF_GPIO_PAR_FECI2C_PAR_SDA_URXD2; + writeb(par2, MCF_IPSBAR + MCF_GPIO_PAR_FECI2C); + break; + } +} + +static void __init m520x_uarts_init(void) +{ + const int nrlines = ARRAY_SIZE(m520x_uart_platform); + int line; + + for (line = 0; (line < nrlines); line++) + m520x_uart_init_line(line, m520x_uart_platform[line].irq); +} /***************************************************************************/ @@ -42,9 +114,20 @@ void mcf_autovector(unsigned int vec) /***************************************************************************/ -void config_BSP(char *commandp, int size) +void __init config_BSP(char *commandp, int size) { mach_reset = coldfire_reset; + m520x_uarts_init(); +} + +/***************************************************************************/ + +static int __init init_BSP(void) +{ + platform_add_devices(m520x_devices, ARRAY_SIZE(m520x_devices)); + return 0; } +arch_initcall(init_BSP); + /***************************************************************************/ Index: linux-2.6.24.7/arch/m68knommu/platform/523x/config.c =================================================================== --- linux-2.6.24.7.orig/arch/m68knommu/platform/523x/config.c +++ linux-2.6.24.7/arch/m68knommu/platform/523x/config.c @@ -16,11 +16,15 @@ #include <linux/param.h> #include <linux/init.h> #include <linux/interrupt.h> -#include <asm/dma.h> +#include <linux/io.h> #include <asm/machdep.h> #include <asm/coldfire.h> #include <asm/mcfsim.h> -#include <asm/mcfdma.h> +#include <asm/mcfuart.h> + +#ifdef CONFIG_MTD +#include <linux/mtd/physmap.h> +#endif /***************************************************************************/ @@ -28,14 +32,58 @@ void coldfire_reset(void); /***************************************************************************/ -/* - * DMA channel base address table. - */ -unsigned int dma_base_addr[MAX_M68K_DMA_CHANNELS] = { - MCF_MBAR + MCFDMA_BASE0, +static struct mcf_platform_uart m523x_uart_platform[] = { + { + .mapbase = MCF_MBAR + MCFUART_BASE1, + .irq = MCFINT_VECBASE + MCFINT_UART0, + }, + { + .mapbase = MCF_MBAR + MCFUART_BASE2, + .irq = MCFINT_VECBASE + MCFINT_UART0 + 1, + }, + { + .mapbase = MCF_MBAR + MCFUART_BASE3, + .irq = MCFINT_VECBASE + MCFINT_UART0 + 2, + }, + { }, +}; + +static struct platform_device m523x_uart = { + .name = "mcfuart", + .id = 0, + .dev.platform_data = m523x_uart_platform, }; -unsigned int dma_device_address[MAX_M68K_DMA_CHANNELS]; +static struct platform_device *m523x_devices[] __initdata = { + &m523x_uart, +}; + +/***************************************************************************/ + +#define INTC0 (MCF_MBAR + MCFICM_INTC0) + +static void __init m523x_uart_init_line(int line, int irq) +{ + u32 imr; + + if ((line < 0) || (line > 2)) + return; + + writeb(0x30+line, (INTC0 + MCFINTC_ICR0 + MCFINT_UART0 + line)); + + imr = readl(INTC0 + MCFINTC_IMRL); + imr &= ~((1 << (irq - MCFINT_VECBASE)) | 1); + writel(imr, INTC0 + MCFINTC_IMRL); +} + +static void __init m523x_uarts_init(void) +{ + const int nrlines = ARRAY_SIZE(m523x_uart_platform); + int line; + + for (line = 0; (line < nrlines); line++) + m523x_uart_init_line(line, m523x_uart_platform[line].irq); +} /***************************************************************************/ @@ -49,15 +97,85 @@ void mcf_disableall(void) void mcf_autovector(unsigned int vec) { - /* Everything is auto-vectored on the 5272 */ + /* Everything is auto-vectored on the 523x */ } /***************************************************************************/ -void config_BSP(char *commandp, int size) +#if defined(CONFIG_SAVANT) + +/* + * Do special config for SAVANT BSP + */ +static void __init config_savantBSP(char *commandP, int size) +{ + /* setup BOOTPARAM_STRING */ + strncpy(commandP, "root=/dev/mtdblock1 ro rootfstype=romfs", size); + /* Look at Chatter DIP Switch, if CS3 is enabled */ + { + uint32_t *csmr3 = (uint32_t *) (MCF_IPSBAR + MCF523x_CSMR3); + uint32_t *csar3 = (uint32_t *) (MCF_IPSBAR + MCF523x_CSAR3); + uint16_t *dipsP = (uint16_t *) *csar3; + uint16_t dipSetOff = *dipsP & 0x0100; // switch #1 + uint16_t *btnPressP = (uint16_t *)(*csar3 + 0x10); + uint16_t shortButtonPress = *btnPressP & 0x8000; + if (*csmr3 & 1) { + /* CS3 enabled */ + if (!dipSetOff && shortButtonPress) { + /* switch on, so be quiet */ + strncat(commandP, " console=", size-strlen(commandP)-1); + } + } + } + commandP[size-1] = 0; + + /* Set on-chip peripheral space to user mode */ + { + uint8_t *gpacr = (uint8_t *) (MCF_IPSBAR + MCF523x_GPACR); + uint8_t *pacr1 = (uint8_t *) (MCF_IPSBAR + MCF523x_PACR1); + uint8_t *pacr4 = (uint8_t *) (MCF_IPSBAR + MCF523x_PACR4); + uint8_t *pacr7 = (uint8_t *) (MCF_IPSBAR + MCF523x_PACR7); + uint8_t *pacr8 = (uint8_t *) (MCF_IPSBAR + MCF523x_PACR8); + *gpacr = 0x04; + *pacr1 = 0x40; /* EIM required for Chip Select access */ + *pacr4 = 0x40; /* I2C */ + *pacr7 = 0x44; /* INTC0 & 1 handy for debug */ + *pacr8 = 0x40; /* FEC MAC */ + } + +#ifdef CONFIG_MTD + /* all board spins cannot access flash from linux unless we change the map here */ + { + uint32_t *csar0 = (uint32_t *) (MCF_IPSBAR + MCF523x_CSAR0); + uint32_t start = *csar0; + uint32_t size = 0xffffFFFF - start + 1; + physmap_configure(start, size, CONFIG_MTD_PHYSMAP_BANKWIDTH, NULL); + } +#endif +} + +#endif /* CONFIG_SAVANT */ + +/***************************************************************************/ + +void __init config_BSP(char *commandp, int size) { mcf_disableall(); +#if defined(CONFIG_SAVANT) + config_savantBSP(commandp, size); +#endif /* CONFIG_SAVANT */ mach_reset = coldfire_reset; + m523x_uarts_init(); +} + +/***************************************************************************/ + +static int __init init_BSP(void) +{ + platform_add_devices(m523x_devices, ARRAY_SIZE(m523x_devices)); + return 0; } +arch_initcall(init_BSP); + /***************************************************************************/ Index: linux-2.6.24.7/arch/m68knommu/platform/5249/config.c =================================================================== --- linux-2.6.24.7.orig/arch/m68knommu/platform/5249/config.c +++ linux-2.6.24.7/arch/m68knommu/platform/5249/config.c @@ -12,11 +12,11 @@ #include <linux/param.h> #include <linux/init.h> #include <linux/interrupt.h> -#include <asm/dma.h> +#include <linux/io.h> #include <asm/machdep.h> #include <asm/coldfire.h> #include <asm/mcfsim.h> -#include <asm/mcfdma.h> +#include <asm/mcfuart.h> /***************************************************************************/ @@ -24,17 +24,51 @@ void coldfire_reset(void); /***************************************************************************/ -/* - * DMA channel base address table. - */ -unsigned int dma_base_addr[MAX_M68K_DMA_CHANNELS] = { - MCF_MBAR + MCFDMA_BASE0, - MCF_MBAR + MCFDMA_BASE1, - MCF_MBAR + MCFDMA_BASE2, - MCF_MBAR + MCFDMA_BASE3, +static struct mcf_platform_uart m5249_uart_platform[] = { + { + .mapbase = MCF_MBAR + MCFUART_BASE1, + .irq = 73, + }, + { + .mapbase = MCF_MBAR + MCFUART_BASE2, + .irq = 74, + } +}; + +static struct platform_device m5249_uart = { + .name = "mcfuart", + .id = 0, + .dev.platform_data = m5249_uart_platform, }; -unsigned int dma_device_address[MAX_M68K_DMA_CHANNELS]; +static struct platform_device *m5249_devices[] __initdata = { + &m5249_uart, +}; + +/***************************************************************************/ + +static void __init m5249_uart_init_line(int line, int irq) +{ + if (line == 0) { + writel(MCFSIM_ICR_LEVEL6 | MCFSIM_ICR_PRI1, MCF_MBAR + MCFSIM_UART1ICR); + writeb(irq, MCFUART_BASE1 + MCFUART_UIVR); + mcf_setimr(mcf_getimr() & ~MCFSIM_IMR_UART1); + } else if (line == 1) { + writel(MCFSIM_ICR_LEVEL6 | MCFSIM_ICR_PRI2, MCF_MBAR + MCFSIM_UART2ICR); + writeb(irq, MCFUART_BASE2 + MCFUART_UIVR); + mcf_setimr(mcf_getimr() & ~MCFSIM_IMR_UART2); + } +} + +static void __init m5249_uarts_init(void) +{ + const int nrlines = ARRAY_SIZE(m5249_uart_platform); + int line; + + for (line = 0; (line < nrlines); line++) + m5249_uart_init_line(line, m5249_uart_platform[line].irq); +} + /***************************************************************************/ @@ -71,24 +105,21 @@ void mcf_settimericr(unsigned int timer, /***************************************************************************/ -int mcf_timerirqpending(int timer) +void __init config_BSP(char *commandp, int size) { - unsigned int imr = 0; - - switch (timer) { - case 1: imr = MCFSIM_IMR_TIMER1; break; - case 2: imr = MCFSIM_IMR_TIMER2; break; - default: break; - } - return (mcf_getipr() & imr); + mcf_setimr(MCFSIM_IMR_MASKALL); + mach_reset = coldfire_reset; } /***************************************************************************/ -void config_BSP(char *commandp, int size) +static int __init init_BSP(void) { - mcf_setimr(MCFSIM_IMR_MASKALL); - mach_reset = coldfire_reset; + m5249_uarts_init(); + platform_add_devices(m5249_devices, ARRAY_SIZE(m5249_devices)); + return 0; } +arch_initcall(init_BSP); + /***************************************************************************/ Index: linux-2.6.24.7/arch/m68knommu/platform/5272/config.c =================================================================== --- linux-2.6.24.7.orig/arch/m68knommu/platform/5272/config.c +++ linux-2.6.24.7/arch/m68knommu/platform/5272/config.c @@ -13,11 +13,11 @@ #include <linux/param.h> #include <linux/init.h> #include <linux/interrupt.h> -#include <asm/dma.h> +#include <linux/io.h> #include <asm/machdep.h> #include <asm/coldfire.h> #include <asm/mcfsim.h> -#include <asm/mcfdma.h> +#include <asm/mcfuart.h> /***************************************************************************/ @@ -37,14 +37,57 @@ unsigned char ledbank = 0xff; /***************************************************************************/ -/* - * DMA channel base address table. - */ -unsigned int dma_base_addr[MAX_M68K_DMA_CHANNELS] = { - MCF_MBAR + MCFDMA_BASE0, +static struct mcf_platform_uart m5272_uart_platform[] = { + { + .mapbase = MCF_MBAR + MCFUART_BASE1, + .irq = 73, + }, + { + .mapbase = MCF_MBAR + MCFUART_BASE2, + .irq = 74, + }, + { }, }; -unsigned int dma_device_address[MAX_M68K_DMA_CHANNELS]; +static struct platform_device m5272_uart = { + .name = "mcfuart", + .id = 0, + .dev.platform_data = m5272_uart_platform, +}; + +static struct platform_device *m5272_devices[] __initdata = { + &m5272_uart, +}; + +/***************************************************************************/ + +static void __init m5272_uart_init_line(int line, int irq) +{ + u32 v; + + if ((line >= 0) && (line < 2)) { + v = (line) ? 0x0e000000 : 0xe0000000; + writel(v, MCF_MBAR + MCFSIM_ICR2); + + /* Enable the output lines for the serial ports */ + v = readl(MCF_MBAR + MCFSIM_PBCNT); + v = (v & ~0x000000ff) | 0x00000055; + writel(v, MCF_MBAR + MCFSIM_PBCNT); + + v = readl(MCF_MBAR + MCFSIM_PDCNT); + v = (v & ~0x000003fc) | 0x000002a8; + writel(v, MCF_MBAR + MCFSIM_PDCNT); + } +} + +static void __init m5272_uarts_init(void) +{ + const int nrlines = ARRAY_SIZE(m5272_uart_platform); + int line; + + for (line = 0; (line < nrlines); line++) + m5272_uart_init_line(line, m5272_uart_platform[line].irq); +} /***************************************************************************/ @@ -80,20 +123,7 @@ void mcf_settimericr(int timer, int leve /***************************************************************************/ -int mcf_timerirqpending(int timer) -{ - volatile unsigned long *icrp; - - if ((timer >= 1 ) && (timer <= 4)) { - icrp = (volatile unsigned long *) (MCF_MBAR + MCFSIM_ICR1); - return (*icrp & (0x8 << ((4 - timer) * 4))); - } - return 0; -} - -/***************************************************************************/ - -void config_BSP(char *commandp, int size) +void __init config_BSP(char *commandp, int size) { #if defined (CONFIG_MOD5272) volatile unsigned char *pivrp; @@ -109,10 +139,6 @@ void config_BSP(char *commandp, int size /* Copy command line from FLASH to local buffer... */ memcpy(commandp, (char *) 0xf0004000, size); commandp[size-1] = 0; -#elif defined(CONFIG_MTD_KeyTechnology) - /* Copy command line from FLASH to local buffer... */ - memcpy(commandp, (char *) 0xffe06000, size); - commandp[size-1] = 0; #elif defined(CONFIG_CANCam) /* Copy command line from FLASH to local buffer... */ memcpy(commandp, (char *) 0xf0010000, size); @@ -125,3 +151,14 @@ void config_BSP(char *commandp, int size } /***************************************************************************/ + +static int __init init_BSP(void) +{ + m5272_uarts_init(); + platform_add_devices(m5272_devices, ARRAY_SIZE(m5272_devices)); + return 0; +} + +arch_initcall(init_BSP); + +/***************************************************************************/ Index: linux-2.6.24.7/arch/m68knommu/platform/527x/config.c =================================================================== --- linux-2.6.24.7.orig/arch/m68knommu/platform/527x/config.c +++ linux-2.6.24.7/arch/m68knommu/platform/527x/config.c @@ -16,11 +16,11 @@ #include <linux/param.h> #include <linux/init.h> #include <linux/interrupt.h> -#include <asm/dma.h> +#include <linux/io.h> #include <asm/machdep.h> #include <asm/coldfire.h> #include <asm/mcfsim.h> -#include <asm/mcfdma.h> +#include <asm/mcfuart.h> /***************************************************************************/ @@ -28,14 +28,72 @@ void coldfire_reset(void); /***************************************************************************/ -/* - * DMA channel base address table. - */ -unsigned int dma_base_addr[MAX_M68K_DMA_CHANNELS] = { - MCF_MBAR + MCFDMA_BASE0, +static struct mcf_platform_uart m527x_uart_platform[] = { + { + .mapbase = MCF_MBAR + MCFUART_BASE1, + .irq = MCFINT_VECBASE + MCFINT_UART0, + }, + { + .mapbase = MCF_MBAR + MCFUART_BASE2, + .irq = MCFINT_VECBASE + MCFINT_UART1, + }, + { + .mapbase = MCF_MBAR + MCFUART_BASE3, + .irq = MCFINT_VECBASE + MCFINT_UART2, + }, + { }, }; -unsigned int dma_device_address[MAX_M68K_DMA_CHANNELS]; +static struct platform_device m527x_uart = { + .name = "mcfuart", + .id = 0, + .dev.platform_data = m527x_uart_platform, +}; + +static struct platform_device *m527x_devices[] __initdata = { + &m527x_uart, +}; + +/***************************************************************************/ + +#define INTC0 (MCF_MBAR + MCFICM_INTC0) + +static void __init m527x_uart_init_line(int line, int irq) +{ + u16 sepmask; + u32 imr; + + if ((line < 0) || (line > 2)) + return; + + /* level 6, line based priority */ + writeb(0x30+line, INTC0 + MCFINTC_ICR0 + MCFINT_UART0 + line); + + imr = readl(INTC0 + MCFINTC_IMRL); + imr &= ~((1 << (irq - MCFINT_VECBASE)) | 1); + writel(imr, INTC0 + MCFINTC_IMRL); + + /* + * External Pin Mask Setting & Enable External Pin for Interface + */ + sepmask = readw(MCF_IPSBAR + MCF_GPIO_PAR_UART); + if (line == 0) + sepmask |= UART0_ENABLE_MASK; + else if (line == 1) + sepmask |= UART1_ENABLE_MASK; + else if (line == 2) + sepmask |= UART2_ENABLE_MASK; + writew(sepmask, MCF_IPSBAR + MCF_GPIO_PAR_UART); +} + +static void __init m527x_uarts_init(void) +{ + const int nrlines = ARRAY_SIZE(m527x_uart_platform); + int line; + + for (line = 0; (line < nrlines); line++) + m527x_uart_init_line(line, m527x_uart_platform[line].irq); +} /***************************************************************************/ @@ -54,10 +112,21 @@ void mcf_autovector(unsigned int vec) /***************************************************************************/ -void config_BSP(char *commandp, int size) +void __init config_BSP(char *commandp, int size) { mcf_disableall(); mach_reset = coldfire_reset; } /***************************************************************************/ + +static int __init init_BSP(void) +{ + m527x_uarts_init(); + platform_add_devices(m527x_devices, ARRAY_SIZE(m527x_devices)); + return 0; +} + +arch_initcall(init_BSP); + +/***************************************************************************/ Index: linux-2.6.24.7/arch/m68knommu/platform/528x/config.c =================================================================== --- linux-2.6.24.7.orig/arch/m68knommu/platform/528x/config.c +++ linux-2.6.24.7/arch/m68knommu/platform/528x/config.c @@ -16,26 +16,314 @@ #include <linux/param.h> #include <linux/init.h> #include <linux/interrupt.h> -#include <asm/dma.h> +#include <linux/platform_device.h> +#include <linux/spi/spi.h> +#include <linux/spi/flash.h> +#include <linux/io.h> #include <asm/machdep.h> #include <asm/coldfire.h> #include <asm/mcfsim.h> -#include <asm/mcfdma.h> +#include <asm/mcfuart.h> +#include <asm/mcfqspi.h> + +#ifdef CONFIG_MTD_PARTITIONS +#include <linux/mtd/partitions.h> +#endif /***************************************************************************/ void coldfire_reset(void); +void coldfire_qspi_cs_control(u8 cs, u8 command); + +/***************************************************************************/ + +#if defined(CONFIG_SPI) + +#if defined(CONFIG_WILDFIRE) +#define SPI_NUM_CHIPSELECTS 0x02 +#define SPI_PAR_VAL 0x07 // Enable DIN, DOUT, CLK +#define SPI_CS_MASK 0x18 + +#define FLASH_BLOCKSIZE (1024*64) +#define FLASH_NUMBLOCKS 16 +#define FLASH_TYPE "m25p80" + +#define M25P80_CS 0 +#define MMC_CS 1 + +#ifdef CONFIG_MTD_PARTITIONS +static struct mtd_partition stm25p_partitions[] = { + /* sflash */ + [0] = { + .name = "stm25p80", + .offset = 0x00000000, + .size = FLASH_BLOCKSIZE * FLASH_NUMBLOCKS, + .mask_flags = 0 + } +}; + +#endif + +#elif defined(CONFIG_WILDFIREMOD) + +#define SPI_NUM_CHIPSELECTS 0x08 +#define SPI_PAR_VAL 0x07 // Enable DIN, DOUT, CLK +#define SPI_CS_MASK 0x78 + +#define FLASH_BLOCKSIZE (1024*64) +#define FLASH_NUMBLOCKS 64 +#define FLASH_TYPE "m25p32" +/* Reserve 1M for the kernel parition */ +#define FLASH_KERNEL_SIZE (1024 * 1024) + +#define M25P80_CS 5 +#define MMC_CS 6 + +#ifdef CONFIG_MTD_PARTITIONS +static struct mtd_partition stm25p_partitions[] = { + /* sflash */ + [0] = { + .name = "kernel", + .offset = FLASH_BLOCKSIZE * FLASH_NUMBLOCKS - FLASH_KERNEL_SIZE, + .size = FLASH_KERNEL_SIZE, + .mask_flags = 0 + }, + [1] = { + .name = "image", + .offset = 0x00000000, + .size = FLASH_BLOCKSIZE * FLASH_NUMBLOCKS - FLASH_KERNEL_SIZE, + .mask_flags = 0 + }, + [2] = { + .name = "all", + .offset = 0x00000000, + .size = FLASH_BLOCKSIZE * FLASH_NUMBLOCKS, + .mask_flags = 0 + } +}; +#endif + +#else +#define SPI_NUM_CHIPSELECTS 0x04 +#define SPI_PAR_VAL 0x7F // Enable DIN, DOUT, CLK, CS0 - CS4 +#endif + +#ifdef MMC_CS +static struct coldfire_spi_chip flash_chip_info = { + .mode = SPI_MODE_0, + .bits_per_word = 16, + .del_cs_to_clk = 17, + .del_after_trans = 1, + .void_write_data = 0 +}; + +static struct coldfire_spi_chip mmc_chip_info = { + .mode = SPI_MODE_0, + .bits_per_word = 16, + .del_cs_to_clk = 17, + .del_after_trans = 1, + .void_write_data = 0xFFFF +}; +#endif + +#ifdef M25P80_CS +static struct flash_platform_data stm25p80_platform_data = { + .name = "ST M25P80 SPI Flash chip", +#ifdef CONFIG_MTD_PARTITIONS + .parts = stm25p_partitions, + .nr_parts = sizeof(stm25p_partitions) / sizeof(*stm25p_partitions), +#endif + .type = FLASH_TYPE +}; +#endif + +static struct spi_board_info spi_board_info[] __initdata = { +#ifdef M25P80_CS + { + .modalias = "m25p80", + .max_speed_hz = 16000000, + .bus_num = 1, + .chip_select = M25P80_CS, + .platform_data = &stm25p80_platform_data, + .controller_data = &flash_chip_info + }, +#endif +#ifdef MMC_CS + { + .modalias = "mmc_spi", + .max_speed_hz = 16000000, + .bus_num = 1, + .chip_select = MMC_CS, + .controller_data = &mmc_chip_info + } +#endif +}; + +static struct coldfire_spi_master coldfire_master_info = { + .bus_num = 1, + .num_chipselect = SPI_NUM_CHIPSELECTS, + .irq_source = MCF5282_QSPI_IRQ_SOURCE, + .irq_vector = MCF5282_QSPI_IRQ_VECTOR, + .irq_mask = ((0x01 << MCF5282_QSPI_IRQ_SOURCE) | 0x01), + .irq_lp = 0x2B, // Level 5 and Priority 3 + .par_val = SPI_PAR_VAL, + .cs_control = coldfire_qspi_cs_control, +}; + +static struct resource coldfire_spi_resources[] = { + [0] = { + .name = "qspi-par", + .start = MCF5282_QSPI_PAR, + .end = MCF5282_QSPI_PAR, + .flags = IORESOURCE_MEM + }, + + [1] = { + .name = "qspi-module", + .start = MCF5282_QSPI_QMR, + .end = MCF5282_QSPI_QMR + 0x18, + .flags = IORESOURCE_MEM + }, + + [2] = { + .name = "qspi-int-level", + .start = MCF5282_INTC0 + MCFINTC_ICR0 + MCF5282_QSPI_IRQ_SOURCE, + .end = MCF5282_INTC0 + MCFINTC_ICR0 + MCF5282_QSPI_IRQ_SOURCE, + .flags = IORESOURCE_MEM + }, + + [3] = { + .name = "qspi-int-mask", + .start = MCF5282_INTC0 + MCFINTC_IMRL, + .end = MCF5282_INTC0 + MCFINTC_IMRL, + .flags = IORESOURCE_MEM + } +}; + +static struct platform_device coldfire_spi = { + .name = "spi_coldfire", + .id = -1, + .resource = coldfire_spi_resources, + .num_resources = ARRAY_SIZE(coldfire_spi_resources), + .dev = { + .platform_data = &coldfire_master_info, + } +}; + +void coldfire_qspi_cs_control(u8 cs, u8 command) +{ + u8 cs_bit = ((0x01 << cs) << 3) & SPI_CS_MASK; + +#if defined(CONFIG_WILDFIRE) + u8 cs_mask = ~(((0x01 << cs) << 3) & SPI_CS_MASK); +#endif +#if defined(CONFIG_WILDFIREMOD) + u8 cs_mask = (cs << 3) & SPI_CS_MASK; +#endif + + /* + * Don't do anything if the chip select is not + * one of the port qs pins. + */ + if (command & QSPI_CS_INIT) { +#if defined(CONFIG_WILDFIRE) + MCF5282_GPIO_DDRQS |= cs_bit; + MCF5282_GPIO_PQSPAR &= ~cs_bit; +#endif + +#if defined(CONFIG_WILDFIREMOD) + MCF5282_GPIO_DDRQS |= SPI_CS_MASK; + MCF5282_GPIO_PQSPAR &= ~SPI_CS_MASK; +#endif + } + + if (command & QSPI_CS_ASSERT) { + MCF5282_GPIO_PORTQS &= ~SPI_CS_MASK; + MCF5282_GPIO_PORTQS |= cs_mask; + } else if (command & QSPI_CS_DROP) { + MCF5282_GPIO_PORTQS |= SPI_CS_MASK; + } +} + +static int __init spi_dev_init(void) +{ + int retval; + + retval = platform_device_register(&coldfire_spi); + if (retval < 0) + return retval; + + if (ARRAY_SIZE(spi_board_info)) + retval = spi_register_board_info(spi_board_info, ARRAY_SIZE(spi_board_info)); + + return retval; +} + +#endif /* CONFIG_SPI */ /***************************************************************************/ -/* - * DMA channel base address table. - */ -unsigned int dma_base_addr[MAX_M68K_DMA_CHANNELS] = { - MCF_MBAR + MCFDMA_BASE0, +static struct mcf_platform_uart m528x_uart_platform[] = { + { + .mapbase = MCF_MBAR + MCFUART_BASE1, + .irq = MCFINT_VECBASE + MCFINT_UART0, + }, + { + .mapbase = MCF_MBAR + MCFUART_BASE2, + .irq = MCFINT_VECBASE + MCFINT_UART0 + 1, + }, + { + .mapbase = MCF_MBAR + MCFUART_BASE3, + .irq = MCFINT_VECBASE + MCFINT_UART0 + 2, + }, + { }, +}; + +static struct platform_device m528x_uart = { + .name = "mcfuart", + .id = 0, + .dev.platform_data = m528x_uart_platform, +}; + +static struct platform_device *m528x_devices[] __initdata = { + &m528x_uart, }; -unsigned int dma_device_address[MAX_M68K_DMA_CHANNELS]; +/***************************************************************************/ + +#define INTC0 (MCF_MBAR + MCFICM_INTC0) + +static void __init m528x_uart_init_line(int line, int irq) +{ + u8 port; + u32 imr; + + if ((line < 0) || (line > 2)) + return; + + /* level 6, line based priority */ + writeb(0x30+line, INTC0 + MCFINTC_ICR0 + MCFINT_UART0 + line); + + imr = readl(INTC0 + MCFINTC_IMRL); + imr &= ~((1 << (irq - MCFINT_VECBASE)) | 1); + writel(imr, INTC0 + MCFINTC_IMRL); + + /* make sure PUAPAR is set for UART0 and UART1 */ + if (line < 2) { + port = readb(MCF_MBAR + MCF5282_GPIO_PUAPAR); + port |= (0x03 << (line * 2)); + writeb(port, MCF_MBAR + MCF5282_GPIO_PUAPAR); + } +} + +static void __init m528x_uarts_init(void) +{ + const int nrlines = ARRAY_SIZE(m528x_uart_platform); + int line; + + for (line = 0; (line < nrlines); line++) + m528x_uart_init_line(line, m528x_uart_platform[line].irq); +} /***************************************************************************/ @@ -54,10 +342,57 @@ void mcf_autovector(unsigned int vec) /***************************************************************************/ -void config_BSP(char *commandp, int size) +#ifdef CONFIG_WILDFIRE +void wildfire_halt (void) +{ + writeb(0, 0x30000007); + writeb(0x2, 0x30000007); +} +#endif + +#ifdef CONFIG_WILDFIREMOD +void wildfiremod_halt (void) +{ + printk("WildFireMod hibernating...\n"); + + /* Set portE.5 to Digital IO */ + MCF5282_GPIO_PEPAR &= ~(1 << (5 * 2)); + + /* Make portE.5 an output */ + MCF5282_GPIO_DDRE |= (1 << 5); + + /* Now toggle portE.5 from low to high */ + MCF5282_GPIO_PORTE &= ~(1 << 5); + MCF5282_GPIO_PORTE |= (1 << 5); + + printk("Failed to hibernate. Halting!\n"); +} +#endif + +void __init config_BSP(char *commandp, int size) { mcf_disableall(); - mach_reset = coldfire_reset; + +#ifdef CONFIG_WILDFIRE + mach_halt = wildfire_halt; +#endif +#ifdef CONFIG_WILDFIREMOD + mach_halt = wildfiremod_halt; +#endif +} + +/***************************************************************************/ + +static int __init init_BSP(void) +{ + m528x_uarts_init(); +#ifdef CONFIG_SPI + spi_dev_init(); +#endif + platform_add_devices(m528x_devices, ARRAY_SIZE(m528x_devices)); + return 0; } +arch_initcall(init_BSP); + /***************************************************************************/ Index: linux-2.6.24.7/arch/m68knommu/platform/5307/Makefile =================================================================== --- linux-2.6.24.7.orig/arch/m68knommu/platform/5307/Makefile +++ linux-2.6.24.7/arch/m68knommu/platform/5307/Makefile @@ -16,17 +16,5 @@ ifdef CONFIG_FULLDEBUG EXTRA_AFLAGS += -DDEBUGGER_COMPATIBLE_CACHE=1 endif -obj-$(CONFIG_COLDFIRE) += entry.o vectors.o -obj-$(CONFIG_M5206) += timers.o -obj-$(CONFIG_M5206e) += timers.o -obj-$(CONFIG_M520x) += pit.o -obj-$(CONFIG_M523x) += pit.o -obj-$(CONFIG_M5249) += timers.o -obj-$(CONFIG_M527x) += pit.o -obj-$(CONFIG_M5272) += timers.o -obj-$(CONFIG_M5307) += config.o timers.o -obj-$(CONFIG_M532x) += timers.o -obj-$(CONFIG_M528x) += pit.o -obj-$(CONFIG_M5407) += timers.o +obj-y += config.o -extra-y := head.o Index: linux-2.6.24.7/arch/m68knommu/platform/5307/config.c =================================================================== --- linux-2.6.24.7.orig/arch/m68knommu/platform/5307/config.c +++ linux-2.6.24.7/arch/m68knommu/platform/5307/config.c @@ -13,11 +13,11 @@ #include <linux/param.h> #include <linux/init.h> #include <linux/interrupt.h> -#include <asm/dma.h> +#include <linux/io.h> #include <asm/machdep.h> #include <asm/coldfire.h> #include <asm/mcfsim.h> -#include <asm/mcfdma.h> +#include <asm/mcfuart.h> #include <asm/mcfwdebug.h> /***************************************************************************/ @@ -38,17 +38,51 @@ unsigned char ledbank = 0xff; /***************************************************************************/ -/* - * DMA channel base address table. - */ -unsigned int dma_base_addr[MAX_M68K_DMA_CHANNELS] = { - MCF_MBAR + MCFDMA_BASE0, - MCF_MBAR + MCFDMA_BASE1, - MCF_MBAR + MCFDMA_BASE2, - MCF_MBAR + MCFDMA_BASE3, +static struct mcf_platform_uart m5307_uart_platform[] = { + { + .mapbase = MCF_MBAR + MCFUART_BASE1, + .irq = 73, + }, + { + .mapbase = MCF_MBAR + MCFUART_BASE2, + .irq = 74, + }, + { }, +}; + +static struct platform_device m5307_uart = { + .name = "mcfuart", + .id = 0, + .dev.platform_data = m5307_uart_platform, }; -unsigned int dma_device_address[MAX_M68K_DMA_CHANNELS]; +static struct platform_device *m5307_devices[] __initdata = { + &m5307_uart, +}; + +/***************************************************************************/ + +static void __init m5307_uart_init_line(int line, int irq) +{ + if (line == 0) { + writel(MCFSIM_ICR_LEVEL6 | MCFSIM_ICR_PRI1, MCF_MBAR + MCFSIM_UART1ICR); + writeb(irq, MCFUART_BASE1 + MCFUART_UIVR); + mcf_setimr(mcf_getimr() & ~MCFSIM_IMR_UART1); + } else if (line == 1) { + writel(MCFSIM_ICR_LEVEL6 | MCFSIM_ICR_PRI2, MCF_MBAR + MCFSIM_UART2ICR); + writeb(irq, MCFUART_BASE2 + MCFUART_UIVR); + mcf_setimr(mcf_getimr() & ~MCFSIM_IMR_UART2); + } +} + +static void __init m5307_uarts_init(void) +{ + const int nrlines = ARRAY_SIZE(m5307_uart_platform); + int line; + + for (line = 0; (line < nrlines); line++) + m5307_uart_init_line(line, m5307_uart_platform[line].irq); +} /***************************************************************************/ @@ -85,27 +119,12 @@ void mcf_settimericr(unsigned int timer, /***************************************************************************/ -int mcf_timerirqpending(int timer) -{ - unsigned int imr = 0; - - switch (timer) { - case 1: imr = MCFSIM_IMR_TIMER1; break; - case 2: imr = MCFSIM_IMR_TIMER2; break; - default: break; - } - return (mcf_getipr() & imr); -} - -/***************************************************************************/ - -void config_BSP(char *commandp, int size) +void __init config_BSP(char *commandp, int size) { mcf_setimr(MCFSIM_IMR_MASKALL); #if defined(CONFIG_NETtel) || defined(CONFIG_eLIA) || \ - defined(CONFIG_DISKtel) || defined(CONFIG_SECUREEDGEMP3) || \ - defined(CONFIG_CLEOPATRA) + defined(CONFIG_SECUREEDGEMP3) || defined(CONFIG_CLEOPATRA) /* Copy command line from FLASH to local buffer... */ memcpy(commandp, (char *) 0xf0004000, size); commandp[size-1] = 0; @@ -117,7 +136,7 @@ void config_BSP(char *commandp, int size mach_reset = coldfire_reset; -#ifdef MCF_BDM_DISABLE +#ifdef CONFIG_BDM_DISABLE /* * Disable the BDM clocking. This also turns off most of the rest of * the BDM device. This is good for EMC reasons. This option is not @@ -128,3 +147,14 @@ void config_BSP(char *commandp, int size } /***************************************************************************/ + +static int __init init_BSP(void) +{ + m5307_uarts_init(); + platform_add_devices(m5307_devices, ARRAY_SIZE(m5307_devices)); + return 0; +} + +arch_initcall(init_BSP); + +/***************************************************************************/ Index: linux-2.6.24.7/arch/m68knommu/platform/5307/entry.S =================================================================== --- linux-2.6.24.7.orig/arch/m68knommu/platform/5307/entry.S +++ /dev/null @@ -1,235 +0,0 @@ -/* - * linux/arch/m68knommu/platform/5307/entry.S - * - * Copyright (C) 1999-2007, Greg Ungerer (gerg@snapgear.com) - * Copyright (C) 1998 D. Jeff Dionne <jeff@lineo.ca>, - * Kenneth Albanowski <kjahds@kjahds.com>, - * Copyright (C) 2000 Lineo Inc. (www.lineo.com) - * Copyright (C) 2004-2006 Macq Electronique SA. (www.macqel.com) - * - * Based on: - * - * linux/arch/m68k/kernel/entry.S - * - * Copyright (C) 1991, 1992 Linus Torvalds - * - * This file is subject to the terms and conditions of the GNU General Public - * License. See the file README.legal in the main directory of this archive - * for more details. - * - * Linux/m68k support by Hamish Macdonald - * - * 68060 fixes by Jesper Skov - * ColdFire support by Greg Ungerer (gerg@snapgear.com) - * 5307 fixes by David W. Miller - * linux 2.4 support David McCullough <davidm@snapgear.com> - * Bug, speed and maintainability fixes by Philippe De Muyter <phdm@macqel.be> - */ - -#include <linux/sys.h> -#include <linux/linkage.h> -#include <asm/unistd.h> -#include <asm/thread_info.h> -#include <asm/errno.h> -#include <asm/setup.h> -#include <asm/segment.h> -#include <asm/asm-offsets.h> -#include <asm/entry.h> - -.bss - -sw_ksp: -.long 0 - -sw_usp: -.long 0 - -.text - -.globl system_call -.globl resume -.globl ret_from_exception -.globl ret_from_signal -.globl sys_call_table -.globl ret_from_interrupt -.globl inthandler -.globl fasthandler - -enosys: - mov.l #sys_ni_syscall,%d3 - bra 1f - -ENTRY(system_call) - SAVE_ALL - move #0x2000,%sr /* enable intrs again */ - - cmpl #NR_syscalls,%d0 - jcc enosys - lea sys_call_table,%a0 - lsll #2,%d0 /* movel %a0@(%d0:l:4),%d3 */ - movel %a0@(%d0),%d3 - jeq enosys - -1: - movel %sp,%d2 /* get thread_info pointer */ - andl #-THREAD_SIZE,%d2 /* at start of kernel stack */ - movel %d2,%a0 - movel %a0@,%a1 /* save top of frame */ - movel %sp,%a1@(TASK_THREAD+THREAD_ESP0) - btst #(TIF_SYSCALL_TRACE%8),%a0@(TI_FLAGS+(31-TIF_SYSCALL_TRACE)/8) - bnes 1f - - movel %d3,%a0 - jbsr %a0@ - movel %d0,%sp@(PT_D0) /* save the return value */ - jra ret_from_exception -1: - movel #-ENOSYS,%d2 /* strace needs -ENOSYS in PT_D0 */ - movel %d2,PT_D0(%sp) /* on syscall entry */ - subql #4,%sp - SAVE_SWITCH_STACK - jbsr syscall_trace - RESTORE_SWITCH_STACK - addql #4,%sp - movel %d3,%a0 - jbsr %a0@ - movel %d0,%sp@(PT_D0) /* save the return value */ - subql #4,%sp /* dummy return address */ - SAVE_SWITCH_STACK - jbsr syscall_trace - -ret_from_signal: - RESTORE_SWITCH_STACK - addql #4,%sp - -ret_from_exception: - btst #5,%sp@(PT_SR) /* check if returning to kernel */ - jeq Luser_return /* if so, skip resched, signals */ - -Lkernel_return: - moveml %sp@,%d1-%d5/%a0-%a2 - lea %sp@(32),%sp /* space for 8 regs */ - movel %sp@+,%d0 - addql #4,%sp /* orig d0 */ - addl %sp@+,%sp /* stk adj */ - rte - -Luser_return: - movel %sp,%d1 /* get thread_info pointer */ - andl #-THREAD_SIZE,%d1 /* at base of kernel stack */ - movel %d1,%a0 - movel %a0@(TI_FLAGS),%d1 /* get thread_info->flags */ - andl #_TIF_WORK_MASK,%d1 - jne Lwork_to_do /* still work to do */ - -Lreturn: - move #0x2700,%sr /* disable intrs */ - movel sw_usp,%a0 /* get usp */ - movel %sp@(PT_PC),%a0@- /* copy exception program counter */ - movel %sp@(PT_FORMATVEC),%a0@-/* copy exception format/vector/sr */ - moveml %sp@,%d1-%d5/%a0-%a2 - lea %sp@(32),%sp /* space for 8 regs */ - movel %sp@+,%d0 - addql #4,%sp /* orig d0 */ - addl %sp@+,%sp /* stk adj */ - addql #8,%sp /* remove exception */ - movel %sp,sw_ksp /* save ksp */ - subql #8,sw_usp /* set exception */ - movel sw_usp,%sp /* restore usp */ - rte - -Lwork_to_do: - movel %a0@(TI_FLAGS),%d1 /* get thread_info->flags */ - btst #TIF_NEED_RESCHED,%d1 - jne reschedule - - /* GERG: do we need something here for TRACEing?? */ - -Lsignal_return: - subql #4,%sp /* dummy return address */ - SAVE_SWITCH_STACK - pea %sp@(SWITCH_STACK_SIZE) - clrl %sp@- - jsr do_signal - addql #8,%sp - RESTORE_SWITCH_STACK - addql #4,%sp - jmp Lreturn - -/* - * This is the generic interrupt handler (for all hardware interrupt - * sources). Calls upto high level code to do all the work. - */ -ENTRY(inthandler) - SAVE_ALL - moveq #-1,%d0 - movel %d0,%sp@(PT_ORIG_D0) - - movew %sp@(PT_FORMATVEC),%d0 /* put exception # in d0 */ - andl #0x03fc,%d0 /* mask out vector only */ - - movel %sp,%sp@- /* push regs arg */ - lsrl #2,%d0 /* calculate real vector # */ - movel %d0,%sp@- /* push vector number */ - jbsr do_IRQ /* call high level irq handler */ - lea %sp@(8),%sp /* pop args off stack */ - - bra ret_from_interrupt /* this was fallthrough */ - -/* - * This is the fast interrupt handler (for certain hardware interrupt - * sources). Unlike the normal interrupt handler it just uses the - * current stack (doesn't care if it is user or kernel). It also - * doesn't bother doing the bottom half handlers. - */ -ENTRY(fasthandler) - SAVE_LOCAL - - movew %sp@(PT_FORMATVEC),%d0 - andl #0x03fc,%d0 /* mask out vector only */ - - movel %sp,%sp@- /* push regs arg */ - lsrl #2,%d0 /* calculate real vector # */ - movel %d0,%sp@- /* push vector number */ - jbsr do_IRQ /* call high level irq handler */ - lea %sp@(8),%sp /* pop args off stack */ - - RESTORE_LOCAL - -ENTRY(ret_from_interrupt) - jeq 2f -1: - RESTORE_ALL -2: - moveb %sp@(PT_SR),%d0 - andl #0x7,%d0 - jhi 1b - - /* check if we need to do software interrupts */ - movel irq_stat+CPUSTAT_SOFTIRQ_PENDING,%d0 - jeq ret_from_exception - - pea ret_from_exception - jmp do_softirq - -/* - * Beware - when entering resume, prev (the current task) is - * in a0, next (the new task) is in a1,so don't change these - * registers until their contents are no longer needed. - * This is always called in supervisor mode, so don't bother to save - * and restore sr; user's process sr is actually in the stack. - */ -ENTRY(resume) - movel %a0, %d1 /* get prev thread in d1 */ - - movel sw_usp,%d0 /* save usp */ - movel %d0,%a0@(TASK_THREAD+THREAD_USP) - - SAVE_SWITCH_STACK - movel %sp,%a0@(TASK_THREAD+THREAD_KSP) /* save kernel stack pointer */ - movel %a1@(TASK_THREAD+THREAD_KSP),%sp /* restore new thread stack */ - RESTORE_SWITCH_STACK - - movel %a1@(TASK_THREAD+THREAD_USP),%a0 /* restore thread user stack */ - movel %a0, sw_usp - rts Index: linux-2.6.24.7/arch/m68knommu/platform/5307/head.S =================================================================== --- linux-2.6.24.7.orig/arch/m68knommu/platform/5307/head.S +++ /dev/null @@ -1,222 +0,0 @@ -/*****************************************************************************/ - -/* - * head.S -- common startup code for ColdFire CPUs. - * - * (C) Copyright 1999-2006, Greg Ungerer <gerg@snapgear.com>. - */ - -/*****************************************************************************/ - -#include <linux/sys.h> -#include <linux/linkage.h> -#include <asm/asm-offsets.h> -#include <asm/coldfire.h> -#include <asm/mcfcache.h> -#include <asm/mcfsim.h> - -/*****************************************************************************/ - -/* - * If we don't have a fixed memory size, then lets build in code - * to auto detect the DRAM size. Obviously this is the prefered - * method, and should work for most boards. It won't work for those - * that do not have their RAM starting at address 0, and it only - * works on SDRAM (not boards fitted with SRAM). - */ -#if CONFIG_RAMSIZE != 0 -.macro GET_MEM_SIZE - movel #CONFIG_RAMSIZE,%d0 /* hard coded memory size */ -.endm - -#elif defined(CONFIG_M5206) || defined(CONFIG_M5206e) || \ - defined(CONFIG_M5249) || defined(CONFIG_M527x) || \ - defined(CONFIG_M528x) || defined(CONFIG_M5307) || \ - defined(CONFIG_M5407) -/* - * Not all these devices have exactly the same DRAM controller, - * but the DCMR register is virtually identical - give or take - * a couple of bits. The only exception is the 5272 devices, their - * DRAM controller is quite different. - */ -.macro GET_MEM_SIZE - movel MCF_MBAR+MCFSIM_DMR0,%d0 /* get mask for 1st bank */ - btst #0,%d0 /* check if region enabled */ - beq 1f - andl #0xfffc0000,%d0 - beq 1f - addl #0x00040000,%d0 /* convert mask to size */ -1: - movel MCF_MBAR+MCFSIM_DMR1,%d1 /* get mask for 2nd bank */ - btst #0,%d1 /* check if region enabled */ - beq 2f - andl #0xfffc0000, %d1 - beq 2f - addl #0x00040000,%d1 - addl %d1,%d0 /* total mem size in d0 */ -2: -.endm - -#elif defined(CONFIG_M5272) -.macro GET_MEM_SIZE - movel MCF_MBAR+MCFSIM_CSOR7,%d0 /* get SDRAM address mask */ - andil #0xfffff000,%d0 /* mask out chip select options */ - negl %d0 /* negate bits */ -.endm - -#elif defined(CONFIG_M520x) -.macro GET_MEM_SIZE - clrl %d0 - movel MCF_MBAR+MCFSIM_SDCS0, %d2 /* Get SDRAM chip select 0 config */ - andl #0x1f, %d2 /* Get only the chip select size */ - beq 3f /* Check if it is enabled */ - addql #1, %d2 /* Form exponent */ - moveql #1, %d0 - lsll %d2, %d0 /* 2 ^ exponent */ -3: - movel MCF_MBAR+MCFSIM_SDCS1, %d2 /* Get SDRAM chip select 1 config */ - andl #0x1f, %d2 /* Get only the chip select size */ - beq 4f /* Check if it is enabled */ - addql #1, %d2 /* Form exponent */ - moveql #1, %d1 - lsll %d2, %d1 /* 2 ^ exponent */ - addl %d1, %d0 /* Total size of SDRAM in d0 */ -4: -.endm - -#else -#error "ERROR: I don't know how to probe your boards memory size?" -#endif - -/*****************************************************************************/ - -/* - * Boards and platforms can do specific early hardware setup if - * they need to. Most don't need this, define away if not required. - */ -#ifndef PLATFORM_SETUP -#define PLATFORM_SETUP -#endif - -/*****************************************************************************/ - -.global _start -.global _rambase -.global _ramvec -.global _ramstart -.global _ramend - -/*****************************************************************************/ - -.data - -/* - * During startup we store away the RAM setup. These are not in the - * bss, since their values are determined and written before the bss - * has been cleared. - */ -_rambase: -.long 0 -_ramvec: -.long 0 -_ramstart: -.long 0 -_ramend: -.long 0 - -/*****************************************************************************/ - -.text - -/* - * This is the codes first entry point. This is where it all - * begins... - */ - -_start: - nop /* filler */ - movew #0x2700, %sr /* no interrupts */ - - /* - * Do any platform or board specific setup now. Most boards - * don't need anything. Those exceptions are define this in - * their board specific includes. - */ - PLATFORM_SETUP - - /* - * Create basic memory configuration. Set VBR accordingly, - * and size memory. - */ - movel #CONFIG_VECTORBASE,%a7 - movec %a7,%VBR /* set vectors addr */ - movel %a7,_ramvec - - movel #CONFIG_RAMBASE,%a7 /* mark the base of RAM */ - movel %a7,_rambase - - GET_MEM_SIZE /* macro code determines size */ - addl %a7,%d0 - movel %d0,_ramend /* set end ram addr */ - - /* - * Now that we know what the memory is, lets enable cache - * and get things moving. This is Coldfire CPU specific. - */ - CACHE_ENABLE /* enable CPU cache */ - - -#ifdef CONFIG_ROMFS_FS - /* - * Move ROM filesystem above bss :-) - */ - lea _sbss,%a0 /* get start of bss */ - lea _ebss,%a1 /* set up destination */ - movel %a0,%a2 /* copy of bss start */ - - movel 8(%a0),%d0 /* get size of ROMFS */ - addql #8,%d0 /* allow for rounding */ - andl #0xfffffffc, %d0 /* whole words */ - - addl %d0,%a0 /* copy from end */ - addl %d0,%a1 /* copy from end */ - movel %a1,_ramstart /* set start of ram */ - -_copy_romfs: - movel -(%a0),%d0 /* copy dword */ - movel %d0,-(%a1) - cmpl %a0,%a2 /* check if at end */ - bne _copy_romfs - -#else /* CONFIG_ROMFS_FS */ - lea _ebss,%a1 - movel %a1,_ramstart -#endif /* CONFIG_ROMFS_FS */ - - - /* - * Zero out the bss region. - */ - lea _sbss,%a0 /* get start of bss */ - lea _ebss,%a1 /* get end of bss */ - clrl %d0 /* set value */ -_clear_bss: - movel %d0,(%a0)+ /* clear each word */ - cmpl %a0,%a1 /* check if at end */ - bne _clear_bss - - /* - * Load the current task pointer and stack. - */ - lea init_thread_union,%a0 - lea THREAD_SIZE(%a0),%sp - - /* - * Assember start up done, start code proper. - */ - jsr start_kernel /* start Linux kernel */ - -_exit: - jmp _exit /* should never get here */ - -/*****************************************************************************/ Index: linux-2.6.24.7/arch/m68knommu/platform/5307/pit.c =================================================================== --- linux-2.6.24.7.orig/arch/m68knommu/platform/5307/pit.c +++ /dev/null @@ -1,97 +0,0 @@ -/***************************************************************************/ - -/* - * pit.c -- Freescale ColdFire PIT timer. Currently this type of - * hardware timer only exists in the Freescale ColdFire - * 5270/5271, 5282 and other CPUs. - * - * Copyright (C) 1999-2007, Greg Ungerer (gerg@snapgear.com) - * Copyright (C) 2001-2004, SnapGear Inc. (www.snapgear.com) - */ - -/***************************************************************************/ - -#include <linux/kernel.h> -#include <linux/sched.h> -#include <linux/param.h> -#include <linux/init.h> -#include <linux/interrupt.h> -#include <linux/irq.h> -#include <asm/machdep.h> -#include <asm/io.h> -#include <asm/coldfire.h> -#include <asm/mcfpit.h> -#include <asm/mcfsim.h> - -/***************************************************************************/ - -/* - * By default use timer1 as the system clock timer. - */ -#define TA(a) (MCF_IPSBAR + MCFPIT_BASE1 + (a)) - -/***************************************************************************/ - -static irqreturn_t hw_tick(int irq, void *dummy) -{ - unsigned short pcsr; - - /* Reset the ColdFire timer */ - pcsr = __raw_readw(TA(MCFPIT_PCSR)); - __raw_writew(pcsr | MCFPIT_PCSR_PIF, TA(MCFPIT_PCSR)); - - return arch_timer_interrupt(irq, dummy); -} - -/***************************************************************************/ - -static struct irqaction coldfire_pit_irq = { - .name = "timer", - .flags = IRQF_DISABLED | IRQF_TIMER, - .handler = hw_tick, -}; - -void hw_timer_init(void) -{ - volatile unsigned char *icrp; - volatile unsigned long *imrp; - - setup_irq(MCFINT_VECBASE + MCFINT_PIT1, &coldfire_pit_irq); - - icrp = (volatile unsigned char *) (MCF_IPSBAR + MCFICM_INTC0 + - MCFINTC_ICR0 + MCFINT_PIT1); - *icrp = ICR_INTRCONF; - - imrp = (volatile unsigned long *) (MCF_IPSBAR + MCFICM_INTC0 + MCFPIT_IMR); - *imrp &= ~MCFPIT_IMR_IBIT; - - /* Set up PIT timer 1 as poll clock */ - __raw_writew(MCFPIT_PCSR_DISABLE, TA(MCFPIT_PCSR)); - __raw_writew(((MCF_CLK / 2) / 64) / HZ, TA(MCFPIT_PMR)); - __raw_writew(MCFPIT_PCSR_EN | MCFPIT_PCSR_PIE | MCFPIT_PCSR_OVW | - MCFPIT_PCSR_RLD | MCFPIT_PCSR_CLK64, TA(MCFPIT_PCSR)); -} - -/***************************************************************************/ - -unsigned long hw_timer_offset(void) -{ - volatile unsigned long *ipr; - unsigned long pmr, pcntr, offset; - - ipr = (volatile unsigned long *) (MCF_IPSBAR + MCFICM_INTC0 + MCFPIT_IMR); - - pmr = __raw_readw(TA(MCFPIT_PMR)); - pcntr = __raw_readw(TA(MCFPIT_PCNTR)); - - /* - * If we are still in the first half of the upcount and a - * timer interrupt is pending, then add on a ticks worth of time. - */ - offset = ((pmr - pcntr) * (1000000 / HZ)) / pmr; - if ((offset < (1000000 / HZ / 2)) && (*ipr & MCFPIT_IMR_IBIT)) - offset += 1000000 / HZ; - return offset; -} - -/***************************************************************************/ Index: linux-2.6.24.7/arch/m68knommu/platform/5307/timers.c =================================================================== --- linux-2.6.24.7.orig/arch/m68knommu/platform/5307/timers.c +++ /dev/null @@ -1,155 +0,0 @@ -/***************************************************************************/ - -/* - * timers.c -- generic ColdFire hardware timer support. - * - * Copyright (C) 1999-2007, Greg Ungerer (gerg@snapgear.com) - */ - -/***************************************************************************/ - -#include <linux/kernel.h> -#include <linux/init.h> -#include <linux/sched.h> -#include <linux/interrupt.h> -#include <linux/irq.h> -#include <asm/io.h> -#include <asm/traps.h> -#include <asm/machdep.h> -#include <asm/coldfire.h> -#include <asm/mcftimer.h> -#include <asm/mcfsim.h> - -/***************************************************************************/ - -/* - * By default use timer1 as the system clock timer. - */ -#define TA(a) (MCF_MBAR + MCFTIMER_BASE1 + (a)) - -/* - * Default the timer and vector to use for ColdFire. Some ColdFire - * CPU's and some boards may want different. Their sub-architecture - * startup code (in config.c) can change these if they want. - */ -unsigned int mcf_timervector = 29; -unsigned int mcf_profilevector = 31; -unsigned int mcf_timerlevel = 5; - -/* - * These provide the underlying interrupt vector support. - * Unfortunately it is a little different on each ColdFire. - */ -extern void mcf_settimericr(int timer, int level); -extern int mcf_timerirqpending(int timer); - -#if defined(CONFIG_M532x) -#define __raw_readtrr __raw_readl -#define __raw_writetrr __raw_writel -#else -#define __raw_readtrr __raw_readw -#define __raw_writetrr __raw_writew -#endif - -/***************************************************************************/ - -static irqreturn_t hw_tick(int irq, void *dummy) -{ - /* Reset the ColdFire timer */ - __raw_writeb(MCFTIMER_TER_CAP | MCFTIMER_TER_REF, TA(MCFTIMER_TER)); - - return arch_timer_interrupt(irq, dummy); -} - -/***************************************************************************/ - -static struct irqaction coldfire_timer_irq = { - .name = "timer", - .flags = IRQF_DISABLED | IRQF_TIMER, - .handler = hw_tick, -}; - -/***************************************************************************/ - -static int ticks_per_intr; - -void hw_timer_init(void) -{ - setup_irq(mcf_timervector, &coldfire_timer_irq); - - __raw_writew(MCFTIMER_TMR_DISABLE, TA(MCFTIMER_TMR)); - ticks_per_intr = (MCF_BUSCLK / 16) / HZ; - __raw_writetrr(ticks_per_intr - 1, TA(MCFTIMER_TRR)); - __raw_writew(MCFTIMER_TMR_ENORI | MCFTIMER_TMR_CLK16 | - MCFTIMER_TMR_RESTART | MCFTIMER_TMR_ENABLE, TA(MCFTIMER_TMR)); - - mcf_settimericr(1, mcf_timerlevel); - -#ifdef CONFIG_HIGHPROFILE - coldfire_profile_init(); -#endif -} - -/***************************************************************************/ - -unsigned long hw_timer_offset(void) -{ - unsigned long tcn, offset; - - tcn = __raw_readw(TA(MCFTIMER_TCN)); - offset = ((tcn + 1) * (1000000 / HZ)) / ticks_per_intr; - - /* Check if we just wrapped the counters and maybe missed a tick */ - if ((offset < (1000000 / HZ / 2)) && mcf_timerirqpending(1)) - offset += 1000000 / HZ; - return offset; -} - -/***************************************************************************/ -#ifdef CONFIG_HIGHPROFILE -/***************************************************************************/ - -/* - * By default use timer2 as the profiler clock timer. - */ -#define PA(a) (MCF_MBAR + MCFTIMER_BASE2 + (a)) - -/* - * Choose a reasonably fast profile timer. Make it an odd value to - * try and get good coverage of kernel operations. - */ -#define PROFILEHZ 1013 - -/* - * Use the other timer to provide high accuracy profiling info. - */ -irqreturn_t coldfire_profile_tick(int irq, void *dummy) -{ - /* Reset ColdFire timer2 */ - __raw_writeb(MCFTIMER_TER_CAP | MCFTIMER_TER_REF, PA(MCFTIMER_TER)); - if (current->pid) - profile_tick(CPU_PROFILING, regs); - return IRQ_HANDLED; -} - -/***************************************************************************/ - -void coldfire_profile_init(void) -{ - printk(KERN_INFO "PROFILE: lodging TIMER2 @ %dHz as profile timer\n", PROFILEHZ); - - /* Set up TIMER 2 as high speed profile clock */ - __raw_writew(MCFTIMER_TMR_DISABLE, PA(MCFTIMER_TMR)); - - __raw_writetrr(((MCF_CLK / 16) / PROFILEHZ), PA(MCFTIMER_TRR)); - __raw_writew(MCFTIMER_TMR_ENORI | MCFTIMER_TMR_CLK16 | - MCFTIMER_TMR_RESTART | MCFTIMER_TMR_ENABLE, PA(MCFTIMER_TMR)); - - request_irq(mcf_profilevector, coldfire_profile_tick, - (IRQF_DISABLED | IRQ_FLG_FAST), "profile timer", NULL); - mcf_settimericr(2, 7); -} - -/***************************************************************************/ -#endif /* CONFIG_HIGHPROFILE */ -/***************************************************************************/ Index: linux-2.6.24.7/arch/m68knommu/platform/5307/vectors.c =================================================================== --- linux-2.6.24.7.orig/arch/m68knommu/platform/5307/vectors.c +++ /dev/null @@ -1,105 +0,0 @@ -/***************************************************************************/ - -/* - * linux/arch/m68knommu/platform/5307/vectors.c - * - * Copyright (C) 1999-2007, Greg Ungerer <gerg@snapgear.com> - */ - -/***************************************************************************/ - -#include <linux/kernel.h> -#include <linux/init.h> -#include <linux/irq.h> -#include <asm/traps.h> -#include <asm/machdep.h> -#include <asm/coldfire.h> -#include <asm/mcfsim.h> -#include <asm/mcfdma.h> -#include <asm/mcfwdebug.h> - -/***************************************************************************/ - -#ifdef TRAP_DBG_INTERRUPT - -asmlinkage void dbginterrupt_c(struct frame *fp) -{ - extern void dump(struct pt_regs *fp); - printk(KERN_DEBUG "%s(%d): BUS ERROR TRAP\n", __FILE__, __LINE__); - dump((struct pt_regs *) fp); - asm("halt"); -} - -#endif - -/***************************************************************************/ - -extern e_vector *_ramvec; - -void set_evector(int vecnum, void (*handler)(void)) -{ - if (vecnum >= 0 && vecnum <= 255) - _ramvec[vecnum] = handler; -} - -/***************************************************************************/ - -/* Assembler routines */ -asmlinkage void buserr(void); -asmlinkage void trap(void); -asmlinkage void system_call(void); -asmlinkage void inthandler(void); - -void __init init_vectors(void) -{ - int i; - - /* - * There is a common trap handler and common interrupt - * handler that handle almost every vector. We treat - * the system call and bus error special, they get their - * own first level handlers. - */ - for (i = 3; (i <= 23); i++) - _ramvec[i] = trap; - for (i = 33; (i <= 63); i++) - _ramvec[i] = trap; - for (i = 24; (i <= 31); i++) - _ramvec[i] = inthandler; - for (i = 64; (i < 255); i++) - _ramvec[i] = inthandler; - _ramvec[255] = 0; - - _ramvec[2] = buserr; - _ramvec[32] = system_call; - -#ifdef TRAP_DBG_INTERRUPT - _ramvec[12] = dbginterrupt; -#endif -} - -/***************************************************************************/ - -void enable_vector(unsigned int irq) -{ - /* Currently no action on ColdFire */ -} - -void disable_vector(unsigned int irq) -{ - /* Currently no action on ColdFire */ -} - -void ack_vector(unsigned int irq) -{ - /* Currently no action on ColdFire */ -} - -/***************************************************************************/ - -void coldfire_reset(void) -{ - HARD_RESET_NOW(); -} - -/***************************************************************************/ Index: linux-2.6.24.7/arch/m68knommu/platform/532x/config.c =================================================================== --- linux-2.6.24.7.orig/arch/m68knommu/platform/532x/config.c +++ linux-2.6.24.7/arch/m68knommu/platform/532x/config.c @@ -21,10 +21,11 @@ #include <linux/param.h> #include <linux/init.h> #include <linux/interrupt.h> -#include <asm/dma.h> +#include <linux/io.h> #include <asm/machdep.h> #include <asm/coldfire.h> #include <asm/mcfsim.h> +#include <asm/mcfuart.h> #include <asm/mcfdma.h> #include <asm/mcfwdebug.h> @@ -38,11 +39,75 @@ extern unsigned int mcf_timerlevel; /***************************************************************************/ -/* - * DMA channel base address table. - */ -unsigned int dma_base_addr[MAX_M68K_DMA_CHANNELS] = { }; -unsigned int dma_device_address[MAX_M68K_DMA_CHANNELS]; +int sys_clk_khz = 0; +int sys_clk_mhz = 0; + +void wtm_init(void); +void scm_init(void); +void gpio_init(void); +void fbcs_init(void); +void sdramc_init(void); +int clock_pll (int fsys, int flags); +int clock_limp (int); +int clock_exit_limp (void); +int get_sys_clock (void); + +/***************************************************************************/ + +static struct mcf_platform_uart m532x_uart_platform[] = { + { + .mapbase = MCF_MBAR + MCFUART_BASE1, + .irq = MCFINT_VECBASE + MCFINT_UART0, + }, + { + .mapbase = MCF_MBAR + MCFUART_BASE2, + .irq = MCFINT_VECBASE + MCFINT_UART1, + }, + { + .mapbase = MCF_MBAR + MCFUART_BASE3, + .irq = MCFINT_VECBASE + MCFINT_UART2, + }, + { }, +}; + +static struct platform_device m532x_uart = { + .name = "mcfuart", + .id = 0, + .dev.platform_data = m532x_uart_platform, +}; + +static struct platform_device *m532x_devices[] __initdata = { + &m532x_uart, +}; + +/***************************************************************************/ + +static void __init m532x_uart_init_line(int line, int irq) +{ + if (line == 0) { + MCF_INTC0_ICR26 = 0x3; + MCF_INTC0_CIMR = 26; + /* GPIO initialization */ + MCF_GPIO_PAR_UART |= 0x000F; + } else if (line == 1) { + MCF_INTC0_ICR27 = 0x3; + MCF_INTC0_CIMR = 27; + /* GPIO initialization */ + MCF_GPIO_PAR_UART |= 0x0FF0; + } else if (line == 2) { + MCF_INTC0_ICR28 = 0x3; + MCF_INTC0_CIMR = 28; + } +} + +static void __init m532x_uarts_init(void) +{ + const int nrlines = ARRAY_SIZE(m532x_uart_platform); + int line; + + for (line = 0; (line < nrlines); line++) + m532x_uart_init_line(line, m532x_uart_platform[line].irq); +} /***************************************************************************/ @@ -66,22 +131,11 @@ void mcf_settimericr(unsigned int timer, /***************************************************************************/ -int mcf_timerirqpending(int timer) +void __init config_BSP(char *commandp, int size) { - unsigned int imr = 0; - - switch (timer) { - case 1: imr = 0x1; break; - case 2: imr = 0x2; break; - default: break; - } - return (mcf_getiprh() & imr); -} - -/***************************************************************************/ + sys_clk_khz = get_sys_clock(); + sys_clk_mhz = sys_clk_khz/1000; -void config_BSP(char *commandp, int size) -{ mcf_setimr(MCFSIM_IMR_MASKALL); #if !defined(CONFIG_BOOTPARAM) @@ -99,7 +153,7 @@ void config_BSP(char *commandp, int size mcf_profilevector = 64+33; mach_reset = coldfire_reset; -#ifdef MCF_BDM_DISABLE +#ifdef CONFIG_BDM_DISABLE /* * Disable the BDM clocking. This also turns off most of the rest of * the BDM device. This is good for EMC reasons. This option is not @@ -110,6 +164,17 @@ void config_BSP(char *commandp, int size } /***************************************************************************/ + +static int __init init_BSP(void) +{ + m532x_uarts_init(); + platform_add_devices(m532x_devices, ARRAY_SIZE(m532x_devices)); + return 0; +} + +arch_initcall(init_BSP); + +/***************************************************************************/ /* Board initialization */ /********************************************************************/ @@ -152,24 +217,9 @@ void config_BSP(char *commandp, int size #define NAND_FLASH_ADDRESS (0xD0000000) -int sys_clk_khz = 0; -int sys_clk_mhz = 0; - -void wtm_init(void); -void scm_init(void); -void gpio_init(void); -void fbcs_init(void); -void sdramc_init(void); -int clock_pll (int fsys, int flags); -int clock_limp (int); -int clock_exit_limp (void); -int get_sys_clock (void); asmlinkage void __init sysinit(void) { - sys_clk_khz = clock_pll(0, 0); - sys_clk_mhz = sys_clk_khz/1000; - wtm_init(); scm_init(); gpio_init(); @@ -207,25 +257,61 @@ void scm_init(void) void fbcs_init(void) { +#if defined(CONFIG_COBRA5329) + /* The COBRA5329 by senTec needs this settings */ + + /* + * We need to give the LCD enough bandwidth + */ + + MCF_XBS_PRS1 = MCF_XBS_PRIO_LCD(MCF_PRIO_LVL_1) + | MCF_XBS_PRIO_CORE(MCF_PRIO_LVL_2) + | MCF_XBS_PRIO_FEC(MCF_PRIO_LVL_3) + | MCF_XBS_PRIO_USBHOST(MCF_PRIO_LVL_4) + | MCF_XBS_PRIO_EDMA(MCF_PRIO_LVL_5) + | MCF_XBS_PRIO_USBOTG(MCF_PRIO_LVL_6) + | MCF_XBS_PRIO_FACTTEST(MCF_PRIO_LVL_7); + + /* Boot Flash connected to FBCS0 */ + MCF_FBCS0_CSAR = FLASH_ADDRESS; + MCF_FBCS0_CSCR = (MCF_FBCS_CSCR_PS_16 + | MCF_FBCS_CSCR_BEM + | MCF_FBCS_CSCR_AA + | MCF_FBCS_CSCR_WS(8)); + + MCF_FBCS0_CSMR = (MCF_FBCS_CSMR_BAM_1G + | MCF_FBCS_CSMR_V); + + /* Fix bug #10 in the errata */ + MCF_FBCS1_CSAR = 0xC0000000; + MCF_FBCS1_CSCR = (MCF_FBCS_CSCR_PS_16 + | MCF_FBCS_CSCR_BEM + | MCF_FBCS_CSCR_AA + | MCF_FBCS_CSCR_WS(8)); + + MCF_FBCS1_CSMR = (0x30000000 + | MCF_FBCS_CSMR_V + | MCF_FBCS_CSMR_WP ); +#else MCF_GPIO_PAR_CS = 0x0000003E; /* Latch chip select */ MCF_FBCS1_CSAR = 0x10080000; - MCF_FBCS1_CSCR = 0x002A3780; + MCF_FBCS1_CSCR = 0x002A3580 | (MCF_FBCS1_CSCR&0x200); MCF_FBCS1_CSMR = (MCF_FBCS_CSMR_BAM_2M | MCF_FBCS_CSMR_V); /* Initialize latch to drive signals to inactive states */ - *((u16 *)(0x10080000)) = 0xFFFF; + *((u16 *)(0x10080000)) = 0xD3FF; - /* External SRAM */ - MCF_FBCS1_CSAR = EXT_SRAM_ADDRESS; - MCF_FBCS1_CSCR = (MCF_FBCS_CSCR_PS_16 - | MCF_FBCS_CSCR_AA - | MCF_FBCS_CSCR_SBM - | MCF_FBCS_CSCR_WS(1)); - MCF_FBCS1_CSMR = (MCF_FBCS_CSMR_BAM_512K - | MCF_FBCS_CSMR_V); +// /* External SRAM */ +// MCF_FBCS1_CSAR = EXT_SRAM_ADDRESS; +// MCF_FBCS1_CSCR = (MCF_FBCS_CSCR_PS_16 +// | MCF_FBCS_CSCR_AA +// | MCF_FBCS_CSCR_SBM +// | MCF_FBCS_CSCR_WS(1)); +// MCF_FBCS1_CSMR = (MCF_FBCS_CSMR_BAM_512K +// | MCF_FBCS_CSMR_V); /* Boot Flash connected to FBCS0 */ MCF_FBCS0_CSAR = FLASH_ADDRESS; @@ -236,6 +322,7 @@ void fbcs_init(void) | MCF_FBCS_CSCR_WS(7)); MCF_FBCS0_CSMR = (MCF_FBCS_CSMR_BAM_32M | MCF_FBCS_CSMR_V); +#endif } void sdramc_init(void) Index: linux-2.6.24.7/arch/m68knommu/platform/532x/spi-mcf532x.c =================================================================== --- /dev/null +++ linux-2.6.24.7/arch/m68knommu/platform/532x/spi-mcf532x.c @@ -0,0 +1,176 @@ +/***************************************************************************/ +/* + * linux/arch/m68knommu/platform/532x/spi-mcf532x.c + * + * Sub-architcture dependant initialization code for the Freescale + * 532x SPI module + * + * Yaroslav Vinogradov yaroslav.vinogradov@freescale.com + * Copyright Freescale Semiconductor, Inc 2006 + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + */ +/***************************************************************************/ + + +#include <linux/kernel.h> +#include <linux/sched.h> +#include <linux/param.h> +#include <linux/init.h> +#include <linux/interrupt.h> +#include <linux/device.h> +#include <linux/platform_device.h> +#include <linux/spi/spi.h> +#include <linux/spi/mcfqspi.h> +#include <linux/spi/ads7843.h> + +#include <asm/dma.h> +#include <asm/traps.h> +#include <asm/machdep.h> +#include <asm/coldfire.h> +#include <asm/mcfsim.h> +#include <asm/mcfdma.h> + +#define SPI_NUM_CHIPSELECTS 0x04 +#define SPI_PAR_VAL 0xFFF0 /* Enable DIN, DOUT, CLK */ + +#define MCF532x_QSPI_IRQ_SOURCE (31) +#define MCF532x_QSPI_IRQ_VECTOR (64 + MCF532x_QSPI_IRQ_SOURCE) + +#define MCF532x_QSPI_PAR (0xFC0A405A) +#define MCF532x_QSPI_QMR (0xFC05C000) +#define MCF532x_INTC0_ICR (0xFC048040) +#define MCF532x_INTC0_IMRL (0xFC04800C) + +/* on 5329 EVB ADS7843 is connected to IRQ4 */ +#define ADS784x_IRQ_SOURCE 4 +#define ADS784x_IRQ_VECTOR (64+ADS784x_IRQ_SOURCE) +#define ADS7843_IRQ_LEVEL 2 + + +void coldfire_qspi_cs_control(u8 cs, u8 command) +{ +} + +#if defined(CONFIG_TOUCHSCREEN_ADS7843) +static struct coldfire_spi_chip ads784x_chip_info = { + .mode = SPI_MODE_0, + .bits_per_word = 8, + .del_cs_to_clk = 17, + .del_after_trans = 1, + .void_write_data = 0 +}; + +static struct ads7843_platform_data ads784x_platform_data = { + .model = 7843, + .vref_delay_usecs = 0, + .x_plate_ohms = 580, + .y_plate_ohms = 410 +}; +#endif + + +static struct spi_board_info spi_board_info[] = { +#if defined(CONFIG_TOUCHSCREEN_ADS7843) + { + .modalias = "ads7843", + .max_speed_hz = 125000 * 16, + .bus_num = 1, + .chip_select = 1, + .irq = ADS784x_IRQ_VECTOR, + .platform_data = &ads784x_platform_data, + .controller_data = &ads784x_chip_info + } +#endif +}; + +static struct coldfire_spi_master coldfire_master_info = { + .bus_num = 1, + .num_chipselect = SPI_NUM_CHIPSELECTS, + .irq_source = MCF532x_QSPI_IRQ_SOURCE, + .irq_vector = MCF532x_QSPI_IRQ_VECTOR, + .irq_mask = (0x01 << MCF532x_QSPI_IRQ_SOURCE), + .irq_lp = 0x5, /* Level */ + .par_val = 0, /* not used on 532x */ + .par_val16 = SPI_PAR_VAL, + .cs_control = coldfire_qspi_cs_control, +}; + +static struct resource coldfire_spi_resources[] = { + [0] = { + .name = "qspi-par", + .start = MCF532x_QSPI_PAR, + .end = MCF532x_QSPI_PAR, + .flags = IORESOURCE_MEM + }, + + [1] = { + .name = "qspi-module", + .start = MCF532x_QSPI_QMR, + .end = MCF532x_QSPI_QMR + 0x18, + .flags = IORESOURCE_MEM + }, + + [2] = { + .name = "qspi-int-level", + .start = MCF532x_INTC0_ICR + MCF532x_QSPI_IRQ_SOURCE, + .end = MCF532x_INTC0_ICR + MCF532x_QSPI_IRQ_SOURCE, + .flags = IORESOURCE_MEM + }, + + [3] = { + .name = "qspi-int-mask", + .start = MCF532x_INTC0_IMRL, + .end = MCF532x_INTC0_IMRL, + .flags = IORESOURCE_MEM + } +}; + +static struct platform_device coldfire_spi = { + .name = "coldfire-qspi", + .id = -1, + .resource = coldfire_spi_resources, + .num_resources = ARRAY_SIZE(coldfire_spi_resources), + .dev = { + .platform_data = &coldfire_master_info, + } +}; + +#if defined(CONFIG_TOUCHSCREEN_ADS7843) +static int __init init_ads7843(void) +{ + /* GPIO initiaalization */ + MCF_GPIO_PAR_IRQ = MCF_GPIO_PAR_IRQ_PAR_IRQ4(0); + /* EPORT initialization */ + MCF_EPORT_EPPAR = MCF_EPORT_EPPAR_EPPA4(MCF_EPORT_EPPAR_FALLING); + MCF_EPORT_EPDDR = 0; + MCF_EPORT_EPIER = MCF_EPORT_EPIER_EPIE4; + /* enable interrupt source */ + MCF_INTC0_ICR4 = ADS7843_IRQ_LEVEL; + MCF_INTC0_CIMR = ADS784x_IRQ_SOURCE; +} +#endif + +static int __init spi_dev_init(void) +{ + int retval = 0; +#if defined(CONFIG_TOUCHSCREEN_ADS7843) + init_ads7843(); +#endif + + retval = platform_device_register(&coldfire_spi); + if (retval < 0) + goto out; + + if (ARRAY_SIZE(spi_board_info)) + retval = spi_register_board_info(spi_board_info, ARRAY_SIZE(spi_board_info)); + + +out: + return retval; +} + +arch_initcall(spi_dev_init); Index: linux-2.6.24.7/arch/m68knommu/platform/532x/usb-mcf532x.c =================================================================== --- /dev/null +++ linux-2.6.24.7/arch/m68knommu/platform/532x/usb-mcf532x.c @@ -0,0 +1,171 @@ +/*************************************************************************** + * usb-mcf532x.c - Platform level (mcf532x) USB initialization. + * + * Andrey Butok Andrey.Butok@freescale.com. + * Copyright Freescale Semiconductor, Inc 2006 + * + * This program is free software; you can redistribute it and/or modify it + * under the terms of the GNU General Public License as published by the + * Free Software Foundation; either version 2 of the License, or (at your + * option) any later version. + * + * This program is distributed in the hope that it will be useful, but + * WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY + * or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License + * for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software Foundation, + * Inc., 675 Mass Ave, Cambridge, MA 02139, USA. + * + *************************************************************************** + * Changes: + * v0.01 31 March 2006 Andrey Butok + * Initial Release - developed on uClinux with 2.6.15.6 kernel + * + * WARNING: The MCF532x USB functionality was tested + * only with low-speed USB devices (cause of HW bugs). + */ + +#undef DEBUG + +#include <linux/module.h> +#include <linux/kernel.h> +#include <linux/types.h> +#include <linux/errno.h> +#include <linux/init.h> +#include <linux/device.h> +#include <linux/platform_device.h> + +/* Start address of HC registers.*/ +#define MCF532x_USB_HOST_REG_START (0xfc0b4000) +/* End address of HC registers */ +#define MCF532x_USB_HOST_REG_END (MCF532x_USB_HOST_REG_START+0x200) +/* USB Host Interrupt number */ +#define MCF532x_USB_HOST_INT_NUMBER (128+48) + +#ifdef CONFIG_USB_OTG +/* Start address of OTG module registers.*/ +#define MCF532x_USB_OTG_REG_START (0xfc0b0000) +/* End address of OTG module registers */ +#define MCF532x_USB_OTG_REG_END (MCF532x_USB_OTG_REG_START+0x200) +/* USB OTG Interrupt number */ +#define MCF532x_USB_OTG_INT_NUMBER (128+47) +#endif + +/*-------------------------------------------------------------------------*/ + +static void +usb_release(struct device *dev) +{ + /* normally not freed */ +} + +/* + * USB Host module structures + */ +static struct resource ehci_host_resources[] = { + { + .start = MCF532x_USB_HOST_REG_START, + .end = MCF532x_USB_HOST_REG_END, + .flags = IORESOURCE_MEM, + }, + { + .start = MCF532x_USB_HOST_INT_NUMBER, + .flags = IORESOURCE_IRQ, + }, +}; + +static struct platform_device ehci_host_device = { + .name = "ehci", + .id = 1, + .dev = { + .release = usb_release, + .dma_mask = 0x0}, + .num_resources = ARRAY_SIZE(ehci_host_resources), + .resource = ehci_host_resources, +}; + +/* + * USB OTG module structures. + */ +#ifdef CONFIG_USB_OTG +static struct resource ehci_otg_resources[] = { + { + .start = MCF532x_USB_OTG_REG_START, + .end = MCF532x_USB_OTG_REG_END, + .flags = IORESOURCE_MEM, + }, + { + .start = MCF532x_USB_OTG_INT_NUMBER, + .flags = IORESOURCE_IRQ, + }, +}; + +static struct platform_device ehci_otg_device = { + .name = "ehci", + .id = 0, + .dev = { + .release = usb_release, + .dma_mask = 0x0}, + .num_resources = ARRAY_SIZE(ehci_otg_resources), + .resource = ehci_otg_resources, +}; +#endif + +typedef volatile u8 vuint8; /* 8 bits */ + +static int __init +mcf532x_usb_init(void) +{ + int status; + + /* + * Initialize the clock divider for the USB: + */ +#if CONFIG_CLOCK_FREQ == 240000000 + /* + * CPU oerating on 240Mhz (MISCCR[USBDIV]=1) + * this is the default + */ + (*(volatile u16 *) (0xFC0A0010)) |= (0x0002); +#elif CONFIG_CLOCK_FREQ == 180000000 + /* + * CPU oerating on 180Mhz (MISCCR[USBDIV]=0) + */ + (*(volatile u16 *) (0xFC0A0010)) &= ~(0x0002); +#else + #error "CLOCK must be 240MHz or 180Mhz" +#endif + /* + * Register USB Host device: + */ + status = platform_device_register(&ehci_host_device); + if (status) { + pr_info + ("USB-MCF532x: Can't register MCF532x USB Host device, %d\n", + status); + return -ENODEV; + } + pr_info("USB-MCF532x: MCF532x USB Host device is registered\n"); + +#ifdef CONFIG_USB_OTG + /* + * Register USB OTG device: + * Done only USB Host. + * TODO: Device and OTG functinality. + */ + status = platform_device_register(&ehci_otg_device); + if (status) { + pr_info + ("USB-MCF532x: Can't register MCF532x USB OTG device, %d\n", + status); + return -ENODEV; + } + pr_info("USB-MCF532x: MCF532x USB OTG device is registered\n"); +#endif + + return 0; +} + +subsys_initcall(mcf532x_usb_init); Index: linux-2.6.24.7/arch/m68knommu/platform/5407/config.c =================================================================== --- linux-2.6.24.7.orig/arch/m68knommu/platform/5407/config.c +++ linux-2.6.24.7/arch/m68knommu/platform/5407/config.c @@ -13,11 +13,11 @@ #include <linux/param.h> #include <linux/init.h> #include <linux/interrupt.h> -#include <asm/dma.h> +#include <linux/io.h> #include <asm/machdep.h> #include <asm/coldfire.h> #include <asm/mcfsim.h> -#include <asm/mcfdma.h> +#include <asm/mcfuart.h> /***************************************************************************/ @@ -29,17 +29,51 @@ extern unsigned int mcf_timerlevel; /***************************************************************************/ -/* - * DMA channel base address table. - */ -unsigned int dma_base_addr[MAX_M68K_DMA_CHANNELS] = { - MCF_MBAR + MCFDMA_BASE0, - MCF_MBAR + MCFDMA_BASE1, - MCF_MBAR + MCFDMA_BASE2, - MCF_MBAR + MCFDMA_BASE3, +static struct mcf_platform_uart m5407_uart_platform[] = { + { + .mapbase = MCF_MBAR + MCFUART_BASE1, + .irq = 73, + }, + { + .mapbase = MCF_MBAR + MCFUART_BASE2, + .irq = 74, + }, + { }, }; -unsigned int dma_device_address[MAX_M68K_DMA_CHANNELS]; +static struct platform_device m5407_uart = { + .name = "mcfuart", + .id = 0, + .dev.platform_data = m5407_uart_platform, +}; + +static struct platform_device *m5407_devices[] __initdata = { + &m5407_uart, +}; + +/***************************************************************************/ + +static void __init m5407_uart_init_line(int line, int irq) +{ + if (line == 0) { + writel(MCFSIM_ICR_LEVEL6 | MCFSIM_ICR_PRI1, MCF_MBAR + MCFSIM_UART1ICR); + writeb(irq, MCFUART_BASE1 + MCFUART_UIVR); + mcf_setimr(mcf_getimr() & ~MCFSIM_IMR_UART1); + } else if (line == 1) { + writel(MCFSIM_ICR_LEVEL6 | MCFSIM_ICR_PRI2, MCF_MBAR + MCFSIM_UART2ICR); + writeb(irq, MCFUART_BASE2 + MCFUART_UIVR); + mcf_setimr(mcf_getimr() & ~MCFSIM_IMR_UART2); + } +} + +static void __init m5407_uarts_init(void) +{ + const int nrlines = ARRAY_SIZE(m5407_uart_platform); + int line; + + for (line = 0; (line < nrlines); line++) + m5407_uart_init_line(line, m5407_uart_platform[line].irq); +} /***************************************************************************/ @@ -76,21 +110,7 @@ void mcf_settimericr(unsigned int timer, /***************************************************************************/ -int mcf_timerirqpending(int timer) -{ - unsigned int imr = 0; - - switch (timer) { - case 1: imr = MCFSIM_IMR_TIMER1; break; - case 2: imr = MCFSIM_IMR_TIMER2; break; - default: break; - } - return (mcf_getipr() & imr); -} - -/***************************************************************************/ - -void config_BSP(char *commandp, int size) +void __init config_BSP(char *commandp, int size) { mcf_setimr(MCFSIM_IMR_MASKALL); @@ -105,3 +125,14 @@ void config_BSP(char *commandp, int size } /***************************************************************************/ + +static int __init init_BSP(void) +{ + m5407_uarts_init(); + platform_add_devices(m5407_devices, ARRAY_SIZE(m5407_devices)); + return 0; +} + +arch_initcall(init_BSP); + +/***************************************************************************/ Index: linux-2.6.24.7/arch/m68knommu/platform/68328/ints.c =================================================================== --- linux-2.6.24.7.orig/arch/m68knommu/platform/68328/ints.c +++ linux-2.6.24.7/arch/m68knommu/platform/68328/ints.c @@ -101,6 +101,8 @@ void __init init_vectors(void) IMR = ~0; } +void do_IRQ(int irq, struct pt_regs *fp); + /* The 68k family did not have a good way to determine the source * of interrupts until later in the family. The EC000 core does * not provide the vector number on the stack, we vector everything Index: linux-2.6.24.7/arch/m68knommu/platform/68328/timers.c =================================================================== --- linux-2.6.24.7.orig/arch/m68knommu/platform/68328/timers.c +++ linux-2.6.24.7/arch/m68knommu/platform/68328/timers.c @@ -19,6 +19,7 @@ #include <linux/mm.h> #include <linux/interrupt.h> #include <linux/irq.h> +#include <linux/clocksource.h> #include <asm/setup.h> #include <asm/system.h> #include <asm/pgtable.h> @@ -51,6 +52,19 @@ #define TICKS_PER_JIFFY 10 #endif +static u32 m68328_tick_cnt; + +/***************************************************************************/ + +static irqreturn_t hw_tick(int irq, void *dummy) +{ + /* Reset Timer1 */ + TSTAT &= 0; + + m68328_tick_cnt += TICKS_PER_JIFFY; + return arch_timer_interrupt(irq, dummy); +} + /***************************************************************************/ static irqreturn_t hw_tick(int irq, void *dummy) @@ -69,6 +83,33 @@ static struct irqaction m68328_timer_irq .handler = hw_tick, }; +/***************************************************************************/ + +static cycle_t m68328_read_clk(void) +{ + unsigned long flags; + u32 cycles; + + local_irq_save(flags); + cycles = m68328_tick_cnt + TCN; + local_irq_restore(flags); + + return cycles; +} + +/***************************************************************************/ + +static struct clocksource m68328_clk = { + .name = "timer", + .rating = 250, + .read = m68328_read_clk, + .shift = 20, + .mask = CLOCKSOURCE_MASK(32), + .flags = CLOCK_SOURCE_IS_CONTINUOUS, +}; + +/***************************************************************************/ + void hw_timer_init(void) { /* disable timer 1 */ @@ -84,19 +125,8 @@ void hw_timer_init(void) /* Enable timer 1 */ TCTL |= TCTL_TEN; -} - -/***************************************************************************/ - -unsigned long hw_timer_offset(void) -{ - unsigned long ticks = TCN, offset = 0; - - /* check for pending interrupt */ - if (ticks < (TICKS_PER_JIFFY >> 1) && (ISR & (1 << TMR_IRQ_NUM))) - offset = 1000000 / HZ; - ticks = (ticks * 1000000 / HZ) / TICKS_PER_JIFFY; - return ticks + offset; + m68328_clk.mult = clocksource_hz2mult(TICKS_PER_JIFFY*HZ, m68328_clk.shift); + clocksource_register(&m68328_clk); } /***************************************************************************/ Index: linux-2.6.24.7/arch/m68knommu/platform/68360/config.c =================================================================== --- linux-2.6.24.7.orig/arch/m68knommu/platform/68360/config.c +++ linux-2.6.24.7/arch/m68knommu/platform/68360/config.c @@ -103,11 +103,6 @@ void hw_timer_init(void) pquicc->timer_tgcr = tgcr_save; } -unsigned long hw_timer_offset(void) -{ - return 0; -} - void BSP_gettod (int *yearp, int *monp, int *dayp, int *hourp, int *minp, int *secp) { Index: linux-2.6.24.7/arch/m68knommu/platform/coldfire/Makefile =================================================================== --- /dev/null +++ linux-2.6.24.7/arch/m68knommu/platform/coldfire/Makefile @@ -0,0 +1,32 @@ +# +# Makefile for the m68knommu kernel. +# + +# +# If you want to play with the HW breakpoints then you will +# need to add define this, which will give you a stack backtrace +# on the console port whenever a DBG interrupt occurs. You have to +# set up you HW breakpoints to trigger a DBG interrupt: +# +# EXTRA_CFLAGS += -DTRAP_DBG_INTERRUPT +# EXTRA_AFLAGS += -DTRAP_DBG_INTERRUPT +# + +ifdef CONFIG_FULLDEBUG +AFLAGS += -DDEBUGGER_COMPATIBLE_CACHE=1 +endif + +obj-$(CONFIG_COLDFIRE) += dma.o entry.o vectors.o +obj-$(CONFIG_M5206) += timers.o +obj-$(CONFIG_M5206e) += timers.o +obj-$(CONFIG_M520x) += pit.o +obj-$(CONFIG_M523x) += pit.o dma_timer.o irq_chip.o +obj-$(CONFIG_M5249) += timers.o +obj-$(CONFIG_M527x) += pit.o +obj-$(CONFIG_M5272) += timers.o +obj-$(CONFIG_M528x) += pit.o +obj-$(CONFIG_M5307) += timers.o +obj-$(CONFIG_M532x) += timers.o +obj-$(CONFIG_M5407) += timers.o + +extra-y := head.o Index: linux-2.6.24.7/arch/m68knommu/platform/coldfire/dma.c =================================================================== --- /dev/null +++ linux-2.6.24.7/arch/m68knommu/platform/coldfire/dma.c @@ -0,0 +1,39 @@ +/***************************************************************************/ + +/* + * dma.c -- Freescale ColdFire DMA support + * + * Copyright (C) 2007, Greg Ungerer (gerg@snapgear.com) + */ + +/***************************************************************************/ + +#include <linux/kernel.h> +#include <asm/dma.h> +#include <asm/coldfire.h> +#include <asm/mcfsim.h> +#include <asm/mcfdma.h> + +/***************************************************************************/ + +/* + * DMA channel base address table. + */ +unsigned int dma_base_addr[MAX_M68K_DMA_CHANNELS] = { +#ifdef MCFDMA_BASE0 + MCF_MBAR + MCFDMA_BASE0, +#endif +#ifdef MCFDMA_BASE1 + MCF_MBAR + MCFDMA_BASE1, +#endif +#ifdef MCFDMA_BASE2 + MCF_MBAR + MCFDMA_BASE2, +#endif +#ifdef MCFDMA_BASE3 + MCF_MBAR + MCFDMA_BASE3, +#endif +}; + +unsigned int dma_device_address[MAX_M68K_DMA_CHANNELS]; + +/***************************************************************************/ Index: linux-2.6.24.7/arch/m68knommu/platform/coldfire/dma_timer.c =================================================================== --- /dev/null +++ linux-2.6.24.7/arch/m68knommu/platform/coldfire/dma_timer.c @@ -0,0 +1,84 @@ +/* + * dma_timer.c -- Freescale ColdFire DMA Timer. + * + * Copyright (C) 2007, Benedikt Spranger <b.spranger@linutronix.de> + * Copyright (C) 2008. Sebastian Siewior, Linutronix + * + */ + +#include <linux/clocksource.h> +#include <linux/io.h> + +#include <asm/machdep.h> +#include <asm/coldfire.h> +#include <asm/mcfpit.h> +#include <asm/mcfsim.h> + +#define DMA_TIMER_0 (0x00) +#define DMA_TIMER_1 (0x40) +#define DMA_TIMER_2 (0x80) +#define DMA_TIMER_3 (0xc0) + +#define DTMR0 (MCF_IPSBAR + DMA_TIMER_0 + 0x400) +#define DTXMR0 (MCF_IPSBAR + DMA_TIMER_0 + 0x402) +#define DTER0 (MCF_IPSBAR + DMA_TIMER_0 + 0x403) +#define DTRR0 (MCF_IPSBAR + DMA_TIMER_0 + 0x404) +#define DTCR0 (MCF_IPSBAR + DMA_TIMER_0 + 0x408) +#define DTCN0 (MCF_IPSBAR + DMA_TIMER_0 + 0x40c) + +#define DMA_FREQ ((MCF_CLK / 2) / 16) + +/* DTMR */ +#define DMA_DTMR_RESTART (1 << 3) +#define DMA_DTMR_CLK_DIV_1 (1 << 1) +#define DMA_DTMR_CLK_DIV_16 (2 << 1) +#define DMA_DTMR_ENABLE (1 << 0) + +static cycle_t cf_dt_get_cycles(void) +{ + return __raw_readl(DTCN0); +} + +static struct clocksource clocksource_cf_dt = { + .name = "coldfire_dma_timer", + .rating = 200, + .read = cf_dt_get_cycles, + .mask = CLOCKSOURCE_MASK(32), + .shift = 20, + .flags = CLOCK_SOURCE_IS_CONTINUOUS, +}; + +static int __init init_cf_dt_clocksource(void) +{ + /* + * We setup DMA timer 0 in free run mode. This incrementing counter is + * used as a highly precious clock source. With MCF_CLOCK = 150 MHz we + * get a ~213 ns resolution and the 32bit register will overflow almost + * every 15 minutes. + */ + __raw_writeb(0x00, DTXMR0); + __raw_writeb(0x00, DTER0); + __raw_writel(0x00000000, DTRR0); + __raw_writew(DMA_DTMR_CLK_DIV_16 | DMA_DTMR_ENABLE, DTMR0); + clocksource_cf_dt.mult = clocksource_hz2mult(DMA_FREQ, + clocksource_cf_dt.shift); + return clocksource_register(&clocksource_cf_dt); +} + +arch_initcall(init_cf_dt_clocksource); + +#define CYC2NS_SCALE_FACTOR 10 /* 2^10, carefully chosen in tsc / x86 */ +#define CYC2NS_SCALE ((1000000 << CYC2NS_SCALE_FACTOR) / (DMA_FREQ / 1000)) + +static unsigned long long cycles2ns(unsigned long cycl) +{ + return (unsigned long long) ((unsigned long long)cycl * CYC2NS_SCALE) + >> CYC2NS_SCALE_FACTOR; +} + +unsigned long long sched_clock(void) +{ + unsigned long cycl = __raw_readl(DTCN0); + + return cycles2ns(cycl); +} Index: linux-2.6.24.7/arch/m68knommu/platform/coldfire/entry.S =================================================================== --- /dev/null +++ linux-2.6.24.7/arch/m68knommu/platform/coldfire/entry.S @@ -0,0 +1,241 @@ +/* + * linux/arch/m68knommu/platform/5307/entry.S + * + * Copyright (C) 1999-2007, Greg Ungerer (gerg@snapgear.com) + * Copyright (C) 1998 D. Jeff Dionne <jeff@lineo.ca>, + * Kenneth Albanowski <kjahds@kjahds.com>, + * Copyright (C) 2000 Lineo Inc. (www.lineo.com) + * Copyright (C) 2004-2006 Macq Electronique SA. (www.macqel.com) + * + * Based on: + * + * linux/arch/m68k/kernel/entry.S + * + * Copyright (C) 1991, 1992 Linus Torvalds + * + * This file is subject to the terms and conditions of the GNU General Public + * License. See the file README.legal in the main directory of this archive + * for more details. + * + * Linux/m68k support by Hamish Macdonald + * + * 68060 fixes by Jesper Skov + * ColdFire support by Greg Ungerer (gerg@snapgear.com) + * 5307 fixes by David W. Miller + * linux 2.4 support David McCullough <davidm@snapgear.com> + * Bug, speed and maintainability fixes by Philippe De Muyter <phdm@macqel.be> + */ + +#include <linux/sys.h> +#include <linux/linkage.h> +#include <asm/unistd.h> +#include <asm/thread_info.h> +#include <asm/errno.h> +#include <asm/setup.h> +#include <asm/segment.h> +#include <asm/asm-offsets.h> +#include <asm/entry.h> + +.bss + +sw_ksp: +.long 0 + +sw_usp: +.long 0 + +.text + +.globl system_call +.globl resume +.globl ret_from_exception +.globl ret_from_signal +.globl sys_call_table +.globl ret_from_interrupt +.globl inthandler +.globl fasthandler + +enosys: + mov.l #sys_ni_syscall,%d3 + bra 1f + +ENTRY(system_call) + SAVE_ALL + move #0x2000,%sr /* enable intrs again */ + + cmpl #NR_syscalls,%d0 + jcc enosys + lea sys_call_table,%a0 + lsll #2,%d0 /* movel %a0@(%d0:l:4),%d3 */ + movel %a0@(%d0),%d3 + jeq enosys + +1: + movel %sp,%d2 /* get thread_info pointer */ + andl #-THREAD_SIZE,%d2 /* at start of kernel stack */ + movel %d2,%a0 + movel %a0@,%a1 /* save top of frame */ + movel %sp,%a1@(TASK_THREAD+THREAD_ESP0) + btst #(TIF_SYSCALL_TRACE%8),%a0@(TI_FLAGS+(31-TIF_SYSCALL_TRACE)/8) + bnes 1f + + movel %d3,%a0 + jbsr %a0@ + movel %d0,%sp@(PT_D0) /* save the return value */ + jra ret_from_exception +1: + movel #-ENOSYS,%d2 /* strace needs -ENOSYS in PT_D0 */ + movel %d2,PT_D0(%sp) /* on syscall entry */ + subql #4,%sp + SAVE_SWITCH_STACK + jbsr syscall_trace + RESTORE_SWITCH_STACK + addql #4,%sp + movel %d3,%a0 + jbsr %a0@ + movel %d0,%sp@(PT_D0) /* save the return value */ + subql #4,%sp /* dummy return address */ + SAVE_SWITCH_STACK + jbsr syscall_trace + +ret_from_signal: + RESTORE_SWITCH_STACK + addql #4,%sp + +ret_from_exception: + move #0x2700,%sr /* disable intrs */ + btst #5,%sp@(PT_SR) /* check if returning to kernel */ + jeq Luser_return /* if so, skip resched, signals */ + +#ifdef CONFIG_PREEMPT + movel %sp,%d1 /* get thread_info pointer */ + andl #-THREAD_SIZE,%d1 /* at base of kernel stack */ + movel %d1,%a0 + movel %a0@(TI_FLAGS),%d1 /* get thread_info->flags */ + andl #_TIF_NEED_RESCHED,%d1 + jeq Lkernel_return + + movel %a0@(TI_PREEMPTCOUNT),%d1 + cmpl #0,%d1 + jne Lkernel_return + + pea Lkernel_return + jmp preempt_schedule_irq /* preempt the kernel */ +#endif + +Lkernel_return: + moveml %sp@,%d1-%d5/%a0-%a2 + lea %sp@(32),%sp /* space for 8 regs */ + movel %sp@+,%d0 + addql #4,%sp /* orig d0 */ + addl %sp@+,%sp /* stk adj */ + rte + +Luser_return: + movel %sp,%d1 /* get thread_info pointer */ + andl #-THREAD_SIZE,%d1 /* at base of kernel stack */ + movel %d1,%a0 + movel %a0@(TI_FLAGS),%d1 /* get thread_info->flags */ + andl #_TIF_WORK_MASK,%d1 + jne Lwork_to_do /* still work to do */ + +Lreturn: + move #0x2700,%sr /* disable intrs */ + movel sw_usp,%a0 /* get usp */ + movel %sp@(PT_PC),%a0@- /* copy exception program counter */ + movel %sp@(PT_FORMATVEC),%a0@-/* copy exception format/vector/sr */ + moveml %sp@,%d1-%d5/%a0-%a2 + lea %sp@(32),%sp /* space for 8 regs */ + movel %sp@+,%d0 + addql #4,%sp /* orig d0 */ + addl %sp@+,%sp /* stk adj */ + addql #8,%sp /* remove exception */ + movel %sp,sw_ksp /* save ksp */ + subql #8,sw_usp /* set exception */ + movel sw_usp,%sp /* restore usp */ + rte + +Lwork_to_do: + movel %a0@(TI_FLAGS),%d1 /* get thread_info->flags */ + move #0x2000,%sr /* enable intrs again */ + btst #TIF_NEED_RESCHED,%d1 + jne reschedule + + /* GERG: do we need something here for TRACEing?? */ + +Lsignal_return: + subql #4,%sp /* dummy return address */ + SAVE_SWITCH_STACK + pea %sp@(SWITCH_STACK_SIZE) + clrl %sp@- + jsr do_signal + addql #8,%sp + RESTORE_SWITCH_STACK + addql #4,%sp + jmp Lreturn + +/* + * This is the generic interrupt handler (for all hardware interrupt + * sources). Calls upto high level code to do all the work. + */ +ENTRY(inthandler) + SAVE_ALL + moveq #-1,%d0 + movel %d0,%sp@(PT_ORIG_D0) + + movew %sp@(PT_FORMATVEC),%d0 /* put exception # in d0 */ + andl #0x03fc,%d0 /* mask out vector only */ + + movel %sp,%sp@- /* push regs arg */ + lsrl #2,%d0 /* calculate real vector # */ + movel %d0,%sp@- /* push vector number */ + jbsr do_IRQ /* call high level irq handler */ + lea %sp@(8),%sp /* pop args off stack */ + + bra ret_from_interrupt /* this was fallthrough */ + +/* + * This is the fast interrupt handler (for certain hardware interrupt + * sources). Unlike the normal interrupt handler it just uses the + * current stack (doesn't care if it is user or kernel). It also + * doesn't bother doing the bottom half handlers. + */ +ENTRY(fasthandler) + SAVE_LOCAL + + movew %sp@(PT_FORMATVEC),%d0 + andl #0x03fc,%d0 /* mask out vector only */ + + movel %sp,%sp@- /* push regs arg */ + lsrl #2,%d0 /* calculate real vector # */ + movel %d0,%sp@- /* push vector number */ + jbsr do_IRQ /* call high level irq handler */ + lea %sp@(8),%sp /* pop args off stack */ + + RESTORE_LOCAL + +ENTRY(ret_from_interrupt) + /* the fasthandler is confusing me, haven't seen any user */ + jmp ret_from_exception + +/* + * Beware - when entering resume, prev (the current task) is + * in a0, next (the new task) is in a1,so don't change these + * registers until their contents are no longer needed. + * This is always called in supervisor mode, so don't bother to save + * and restore sr; user's process sr is actually in the stack. + */ +ENTRY(resume) + movel %a0, %d1 /* get prev thread in d1 */ + + movel sw_usp,%d0 /* save usp */ + movel %d0,%a0@(TASK_THREAD+THREAD_USP) + + SAVE_SWITCH_STACK + movel %sp,%a0@(TASK_THREAD+THREAD_KSP) /* save kernel stack pointer */ + movel %a1@(TASK_THREAD+THREAD_KSP),%sp /* restore new thread stack */ + RESTORE_SWITCH_STACK + + movel %a1@(TASK_THREAD+THREAD_USP),%a0 /* restore thread user stack */ + movel %a0, sw_usp + rts Index: linux-2.6.24.7/arch/m68knommu/platform/coldfire/head.S =================================================================== --- /dev/null +++ linux-2.6.24.7/arch/m68knommu/platform/coldfire/head.S @@ -0,0 +1,222 @@ +/*****************************************************************************/ + +/* + * head.S -- common startup code for ColdFire CPUs. + * + * (C) Copyright 1999-2006, Greg Ungerer <gerg@snapgear.com>. + */ + +/*****************************************************************************/ + +#include <linux/sys.h> +#include <linux/linkage.h> +#include <asm/asm-offsets.h> +#include <asm/coldfire.h> +#include <asm/mcfcache.h> +#include <asm/mcfsim.h> + +/*****************************************************************************/ + +/* + * If we don't have a fixed memory size, then lets build in code + * to auto detect the DRAM size. Obviously this is the prefered + * method, and should work for most boards. It won't work for those + * that do not have their RAM starting at address 0, and it only + * works on SDRAM (not boards fitted with SRAM). + */ +#if CONFIG_RAMSIZE != 0 +.macro GET_MEM_SIZE + movel #CONFIG_RAMSIZE,%d0 /* hard coded memory size */ +.endm + +#elif defined(CONFIG_M5206) || defined(CONFIG_M5206e) || \ + defined(CONFIG_M5249) || defined(CONFIG_M527x) || \ + defined(CONFIG_M528x) || defined(CONFIG_M5307) || \ + defined(CONFIG_M5407) +/* + * Not all these devices have exactly the same DRAM controller, + * but the DCMR register is virtually identical - give or take + * a couple of bits. The only exception is the 5272 devices, their + * DRAM controller is quite different. + */ +.macro GET_MEM_SIZE + movel MCF_MBAR+MCFSIM_DMR0,%d0 /* get mask for 1st bank */ + btst #0,%d0 /* check if region enabled */ + beq 1f + andl #0xfffc0000,%d0 + beq 1f + addl #0x00040000,%d0 /* convert mask to size */ +1: + movel MCF_MBAR+MCFSIM_DMR1,%d1 /* get mask for 2nd bank */ + btst #0,%d1 /* check if region enabled */ + beq 2f + andl #0xfffc0000, %d1 + beq 2f + addl #0x00040000,%d1 + addl %d1,%d0 /* total mem size in d0 */ +2: +.endm + +#elif defined(CONFIG_M5272) +.macro GET_MEM_SIZE + movel MCF_MBAR+MCFSIM_CSOR7,%d0 /* get SDRAM address mask */ + andil #0xfffff000,%d0 /* mask out chip select options */ + negl %d0 /* negate bits */ +.endm + +#elif defined(CONFIG_M520x) +.macro GET_MEM_SIZE + clrl %d0 + movel MCF_MBAR+MCFSIM_SDCS0, %d2 /* Get SDRAM chip select 0 config */ + andl #0x1f, %d2 /* Get only the chip select size */ + beq 3f /* Check if it is enabled */ + addql #1, %d2 /* Form exponent */ + moveql #1, %d0 + lsll %d2, %d0 /* 2 ^ exponent */ +3: + movel MCF_MBAR+MCFSIM_SDCS1, %d2 /* Get SDRAM chip select 1 config */ + andl #0x1f, %d2 /* Get only the chip select size */ + beq 4f /* Check if it is enabled */ + addql #1, %d2 /* Form exponent */ + moveql #1, %d1 + lsll %d2, %d1 /* 2 ^ exponent */ + addl %d1, %d0 /* Total size of SDRAM in d0 */ +4: +.endm + +#else +#error "ERROR: I don't know how to probe your boards memory size?" +#endif + +/*****************************************************************************/ + +/* + * Boards and platforms can do specific early hardware setup if + * they need to. Most don't need this, define away if not required. + */ +#ifndef PLATFORM_SETUP +#define PLATFORM_SETUP +#endif + +/*****************************************************************************/ + +.global _start +.global _rambase +.global _ramvec +.global _ramstart +.global _ramend + +/*****************************************************************************/ + +.data + +/* + * During startup we store away the RAM setup. These are not in the + * bss, since their values are determined and written before the bss + * has been cleared. + */ +_rambase: +.long 0 +_ramvec: +.long 0 +_ramstart: +.long 0 +_ramend: +.long 0 + +/*****************************************************************************/ + +.text + +/* + * This is the codes first entry point. This is where it all + * begins... + */ + +_start: + nop /* filler */ + movew #0x2700, %sr /* no interrupts */ + + /* + * Do any platform or board specific setup now. Most boards + * don't need anything. Those exceptions are define this in + * their board specific includes. + */ + PLATFORM_SETUP + + /* + * Create basic memory configuration. Set VBR accordingly, + * and size memory. + */ + movel #CONFIG_VECTORBASE,%a7 + movec %a7,%VBR /* set vectors addr */ + movel %a7,_ramvec + + movel #CONFIG_RAMBASE,%a7 /* mark the base of RAM */ + movel %a7,_rambase + + GET_MEM_SIZE /* macro code determines size */ + addl %a7,%d0 + movel %d0,_ramend /* set end ram addr */ + + /* + * Now that we know what the memory is, lets enable cache + * and get things moving. This is Coldfire CPU specific. + */ + CACHE_ENABLE /* enable CPU cache */ + + +#ifdef CONFIG_ROMFS_FS + /* + * Move ROM filesystem above bss :-) + */ + lea _sbss,%a0 /* get start of bss */ + lea _ebss,%a1 /* set up destination */ + movel %a0,%a2 /* copy of bss start */ + + movel 8(%a0),%d0 /* get size of ROMFS */ + addql #8,%d0 /* allow for rounding */ + andl #0xfffffffc, %d0 /* whole words */ + + addl %d0,%a0 /* copy from end */ + addl %d0,%a1 /* copy from end */ + movel %a1,_ramstart /* set start of ram */ + +_copy_romfs: + movel -(%a0),%d0 /* copy dword */ + movel %d0,-(%a1) + cmpl %a0,%a2 /* check if at end */ + bne _copy_romfs + +#else /* CONFIG_ROMFS_FS */ + lea _ebss,%a1 + movel %a1,_ramstart +#endif /* CONFIG_ROMFS_FS */ + + + /* + * Zero out the bss region. + */ + lea _sbss,%a0 /* get start of bss */ + lea _ebss,%a1 /* get end of bss */ + clrl %d0 /* set value */ +_clear_bss: + movel %d0,(%a0)+ /* clear each word */ + cmpl %a0,%a1 /* check if at end */ + bne _clear_bss + + /* + * Load the current task pointer and stack. + */ + lea init_thread_union,%a0 + lea THREAD_SIZE(%a0),%sp + + /* + * Assember start up done, start code proper. + */ + jsr start_kernel /* start Linux kernel */ + +_exit: + jmp _exit /* should never get here */ + +/*****************************************************************************/ Index: linux-2.6.24.7/arch/m68knommu/platform/coldfire/irq_chip.c =================================================================== --- /dev/null +++ linux-2.6.24.7/arch/m68knommu/platform/coldfire/irq_chip.c @@ -0,0 +1,110 @@ +/* + * IRQ-Chip implementation for Coldfire + * + * Author: Sebastian Siewior <bigeasy@linutronix.de> + */ + +#include <linux/types.h> +#include <linux/irq.h> +#include <asm/coldfire.h> +#include <asm/mcfsim.h> + +static inline void *coldfire_irqnum_to_mem(unsigned int irq) +{ + u32 imrp; + + imrp = MCF_IPSBAR; +#if defined(MCFINT_INTC1_VECBASE) + if (irq > MCFINT_INTC1_VECBASE) { + imrp += MCFICM_INTC1; + irq -= MCFINT_PER_INTC; + } else +#endif + imrp += MCFICM_INTC0; + + irq -= MCFINT_VECBASE; + + if (irq > 32) + imrp += MCFINTC_IMRH; + else + imrp += MCFINTC_IMRL; + + return (void *)imrp; +} + +static inline unsigned int coldfire_irqnum_to_bit(unsigned int irq) +{ + irq -= MCFINT_VECBASE; + + if (irq > 32) + irq -= 32; + + return irq; +} + +static void coldfire_mask(unsigned int irq) +{ + volatile unsigned long *imrp; + u32 mask; + u32 irq_bit; + + imrp = coldfire_irqnum_to_mem(irq); + irq_bit = coldfire_irqnum_to_bit(irq); + + mask = 1 << irq_bit; + *imrp |= mask; +} + +static void coldfire_unmask(unsigned int irq) +{ + volatile unsigned long *imrp; + u32 mask; + u32 irq_bit; + + imrp = coldfire_irqnum_to_mem(irq); + irq_bit = coldfire_irqnum_to_bit(irq); + + mask = 1 << irq_bit; + *imrp &= ~mask; +} + +static void coldfire_nop(unsigned int irq) +{ +} + +static struct irq_chip m_irq_chip = { + .name = "M68K-INTC", + .ack = coldfire_nop, + .mask = coldfire_mask, + .unmask = coldfire_unmask, +}; + +void __init coldfire_init_irq_chip(void) +{ + volatile u32 *imrp; + volatile u8 *icrp; + u32 irq; + u32 i; + + for (irq = 0; irq < NR_IRQS; irq++) + set_irq_chip_and_handler_name(irq, &m_irq_chip, + handle_level_irq, m_irq_chip.name); + + /* setup prios for interrupt sources (first field is reserved) */ + icrp = (u8 *)MCF_IPSBAR + MCFICM_INTC0 + MCFINTC_ICR0; + for (i = 1; i <= 63; i++) + icrp[i] = i; + + /* remove the disable all flag, disable all interrupt sources */ + imrp = coldfire_irqnum_to_mem(MCFINT_VECBASE); + *imrp = 0xfffffffe; + +#if defined(MCFINT_INTC1_VECBASE) + icrp = (u8 *)MCF_IPSBAR + MCFICM_INTC1 + MCFINTC_ICR0; + for (i = 1; i <= 63; i++) + icrp[i] = i; + + imrp = coldfire_irqnum_to_mem(MCFINT_INTC1_VECBASE); + *imrp = 0xfffffffe; +#endif +} Index: linux-2.6.24.7/arch/m68knommu/platform/coldfire/pit.c =================================================================== --- /dev/null +++ linux-2.6.24.7/arch/m68knommu/platform/coldfire/pit.c @@ -0,0 +1,180 @@ +/***************************************************************************/ + +/* + * pit.c -- Freescale ColdFire PIT timer. Currently this type of + * hardware timer only exists in the Freescale ColdFire + * 5270/5271, 5282 and 5208 CPUs. No doubt newer ColdFire + * family members will probably use it too. + * + * Copyright (C) 1999-2008, Greg Ungerer (gerg@snapgear.com) + * Copyright (C) 2001-2004, SnapGear Inc. (www.snapgear.com) + */ + +/***************************************************************************/ + +#include <linux/kernel.h> +#include <linux/sched.h> +#include <linux/param.h> +#include <linux/init.h> +#include <linux/interrupt.h> +#include <linux/irq.h> +#include <linux/clockchips.h> +#include <asm/machdep.h> +#include <asm/io.h> +#include <asm/coldfire.h> +#include <asm/mcfpit.h> +#include <asm/mcfsim.h> + +/***************************************************************************/ + +/* + * By default use timer1 as the system clock timer. + */ +#define FREQ ((MCF_CLK / 2) / 64) +#define TA(a) (MCF_IPSBAR + MCFPIT_BASE1 + (a)) +#define INTC0 (MCF_IPSBAR + MCFICM_INTC0) +#define PIT_CYCLES_PER_JIFFY (FREQ / HZ) + +static u32 pit_cnt; + +/* + * Initialize the PIT timer. + * + * This is also called after resume to bring the PIT into operation again. + */ + +static void init_cf_pit_timer(enum clock_event_mode mode, + struct clock_event_device *evt) +{ + switch (mode) { + case CLOCK_EVT_MODE_PERIODIC: + + __raw_writew(MCFPIT_PCSR_DISABLE, TA(MCFPIT_PCSR)); + __raw_writew(PIT_CYCLES_PER_JIFFY, TA(MCFPIT_PMR)); + __raw_writew(MCFPIT_PCSR_EN | MCFPIT_PCSR_PIE | \ + MCFPIT_PCSR_OVW | MCFPIT_PCSR_RLD | \ + MCFPIT_PCSR_CLK64, TA(MCFPIT_PCSR)); + break; + + case CLOCK_EVT_MODE_SHUTDOWN: + case CLOCK_EVT_MODE_UNUSED: + + __raw_writew(MCFPIT_PCSR_DISABLE, TA(MCFPIT_PCSR)); + break; + + case CLOCK_EVT_MODE_ONESHOT: + + __raw_writew(MCFPIT_PCSR_DISABLE, TA(MCFPIT_PCSR)); + __raw_writew(MCFPIT_PCSR_EN | MCFPIT_PCSR_PIE | \ + MCFPIT_PCSR_OVW | MCFPIT_PCSR_CLK64, \ + TA(MCFPIT_PCSR)); + break; + + case CLOCK_EVT_MODE_RESUME: + /* Nothing to do here */ + break; + } +} + +/* + * Program the next event in oneshot mode + * + * Delta is given in PIT ticks + */ +static int cf_pit_next_event(unsigned long delta, + struct clock_event_device *evt) +{ + __raw_writew(delta, TA(MCFPIT_PMR)); + return 0; +} + +struct clock_event_device cf_pit_clockevent = { + .name = "pit", + .features = CLOCK_EVT_FEAT_PERIODIC | CLOCK_EVT_FEAT_ONESHOT, + .set_mode = init_cf_pit_timer, + .set_next_event = cf_pit_next_event, + .shift = 32, + .irq = MCFINT_VECBASE + MCFINT_PIT1, +}; + + + +/***************************************************************************/ + +static irqreturn_t pit_tick(int irq, void *dummy) +{ + struct clock_event_device *evt = &cf_pit_clockevent; + u16 pcsr; + + /* Reset the ColdFire timer */ + pcsr = __raw_readw(TA(MCFPIT_PCSR)); + __raw_writew(pcsr | MCFPIT_PCSR_PIF, TA(MCFPIT_PCSR)); + + pit_cnt += PIT_CYCLES_PER_JIFFY; + evt->event_handler(evt); + return IRQ_HANDLED; +} + +/***************************************************************************/ + +static struct irqaction pit_irq = { + .name = "timer", + .flags = IRQF_DISABLED | IRQF_TIMER, + .handler = pit_tick, +}; + +/***************************************************************************/ + +static cycle_t pit_read_clk(void) +{ + unsigned long flags; + u32 cycles; + u16 pcntr; + + local_irq_save(flags); + pcntr = __raw_readw(TA(MCFPIT_PCNTR)); + cycles = pit_cnt; + local_irq_restore(flags); + + return cycles + PIT_CYCLES_PER_JIFFY - pcntr; +} + +/***************************************************************************/ + +static struct clocksource pit_clk = { + .name = "pit", + .rating = 100, + .read = pit_read_clk, + .shift = 20, + .mask = CLOCKSOURCE_MASK(32), + .flags = CLOCK_SOURCE_IS_CONTINUOUS, +}; + +/***************************************************************************/ + +void hw_timer_init(void) +{ + u32 imr; + + cf_pit_clockevent.cpumask = cpumask_of_cpu(smp_processor_id()); + cf_pit_clockevent.mult = div_sc(FREQ, NSEC_PER_SEC, 32); + cf_pit_clockevent.max_delta_ns = + clockevent_delta2ns(0xFFFF, &cf_pit_clockevent); + cf_pit_clockevent.min_delta_ns = + clockevent_delta2ns(0x3f, &cf_pit_clockevent); + clockevents_register_device(&cf_pit_clockevent); + + setup_irq(MCFINT_VECBASE + MCFINT_PIT1, &pit_irq); + +#if !defined(CONFIG_M523x) + __raw_writeb(ICR_INTRCONF, INTC0 + MCFINTC_ICR0 + MCFINT_PIT1); + imr = __raw_readl(INTC0 + MCFPIT_IMR); + imr &= ~MCFPIT_IMR_IBIT; + __raw_writel(imr, INTC0 + MCFPIT_IMR); + +#endif + pit_clk.mult = clocksource_hz2mult(FREQ, pit_clk.shift); + clocksource_register(&pit_clk); +} + +/***************************************************************************/ Index: linux-2.6.24.7/arch/m68knommu/platform/coldfire/timers.c =================================================================== --- /dev/null +++ linux-2.6.24.7/arch/m68knommu/platform/coldfire/timers.c @@ -0,0 +1,182 @@ +/***************************************************************************/ + +/* + * timers.c -- generic ColdFire hardware timer support. + * + * Copyright (C) 1999-2008, Greg Ungerer <gerg@snapgear.com> + */ + +/***************************************************************************/ + +#include <linux/kernel.h> +#include <linux/init.h> +#include <linux/sched.h> +#include <linux/interrupt.h> +#include <linux/irq.h> +#include <linux/profile.h> +#include <linux/clocksource.h> +#include <asm/io.h> +#include <asm/traps.h> +#include <asm/machdep.h> +#include <asm/coldfire.h> +#include <asm/mcftimer.h> +#include <asm/mcfsim.h> + +/***************************************************************************/ + +/* + * By default use timer1 as the system clock timer. + */ +#define FREQ (MCF_BUSCLK / 16) +#define TA(a) (MCF_MBAR + MCFTIMER_BASE1 + (a)) + +/* + * Default the timer and vector to use for ColdFire. Some ColdFire + * CPU's and some boards may want different. Their sub-architecture + * startup code (in config.c) can change these if they want. + */ +unsigned int mcf_timervector = 29; +unsigned int mcf_profilevector = 31; +unsigned int mcf_timerlevel = 5; + +/* + * These provide the underlying interrupt vector support. + * Unfortunately it is a little different on each ColdFire. + */ +extern void mcf_settimericr(int timer, int level); +void coldfire_profile_init(void); + +#if defined(CONFIG_M532x) +#define __raw_readtrr __raw_readl +#define __raw_writetrr __raw_writel +#else +#define __raw_readtrr __raw_readw +#define __raw_writetrr __raw_writew +#endif + +static u32 mcftmr_cycles_per_jiffy; +static u32 mcftmr_cnt; + +/***************************************************************************/ + +static irqreturn_t mcftmr_tick(int irq, void *dummy) +{ + /* Reset the ColdFire timer */ + __raw_writeb(MCFTIMER_TER_CAP | MCFTIMER_TER_REF, TA(MCFTIMER_TER)); + + mcftmr_cnt += mcftmr_cycles_per_jiffy; + return arch_timer_interrupt(irq, dummy); +} + +/***************************************************************************/ + +static struct irqaction mcftmr_timer_irq = { + .name = "timer", + .flags = IRQF_DISABLED | IRQF_TIMER, + .handler = mcftmr_tick, +}; + +/***************************************************************************/ + +static cycle_t mcftmr_read_clk(void) +{ + unsigned long flags; + u32 cycles; + u16 tcn; + + local_irq_save(flags); + tcn = __raw_readw(TA(MCFTIMER_TCN)); + cycles = mcftmr_cnt; + local_irq_restore(flags); + + return cycles + tcn; +} + +/***************************************************************************/ + +static struct clocksource mcftmr_clk = { + .name = "tmr", + .rating = 250, + .read = mcftmr_read_clk, + .shift = 20, + .mask = CLOCKSOURCE_MASK(32), + .flags = CLOCK_SOURCE_IS_CONTINUOUS, +}; + +/***************************************************************************/ + +void hw_timer_init(void) +{ + setup_irq(mcf_timervector, &mcftmr_timer_irq); + + __raw_writew(MCFTIMER_TMR_DISABLE, TA(MCFTIMER_TMR)); + mcftmr_cycles_per_jiffy = FREQ / HZ; + __raw_writetrr(mcftmr_cycles_per_jiffy, TA(MCFTIMER_TRR)); + __raw_writew(MCFTIMER_TMR_ENORI | MCFTIMER_TMR_CLK16 | + MCFTIMER_TMR_RESTART | MCFTIMER_TMR_ENABLE, TA(MCFTIMER_TMR)); + + mcftmr_clk.mult = clocksource_hz2mult(FREQ, mcftmr_clk.shift); + clocksource_register(&mcftmr_clk); + + mcf_settimericr(1, mcf_timerlevel); + +#ifdef CONFIG_HIGHPROFILE + coldfire_profile_init(); +#endif +} + +/***************************************************************************/ +#ifdef CONFIG_HIGHPROFILE +/***************************************************************************/ + +/* + * By default use timer2 as the profiler clock timer. + */ +#define PA(a) (MCF_MBAR + MCFTIMER_BASE2 + (a)) + +/* + * Choose a reasonably fast profile timer. Make it an odd value to + * try and get good coverage of kernel operations. + */ +#define PROFILEHZ 1013 + +/* + * Use the other timer to provide high accuracy profiling info. + */ +irqreturn_t coldfire_profile_tick(int irq, void *dummy) +{ + /* Reset ColdFire timer2 */ + __raw_writeb(MCFTIMER_TER_CAP | MCFTIMER_TER_REF, PA(MCFTIMER_TER)); + if (current->pid) + profile_tick(CPU_PROFILING); + return IRQ_HANDLED; +} + +/***************************************************************************/ + +static struct irqaction coldfire_profile_irq = { + .name = "profile timer", + .flags = IRQF_DISABLED | IRQF_TIMER, + .handler = coldfire_profile_tick, +}; + +void coldfire_profile_init(void) +{ + printk(KERN_INFO "PROFILE: lodging TIMER2 @ %dHz as profile timer\n", + PROFILEHZ); + + setup_irq(mcf_profilevector, &coldfire_profile_irq); + + /* Set up TIMER 2 as high speed profile clock */ + __raw_writew(MCFTIMER_TMR_DISABLE, PA(MCFTIMER_TMR)); + + __raw_writetrr(((MCF_BUSCLK / 16) / PROFILEHZ), PA(MCFTIMER_TRR)); + __raw_writew(MCFTIMER_TMR_ENORI | MCFTIMER_TMR_CLK16 | + MCFTIMER_TMR_RESTART | MCFTIMER_TMR_ENABLE, PA(MCFTIMER_TMR)); + + mcf_settimericr(2, 7); +} + +/***************************************************************************/ +#endif /* CONFIG_HIGHPROFILE */ +/***************************************************************************/ Index: linux-2.6.24.7/arch/m68knommu/platform/coldfire/vectors.c =================================================================== --- /dev/null +++ linux-2.6.24.7/arch/m68knommu/platform/coldfire/vectors.c @@ -0,0 +1,105 @@ +/***************************************************************************/ + +/* + * linux/arch/m68knommu/platform/5307/vectors.c + * + * Copyright (C) 1999-2007, Greg Ungerer <gerg@snapgear.com> + */ + +/***************************************************************************/ + +#include <linux/kernel.h> +#include <linux/init.h> +#include <linux/irq.h> +#include <asm/traps.h> +#include <asm/machdep.h> +#include <asm/coldfire.h> +#include <asm/mcfsim.h> +#include <asm/mcfdma.h> +#include <asm/mcfwdebug.h> + +/***************************************************************************/ + +#ifdef TRAP_DBG_INTERRUPT + +asmlinkage void dbginterrupt_c(struct frame *fp) +{ + extern void dump(struct pt_regs *fp); + printk(KERN_DEBUG "%s(%d): BUS ERROR TRAP\n", __FILE__, __LINE__); + dump((struct pt_regs *) fp); + asm("halt"); +} + +#endif + +/***************************************************************************/ + +extern e_vector *_ramvec; + +void set_evector(int vecnum, void (*handler)(void)) +{ + if (vecnum >= 0 && vecnum <= 255) + _ramvec[vecnum] = handler; +} + +/***************************************************************************/ + +/* Assembler routines */ +asmlinkage void buserr(void); +asmlinkage void trap(void); +asmlinkage void system_call(void); +asmlinkage void inthandler(void); + +void __init init_vectors(void) +{ + int i; + + /* + * There is a common trap handler and common interrupt + * handler that handle almost every vector. We treat + * the system call and bus error special, they get their + * own first level handlers. + */ + for (i = 3; (i <= 23); i++) + _ramvec[i] = trap; + for (i = 33; (i <= 63); i++) + _ramvec[i] = trap; + for (i = 24; (i <= 31); i++) + _ramvec[i] = inthandler; + for (i = 64; (i < 255); i++) + _ramvec[i] = inthandler; + _ramvec[255] = 0; + + _ramvec[2] = buserr; + _ramvec[32] = system_call; + +#ifdef TRAP_DBG_INTERRUPT + _ramvec[12] = dbginterrupt; +#endif +} + +/***************************************************************************/ + +void enable_vector(unsigned int irq) +{ + /* Currently no action on ColdFire */ +} + +void disable_vector(unsigned int irq) +{ + /* Currently no action on ColdFire */ +} + +void ack_vector(unsigned int irq) +{ + /* Currently no action on ColdFire */ +} + +/***************************************************************************/ + +void coldfire_reset(void) +{ + HARD_RESET_NOW(); +} + +/***************************************************************************/ Index: linux-2.6.24.7/drivers/net/fec.c =================================================================== --- linux-2.6.24.7.orig/drivers/net/fec.c +++ linux-2.6.24.7/drivers/net/fec.c @@ -2,12 +2,6 @@ * Fast Ethernet Controller (FEC) driver for Motorola MPC8xx. * Copyright (c) 1997 Dan Malek (dmalek@jlc.net) * - * This version of the driver is specific to the FADS implementation, - * since the board contains control registers external to the processor - * for the control of the LevelOne LXT970 transceiver. The MPC860T manual - * describes connections using the internal parallel port I/O, which - * is basically all of Port D. - * * Right now, I am very wasteful with the buffers. I allocate memory * pages and then divide them into 2K frame buffers. This way I know I * have buffers large enough to hold one frame within one buffer descriptor. @@ -49,17 +43,9 @@ #include <asm/pgtable.h> #include <asm/cacheflush.h> -#if defined(CONFIG_M523x) || defined(CONFIG_M527x) || \ - defined(CONFIG_M5272) || defined(CONFIG_M528x) || \ - defined(CONFIG_M520x) || defined(CONFIG_M532x) #include <asm/coldfire.h> #include <asm/mcfsim.h> #include "fec.h" -#else -#include <asm/8xx_immap.h> -#include <asm/mpc8xx.h> -#include "commproc.h" -#endif #if defined(CONFIG_FEC2) #define FEC_MAX_PORTS 2 @@ -67,6 +53,7 @@ #define FEC_MAX_PORTS 1 #endif + /* * Define the fixed address of the FEC hardware. */ @@ -79,15 +66,15 @@ static unsigned int fec_hw[] = { #elif defined(CONFIG_M523x) || defined(CONFIG_M528x) (MCF_MBAR + 0x1000), #elif defined(CONFIG_M520x) - (MCF_MBAR+0x30000), + (MCF_MBAR + 0x30000), #elif defined(CONFIG_M532x) - (MCF_MBAR+0xfc030000), + (MCF_MBAR + 0xfc030000), #else - &(((immap_t *)IMAP_ADDR)->im_cpm.cp_fec), + &(((immap_t *) IMAP_ADDR)->im_cpm.cp_fec), #endif }; -static unsigned char fec_mac_default[] = { +static unsigned char fec_mac_default[] = { 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, }; @@ -101,20 +88,20 @@ static unsigned char fec_mac_default[] = #define FEC_FLASHMAC 0xf0006000 #elif defined(CONFIG_CANCam) #define FEC_FLASHMAC 0xf0020000 -#elif defined (CONFIG_M5272C3) +#elif defined(CONFIG_M5272C3) #define FEC_FLASHMAC (0xffe04000 + 4) #elif defined(CONFIG_MOD5272) -#define FEC_FLASHMAC 0xffc0406b +#define FEC_FLASHMAC 0xffc0406b #else #define FEC_FLASHMAC 0 #endif /* Forward declarations of some structures to support different PHYs */ - +typedef void (mii_func)(uint val, struct net_device *dev); typedef struct { uint mii_data; - void (*funct)(uint mii_reg, struct net_device *dev); + mii_func *funct; } phy_cmd_t; typedef struct { @@ -165,7 +152,6 @@ typedef struct { #define PKT_MINBUF_SIZE 64 #define PKT_MAXBLR_SIZE 1520 - /* * The 5270/5271/5280/5282/532x RX control register also contains maximum frame * size bits. Other FEC hardware does not, so we need to take that into @@ -188,75 +174,67 @@ typedef struct { */ struct fec_enet_private { /* Hardware registers of the FEC device */ - volatile fec_t *hwp; + volatile fec_t *hwp; struct net_device *netdev; /* The saved address of a sent-in-place packet/buffer, for skfree(). */ unsigned char *tx_bounce[TX_RING_SIZE]; - struct sk_buff* tx_skbuff[TX_RING_SIZE]; - ushort skb_cur; - ushort skb_dirty; + struct sk_buff *tx_skbuff[TX_RING_SIZE]; + ushort skb_cur; + ushort skb_dirty; /* CPM dual port RAM relative addresses. - */ - cbd_t *rx_bd_base; /* Address of Rx and Tx buffers. */ - cbd_t *tx_bd_base; - cbd_t *cur_rx, *cur_tx; /* The next free ring entry */ - cbd_t *dirty_tx; /* The ring entries to be free()ed. */ - uint tx_full; - spinlock_t lock; - - uint phy_id; - uint phy_id_done; - uint phy_status; - uint phy_speed; - phy_info_t const *phy; + */ + cbd_t *rx_bd_base; /* Address of Rx and Tx buffers. */ + cbd_t *tx_bd_base; + cbd_t *cur_rx, *cur_tx; /* The next free ring entry */ + cbd_t *dirty_tx; /* The ring entries to be free()ed. */ + uint tx_full; + /* hold while accessing the HW like ringbuffer for tx/rx but not MAC */ + spinlock_t hw_lock; + /* hold while accessing the mii_list_t() elements */ + spinlock_t mii_lock; + + uint phy_id; + uint phy_id_done; + uint phy_status; + uint phy_speed; + phy_info_t const *phy; struct work_struct phy_task; - uint sequence_done; - uint mii_phy_task_queued; + uint sequence_done; + uint mii_phy_task_queued; + + uint phy_addr; - uint phy_addr; + int index; + int opened; + int link; + int old_link; + int full_duplex; +}; - int index; - int opened; - int link; - int old_link; - int full_duplex; -}; - -static int fec_enet_open(struct net_device *dev); -static int fec_enet_start_xmit(struct sk_buff *skb, struct net_device *dev); -static void fec_enet_mii(struct net_device *dev); -static irqreturn_t fec_enet_interrupt(int irq, void * dev_id); -static void fec_enet_tx(struct net_device *dev); -static void fec_enet_rx(struct net_device *dev); -static int fec_enet_close(struct net_device *dev); -static void set_multicast_list(struct net_device *dev); static void fec_restart(struct net_device *dev, int duplex); static void fec_stop(struct net_device *dev); -static void fec_set_mac_address(struct net_device *dev); - /* MII processing. We keep this as simple as possible. Requests are * placed on the list (if there is room). When the request is finished * by the MII, an optional function may be called. */ typedef struct mii_list { - uint mii_regval; - void (*mii_func)(uint val, struct net_device *dev); - struct mii_list *mii_next; + uint mii_regval; + void (*mii_func)(uint val, struct net_device *dev); + struct mii_list *mii_next; } mii_list_t; -#define NMII 20 -static mii_list_t mii_cmds[NMII]; -static mii_list_t *mii_free; -static mii_list_t *mii_head; -static mii_list_t *mii_tail; +#define NMII 20 +static mii_list_t mii_cmds[NMII]; +static mii_list_t *mii_free; +static mii_list_t *mii_head; +static mii_list_t *mii_tail; -static int mii_queue(struct net_device *dev, int request, - void (*func)(uint, struct net_device *)); +static int mii_queue(struct net_device *dev, int request, mii_func *funct); /* Make MII read/write commands for the FEC. */ @@ -272,52 +250,52 @@ static int mii_queue(struct net_device * /* Register definitions for the PHY. */ -#define MII_REG_CR 0 /* Control Register */ -#define MII_REG_SR 1 /* Status Register */ -#define MII_REG_PHYIR1 2 /* PHY Identification Register 1 */ -#define MII_REG_PHYIR2 3 /* PHY Identification Register 2 */ -#define MII_REG_ANAR 4 /* A-N Advertisement Register */ -#define MII_REG_ANLPAR 5 /* A-N Link Partner Ability Register */ -#define MII_REG_ANER 6 /* A-N Expansion Register */ -#define MII_REG_ANNPTR 7 /* A-N Next Page Transmit Register */ -#define MII_REG_ANLPRNPR 8 /* A-N Link Partner Received Next Page Reg. */ +#define MII_REG_CR 0 /* Control Register */ +#define MII_REG_SR 1 /* Status Register */ +#define MII_REG_PHYIR1 2 /* PHY Identification Register 1 */ +#define MII_REG_PHYIR2 3 /* PHY Identification Register 2 */ +#define MII_REG_ANAR 4 /* A-N Advertisement Register */ +#define MII_REG_ANLPAR 5 /* A-N Link Partner Ability Register */ +#define MII_REG_ANER 6 /* A-N Expansion Register */ +#define MII_REG_ANNPTR 7 /* A-N Next Page Transmit Register */ +#define MII_REG_ANLPRNPR 8 /* A-N Link Partner Received Next Page Reg. */ /* values for phy_status */ -#define PHY_CONF_ANE 0x0001 /* 1 auto-negotiation enabled */ -#define PHY_CONF_LOOP 0x0002 /* 1 loopback mode enabled */ -#define PHY_CONF_SPMASK 0x00f0 /* mask for speed */ -#define PHY_CONF_10HDX 0x0010 /* 10 Mbit half duplex supported */ -#define PHY_CONF_10FDX 0x0020 /* 10 Mbit full duplex supported */ -#define PHY_CONF_100HDX 0x0040 /* 100 Mbit half duplex supported */ -#define PHY_CONF_100FDX 0x0080 /* 100 Mbit full duplex supported */ - -#define PHY_STAT_LINK 0x0100 /* 1 up - 0 down */ -#define PHY_STAT_FAULT 0x0200 /* 1 remote fault */ -#define PHY_STAT_ANC 0x0400 /* 1 auto-negotiation complete */ -#define PHY_STAT_SPMASK 0xf000 /* mask for speed */ -#define PHY_STAT_10HDX 0x1000 /* 10 Mbit half duplex selected */ -#define PHY_STAT_10FDX 0x2000 /* 10 Mbit full duplex selected */ -#define PHY_STAT_100HDX 0x4000 /* 100 Mbit half duplex selected */ -#define PHY_STAT_100FDX 0x8000 /* 100 Mbit full duplex selected */ +#define PHY_CONF_ANE 0x0001 /* 1 auto-negotiation enabled */ +#define PHY_CONF_LOOP 0x0002 /* 1 loopback mode enabled */ +#define PHY_CONF_SPMASK 0x00f0 /* mask for speed */ +#define PHY_CONF_10HDX 0x0010 /* 10 Mbit half duplex supported */ +#define PHY_CONF_10FDX 0x0020 /* 10 Mbit full duplex supported */ +#define PHY_CONF_100HDX 0x0040 /* 100 Mbit half duplex supported */ +#define PHY_CONF_100FDX 0x0080 /* 100 Mbit full duplex supported */ + +#define PHY_STAT_LINK 0x0100 /* 1 up - 0 down */ +#define PHY_STAT_FAULT 0x0200 /* 1 remote fault */ +#define PHY_STAT_ANC 0x0400 /* 1 auto-negotiation complete */ +#define PHY_STAT_SPMASK 0xf000 /* mask for speed */ +#define PHY_STAT_10HDX 0x1000 /* 10 Mbit half duplex selected */ +#define PHY_STAT_10FDX 0x2000 /* 10 Mbit full duplex selected */ +#define PHY_STAT_100HDX 0x4000 /* 100 Mbit half duplex selected */ +#define PHY_STAT_100FDX 0x8000 /* 100 Mbit full duplex selected */ - -static int -fec_enet_start_xmit(struct sk_buff *skb, struct net_device *dev) +static int fec_enet_start_xmit(struct sk_buff *skb, struct net_device *dev) { struct fec_enet_private *fep; - volatile fec_t *fecp; - volatile cbd_t *bdp; - unsigned short status; + volatile fec_t *fecp; + volatile cbd_t *bdp; + unsigned short status; + unsigned long flags; fep = netdev_priv(dev); - fecp = (volatile fec_t*)dev->base_addr; + fecp = (volatile fec_t *)dev->base_addr; if (!fep->link) { /* Link is down or autonegotiation is in progress. */ return 1; } + spin_lock_irqsave(&fep->hw_lock, flags); /* Fill in a Tx ring entry */ bdp = fep->cur_tx; @@ -328,6 +306,7 @@ fec_enet_start_xmit(struct sk_buff *skb, * This should not happen, since dev->tbusy should be set. */ printk("%s: tx queue full!.\n", dev->name); + spin_unlock_irqrestore(&fep->hw_lock, flags); return 1; } #endif @@ -337,28 +316,29 @@ fec_enet_start_xmit(struct sk_buff *skb, status &= ~BD_ENET_TX_STATS; /* Set buffer length and buffer pointer. - */ + */ bdp->cbd_bufaddr = __pa(skb->data); bdp->cbd_datlen = skb->len; /* - * On some FEC implementations data must be aligned on - * 4-byte boundaries. Use bounce buffers to copy data - * and get it aligned. Ugh. + * On some FEC implementations data must be aligned on + * 4-byte boundaries. Use bounce buffers to copy data + * and get it aligned. Ugh. */ if (bdp->cbd_bufaddr & 0x3) { unsigned int index; index = bdp - fep->tx_bd_base; - memcpy(fep->tx_bounce[index], (void *) bdp->cbd_bufaddr, bdp->cbd_datlen); + memcpy(fep->tx_bounce[index], (void *)bdp->cbd_bufaddr, + bdp->cbd_datlen); bdp->cbd_bufaddr = __pa(fep->tx_bounce[index]); } /* Save skb pointer. - */ + */ fep->tx_skbuff[fep->skb_cur] = skb; dev->stats.tx_bytes += skb->len; - fep->skb_cur = (fep->skb_cur+1) & TX_RING_MOD_MASK; + fep->skb_cur = (fep->skb_cur + 1) & TX_RING_MOD_MASK; /* Push the data cache so the CPM does not get stale memory * data. @@ -366,14 +346,13 @@ fec_enet_start_xmit(struct sk_buff *skb, flush_dcache_range((unsigned long)skb->data, (unsigned long)skb->data + skb->len); - spin_lock_irq(&fep->lock); /* Send it on its way. Tell FEC it's ready, interrupt when done, * it's the last BD of the frame, and to put the CRC on the end. */ status |= (BD_ENET_TX_READY | BD_ENET_TX_INTR - | BD_ENET_TX_LAST | BD_ENET_TX_TC); + | BD_ENET_TX_LAST | BD_ENET_TX_TC); bdp->cbd_sc = status; dev->trans_start = jiffies; @@ -382,7 +361,7 @@ fec_enet_start_xmit(struct sk_buff *skb, fecp->fec_x_des_active = 0; /* If this was the last BD in the ring, start at the beginning again. - */ + */ if (status & BD_ENET_TX_WRAP) { bdp = fep->tx_bd_base; } else { @@ -394,15 +373,14 @@ fec_enet_start_xmit(struct sk_buff *skb, netif_stop_queue(dev); } - fep->cur_tx = (cbd_t *)bdp; + fep->cur_tx = (cbd_t *) bdp; - spin_unlock_irq(&fep->lock); + spin_unlock_irqrestore(&fep->hw_lock, flags); return 0; } -static void -fec_timeout(struct net_device *dev) +static void fec_timeout(struct net_device *dev) { struct fec_enet_private *fep = netdev_priv(dev); @@ -410,115 +388,200 @@ fec_timeout(struct net_device *dev) dev->stats.tx_errors++; #ifndef final_version { - int i; - cbd_t *bdp; + int i; + cbd_t *bdp; - printk("Ring data dump: cur_tx %lx%s, dirty_tx %lx cur_rx: %lx\n", - (unsigned long)fep->cur_tx, fep->tx_full ? " (full)" : "", - (unsigned long)fep->dirty_tx, - (unsigned long)fep->cur_rx); + printk + ("Ring data dump: cur_tx %lx%s, dirty_tx %lx cur_rx: %lx\n", + (unsigned long)fep->cur_tx, fep->tx_full ? " (full)" : "", + (unsigned long)fep->dirty_tx, (unsigned long)fep->cur_rx); - bdp = fep->tx_bd_base; - printk(" tx: %u buffers\n", TX_RING_SIZE); - for (i = 0 ; i < TX_RING_SIZE; i++) { - printk(" %08x: %04x %04x %08x\n", - (uint) bdp, - bdp->cbd_sc, - bdp->cbd_datlen, - (int) bdp->cbd_bufaddr); - bdp++; - } + bdp = fep->tx_bd_base; + printk(" tx: %u buffers\n", TX_RING_SIZE); + for (i = 0; i < TX_RING_SIZE; i++) { + printk(" %08x: %04x %04x %08x\n", + (uint) bdp, + bdp->cbd_sc, + bdp->cbd_datlen, (int)bdp->cbd_bufaddr); + bdp++; + } - bdp = fep->rx_bd_base; - printk(" rx: %lu buffers\n", (unsigned long) RX_RING_SIZE); - for (i = 0 ; i < RX_RING_SIZE; i++) { - printk(" %08x: %04x %04x %08x\n", - (uint) bdp, - bdp->cbd_sc, - bdp->cbd_datlen, - (int) bdp->cbd_bufaddr); - bdp++; - } + bdp = fep->rx_bd_base; + printk(" rx: %lu buffers\n", (unsigned long)RX_RING_SIZE); + for (i = 0; i < RX_RING_SIZE; i++) { + printk(" %08x: %04x %04x %08x\n", + (uint) bdp, + bdp->cbd_sc, + bdp->cbd_datlen, (int)bdp->cbd_bufaddr); + bdp++; + } } #endif fec_restart(dev, fep->full_duplex); netif_wake_queue(dev); } -/* The interrupt handler. - * This is called from the MPC core interrupt. +/* During a receive, the cur_rx points to the current incoming buffer. + * When we update through the ring, if the next incoming buffer has + * not been given to the system, we just set the empty indicator, + * effectively tossing the packet. */ -static irqreturn_t -fec_enet_interrupt(int irq, void * dev_id) +static void fec_enet_rx(struct net_device *dev) { - struct net_device *dev = dev_id; - volatile fec_t *fecp; - uint int_events; - int handled = 0; + struct fec_enet_private *fep; + volatile fec_t *fecp; + volatile cbd_t *bdp; + unsigned short status; + struct sk_buff *skb; + ushort pkt_len; + __u8 *data; - fecp = (volatile fec_t*)dev->base_addr; +#ifdef CONFIG_M532x + flush_cache_all(); +#endif - /* Get the interrupt events that caused us to be here. - */ - while ((int_events = fecp->fec_ievent) != 0) { - fecp->fec_ievent = int_events; + fep = netdev_priv(dev); + spin_lock_irq(&fep->hw_lock); + fecp = (volatile fec_t *)dev->base_addr; - /* Handle receive event in its own function. + /* First, grab all of the stats for the incoming packet. + * These get messed up if we get called due to a busy condition. + */ + bdp = fep->cur_rx; + + while (!((status = bdp->cbd_sc) & BD_ENET_RX_EMPTY)) { + +#ifndef final_version + /* Since we have allocated space to hold a complete frame, + * the last indicator should be set. */ - if (int_events & FEC_ENET_RXF) { - handled = 1; - fec_enet_rx(dev); + if ((status & BD_ENET_RX_LAST) == 0) + printk("FEC ENET: rcv is not +last\n"); +#endif + + if (!fep->opened) + goto rx_processing_done; + + /* Check for errors. */ + if (status & (BD_ENET_RX_LG | BD_ENET_RX_SH | BD_ENET_RX_NO | + BD_ENET_RX_CR | BD_ENET_RX_OV)) { + dev->stats.rx_errors++; + if (status & (BD_ENET_RX_LG | BD_ENET_RX_SH)) { + /* Frame too long or too short. */ + dev->stats.rx_length_errors++; + } + if (status & BD_ENET_RX_NO) /* Frame alignment */ + dev->stats.rx_frame_errors++; + if (status & BD_ENET_RX_CR) /* CRC Error */ + dev->stats.rx_crc_errors++; + if (status & BD_ENET_RX_OV) /* FIFO overrun */ + dev->stats.rx_fifo_errors++; } - /* Transmit OK, or non-fatal error. Update the buffer - descriptors. FEC handles all errors, we just discover - them as part of the transmit process. - */ - if (int_events & FEC_ENET_TXF) { - handled = 1; - fec_enet_tx(dev); + /* Report late collisions as a frame error. + * On this error, the BD is closed, but we don't know what we + * have in the buffer. So, just drop this frame on the floor. + */ + if (status & BD_ENET_RX_CL) { + dev->stats.rx_errors++; + dev->stats.rx_frame_errors++; + goto rx_processing_done; } - if (int_events & FEC_ENET_MII) { - handled = 1; - fec_enet_mii(dev); + /* Process the incoming frame. + */ + dev->stats.rx_packets++; + pkt_len = bdp->cbd_datlen; + dev->stats.rx_bytes += pkt_len; + data = (__u8 *) __va(bdp->cbd_bufaddr); + + /* This does 16 byte alignment, exactly what we need. + * The packet length includes FCS, but we don't want to + * include that when passing upstream as it messes up + * bridging applications. + */ + skb = dev_alloc_skb(pkt_len - 4); + + if (skb == NULL) { + printk("%s: Memory squeeze, dropping packet.\n", + dev->name); + dev->stats.rx_dropped++; + } else { + skb_put(skb, pkt_len - 4); /* Make room */ + skb_copy_to_linear_data(skb, data, pkt_len - 4); + skb->protocol = eth_type_trans(skb, dev); + netif_rx(skb); } +rx_processing_done: - } - return IRQ_RETVAL(handled); -} + /* Clear the status flags for this buffer. + */ + status &= ~BD_ENET_RX_STATS; + /* Mark the buffer empty. + */ + status |= BD_ENET_RX_EMPTY; + bdp->cbd_sc = status; -static void -fec_enet_tx(struct net_device *dev) + /* Update BD pointer to next entry. + */ + if (status & BD_ENET_RX_WRAP) + bdp = fep->rx_bd_base; + else + bdp++; + +#if 1 + /* Doing this here will keep the FEC running while we process + * incoming frames. On a heavily loaded network, we should be + * able to keep up at the expense of system resources. + */ + fecp->fec_r_des_active = 0; +#endif + } /* while (!((status = bdp->cbd_sc) & BD_ENET_RX_EMPTY)) */ + fep->cur_rx = (cbd_t *) bdp; + +#if 0 + /* Doing this here will allow us to process all frames in the + * ring before the FEC is allowed to put more there. On a heavily + * loaded network, some frames may be lost. Unfortunately, this + * increases the interrupt overhead since we can potentially work + * our way back to the interrupt return only to come right back + * here. + */ + fecp->fec_r_des_active = 0; +#endif + spin_unlock_irq(&fep->hw_lock); +} + +static void fec_enet_tx(struct net_device *dev) { - struct fec_enet_private *fep; - volatile cbd_t *bdp; + struct fec_enet_private *fep; + volatile cbd_t *bdp; unsigned short status; - struct sk_buff *skb; + struct sk_buff *skb; fep = netdev_priv(dev); - spin_lock(&fep->lock); + spin_lock_irq(&fep->hw_lock); bdp = fep->dirty_tx; while (((status = bdp->cbd_sc) & BD_ENET_TX_READY) == 0) { - if (bdp == fep->cur_tx && fep->tx_full == 0) break; + if (bdp == fep->cur_tx && fep->tx_full == 0) + break; skb = fep->tx_skbuff[fep->skb_dirty]; /* Check for errors. */ if (status & (BD_ENET_TX_HB | BD_ENET_TX_LC | - BD_ENET_TX_RL | BD_ENET_TX_UN | - BD_ENET_TX_CSL)) { + BD_ENET_TX_RL | BD_ENET_TX_UN | BD_ENET_TX_CSL)) { dev->stats.tx_errors++; - if (status & BD_ENET_TX_HB) /* No heartbeat */ + if (status & BD_ENET_TX_HB) /* No heartbeat */ dev->stats.tx_heartbeat_errors++; - if (status & BD_ENET_TX_LC) /* Late collision */ + if (status & BD_ENET_TX_LC) /* Late collision */ dev->stats.tx_window_errors++; - if (status & BD_ENET_TX_RL) /* Retrans limit */ + if (status & BD_ENET_TX_RL) /* Retrans limit */ dev->stats.tx_aborted_errors++; - if (status & BD_ENET_TX_UN) /* Underrun */ + if (status & BD_ENET_TX_UN) /* Underrun */ dev->stats.tx_fifo_errors++; - if (status & BD_ENET_TX_CSL) /* Carrier lost */ + if (status & BD_ENET_TX_CSL) /* Carrier lost */ dev->stats.tx_carrier_errors++; } else { dev->stats.tx_packets++; @@ -556,164 +619,32 @@ fec_enet_tx(struct net_device *dev) netif_wake_queue(dev); } } - fep->dirty_tx = (cbd_t *)bdp; - spin_unlock(&fep->lock); -} - - -/* During a receive, the cur_rx points to the current incoming buffer. - * When we update through the ring, if the next incoming buffer has - * not been given to the system, we just set the empty indicator, - * effectively tossing the packet. - */ -static void -fec_enet_rx(struct net_device *dev) -{ - struct fec_enet_private *fep; - volatile fec_t *fecp; - volatile cbd_t *bdp; - unsigned short status; - struct sk_buff *skb; - ushort pkt_len; - __u8 *data; - -#ifdef CONFIG_M532x - flush_cache_all(); -#endif - - fep = netdev_priv(dev); - fecp = (volatile fec_t*)dev->base_addr; - - /* First, grab all of the stats for the incoming packet. - * These get messed up if we get called due to a busy condition. - */ - bdp = fep->cur_rx; - -while (!((status = bdp->cbd_sc) & BD_ENET_RX_EMPTY)) { - -#ifndef final_version - /* Since we have allocated space to hold a complete frame, - * the last indicator should be set. - */ - if ((status & BD_ENET_RX_LAST) == 0) - printk("FEC ENET: rcv is not +last\n"); -#endif - - if (!fep->opened) - goto rx_processing_done; - - /* Check for errors. */ - if (status & (BD_ENET_RX_LG | BD_ENET_RX_SH | BD_ENET_RX_NO | - BD_ENET_RX_CR | BD_ENET_RX_OV)) { - dev->stats.rx_errors++; - if (status & (BD_ENET_RX_LG | BD_ENET_RX_SH)) { - /* Frame too long or too short. */ - dev->stats.rx_length_errors++; - } - if (status & BD_ENET_RX_NO) /* Frame alignment */ - dev->stats.rx_frame_errors++; - if (status & BD_ENET_RX_CR) /* CRC Error */ - dev->stats.rx_crc_errors++; - if (status & BD_ENET_RX_OV) /* FIFO overrun */ - dev->stats.rx_fifo_errors++; - } - - /* Report late collisions as a frame error. - * On this error, the BD is closed, but we don't know what we - * have in the buffer. So, just drop this frame on the floor. - */ - if (status & BD_ENET_RX_CL) { - dev->stats.rx_errors++; - dev->stats.rx_frame_errors++; - goto rx_processing_done; - } - - /* Process the incoming frame. - */ - dev->stats.rx_packets++; - pkt_len = bdp->cbd_datlen; - dev->stats.rx_bytes += pkt_len; - data = (__u8*)__va(bdp->cbd_bufaddr); - - /* This does 16 byte alignment, exactly what we need. - * The packet length includes FCS, but we don't want to - * include that when passing upstream as it messes up - * bridging applications. - */ - skb = dev_alloc_skb(pkt_len-4); - - if (skb == NULL) { - printk("%s: Memory squeeze, dropping packet.\n", dev->name); - dev->stats.rx_dropped++; - } else { - skb_put(skb,pkt_len-4); /* Make room */ - skb_copy_to_linear_data(skb, data, pkt_len-4); - skb->protocol=eth_type_trans(skb,dev); - netif_rx(skb); - } - rx_processing_done: - - /* Clear the status flags for this buffer. - */ - status &= ~BD_ENET_RX_STATS; - - /* Mark the buffer empty. - */ - status |= BD_ENET_RX_EMPTY; - bdp->cbd_sc = status; - - /* Update BD pointer to next entry. - */ - if (status & BD_ENET_RX_WRAP) - bdp = fep->rx_bd_base; - else - bdp++; - -#if 1 - /* Doing this here will keep the FEC running while we process - * incoming frames. On a heavily loaded network, we should be - * able to keep up at the expense of system resources. - */ - fecp->fec_r_des_active = 0; -#endif - } /* while (!((status = bdp->cbd_sc) & BD_ENET_RX_EMPTY)) */ - fep->cur_rx = (cbd_t *)bdp; - -#if 0 - /* Doing this here will allow us to process all frames in the - * ring before the FEC is allowed to put more there. On a heavily - * loaded network, some frames may be lost. Unfortunately, this - * increases the interrupt overhead since we can potentially work - * our way back to the interrupt return only to come right back - * here. - */ - fecp->fec_r_des_active = 0; -#endif + fep->dirty_tx = (cbd_t *) bdp; + spin_unlock_irq(&fep->hw_lock); } - /* called from interrupt context */ -static void -fec_enet_mii(struct net_device *dev) +static void fec_enet_mii(struct net_device *dev) { - struct fec_enet_private *fep; - volatile fec_t *ep; - mii_list_t *mip; - uint mii_reg; + struct fec_enet_private *fep; + volatile fec_t *ep; + mii_list_t *mip; + uint mii_reg; + mii_func *mii_func = NULL; fep = netdev_priv(dev); + spin_lock_irq(&fep->mii_lock); + ep = fep->hwp; mii_reg = ep->fec_mii_data; - spin_lock(&fep->lock); - if ((mip = mii_head) == NULL) { printk("MII and no head!\n"); goto unlock; } if (mip->mii_func != NULL) - (*(mip->mii_func))(mii_reg, dev); + mii_func = *(mip->mii_func); mii_head = mip->mii_next; mip->mii_next = mii_free; @@ -723,26 +654,71 @@ fec_enet_mii(struct net_device *dev) ep->fec_mii_data = mip->mii_regval; unlock: - spin_unlock(&fep->lock); + spin_unlock_irq(&fep->mii_lock); + if (mii_func) + mii_func(mii_reg, dev); } -static int -mii_queue(struct net_device *dev, int regval, void (*func)(uint, struct net_device *)) +/* The interrupt handler. + * This is called from the MPC core interrupt. + */ +static irqreturn_t fec_enet_interrupt(int irq, void *dev_id) +{ + struct net_device *dev = dev_id; + volatile fec_t *fecp; + uint int_events; + irqreturn_t ret = IRQ_NONE; + + fecp = (volatile fec_t *)dev->base_addr; + + /* Get the interrupt events that caused us to be here. + */ + do { + int_events = fecp->fec_ievent; + fecp->fec_ievent = int_events; + + /* Handle receive event in its own function. + */ + if (int_events & FEC_ENET_RXF) { + ret = IRQ_HANDLED; + fec_enet_rx(dev); + } + + /* Transmit OK, or non-fatal error. Update the buffer + descriptors. FEC handles all errors, we just discover + them as part of the transmit process. + */ + if (int_events & FEC_ENET_TXF) { + ret = IRQ_HANDLED; + fec_enet_tx(dev); + } + + if (int_events & FEC_ENET_MII) { + ret = IRQ_HANDLED; + fec_enet_mii(dev); + } + + } while (int_events); + + return ret; +} + + +static int mii_queue(struct net_device *dev, int regval, mii_func *func) { struct fec_enet_private *fep; - unsigned long flags; - mii_list_t *mip; - int retval; + unsigned long flags; + mii_list_t *mip; + int retval; /* Add PHY address to register command. - */ + */ fep = netdev_priv(dev); - regval |= fep->phy_addr << 23; + spin_lock_irqsave(&fep->mii_lock, flags); + regval |= fep->phy_addr << 23; retval = 0; - spin_lock_irqsave(&fep->lock,flags); - if ((mip = mii_free) != NULL) { mii_free = mip->mii_next; mip->mii_regval = regval; @@ -759,14 +735,13 @@ mii_queue(struct net_device *dev, int re retval = 1; } - spin_unlock_irqrestore(&fep->lock,flags); - - return(retval); + spin_unlock_irqrestore(&fep->mii_lock, flags); + return retval; } static void mii_do_cmd(struct net_device *dev, const phy_cmd_t *c) { - if(!c) + if (!c) return; for (; c->mii_data != mk_mii_end; c++) @@ -827,11 +802,11 @@ static void mii_parse_anar(uint mii_reg, /* ------------------------------------------------------------------------- */ /* The Level one LXT970 is used by many boards */ -#define MII_LXT970_MIRROR 16 /* Mirror register */ -#define MII_LXT970_IER 17 /* Interrupt Enable Register */ -#define MII_LXT970_ISR 18 /* Interrupt Status Register */ -#define MII_LXT970_CONFIG 19 /* Configuration Register */ -#define MII_LXT970_CSR 20 /* Chip Status Register */ +#define MII_LXT970_MIRROR 16 /* Mirror register */ +#define MII_LXT970_IER 17 /* Interrupt Enable Register */ +#define MII_LXT970_ISR 18 /* Interrupt Status Register */ +#define MII_LXT970_CONFIG 19 /* Configuration Register */ +#define MII_LXT970_CSR 20 /* Chip Status Register */ static void mii_parse_lxt970_csr(uint mii_reg, struct net_device *dev) { @@ -855,28 +830,28 @@ static void mii_parse_lxt970_csr(uint mi } static phy_cmd_t const phy_cmd_lxt970_config[] = { - { mk_mii_read(MII_REG_CR), mii_parse_cr }, - { mk_mii_read(MII_REG_ANAR), mii_parse_anar }, - { mk_mii_end, } - }; -static phy_cmd_t const phy_cmd_lxt970_startup[] = { /* enable interrupts */ - { mk_mii_write(MII_LXT970_IER, 0x0002), NULL }, - { mk_mii_write(MII_REG_CR, 0x1200), NULL }, /* autonegotiate */ - { mk_mii_end, } - }; + {mk_mii_read(MII_REG_CR), mii_parse_cr}, + {mk_mii_read(MII_REG_ANAR), mii_parse_anar}, + {mk_mii_end,} +}; +static phy_cmd_t const phy_cmd_lxt970_startup[] = { /* enable interrupts */ + {mk_mii_write(MII_LXT970_IER, 0x0002), NULL}, + {mk_mii_write(MII_REG_CR, 0x1200), NULL}, /* autonegotiate */ + {mk_mii_end,} +}; static phy_cmd_t const phy_cmd_lxt970_ack_int[] = { - /* read SR and ISR to acknowledge */ - { mk_mii_read(MII_REG_SR), mii_parse_sr }, - { mk_mii_read(MII_LXT970_ISR), NULL }, - - /* find out the current status */ - { mk_mii_read(MII_LXT970_CSR), mii_parse_lxt970_csr }, - { mk_mii_end, } - }; -static phy_cmd_t const phy_cmd_lxt970_shutdown[] = { /* disable interrupts */ - { mk_mii_write(MII_LXT970_IER, 0x0000), NULL }, - { mk_mii_end, } - }; + /* read SR and ISR to acknowledge */ + {mk_mii_read(MII_REG_SR), mii_parse_sr}, + {mk_mii_read(MII_LXT970_ISR), NULL}, + + /* find out the current status */ + {mk_mii_read(MII_LXT970_CSR), mii_parse_lxt970_csr}, + {mk_mii_end,} +}; +static phy_cmd_t const phy_cmd_lxt970_shutdown[] = { /* disable interrupts */ + {mk_mii_write(MII_LXT970_IER, 0x0000), NULL}, + {mk_mii_end,} +}; static phy_info_t const phy_info_lxt970 = { .id = 0x07810000, .name = "LXT970", @@ -891,12 +866,12 @@ static phy_info_t const phy_info_lxt970 /* register definitions for the 971 */ -#define MII_LXT971_PCR 16 /* Port Control Register */ -#define MII_LXT971_SR2 17 /* Status Register 2 */ -#define MII_LXT971_IER 18 /* Interrupt Enable Register */ -#define MII_LXT971_ISR 19 /* Interrupt Status Register */ -#define MII_LXT971_LCR 20 /* LED Control Register */ -#define MII_LXT971_TCR 30 /* Transmit Control Register */ +#define MII_LXT971_PCR 16 /* Port Control Register */ +#define MII_LXT971_SR2 17 /* Status Register 2 */ +#define MII_LXT971_IER 18 /* Interrupt Enable Register */ +#define MII_LXT971_ISR 19 /* Interrupt Status Register */ +#define MII_LXT971_LCR 20 /* LED Control Register */ +#define MII_LXT971_TCR 30 /* Transmit Control Register */ /* * I had some nice ideas of running the MDIO faster... @@ -938,35 +913,35 @@ static void mii_parse_lxt971_sr2(uint mi } static phy_cmd_t const phy_cmd_lxt971_config[] = { - /* limit to 10MBit because my prototype board - * doesn't work with 100. */ - { mk_mii_read(MII_REG_CR), mii_parse_cr }, - { mk_mii_read(MII_REG_ANAR), mii_parse_anar }, - { mk_mii_read(MII_LXT971_SR2), mii_parse_lxt971_sr2 }, - { mk_mii_end, } - }; -static phy_cmd_t const phy_cmd_lxt971_startup[] = { /* enable interrupts */ - { mk_mii_write(MII_LXT971_IER, 0x00f2), NULL }, - { mk_mii_write(MII_REG_CR, 0x1200), NULL }, /* autonegotiate */ - { mk_mii_write(MII_LXT971_LCR, 0xd422), NULL }, /* LED config */ - /* Somehow does the 971 tell me that the link is down - * the first read after power-up. - * read here to get a valid value in ack_int */ - { mk_mii_read(MII_REG_SR), mii_parse_sr }, - { mk_mii_end, } - }; + /* limit to 10MBit because my prototype board + * doesn't work with 100. */ + {mk_mii_read(MII_REG_CR), mii_parse_cr}, + {mk_mii_read(MII_REG_ANAR), mii_parse_anar}, + {mk_mii_read(MII_LXT971_SR2), mii_parse_lxt971_sr2}, + {mk_mii_end,} +}; +static phy_cmd_t const phy_cmd_lxt971_startup[] = { /* enable interrupts */ + {mk_mii_write(MII_LXT971_IER, 0x00f2), NULL}, + {mk_mii_write(MII_REG_CR, 0x1200), NULL}, /* autonegotiate */ + {mk_mii_write(MII_LXT971_LCR, 0xd422), NULL}, /* LED config */ + /* Somehow does the 971 tell me that the link is down + * the first read after power-up. + * read here to get a valid value in ack_int */ + {mk_mii_read(MII_REG_SR), mii_parse_sr}, + {mk_mii_end,} +}; static phy_cmd_t const phy_cmd_lxt971_ack_int[] = { - /* acknowledge the int before reading status ! */ - { mk_mii_read(MII_LXT971_ISR), NULL }, - /* find out the current status */ - { mk_mii_read(MII_REG_SR), mii_parse_sr }, - { mk_mii_read(MII_LXT971_SR2), mii_parse_lxt971_sr2 }, - { mk_mii_end, } - }; -static phy_cmd_t const phy_cmd_lxt971_shutdown[] = { /* disable interrupts */ - { mk_mii_write(MII_LXT971_IER, 0x0000), NULL }, - { mk_mii_end, } - }; + /* acknowledge the int before reading status ! */ + {mk_mii_read(MII_LXT971_ISR), NULL}, + /* find out the current status */ + {mk_mii_read(MII_REG_SR), mii_parse_sr}, + {mk_mii_read(MII_LXT971_SR2), mii_parse_lxt971_sr2}, + {mk_mii_end,} +}; +static phy_cmd_t const phy_cmd_lxt971_shutdown[] = { /* disable interrupts */ + {mk_mii_write(MII_LXT971_IER, 0x0000), NULL}, + {mk_mii_end,} +}; static phy_info_t const phy_info_lxt971 = { .id = 0x0001378e, .name = "LXT971", @@ -981,12 +956,12 @@ static phy_info_t const phy_info_lxt971 /* register definitions */ -#define MII_QS6612_MCR 17 /* Mode Control Register */ -#define MII_QS6612_FTR 27 /* Factory Test Register */ -#define MII_QS6612_MCO 28 /* Misc. Control Register */ -#define MII_QS6612_ISR 29 /* Interrupt Source Register */ -#define MII_QS6612_IMR 30 /* Interrupt Mask Register */ -#define MII_QS6612_PCR 31 /* 100BaseTx PHY Control Reg. */ +#define MII_QS6612_MCR 17 /* Mode Control Register */ +#define MII_QS6612_FTR 27 /* Factory Test Register */ +#define MII_QS6612_MCO 28 /* Misc. Control Register */ +#define MII_QS6612_ISR 29 /* Interrupt Source Register */ +#define MII_QS6612_IMR 30 /* Interrupt Mask Register */ +#define MII_QS6612_PCR 31 /* 100BaseTx PHY Control Reg. */ static void mii_parse_qs6612_pcr(uint mii_reg, struct net_device *dev) { @@ -996,46 +971,54 @@ static void mii_parse_qs6612_pcr(uint mi status = *s & ~(PHY_STAT_SPMASK); - switch((mii_reg >> 2) & 7) { - case 1: status |= PHY_STAT_10HDX; break; - case 2: status |= PHY_STAT_100HDX; break; - case 5: status |= PHY_STAT_10FDX; break; - case 6: status |= PHY_STAT_100FDX; break; -} + switch ((mii_reg >> 2) & 7) { + case 1: + status |= PHY_STAT_10HDX; + break; + case 2: + status |= PHY_STAT_100HDX; + break; + case 5: + status |= PHY_STAT_10FDX; + break; + case 6: + status |= PHY_STAT_100FDX; + break; + } *s = status; } static phy_cmd_t const phy_cmd_qs6612_config[] = { - /* The PHY powers up isolated on the RPX, - * so send a command to allow operation. - */ - { mk_mii_write(MII_QS6612_PCR, 0x0dc0), NULL }, + /* The PHY powers up isolated on the RPX, + * so send a command to allow operation. + */ + {mk_mii_write(MII_QS6612_PCR, 0x0dc0), NULL}, - /* parse cr and anar to get some info */ - { mk_mii_read(MII_REG_CR), mii_parse_cr }, - { mk_mii_read(MII_REG_ANAR), mii_parse_anar }, - { mk_mii_end, } - }; -static phy_cmd_t const phy_cmd_qs6612_startup[] = { /* enable interrupts */ - { mk_mii_write(MII_QS6612_IMR, 0x003a), NULL }, - { mk_mii_write(MII_REG_CR, 0x1200), NULL }, /* autonegotiate */ - { mk_mii_end, } - }; + /* parse cr and anar to get some info */ + {mk_mii_read(MII_REG_CR), mii_parse_cr}, + {mk_mii_read(MII_REG_ANAR), mii_parse_anar}, + {mk_mii_end,} +}; +static phy_cmd_t const phy_cmd_qs6612_startup[] = { /* enable interrupts */ + {mk_mii_write(MII_QS6612_IMR, 0x003a), NULL}, + {mk_mii_write(MII_REG_CR, 0x1200), NULL}, /* autonegotiate */ + {mk_mii_end,} +}; static phy_cmd_t const phy_cmd_qs6612_ack_int[] = { - /* we need to read ISR, SR and ANER to acknowledge */ - { mk_mii_read(MII_QS6612_ISR), NULL }, - { mk_mii_read(MII_REG_SR), mii_parse_sr }, - { mk_mii_read(MII_REG_ANER), NULL }, - - /* read pcr to get info */ - { mk_mii_read(MII_QS6612_PCR), mii_parse_qs6612_pcr }, - { mk_mii_end, } - }; -static phy_cmd_t const phy_cmd_qs6612_shutdown[] = { /* disable interrupts */ - { mk_mii_write(MII_QS6612_IMR, 0x0000), NULL }, - { mk_mii_end, } - }; + /* we need to read ISR, SR and ANER to acknowledge */ + {mk_mii_read(MII_QS6612_ISR), NULL}, + {mk_mii_read(MII_REG_SR), mii_parse_sr}, + {mk_mii_read(MII_REG_ANER), NULL}, + + /* read pcr to get info */ + {mk_mii_read(MII_QS6612_PCR), mii_parse_qs6612_pcr}, + {mk_mii_end,} +}; +static phy_cmd_t const phy_cmd_qs6612_shutdown[] = { /* disable interrupts */ + {mk_mii_write(MII_QS6612_IMR, 0x0000), NULL}, + {mk_mii_end,} +}; static phy_info_t const phy_info_qs6612 = { .id = 0x00181440, .name = "QS6612", @@ -1050,13 +1033,13 @@ static phy_info_t const phy_info_qs6612 /* register definitions for the 874 */ -#define MII_AM79C874_MFR 16 /* Miscellaneous Feature Register */ -#define MII_AM79C874_ICSR 17 /* Interrupt/Status Register */ -#define MII_AM79C874_DR 18 /* Diagnostic Register */ -#define MII_AM79C874_PMLR 19 /* Power and Loopback Register */ -#define MII_AM79C874_MCR 21 /* ModeControl Register */ -#define MII_AM79C874_DC 23 /* Disconnect Counter */ -#define MII_AM79C874_REC 24 /* Recieve Error Counter */ +#define MII_AM79C874_MFR 16 /* Miscellaneous Feature Register */ +#define MII_AM79C874_ICSR 17 /* Interrupt/Status Register */ +#define MII_AM79C874_DR 18 /* Diagnostic Register */ +#define MII_AM79C874_PMLR 19 /* Power and Loopback Register */ +#define MII_AM79C874_MCR 21 /* ModeControl Register */ +#define MII_AM79C874_DC 23 /* Disconnect Counter */ +#define MII_AM79C874_REC 24 /* Recieve Error Counter */ static void mii_parse_am79c874_dr(uint mii_reg, struct net_device *dev) { @@ -1069,37 +1052,39 @@ static void mii_parse_am79c874_dr(uint m if (mii_reg & 0x0080) status |= PHY_STAT_ANC; if (mii_reg & 0x0400) - status |= ((mii_reg & 0x0800) ? PHY_STAT_100FDX : PHY_STAT_100HDX); + status |= + ((mii_reg & 0x0800) ? PHY_STAT_100FDX : PHY_STAT_100HDX); else - status |= ((mii_reg & 0x0800) ? PHY_STAT_10FDX : PHY_STAT_10HDX); + status |= + ((mii_reg & 0x0800) ? PHY_STAT_10FDX : PHY_STAT_10HDX); *s = status; } static phy_cmd_t const phy_cmd_am79c874_config[] = { - { mk_mii_read(MII_REG_CR), mii_parse_cr }, - { mk_mii_read(MII_REG_ANAR), mii_parse_anar }, - { mk_mii_read(MII_AM79C874_DR), mii_parse_am79c874_dr }, - { mk_mii_end, } - }; -static phy_cmd_t const phy_cmd_am79c874_startup[] = { /* enable interrupts */ - { mk_mii_write(MII_AM79C874_ICSR, 0xff00), NULL }, - { mk_mii_write(MII_REG_CR, 0x1200), NULL }, /* autonegotiate */ - { mk_mii_read(MII_REG_SR), mii_parse_sr }, - { mk_mii_end, } - }; + {mk_mii_read(MII_REG_CR), mii_parse_cr}, + {mk_mii_read(MII_REG_ANAR), mii_parse_anar}, + {mk_mii_read(MII_AM79C874_DR), mii_parse_am79c874_dr}, + {mk_mii_end,} +}; +static phy_cmd_t const phy_cmd_am79c874_startup[] = { /* enable interrupts */ + {mk_mii_write(MII_AM79C874_ICSR, 0xff00), NULL}, + {mk_mii_write(MII_REG_CR, 0x1200), NULL}, /* autonegotiate */ + {mk_mii_read(MII_REG_SR), mii_parse_sr}, + {mk_mii_end,} +}; static phy_cmd_t const phy_cmd_am79c874_ack_int[] = { - /* find out the current status */ - { mk_mii_read(MII_REG_SR), mii_parse_sr }, - { mk_mii_read(MII_AM79C874_DR), mii_parse_am79c874_dr }, - /* we only need to read ISR to acknowledge */ - { mk_mii_read(MII_AM79C874_ICSR), NULL }, - { mk_mii_end, } - }; -static phy_cmd_t const phy_cmd_am79c874_shutdown[] = { /* disable interrupts */ - { mk_mii_write(MII_AM79C874_ICSR, 0x0000), NULL }, - { mk_mii_end, } - }; + /* find out the current status */ + {mk_mii_read(MII_REG_SR), mii_parse_sr}, + {mk_mii_read(MII_AM79C874_DR), mii_parse_am79c874_dr}, + /* we only need to read ISR to acknowledge */ + {mk_mii_read(MII_AM79C874_ICSR), NULL}, + {mk_mii_end,} +}; +static phy_cmd_t const phy_cmd_am79c874_shutdown[] = { /* disable interrupts */ + {mk_mii_write(MII_AM79C874_ICSR, 0x0000), NULL}, + {mk_mii_end,} +}; static phy_info_t const phy_info_am79c874 = { .id = 0x00022561, .name = "AM79C874", @@ -1109,7 +1094,6 @@ static phy_info_t const phy_info_am79c87 .shutdown = phy_cmd_am79c874_shutdown }; - /* ------------------------------------------------------------------------- */ /* Kendin KS8721BL phy */ @@ -1120,27 +1104,27 @@ static phy_info_t const phy_info_am79c87 #define MII_KS8721BL_PHYCR 31 static phy_cmd_t const phy_cmd_ks8721bl_config[] = { - { mk_mii_read(MII_REG_CR), mii_parse_cr }, - { mk_mii_read(MII_REG_ANAR), mii_parse_anar }, - { mk_mii_end, } - }; -static phy_cmd_t const phy_cmd_ks8721bl_startup[] = { /* enable interrupts */ - { mk_mii_write(MII_KS8721BL_ICSR, 0xff00), NULL }, - { mk_mii_write(MII_REG_CR, 0x1200), NULL }, /* autonegotiate */ - { mk_mii_read(MII_REG_SR), mii_parse_sr }, - { mk_mii_end, } - }; + {mk_mii_read(MII_REG_CR), mii_parse_cr}, + {mk_mii_read(MII_REG_ANAR), mii_parse_anar}, + {mk_mii_end,} +}; +static phy_cmd_t const phy_cmd_ks8721bl_startup[] = { /* enable interrupts */ + {mk_mii_write(MII_KS8721BL_ICSR, 0xff00), NULL}, + {mk_mii_write(MII_REG_CR, 0x1200), NULL}, /* autonegotiate */ + {mk_mii_read(MII_REG_SR), mii_parse_sr}, + {mk_mii_end,} +}; static phy_cmd_t const phy_cmd_ks8721bl_ack_int[] = { - /* find out the current status */ - { mk_mii_read(MII_REG_SR), mii_parse_sr }, - /* we only need to read ISR to acknowledge */ - { mk_mii_read(MII_KS8721BL_ICSR), NULL }, - { mk_mii_end, } - }; -static phy_cmd_t const phy_cmd_ks8721bl_shutdown[] = { /* disable interrupts */ - { mk_mii_write(MII_KS8721BL_ICSR, 0x0000), NULL }, - { mk_mii_end, } - }; + /* find out the current status */ + {mk_mii_read(MII_REG_SR), mii_parse_sr}, + /* we only need to read ISR to acknowledge */ + {mk_mii_read(MII_KS8721BL_ICSR), NULL}, + {mk_mii_end,} +}; +static phy_cmd_t const phy_cmd_ks8721bl_shutdown[] = { /* disable interrupts */ + {mk_mii_write(MII_KS8721BL_ICSR, 0x0000), NULL}, + {mk_mii_end,} +}; static phy_info_t const phy_info_ks8721bl = { .id = 0x00022161, .name = "KS8721BL", @@ -1153,7 +1137,7 @@ static phy_info_t const phy_info_ks8721b /* ------------------------------------------------------------------------- */ /* register definitions for the DP83848 */ -#define MII_DP8384X_PHYSTST 16 /* PHY Status Register */ +#define MII_DP8384X_PHYSTST 16 /* PHY Status Register */ static void mii_parse_dp8384x_sr2(uint mii_reg, struct net_device *dev) { @@ -1169,15 +1153,19 @@ static void mii_parse_dp8384x_sr2(uint m } else fep->link = 0; /* Status of link */ - if (mii_reg & 0x0010) /* Autonegotioation complete */ + if (mii_reg & 0x0010) /* Autonegotioation complete */ *s |= PHY_STAT_ANC; - if (mii_reg & 0x0002) { /* 10MBps? */ - if (mii_reg & 0x0004) /* Full Duplex? */ + /* 10MBps? */ + if (mii_reg & 0x0002) { + /* Full Duplex? */ + if (mii_reg & 0x0004) *s |= PHY_STAT_10FDX; else *s |= PHY_STAT_10HDX; - } else { /* 100 Mbps? */ - if (mii_reg & 0x0004) /* Full Duplex? */ + } else { + /* 100 Mbps then */ + /* Full Duplex? */ + if (mii_reg & 0x0004) *s |= PHY_STAT_100FDX; else *s |= PHY_STAT_100HDX; @@ -1186,32 +1174,33 @@ static void mii_parse_dp8384x_sr2(uint m *s |= PHY_STAT_FAULT; } -static phy_info_t phy_info_dp83848= { +static phy_info_t phy_info_dp83848 = { 0x020005c9, "DP83848", - (const phy_cmd_t []) { /* config */ - { mk_mii_read(MII_REG_CR), mii_parse_cr }, - { mk_mii_read(MII_REG_ANAR), mii_parse_anar }, - { mk_mii_read(MII_DP8384X_PHYSTST), mii_parse_dp8384x_sr2 }, - { mk_mii_end, } + (const phy_cmd_t[]){ /* config */ + {mk_mii_read(MII_REG_CR), mii_parse_cr}, + {mk_mii_read(MII_REG_ANAR), mii_parse_anar}, + {mk_mii_read(MII_DP8384X_PHYSTST), + mii_parse_dp8384x_sr2}, + {mk_mii_end,} }, - (const phy_cmd_t []) { /* startup - enable interrupts */ - { mk_mii_write(MII_REG_CR, 0x1200), NULL }, /* autonegotiate */ - { mk_mii_read(MII_REG_SR), mii_parse_sr }, - { mk_mii_end, } + (const phy_cmd_t[]){ /* startup - enable interrupts */ + {mk_mii_write(MII_REG_CR, 0x1200), NULL}, /* autonegotiate */ + {mk_mii_read(MII_REG_SR), mii_parse_sr}, + {mk_mii_end,} }, - (const phy_cmd_t []) { /* ack_int - never happens, no interrupt */ - { mk_mii_end, } + (const phy_cmd_t[]){ /* ack_int - never happens, no interrupt */ + {mk_mii_end,} }, - (const phy_cmd_t []) { /* shutdown */ - { mk_mii_end, } + (const phy_cmd_t[]){ /* shutdown */ + {mk_mii_end,} }, }; /* ------------------------------------------------------------------------- */ -static phy_info_t const * const phy_info[] = { +static phy_info_t const *const phy_info[] = { &phy_info_lxt970, &phy_info_lxt971, &phy_info_qs6612, @@ -1221,22 +1210,38 @@ static phy_info_t const * const phy_info NULL }; -/* ------------------------------------------------------------------------- */ -#if !defined(CONFIG_M532x) -#ifdef CONFIG_RPXCLASSIC -static void -mii_link_interrupt(void *dev_id); -#else -static irqreturn_t -mii_link_interrupt(int irq, void * dev_id); -#endif +#if defined(CONFIG_M5272) +static void fec_phy_ack_intr(void) +{ + volatile unsigned long *icrp; + /* Acknowledge the interrupt */ + icrp = (volatile unsigned long *)(MCF_MBAR + MCFSIM_ICR1); + *icrp = 0x0d000000; +} + +/* This interrupt occurs when the PHY detects a link change. +*/ +static irqreturn_t mii_link_interrupt(int irq, void *dev_id) +{ + struct net_device *dev = dev_id; + struct fec_enet_private *fep = netdev_priv(dev); + + fec_phy_ack_intr(); + +#if 0 + disable_irq(fep->mii_irq); /* disable now, enable later */ #endif -#if defined(CONFIG_M5272) + mii_do_cmd(dev, fep->phy->ack_int); + mii_do_cmd(dev, phy_cmd_relink); /* restart and display status */ + + return IRQ_HANDLED; +} + /* * Code specific to Coldfire 5272 setup. */ -static void __inline__ fec_request_intrs(struct net_device *dev) +static void __init fec_request_intrs(struct net_device *dev) { volatile unsigned long *icrp; static const struct idesc { @@ -1244,27 +1249,36 @@ static void __inline__ fec_request_intrs unsigned short irq; irq_handler_t handler; } *idp, id[] = { - { "fec(RX)", 86, fec_enet_interrupt }, - { "fec(TX)", 87, fec_enet_interrupt }, - { "fec(OTHER)", 88, fec_enet_interrupt }, - { "fec(MII)", 66, mii_link_interrupt }, - { NULL }, + /* + * Available but not allocated because not handled: + * fec(OTHER) 88 + */ + { "fec(RX)", 86, fec_enet_interrupt}, + { "fec(TX)", 87, fec_enet_interrupt}, + { "fec(MII)", 66, mii_link_interrupt}, + { NULL, 0 }, }; /* Setup interrupt handlers. */ for (idp = id; idp->name; idp++) { - if (request_irq(idp->irq, idp->handler, 0, idp->name, dev) != 0) - printk("FEC: Could not allocate %s IRQ(%d)!\n", idp->name, idp->irq); + int ret; + + ret =request_irq(idp->irq, idp->handler, IRQF_DISABLED, idp->name, + dev); + if (ret) + printk("FEC: Could not allocate %s IRQ(%d)!\n", + idp->name, idp->irq); } /* Unmask interrupt at ColdFire 5272 SIM */ - icrp = (volatile unsigned long *) (MCF_MBAR + MCFSIM_ICR3); + icrp = (volatile unsigned long *)(MCF_MBAR + MCFSIM_ICR3); *icrp = 0x00000ddd; - icrp = (volatile unsigned long *) (MCF_MBAR + MCFSIM_ICR1); + icrp = (volatile unsigned long *)(MCF_MBAR + MCFSIM_ICR1); *icrp = 0x0d000000; } -static void __inline__ fec_set_mii(struct net_device *dev, struct fec_enet_private *fep) +static void __init fec_set_mii(struct net_device *dev, + struct fec_enet_private *fep) { volatile fec_t *fecp; @@ -1282,7 +1296,7 @@ static void __inline__ fec_set_mii(struc fec_restart(dev, 0); } -static void __inline__ fec_get_mac(struct net_device *dev) +static void __init fec_get_mac(struct net_device *dev) { struct fec_enet_private *fep = netdev_priv(dev); volatile fec_t *fecp; @@ -1303,8 +1317,8 @@ static void __inline__ fec_get_mac(struc (iap[3] == 0xff) && (iap[4] == 0xff) && (iap[5] == 0xff)) iap = fec_mac_default; } else { - *((unsigned long *) &tmpaddr[0]) = fecp->fec_addr_low; - *((unsigned short *) &tmpaddr[4]) = (fecp->fec_addr_high >> 16); + *((unsigned long *)&tmpaddr[0]) = fecp->fec_addr_low; + *((unsigned short *)&tmpaddr[4]) = (fecp->fec_addr_high >> 16); iap = &tmpaddr[0]; } @@ -1312,36 +1326,29 @@ static void __inline__ fec_get_mac(struc /* Adjust MAC if using default MAC address */ if (iap == fec_mac_default) - dev->dev_addr[ETH_ALEN-1] = fec_mac_default[ETH_ALEN-1] + fep->index; + dev->dev_addr[ETH_ALEN - 1] = + fec_mac_default[ETH_ALEN - 1] + fep->index; } -static void __inline__ fec_enable_phy_intr(void) +static void fec_enable_phy_intr(void) { } -static void __inline__ fec_disable_phy_intr(void) +static void fec_disable_phy_intr(void) { volatile unsigned long *icrp; - icrp = (volatile unsigned long *) (MCF_MBAR + MCFSIM_ICR1); + icrp = (volatile unsigned long *)(MCF_MBAR + MCFSIM_ICR1); *icrp = 0x08000000; } -static void __inline__ fec_phy_ack_intr(void) -{ - volatile unsigned long *icrp; - /* Acknowledge the interrupt */ - icrp = (volatile unsigned long *) (MCF_MBAR + MCFSIM_ICR1); - *icrp = 0x0d000000; -} - -static void __inline__ fec_localhw_setup(void) +static void fec_localhw_setup(void) { } /* * Do not need to make region uncached on 5272. */ -static void __inline__ fec_uncache(unsigned long addr) +static void __init fec_uncache(unsigned long addr) { } @@ -1353,7 +1360,7 @@ static void __inline__ fec_uncache(unsig * Code specific to Coldfire 5230/5231/5232/5234/5235, * the 5270/5271/5274/5275 and 5280/5282 setups. */ -static void __inline__ fec_request_intrs(struct net_device *dev) +static void __init fec_request_intrs(struct net_device *dev) { struct fec_enet_private *fep; int b; @@ -1361,20 +1368,16 @@ static void __inline__ fec_request_intrs char *name; unsigned short irq; } *idp, id[] = { - { "fec(TXF)", 23 }, - { "fec(TXB)", 24 }, - { "fec(TXFIFO)", 25 }, - { "fec(TXCR)", 26 }, - { "fec(RXF)", 27 }, - { "fec(RXB)", 28 }, - { "fec(MII)", 29 }, - { "fec(LC)", 30 }, - { "fec(HBERR)", 31 }, - { "fec(GRA)", 32 }, - { "fec(EBERR)", 33 }, - { "fec(BABT)", 34 }, - { "fec(BABR)", 35 }, - { NULL }, + /* + * Available but not allocated because not handled: + * fec(TXB) 24, fec(TXFIFO) 25, fec(TXCR) 26, fec(RXB) 28, + * fec(LC) 30, fec(HBERR) 31, fec(GRA) 32, fec(EBERR) 33, + * fec(BABT) 34, fec(BABR), 35 + */ + { "fec(TXF)", 23}, + { "fec(RXF)", 27}, + { "fec(MII)", 29}, + { NULL, 0}, }; fep = netdev_priv(dev); @@ -1382,43 +1385,47 @@ static void __inline__ fec_request_intrs /* Setup interrupt handlers. */ for (idp = id; idp->name; idp++) { - if (request_irq(b+idp->irq, fec_enet_interrupt, 0, idp->name, dev) != 0) - printk("FEC: Could not allocate %s IRQ(%d)!\n", idp->name, b+idp->irq); - } + int ret; + ret = request_irq(b + idp->irq, fec_enet_interrupt, IRQF_DISABLED, + idp->name, dev); + if (ret) + printk("FEC: Could not allocate %s IRQ(%d)!\n", + idp->name, b + idp->irq); + } +#if defined(CONFIG_M527x) || defined(CONFIG_M528x) /* Unmask interrupts at ColdFire 5280/5282 interrupt controller */ { - volatile unsigned char *icrp; - volatile unsigned long *imrp; + volatile unsigned char *icrp; + volatile unsigned long *imrp; int i, ilip; b = (fep->index) ? MCFICM_INTC1 : MCFICM_INTC0; - icrp = (volatile unsigned char *) (MCF_IPSBAR + b + - MCFINTC_ICR0); + icrp = (volatile unsigned char *)(MCF_IPSBAR + b + + MCFINTC_ICR0); for (i = 23, ilip = 0x28; (i < 36); i++) icrp[i] = ilip--; - imrp = (volatile unsigned long *) (MCF_IPSBAR + b + - MCFINTC_IMRH); + imrp = (volatile unsigned long *)(MCF_IPSBAR + b + + MCFINTC_IMRH); *imrp &= ~0x0000000f; - imrp = (volatile unsigned long *) (MCF_IPSBAR + b + - MCFINTC_IMRL); + imrp = (volatile unsigned long *)(MCF_IPSBAR + b + + MCFINTC_IMRL); *imrp &= ~0xff800001; } - +#endif #if defined(CONFIG_M528x) /* Set up gpio outputs for MII lines */ { volatile u16 *gpio_paspar; volatile u8 *gpio_pehlpar; - gpio_paspar = (volatile u16 *) (MCF_IPSBAR + 0x100056); - gpio_pehlpar = (volatile u16 *) (MCF_IPSBAR + 0x100058); + gpio_paspar = (volatile u16 *)(MCF_IPSBAR + 0x100056); + gpio_pehlpar = (volatile u16 *)(MCF_IPSBAR + 0x100058); *gpio_paspar |= 0x0f00; *gpio_pehlpar = 0xc0; } #endif - #if defined(CONFIG_M527x) /* Set up gpio outputs for MII lines */ { @@ -1443,7 +1450,8 @@ static void __inline__ fec_request_intrs #endif /* CONFIG_M527x */ } -static void __inline__ fec_set_mii(struct net_device *dev, struct fec_enet_private *fep) +static void __init fec_set_mii(struct net_device *dev, + struct fec_enet_private *fep) { volatile fec_t *fecp; @@ -1461,7 +1469,7 @@ static void __inline__ fec_set_mii(struc fec_restart(dev, 0); } -static void __inline__ fec_get_mac(struct net_device *dev) +static void __init fec_get_mac(struct net_device *dev) { struct fec_enet_private *fep = netdev_priv(dev); volatile fec_t *fecp; @@ -1482,8 +1490,8 @@ static void __inline__ fec_get_mac(struc (iap[3] == 0xff) && (iap[4] == 0xff) && (iap[5] == 0xff)) iap = fec_mac_default; } else { - *((unsigned long *) &tmpaddr[0]) = fecp->fec_addr_low; - *((unsigned short *) &tmpaddr[4]) = (fecp->fec_addr_high >> 16); + *((unsigned long *)&tmpaddr[0]) = fecp->fec_addr_low; + *((unsigned short *)&tmpaddr[4]) = (fecp->fec_addr_high >> 16); iap = &tmpaddr[0]; } @@ -1491,29 +1499,26 @@ static void __inline__ fec_get_mac(struc /* Adjust MAC if using default MAC address */ if (iap == fec_mac_default) - dev->dev_addr[ETH_ALEN-1] = fec_mac_default[ETH_ALEN-1] + fep->index; + dev->dev_addr[ETH_ALEN - 1] = + fec_mac_default[ETH_ALEN - 1] + fep->index; } -static void __inline__ fec_enable_phy_intr(void) +static void fec_enable_phy_intr(void) { } -static void __inline__ fec_disable_phy_intr(void) +static void fec_disable_phy_intr(void) { } -static void __inline__ fec_phy_ack_intr(void) -{ -} - -static void __inline__ fec_localhw_setup(void) +static void fec_localhw_setup(void) { } /* * Do not need to make region uncached on 5272. */ -static void __inline__ fec_uncache(unsigned long addr) +static void __init fec_uncache(unsigned long addr) { } @@ -1524,7 +1529,7 @@ static void __inline__ fec_uncache(unsig /* * Code specific to Coldfire 520x */ -static void __inline__ fec_request_intrs(struct net_device *dev) +static void __init fec_request_intrs(struct net_device *dev) { struct fec_enet_private *fep; int b; @@ -1532,20 +1537,16 @@ static void __inline__ fec_request_intrs char *name; unsigned short irq; } *idp, id[] = { - { "fec(TXF)", 23 }, - { "fec(TXB)", 24 }, - { "fec(TXFIFO)", 25 }, - { "fec(TXCR)", 26 }, - { "fec(RXF)", 27 }, - { "fec(RXB)", 28 }, - { "fec(MII)", 29 }, - { "fec(LC)", 30 }, - { "fec(HBERR)", 31 }, - { "fec(GRA)", 32 }, - { "fec(EBERR)", 33 }, - { "fec(BABT)", 34 }, - { "fec(BABR)", 35 }, - { NULL }, + /* + * Available but not allocated because not handled: + * fec(TXB) 24, fec(TXFIFO) 25, fec(TXCR) 26, fec(RXB) 28, + * fec(LC) 30, fec(HBERR) 31, fec(GRA) 32, fec(EBERR) 33, + * fec(BABT) 34, fec(BABR) 35 + */ + { "fec(TXF)", 23}, + { "fec(RXF)", 27}, + { "fec(MII)", 29}, + { NULL, 0}, }; fep = netdev_priv(dev); @@ -1553,28 +1554,34 @@ static void __inline__ fec_request_intrs /* Setup interrupt handlers. */ for (idp = id; idp->name; idp++) { - if (request_irq(b+idp->irq,fec_enet_interrupt,0,idp->name,dev)!=0) - printk("FEC: Could not allocate %s IRQ(%d)!\n", idp->name, b+idp->irq); + int ret; + + ret = request_irq(b + idp->irq, fec_enet_interrupt, IRQF_DISABLED, + idp->name, dev); + if (ret) + printk("FEC: Could not allocate %s IRQ(%d)!\n", + idp->name, b + idp->irq); } /* Unmask interrupts at ColdFire interrupt controller */ { - volatile unsigned char *icrp; - volatile unsigned long *imrp; + volatile unsigned char *icrp; + volatile unsigned long *imrp; - icrp = (volatile unsigned char *) (MCF_IPSBAR + MCFICM_INTC0 + - MCFINTC_ICR0); + icrp = (volatile unsigned char *)(MCF_IPSBAR + MCFICM_INTC0 + + MCFINTC_ICR0); for (b = 36; (b < 49); b++) icrp[b] = 0x04; - imrp = (volatile unsigned long *) (MCF_IPSBAR + MCFICM_INTC0 + - MCFINTC_IMRH); + imrp = (volatile unsigned long *)(MCF_IPSBAR + MCFICM_INTC0 + + MCFINTC_IMRH); *imrp &= ~0x0001FFF0; } *(volatile unsigned char *)(MCF_IPSBAR + MCF_GPIO_PAR_FEC) |= 0xf0; *(volatile unsigned char *)(MCF_IPSBAR + MCF_GPIO_PAR_FECI2C) |= 0x0f; } -static void __inline__ fec_set_mii(struct net_device *dev, struct fec_enet_private *fep) +static void __init fec_set_mii(struct net_device *dev, + struct fec_enet_private *fep) { volatile fec_t *fecp; @@ -1592,7 +1599,7 @@ static void __inline__ fec_set_mii(struc fec_restart(dev, 0); } -static void __inline__ fec_get_mac(struct net_device *dev) +static void __init fec_get_mac(struct net_device *dev) { struct fec_enet_private *fep = netdev_priv(dev); volatile fec_t *fecp; @@ -1607,14 +1614,14 @@ static void __inline__ fec_get_mac(struc */ iap = FEC_FLASHMAC; if ((iap[0] == 0) && (iap[1] == 0) && (iap[2] == 0) && - (iap[3] == 0) && (iap[4] == 0) && (iap[5] == 0)) + (iap[3] == 0) && (iap[4] == 0) && (iap[5] == 0)) iap = fec_mac_default; if ((iap[0] == 0xff) && (iap[1] == 0xff) && (iap[2] == 0xff) && - (iap[3] == 0xff) && (iap[4] == 0xff) && (iap[5] == 0xff)) + (iap[3] == 0xff) && (iap[4] == 0xff) && (iap[5] == 0xff)) iap = fec_mac_default; } else { - *((unsigned long *) &tmpaddr[0]) = fecp->fec_addr_low; - *((unsigned short *) &tmpaddr[4]) = (fecp->fec_addr_high >> 16); + *((unsigned long *)&tmpaddr[0]) = fecp->fec_addr_low; + *((unsigned short *)&tmpaddr[4]) = (fecp->fec_addr_high >> 16); iap = &tmpaddr[0]; } @@ -1622,26 +1629,23 @@ static void __inline__ fec_get_mac(struc /* Adjust MAC if using default MAC address */ if (iap == fec_mac_default) - dev->dev_addr[ETH_ALEN-1] = fec_mac_default[ETH_ALEN-1] + fep->index; -} - -static void __inline__ fec_enable_phy_intr(void) -{ + dev->dev_addr[ETH_ALEN - 1] = + fec_mac_default[ETH_ALEN - 1] + fep->index; } -static void __inline__ fec_disable_phy_intr(void) +static void fec_enable_phy_intr(void) { } -static void __inline__ fec_phy_ack_intr(void) +static void fec_disable_phy_intr(void) { } -static void __inline__ fec_localhw_setup(void) +static void fec_localhw_setup(void) { } -static void __inline__ fec_uncache(unsigned long addr) +static void __init fec_uncache(unsigned long addr) { } @@ -1651,7 +1655,7 @@ static void __inline__ fec_uncache(unsig /* * Code specific for M532x */ -static void __inline__ fec_request_intrs(struct net_device *dev) +static void __init fec_request_intrs(struct net_device *dev) { struct fec_enet_private *fep; int b; @@ -1659,20 +1663,16 @@ static void __inline__ fec_request_intrs char *name; unsigned short irq; } *idp, id[] = { - { "fec(TXF)", 36 }, - { "fec(TXB)", 37 }, - { "fec(TXFIFO)", 38 }, - { "fec(TXCR)", 39 }, - { "fec(RXF)", 40 }, - { "fec(RXB)", 41 }, - { "fec(MII)", 42 }, - { "fec(LC)", 43 }, - { "fec(HBERR)", 44 }, - { "fec(GRA)", 45 }, - { "fec(EBERR)", 46 }, - { "fec(BABT)", 47 }, - { "fec(BABR)", 48 }, - { NULL }, + /* + * Available but not allocated because not handled: + * fec(TXB) 37, fec(TXFIFO) 38, fec(TXCR) 39, fec(RXB) 41, + * fec(LC) 43, fec(HBERR) 44, fec(GRA) 45, fec(EBERR) 46, + * fec(BABT) 47, fec(BABR) 48 + */ + { "fec(TXF)", 36}, + { "fec(RXF)", 40}, + { "fec(MII)", 42}, + { NULL, 0}, }; fep = netdev_priv(dev); @@ -1680,9 +1680,13 @@ static void __inline__ fec_request_intrs /* Setup interrupt handlers. */ for (idp = id; idp->name; idp++) { - if (request_irq(b+idp->irq,fec_enet_interrupt,0,idp->name,dev)!=0) - printk("FEC: Could not allocate %s IRQ(%d)!\n", - idp->name, b+idp->irq); + int ret; + + ret = request_irq(b + idp->irq, fec_enet_interrupt, IRQF_DISABLED, + idp->name, dev); + if (ret) + printk("FEC: Could not allocate %s IRQ(%d)!\n", + idp->name, b + idp->irq); } /* Unmask interrupts */ @@ -1700,31 +1704,31 @@ static void __inline__ fec_request_intrs MCF_INTC0_ICR47 = 0x2; MCF_INTC0_ICR48 = 0x2; - MCF_INTC0_IMRH &= ~( - MCF_INTC_IMRH_INT_MASK36 | - MCF_INTC_IMRH_INT_MASK37 | - MCF_INTC_IMRH_INT_MASK38 | - MCF_INTC_IMRH_INT_MASK39 | - MCF_INTC_IMRH_INT_MASK40 | - MCF_INTC_IMRH_INT_MASK41 | - MCF_INTC_IMRH_INT_MASK42 | - MCF_INTC_IMRH_INT_MASK43 | - MCF_INTC_IMRH_INT_MASK44 | - MCF_INTC_IMRH_INT_MASK45 | - MCF_INTC_IMRH_INT_MASK46 | - MCF_INTC_IMRH_INT_MASK47 | - MCF_INTC_IMRH_INT_MASK48 ); + MCF_INTC0_IMRH &= ~(MCF_INTC_IMRH_INT_MASK36 | + MCF_INTC_IMRH_INT_MASK37 | + MCF_INTC_IMRH_INT_MASK38 | + MCF_INTC_IMRH_INT_MASK39 | + MCF_INTC_IMRH_INT_MASK40 | + MCF_INTC_IMRH_INT_MASK41 | + MCF_INTC_IMRH_INT_MASK42 | + MCF_INTC_IMRH_INT_MASK43 | + MCF_INTC_IMRH_INT_MASK44 | + MCF_INTC_IMRH_INT_MASK45 | + MCF_INTC_IMRH_INT_MASK46 | + MCF_INTC_IMRH_INT_MASK47 | + MCF_INTC_IMRH_INT_MASK48); /* Set up gpio outputs for MII lines */ MCF_GPIO_PAR_FECI2C |= (0 | - MCF_GPIO_PAR_FECI2C_PAR_MDC_EMDC | - MCF_GPIO_PAR_FECI2C_PAR_MDIO_EMDIO); + MCF_GPIO_PAR_FECI2C_PAR_MDC_EMDC | + MCF_GPIO_PAR_FECI2C_PAR_MDIO_EMDIO); MCF_GPIO_PAR_FEC = (0 | - MCF_GPIO_PAR_FEC_PAR_FEC_7W_FEC | - MCF_GPIO_PAR_FEC_PAR_FEC_MII_FEC); + MCF_GPIO_PAR_FEC_PAR_FEC_7W_FEC | + MCF_GPIO_PAR_FEC_PAR_FEC_MII_FEC); } -static void __inline__ fec_set_mii(struct net_device *dev, struct fec_enet_private *fep) +static void __init fec_set_mii(struct net_device *dev, + struct fec_enet_private *fep) { volatile fec_t *fecp; @@ -1741,7 +1745,7 @@ static void __inline__ fec_set_mii(struc fec_restart(dev, 0); } -static void __inline__ fec_get_mac(struct net_device *dev) +static void __init fec_get_mac(struct net_device *dev) { struct fec_enet_private *fep = netdev_priv(dev); volatile fec_t *fecp; @@ -1762,8 +1766,8 @@ static void __inline__ fec_get_mac(struc (iap[3] == 0xff) && (iap[4] == 0xff) && (iap[5] == 0xff)) iap = fec_mac_default; } else { - *((unsigned long *) &tmpaddr[0]) = fecp->fec_addr_low; - *((unsigned short *) &tmpaddr[4]) = (fecp->fec_addr_high >> 16); + *((unsigned long *)&tmpaddr[0]) = fecp->fec_addr_low; + *((unsigned short *)&tmpaddr[4]) = (fecp->fec_addr_high >> 16); iap = &tmpaddr[0]; } @@ -1771,143 +1775,109 @@ static void __inline__ fec_get_mac(struc /* Adjust MAC if using default MAC address */ if (iap == fec_mac_default) - dev->dev_addr[ETH_ALEN-1] = fec_mac_default[ETH_ALEN-1] + fep->index; -} - -static void __inline__ fec_enable_phy_intr(void) -{ + dev->dev_addr[ETH_ALEN - 1] = + fec_mac_default[ETH_ALEN - 1] + fep->index; } -static void __inline__ fec_disable_phy_intr(void) +static void fec_enable_phy_intr(void) { } -static void __inline__ fec_phy_ack_intr(void) +static void fec_disable_phy_intr(void) { } -static void __inline__ fec_localhw_setup(void) +static void fec_localhw_setup(void) { } /* * Do not need to make region uncached on 532x. */ -static void __inline__ fec_uncache(unsigned long addr) +static void __init fec_uncache(unsigned long addr) { } /* ------------------------------------------------------------------------- */ - #else /* * Code specific to the MPC860T setup. */ -static void __inline__ fec_request_intrs(struct net_device *dev) +static void __init fec_request_intrs(struct net_device *dev) { volatile immap_t *immap; - immap = (immap_t *)IMAP_ADDR; /* pointer to internal registers */ + immap = (immap_t *) IMAP_ADDR; /* pointer to internal registers */ - if (request_8xxirq(FEC_INTERRUPT, fec_enet_interrupt, 0, "fec", dev) != 0) + if (request_8xxirq(FEC_INTERRUPT, fec_enet_interrupt, 0, "fec", dev) != + 0) panic("Could not allocate FEC IRQ!"); - -#ifdef CONFIG_RPXCLASSIC - /* Make Port C, bit 15 an input that causes interrupts. - */ - immap->im_ioport.iop_pcpar &= ~0x0001; - immap->im_ioport.iop_pcdir &= ~0x0001; - immap->im_ioport.iop_pcso &= ~0x0001; - immap->im_ioport.iop_pcint |= 0x0001; - cpm_install_handler(CPMVEC_PIO_PC15, mii_link_interrupt, dev); - - /* Make LEDS reflect Link status. - */ - *((uint *) RPX_CSR_ADDR) &= ~BCSR2_FETHLEDMODE; -#endif -#ifdef CONFIG_FADS - if (request_8xxirq(SIU_IRQ2, mii_link_interrupt, 0, "mii", dev) != 0) - panic("Could not allocate MII IRQ!"); -#endif } -static void __inline__ fec_get_mac(struct net_device *dev) +static void __init fec_get_mac(struct net_device *dev) { bd_t *bd; - bd = (bd_t *)__res; + bd = (bd_t *) __res; memcpy(dev->dev_addr, bd->bi_enetaddr, ETH_ALEN); - -#ifdef CONFIG_RPXCLASSIC - /* The Embedded Planet boards have only one MAC address in - * the EEPROM, but can have two Ethernet ports. For the - * FEC port, we create another address by setting one of - * the address bits above something that would have (up to - * now) been allocated. - */ - dev->dev_adrd[3] |= 0x80; -#endif } -static void __inline__ fec_set_mii(struct net_device *dev, struct fec_enet_private *fep) +static void __init fec_set_mii(struct net_device *dev, + struct fec_enet_private *fep) { extern uint _get_IMMR(void); volatile immap_t *immap; volatile fec_t *fecp; fecp = fep->hwp; - immap = (immap_t *)IMAP_ADDR; /* pointer to internal registers */ + immap = (immap_t *) IMAP_ADDR; /* pointer to internal registers */ /* Configure all of port D for MII. - */ + */ immap->im_ioport.iop_pdpar = 0x1fff; /* Bits moved from Rev. D onward. - */ + */ if ((_get_IMMR() & 0xffff) < 0x0501) immap->im_ioport.iop_pddir = 0x1c58; /* Pre rev. D */ else immap->im_ioport.iop_pddir = 0x1fff; /* Rev. D and later */ /* Set MII speed to 2.5 MHz - */ + */ fecp->fec_mii_speed = fep->phy_speed = - ((bd->bi_busfreq * 1000000) / 2500000) & 0x7e; + ((bd->bi_busfreq * 1000000) / 2500000) & 0x7e; } -static void __inline__ fec_enable_phy_intr(void) +static void fec_enable_phy_intr(void) { volatile fec_t *fecp; fecp = fep->hwp; /* Enable MII command finished interrupt - */ - fecp->fec_ivec = (FEC_INTERRUPT/2) << 29; -} - -static void __inline__ fec_disable_phy_intr(void) -{ + */ + fecp->fec_ivec = (FEC_INTERRUPT / 2) << 29; } -static void __inline__ fec_phy_ack_intr(void) +static void fec_disable_phy_intr(void) { } -static void __inline__ fec_localhw_setup(void) +static void fec_localhw_setup(void) { volatile fec_t *fecp; fecp = fep->hwp; fecp->fec_r_hash = PKT_MAXBUF_SIZE; /* Enable big endian and don't care about SDMA FC. - */ + */ fecp->fec_fun_code = 0x78000000; } -static void __inline__ fec_uncache(unsigned long addr) +static void __init fec_uncache(unsigned long addr) { pte_t *pte; pte = va_to_pte(mem_addr); @@ -1936,11 +1906,19 @@ static void mii_display_status(struct ne } else { printk("link up"); - switch(*s & PHY_STAT_SPMASK) { - case PHY_STAT_100FDX: printk(", 100MBit Full Duplex"); break; - case PHY_STAT_100HDX: printk(", 100MBit Half Duplex"); break; - case PHY_STAT_10FDX: printk(", 10MBit Full Duplex"); break; - case PHY_STAT_10HDX: printk(", 10MBit Half Duplex"); break; + switch (*s & PHY_STAT_SPMASK) { + case PHY_STAT_100FDX: + printk(", 100MBit Full Duplex"); + break; + case PHY_STAT_100HDX: + printk(", 100MBit Half Duplex"); + break; + case PHY_STAT_10FDX: + printk(", 10MBit Full Duplex"); + break; + case PHY_STAT_10HDX: + printk(", 10MBit Half Duplex"); + break; default: printk(", Unknown speed/duplex"); } @@ -1957,14 +1935,15 @@ static void mii_display_status(struct ne static void mii_display_config(struct work_struct *work) { - struct fec_enet_private *fep = container_of(work, struct fec_enet_private, phy_task); + struct fec_enet_private *fep = + container_of(work, struct fec_enet_private, phy_task); struct net_device *dev = fep->netdev; uint status = fep->phy_status; /* - ** When we get here, phy_task is already removed from - ** the workqueue. It is thus safe to allow to reuse it. - */ + ** When we get here, phy_task is already removed from + ** the workqueue. It is thus safe to allow to reuse it. + */ fep->mii_phy_task_queued = 0; printk("%s: config: auto-negotiation ", dev->name); @@ -1994,14 +1973,15 @@ static void mii_display_config(struct wo static void mii_relink(struct work_struct *work) { - struct fec_enet_private *fep = container_of(work, struct fec_enet_private, phy_task); + struct fec_enet_private *fep = + container_of(work, struct fec_enet_private, phy_task); struct net_device *dev = fep->netdev; int duplex; /* - ** When we get here, phy_task is already removed from - ** the workqueue. It is thus safe to allow to reuse it. - */ + ** When we get here, phy_task is already removed from + ** the workqueue. It is thus safe to allow to reuse it. + */ fep->mii_phy_task_queued = 0; fep->link = (fep->phy_status & PHY_STAT_LINK) ? 1 : 0; mii_display_status(dev); @@ -2009,8 +1989,7 @@ static void mii_relink(struct work_struc if (fep->link) { duplex = 0; - if (fep->phy_status - & (PHY_STAT_100FDX | PHY_STAT_10FDX)) + if (fep->phy_status & (PHY_STAT_100FDX | PHY_STAT_10FDX)) duplex = 1; fec_restart(dev, duplex); } else @@ -2028,12 +2007,12 @@ static void mii_queue_relink(uint mii_re struct fec_enet_private *fep = netdev_priv(dev); /* - ** We cannot queue phy_task twice in the workqueue. It - ** would cause an endless loop in the workqueue. - ** Fortunately, if the last mii_relink entry has not yet been - ** executed now, it will do the job for the current interrupt, - ** which is just what we want. - */ + ** We cannot queue phy_task twice in the workqueue. It + ** would cause an endless loop in the workqueue. + ** Fortunately, if the last mii_relink entry has not yet been + ** executed now, it will do the job for the current interrupt, + ** which is just what we want. + */ if (fep->mii_phy_task_queued) return; @@ -2056,18 +2035,17 @@ static void mii_queue_config(uint mii_re } phy_cmd_t const phy_cmd_relink[] = { - { mk_mii_read(MII_REG_CR), mii_queue_relink }, - { mk_mii_end, } - }; + {mk_mii_read(MII_REG_CR), mii_queue_relink}, + {mk_mii_end,} +}; phy_cmd_t const phy_cmd_config[] = { - { mk_mii_read(MII_REG_CR), mii_queue_config }, - { mk_mii_end, } - }; + {mk_mii_read(MII_REG_CR), mii_queue_config}, + {mk_mii_end,} +}; /* Read remainder of PHY ID. */ -static void -mii_discover_phy3(uint mii_reg, struct net_device *dev) +static void mii_discover_phy3(uint mii_reg, struct net_device *dev) { struct fec_enet_private *fep; int i; @@ -2076,8 +2054,8 @@ mii_discover_phy3(uint mii_reg, struct n fep->phy_id |= (mii_reg & 0xffff); printk("fec: PHY @ 0x%x, ID 0x%08x", fep->phy_addr, fep->phy_id); - for(i = 0; phy_info[i]; i++) { - if(phy_info[i]->id == (fep->phy_id >> 4)) + for (i = 0; phy_info[i]; i++) { + if (phy_info[i]->id == (fep->phy_id >> 4)) break; } @@ -2093,8 +2071,7 @@ mii_discover_phy3(uint mii_reg, struct n /* Scan all of the MII PHY addresses looking for someone to respond * with a valid ID. This usually happens quickly. */ -static void -mii_discover_phy(uint mii_reg, struct net_device *dev) +static void mii_discover_phy(uint mii_reg, struct net_device *dev) { struct fec_enet_private *fep; volatile fec_t *fecp; @@ -2107,14 +2084,14 @@ mii_discover_phy(uint mii_reg, struct ne if ((phytype = (mii_reg & 0xffff)) != 0xffff && phytype != 0) { /* Got first part of ID, now get remainder. - */ + */ fep->phy_id = phytype << 16; mii_queue(dev, mk_mii_read(MII_REG_PHYIR2), - mii_discover_phy3); + mii_discover_phy3); } else { fep->phy_addr++; mii_queue(dev, mk_mii_read(MII_REG_PHYIR1), - mii_discover_phy); + mii_discover_phy); } } else { printk("FEC: No PHY device found.\n"); @@ -2124,33 +2101,23 @@ mii_discover_phy(uint mii_reg, struct ne } } -/* This interrupt occurs when the PHY detects a link change. -*/ -#ifdef CONFIG_RPXCLASSIC -static void -mii_link_interrupt(void *dev_id) -#else -static irqreturn_t -mii_link_interrupt(int irq, void * dev_id) -#endif +/* Set a MAC change in hardware. + */ +static void fec_set_mac_address(struct net_device *dev) { - struct net_device *dev = dev_id; - struct fec_enet_private *fep = netdev_priv(dev); - - fec_phy_ack_intr(); + volatile fec_t *fecp; -#if 0 - disable_irq(fep->mii_irq); /* disable now, enable later */ -#endif + fecp = ((struct fec_enet_private *)netdev_priv(dev))->hwp; - mii_do_cmd(dev, fep->phy->ack_int); - mii_do_cmd(dev, phy_cmd_relink); /* restart and display status */ + /* Set station address. */ + fecp->fec_addr_low = dev->dev_addr[3] | (dev->dev_addr[2] << 8) | + (dev->dev_addr[1] << 16) | (dev->dev_addr[0] << 24); + fecp->fec_addr_high = (dev->dev_addr[5] << 16) | + (dev->dev_addr[4] << 24); - return IRQ_HANDLED; } -static int -fec_enet_open(struct net_device *dev) +static int fec_enet_open(struct net_device *dev) { struct fec_enet_private *fep = netdev_priv(dev); @@ -2165,7 +2132,7 @@ fec_enet_open(struct net_device *dev) if (fep->phy) { mii_do_cmd(dev, fep->phy->ack_int); mii_do_cmd(dev, fep->phy->config); - mii_do_cmd(dev, phy_cmd_config); /* display configuration */ + mii_do_cmd(dev, phy_cmd_config); /* display configuration */ /* Poll until the PHY tells us its configuration * (not link state). @@ -2174,7 +2141,7 @@ fec_enet_open(struct net_device *dev) * This should take about 25 usec per register at 2.5 MHz, * and we read approximately 5 registers. */ - while(!fep->sequence_done) + while (!fep->sequence_done) schedule(); mii_do_cmd(dev, fep->phy->startup); @@ -2185,7 +2152,7 @@ fec_enet_open(struct net_device *dev) */ fep->link = 1; } else { - fep->link = 1; /* lets just try it and see */ + fep->link = 1; /* lets just try it and see */ /* no phy, go full duplex, it's most likely a hub chip */ fec_restart(dev, 1); } @@ -2195,13 +2162,12 @@ fec_enet_open(struct net_device *dev) return 0; /* Success */ } -static int -fec_enet_close(struct net_device *dev) +static int fec_enet_close(struct net_device *dev) { struct fec_enet_private *fep = netdev_priv(dev); /* Don't know what to do yet. - */ + */ fep->opened = 0; netif_stop_queue(dev); fec_stop(dev); @@ -2219,7 +2185,7 @@ fec_enet_close(struct net_device *dev) * this kind of feature?). */ -#define HASH_BITS 6 /* #bits in hash */ +#define HASH_BITS 6 /* #bits in hash */ #define CRC32_POLY 0xEDB88320 static void set_multicast_list(struct net_device *dev) @@ -2233,76 +2199,61 @@ static void set_multicast_list(struct ne fep = netdev_priv(dev); ep = fep->hwp; - if (dev->flags&IFF_PROMISC) { + if (dev->flags & IFF_PROMISC) { ep->fec_r_cntrl |= 0x0008; - } else { + return ; + } - ep->fec_r_cntrl &= ~0x0008; + ep->fec_r_cntrl &= ~0x0008; - if (dev->flags & IFF_ALLMULTI) { - /* Catch all multicast addresses, so set the - * filter to all 1's. - */ - ep->fec_hash_table_high = 0xffffffff; - ep->fec_hash_table_low = 0xffffffff; - } else { - /* Clear filter and add the addresses in hash register. - */ - ep->fec_hash_table_high = 0; - ep->fec_hash_table_low = 0; - - dmi = dev->mc_list; - - for (j = 0; j < dev->mc_count; j++, dmi = dmi->next) - { - /* Only support group multicast for now. - */ - if (!(dmi->dmi_addr[0] & 1)) - continue; - - /* calculate crc32 value of mac address - */ - crc = 0xffffffff; - - for (i = 0; i < dmi->dmi_addrlen; i++) - { - data = dmi->dmi_addr[i]; - for (bit = 0; bit < 8; bit++, data >>= 1) - { - crc = (crc >> 1) ^ - (((crc ^ data) & 1) ? CRC32_POLY : 0); - } - } - - /* only upper 6 bits (HASH_BITS) are used - which point to specific bit in he hash registers - */ - hash = (crc >> (32 - HASH_BITS)) & 0x3f; - - if (hash > 31) - ep->fec_hash_table_high |= 1 << (hash - 32); - else - ep->fec_hash_table_low |= 1 << hash; - } - } + if (dev->flags & IFF_ALLMULTI) { + /* Catch all multicast addresses, so set the + * filter to all 1's. + */ + ep->fec_hash_table_high = 0xffffffff; + ep->fec_hash_table_low = 0xffffffff; + return ; } -} + /* + * Clear filter and add the addresses in hash register. + */ + ep->fec_hash_table_high = 0; + ep->fec_hash_table_low = 0; -/* Set a MAC change in hardware. - */ -static void -fec_set_mac_address(struct net_device *dev) -{ - volatile fec_t *fecp; + dmi = dev->mc_list; - fecp = ((struct fec_enet_private *)netdev_priv(dev))->hwp; + for (j = 0; j < dev->mc_count; j++, dmi = dmi->next) { + /* Only support group multicast for now. + */ + if (!(dmi->dmi_addr[0] & 1)) + continue; - /* Set station address. */ - fecp->fec_addr_low = dev->dev_addr[3] | (dev->dev_addr[2] << 8) | - (dev->dev_addr[1] << 16) | (dev->dev_addr[0] << 24); - fecp->fec_addr_high = (dev->dev_addr[5] << 16) | - (dev->dev_addr[4] << 24); + /* calculate crc32 value of mac address + */ + crc = 0xffffffff; + + for (i = 0; i < dmi->dmi_addrlen; i++) { + data = dmi->dmi_addr[i]; + for (bit = 0; bit < 8; + bit++, data >>= 1) { + crc = + (crc >> 1) ^ + (((crc ^ data) & 1) ? + CRC32_POLY : 0); + } + } + /* only upper 6 bits (HASH_BITS) are used + which point to specific bit in he hash registers + */ + hash = (crc >> (32 - HASH_BITS)) & 0x3f; + + if (hash > 31) + ep->fec_hash_table_high |= + 1 << (hash - 32); + else + ep->fec_hash_table_low |= 1 << hash; + } } /* Initialize the FEC Ethernet on 860T (or ColdFire 5272). @@ -2310,38 +2261,40 @@ fec_set_mac_address(struct net_device *d /* * XXX: We need to clean up on failure exits here. */ +static int index; int __init fec_enet_init(struct net_device *dev) { struct fec_enet_private *fep = netdev_priv(dev); - unsigned long mem_addr; - volatile cbd_t *bdp; - cbd_t *cbd_base; - volatile fec_t *fecp; - int i, j; - static int index = 0; + unsigned long mem_addr; + volatile cbd_t *bdp; + cbd_t *cbd_base; + volatile fec_t *fecp; + int i, j; /* Only allow us to be probed once. */ if (index >= FEC_MAX_PORTS) return -ENXIO; /* Allocate memory for buffer descriptors. - */ + */ mem_addr = __get_free_page(GFP_KERNEL); if (mem_addr == 0) { printk("FEC: allocate descriptor memory failed?\n"); return -ENOMEM; } + spin_lock_init(&fep->hw_lock); + spin_lock_init(&fep->mii_lock); /* Create an Ethernet device instance. - */ - fecp = (volatile fec_t *) fec_hw[index]; + */ + fecp = (volatile fec_t *)fec_hw[index]; fep->index = index; fep->hwp = fecp; fep->netdev = dev; /* Whack a reset. We should wait for this. - */ + */ fecp->fec_ecntrl = 1; udelay(10); @@ -2353,13 +2306,12 @@ int __init fec_enet_init(struct net_devi */ fec_get_mac(dev); - cbd_base = (cbd_t *)mem_addr; - /* XXX: missing check for allocation failure */ + cbd_base = (cbd_t *) mem_addr; fec_uncache(mem_addr); /* Set receive and transmit descriptor base. - */ + */ fep->rx_bd_base = cbd_base; fep->tx_bd_base = cbd_base + RX_RING_SIZE; @@ -2369,20 +2321,20 @@ int __init fec_enet_init(struct net_devi fep->skb_cur = fep->skb_dirty = 0; /* Initialize the receive buffer descriptors. - */ + */ bdp = fep->rx_bd_base; - for (i=0; i<FEC_ENET_RX_PAGES; i++) { + for (i = 0; i < FEC_ENET_RX_PAGES; i++) { /* Allocate a page. - */ + */ mem_addr = __get_free_page(GFP_KERNEL); /* XXX: missing check for allocation failure */ fec_uncache(mem_addr); /* Initialize the BD for every fragment in the page. - */ - for (j=0; j<FEC_ENET_RX_FRPPG; j++) { + */ + for (j = 0; j < FEC_ENET_RX_FRPPG; j++) { bdp->cbd_sc = BD_ENET_RX_EMPTY; bdp->cbd_bufaddr = __pa(mem_addr); mem_addr += FEC_ENET_RX_FRSIZE; @@ -2391,43 +2343,44 @@ int __init fec_enet_init(struct net_devi } /* Set the last buffer to wrap. - */ + */ bdp--; bdp->cbd_sc |= BD_SC_WRAP; /* ...and the same for transmmit. - */ + */ bdp = fep->tx_bd_base; - for (i=0, j=FEC_ENET_TX_FRPPG; i<TX_RING_SIZE; i++) { + for (i = 0, j = FEC_ENET_TX_FRPPG; i < TX_RING_SIZE; i++) { if (j >= FEC_ENET_TX_FRPPG) { + /* XXX: missing check for allocation failure */ mem_addr = __get_free_page(GFP_KERNEL); j = 1; } else { mem_addr += FEC_ENET_TX_FRSIZE; j++; } - fep->tx_bounce[i] = (unsigned char *) mem_addr; + fep->tx_bounce[i] = (unsigned char *)mem_addr; /* Initialize the BD for every fragment in the page. - */ + */ bdp->cbd_sc = 0; bdp->cbd_bufaddr = 0; bdp++; } /* Set the last buffer to wrap. - */ + */ bdp--; bdp->cbd_sc |= BD_SC_WRAP; /* Set receive and transmit descriptor base. - */ - fecp->fec_r_des_start = __pa((uint)(fep->rx_bd_base)); - fecp->fec_x_des_start = __pa((uint)(fep->tx_bd_base)); + */ + fecp->fec_r_des_start = __pa((uint) (fep->rx_bd_base)); + fecp->fec_x_des_start = __pa((uint) (fep->tx_bd_base)); /* Install our interrupt handlers. This varies depending on * the architecture. - */ + */ fec_request_intrs(dev); fecp->fec_hash_table_high = 0; @@ -2446,8 +2399,8 @@ int __init fec_enet_init(struct net_devi dev->stop = fec_enet_close; dev->set_multicast_list = set_multicast_list; - for (i=0; i<NMII-1; i++) - mii_cmds[i].mii_next = &mii_cmds[i+1]; + for (i = 0; i < NMII - 1; i++) + mii_cmds[i].mii_next = &mii_cmds[i + 1]; mii_free = mii_cmds; /* setup MII interface */ @@ -2455,8 +2408,7 @@ int __init fec_enet_init(struct net_devi /* Clear and enable interrupts */ fecp->fec_ievent = 0xffc00000; - fecp->fec_imask = (FEC_ENET_TXF | FEC_ENET_TXB | - FEC_ENET_RXF | FEC_ENET_RXB | FEC_ENET_MII); + fecp->fec_imask = (FEC_ENET_TXF | FEC_ENET_RXF | FEC_ENET_MII); /* Queue up command to detect the PHY and initialize the * remainder of the interface. @@ -2473,8 +2425,7 @@ int __init fec_enet_init(struct net_devi * change. This only happens when switching between half and full * duplex. */ -static void -fec_restart(struct net_device *dev, int duplex) +static void fec_restart(struct net_device *dev, int duplex) { struct fec_enet_private *fep; volatile cbd_t *bdp; @@ -2485,42 +2436,42 @@ fec_restart(struct net_device *dev, int fecp = fep->hwp; /* Whack a reset. We should wait for this. - */ + */ fecp->fec_ecntrl = 1; udelay(10); /* Clear any outstanding interrupt. - */ + */ fecp->fec_ievent = 0xffc00000; fec_enable_phy_intr(); /* Set station address. - */ + */ fec_set_mac_address(dev); /* Reset all multicast. - */ + */ fecp->fec_hash_table_high = 0; fecp->fec_hash_table_low = 0; /* Set maximum receive buffer size. - */ + */ fecp->fec_r_buff_size = PKT_MAXBLR_SIZE; fec_localhw_setup(); /* Set receive and transmit descriptor base. - */ - fecp->fec_r_des_start = __pa((uint)(fep->rx_bd_base)); - fecp->fec_x_des_start = __pa((uint)(fep->tx_bd_base)); + */ + fecp->fec_r_des_start = __pa((uint) (fep->rx_bd_base)); + fecp->fec_x_des_start = __pa((uint) (fep->tx_bd_base)); fep->dirty_tx = fep->cur_tx = fep->tx_bd_base; fep->cur_rx = fep->rx_bd_base; /* Reset SKB transmit buffers. - */ + */ fep->skb_cur = fep->skb_dirty = 0; - for (i=0; i<=TX_RING_MOD_MASK; i++) { + for (i = 0; i <= TX_RING_MOD_MASK; i++) { if (fep->tx_skbuff[i] != NULL) { dev_kfree_skb_any(fep->tx_skbuff[i]); fep->tx_skbuff[i] = NULL; @@ -2528,43 +2479,43 @@ fec_restart(struct net_device *dev, int } /* Initialize the receive buffer descriptors. - */ + */ bdp = fep->rx_bd_base; - for (i=0; i<RX_RING_SIZE; i++) { + for (i = 0; i < RX_RING_SIZE; i++) { /* Initialize the BD for every fragment in the page. - */ + */ bdp->cbd_sc = BD_ENET_RX_EMPTY; bdp++; } /* Set the last buffer to wrap. - */ + */ bdp--; bdp->cbd_sc |= BD_SC_WRAP; /* ...and the same for transmmit. - */ + */ bdp = fep->tx_bd_base; - for (i=0; i<TX_RING_SIZE; i++) { + for (i = 0; i < TX_RING_SIZE; i++) { /* Initialize the BD for every fragment in the page. - */ + */ bdp->cbd_sc = 0; bdp->cbd_bufaddr = 0; bdp++; } /* Set the last buffer to wrap. - */ + */ bdp--; bdp->cbd_sc |= BD_SC_WRAP; /* Enable MII mode. - */ + */ if (duplex) { - fecp->fec_r_cntrl = OPT_FRAME_SIZE | 0x04;/* MII enable */ - fecp->fec_x_cntrl = 0x04; /* FD enable */ + fecp->fec_r_cntrl = OPT_FRAME_SIZE | 0x04; /* MII enable */ + fecp->fec_x_cntrl = 0x04; /* FD enable */ } else { /* MII enable|No Rcv on Xmit */ fecp->fec_r_cntrl = OPT_FRAME_SIZE | 0x06; @@ -2573,22 +2524,20 @@ fec_restart(struct net_device *dev, int fep->full_duplex = duplex; /* Set MII speed. - */ + */ fecp->fec_mii_speed = fep->phy_speed; /* And last, enable the transmit and receive processing. - */ + */ fecp->fec_ecntrl = 2; fecp->fec_r_des_active = 0; /* Enable interrupts we wish to service. - */ - fecp->fec_imask = (FEC_ENET_TXF | FEC_ENET_TXB | - FEC_ENET_RXF | FEC_ENET_RXB | FEC_ENET_MII); + */ + fecp->fec_imask = (FEC_ENET_TXF | FEC_ENET_RXF | FEC_ENET_MII); } -static void -fec_stop(struct net_device *dev) +static void fec_stop(struct net_device *dev) { volatile fec_t *fecp; struct fec_enet_private *fep; @@ -2597,23 +2546,23 @@ fec_stop(struct net_device *dev) fecp = fep->hwp; /* - ** We cannot expect a graceful transmit stop without link !!! - */ - if (fep->link) - { + ** We cannot expect a graceful transmit stop without link !!! + */ + if (fep->link) { fecp->fec_x_cntrl = 0x01; /* Graceful transmit stop */ udelay(10); if (!(fecp->fec_ievent & FEC_ENET_GRA)) - printk("fec_stop : Graceful transmit stop did not complete !\n"); - } + printk + ("fec_stop : Graceful transmit stop did not complete !\n"); + } /* Whack a reset. We should wait for this. - */ + */ fecp->fec_ecntrl = 1; udelay(10); /* Clear outstanding MII command interrupts. - */ + */ fecp->fec_ievent = FEC_ENET_MII; fec_enable_phy_intr(); @@ -2624,7 +2573,7 @@ fec_stop(struct net_device *dev) static int __init fec_enet_module_init(void) { struct net_device *dev; - int i, j, err; + int i, err; DECLARE_MAC_BUF(mac); printk("FEC ENET Version 0.2\n"); @@ -2651,5 +2600,4 @@ static int __init fec_enet_module_init(v } module_init(fec_enet_module_init); - MODULE_LICENSE("GPL"); Index: linux-2.6.24.7/drivers/serial/68328serial.c =================================================================== --- linux-2.6.24.7.orig/drivers/serial/68328serial.c +++ linux-2.6.24.7/drivers/serial/68328serial.c @@ -1410,7 +1410,7 @@ rs68328_init(void) if (request_irq(uart_irqs[i], rs_interrupt, - IRQ_FLG_STD, + IRQF_DISABLED, "M68328_UART", NULL)) panic("Unable to attach 68328 serial interrupt\n"); } Index: linux-2.6.24.7/drivers/serial/mcf.c =================================================================== --- linux-2.6.24.7.orig/drivers/serial/mcf.c +++ linux-2.6.24.7/drivers/serial/mcf.c @@ -69,7 +69,7 @@ static unsigned int mcf_tx_empty(struct static unsigned int mcf_get_mctrl(struct uart_port *port) { - struct mcf_uart *pp = (struct mcf_uart *) port; + struct mcf_uart *pp = container_of(port, struct mcf_uart, port); unsigned long flags; unsigned int sigs; @@ -87,7 +87,7 @@ static unsigned int mcf_get_mctrl(struct static void mcf_set_mctrl(struct uart_port *port, unsigned int sigs) { - struct mcf_uart *pp = (struct mcf_uart *) port; + struct mcf_uart *pp = container_of(port, struct mcf_uart, port); unsigned long flags; spin_lock_irqsave(&port->lock, flags); @@ -104,7 +104,7 @@ static void mcf_set_mctrl(struct uart_po static void mcf_start_tx(struct uart_port *port) { - struct mcf_uart *pp = (struct mcf_uart *) port; + struct mcf_uart *pp = container_of(port, struct mcf_uart, port); unsigned long flags; spin_lock_irqsave(&port->lock, flags); @@ -117,7 +117,7 @@ static void mcf_start_tx(struct uart_por static void mcf_stop_tx(struct uart_port *port) { - struct mcf_uart *pp = (struct mcf_uart *) port; + struct mcf_uart *pp = container_of(port, struct mcf_uart, port); unsigned long flags; spin_lock_irqsave(&port->lock, flags); @@ -130,7 +130,7 @@ static void mcf_stop_tx(struct uart_port static void mcf_stop_rx(struct uart_port *port) { - struct mcf_uart *pp = (struct mcf_uart *) port; + struct mcf_uart *pp = container_of(port, struct mcf_uart, port); unsigned long flags; spin_lock_irqsave(&port->lock, flags); @@ -163,7 +163,7 @@ static void mcf_enable_ms(struct uart_po static int mcf_startup(struct uart_port *port) { - struct mcf_uart *pp = (struct mcf_uart *) port; + struct mcf_uart *pp = container_of(port, struct mcf_uart, port); unsigned long flags; spin_lock_irqsave(&port->lock, flags); @@ -189,7 +189,7 @@ static int mcf_startup(struct uart_port static void mcf_shutdown(struct uart_port *port) { - struct mcf_uart *pp = (struct mcf_uart *) port; + struct mcf_uart *pp = container_of(port, struct mcf_uart, port); unsigned long flags; spin_lock_irqsave(&port->lock, flags); @@ -273,7 +273,7 @@ static void mcf_set_termios(struct uart_ static void mcf_rx_chars(struct mcf_uart *pp) { - struct uart_port *port = (struct uart_port *) pp; + struct uart_port *port = &pp->port; unsigned char status, ch, flag; while ((status = readb(port->membase + MCFUART_USR)) & MCFUART_USR_RXREADY) { @@ -319,7 +319,7 @@ static void mcf_rx_chars(struct mcf_uart static void mcf_tx_chars(struct mcf_uart *pp) { - struct uart_port *port = (struct uart_port *) pp; + struct uart_port *port = &pp->port; struct circ_buf *xmit = &port->info->xmit; if (port->x_char) { @@ -352,7 +352,7 @@ static void mcf_tx_chars(struct mcf_uart static irqreturn_t mcf_interrupt(int irq, void *data) { struct uart_port *port = data; - struct mcf_uart *pp = (struct mcf_uart *) port; + struct mcf_uart *pp = container_of(port, struct mcf_uart, port); unsigned int isr; isr = readb(port->membase + MCFUART_UISR) & pp->imr; @@ -434,7 +434,7 @@ static struct uart_ops mcf_uart_ops = { static struct mcf_uart mcf_ports[3]; -#define MCF_MAXPORTS (sizeof(mcf_ports) / sizeof(struct mcf_uart)) +#define MCF_MAXPORTS ARRAY_SIZE(mcf_ports) /****************************************************************************/ #if defined(CONFIG_SERIAL_MCF_CONSOLE) Index: linux-2.6.24.7/drivers/serial/mcfserial.c =================================================================== --- linux-2.6.24.7.orig/drivers/serial/mcfserial.c +++ linux-2.6.24.7/drivers/serial/mcfserial.c @@ -65,7 +65,8 @@ struct timer_list mcfrs_timer_struct; #define CONSOLE_BAUD_RATE 115200 #define DEFAULT_CBAUD B115200 #elif defined(CONFIG_ARNEWSH) || defined(CONFIG_FREESCALE) || \ - defined(CONFIG_senTec) || defined(CONFIG_SNEHA) || defined(CONFIG_AVNET) + defined(CONFIG_senTec) || defined(CONFIG_SNEHA) || defined(CONFIG_AVNET) || \ + defined(CONFIG_SAVANT) #define CONSOLE_BAUD_RATE 19200 #define DEFAULT_CBAUD B19200 #endif @@ -324,7 +325,7 @@ static void mcfrs_start(struct tty_struc * ----------------------------------------------------------------------- */ -static inline void receive_chars(struct mcf_serial *info) +static noinline void receive_chars(struct mcf_serial *info) { volatile unsigned char *uartp; struct tty_struct *tty = info->tty; @@ -369,7 +370,7 @@ static inline void receive_chars(struct return; } -static inline void transmit_chars(struct mcf_serial *info) +static noinline void transmit_chars(struct mcf_serial *info) { volatile unsigned char *uartp; @@ -1489,14 +1490,28 @@ int mcfrs_open(struct tty_struct *tty, s /* * Based on the line number set up the internal interrupt stuff. */ -static void mcfrs_irqinit(struct mcf_serial *info) +static int mcfrs_irqinit(struct mcf_serial *info) { + volatile unsigned char *uartp; + int ret; + + uartp = info->addr; + /* Clear mask, so no surprise interrupts. */ + uartp[MCFUART_UIMR] = 0; + + ret = request_irq(info->irq, mcfrs_interrupt, IRQF_DISABLED, + "ColdFire UART", NULL); + if (ret) { + printk("MCFRS: Unable to attach ColdFire UART %d interrupt " + "vector=%d, error: %d\n", info->line, + info->irq, ret); + return ret; + } + #if defined(CONFIG_M5272) volatile unsigned long *icrp; volatile unsigned long *portp; - volatile unsigned char *uartp; - uartp = info->addr; icrp = (volatile unsigned long *) (MCF_MBAR + MCFSIM_ICR2); switch (info->line) { @@ -1518,11 +1533,10 @@ static void mcfrs_irqinit(struct mcf_ser portp = (volatile unsigned long *) (MCF_MBAR + MCFSIM_PDCNT); *portp = (*portp & ~0x000003fc) | 0x000002a8; #elif defined(CONFIG_M523x) || defined(CONFIG_M527x) || defined(CONFIG_M528x) - volatile unsigned char *icrp, *uartp; +#if !defined(CONFIG_M523x) + volatile unsigned char *icrp; volatile unsigned long *imrp; - uartp = info->addr; - icrp = (volatile unsigned char *) (MCF_MBAR + MCFICM_INTC0 + MCFINTC_ICR0 + MCFINT_UART0 + info->line); *icrp = 0x30 + info->line; /* level 6, line based priority */ @@ -1530,6 +1544,14 @@ static void mcfrs_irqinit(struct mcf_ser imrp = (volatile unsigned long *) (MCF_MBAR + MCFICM_INTC0 + MCFINTC_IMRL); *imrp &= ~((1 << (info->irq - MCFINT_VECBASE)) | 1); +#endif +#if defined(CONFIG_M523x) + { + volatile unsigned short *par_uartp; + par_uartp = (volatile unsigned short *) (MCF_MBAR + MCF523x_GPIO_PAR_UART); + *par_uartp = 0x3FFF; /* setup GPIO for UART0, UART1 & UART2 */ + } +#endif #if defined(CONFIG_M527x) { /* @@ -1554,37 +1576,38 @@ static void mcfrs_irqinit(struct mcf_ser } #endif #elif defined(CONFIG_M520x) - volatile unsigned char *icrp, *uartp; - volatile unsigned long *imrp; - - uartp = info->addr; - - icrp = (volatile unsigned char *) (MCF_MBAR + MCFICM_INTC0 + - MCFINTC_ICR0 + MCFINT_UART0 + info->line); - *icrp = 0x03; + { + volatile unsigned char *icrp; + volatile unsigned long *imrp; - imrp = (volatile unsigned long *) (MCF_MBAR + MCFICM_INTC0 + - MCFINTC_IMRL); - *imrp &= ~((1 << (info->irq - MCFINT_VECBASE)) | 1); - if (info->line < 2) { - unsigned short *uart_par; - uart_par = (unsigned short *)(MCF_IPSBAR + MCF_GPIO_PAR_UART); - if (info->line == 0) - *uart_par |= MCF_GPIO_PAR_UART_PAR_UTXD0 - | MCF_GPIO_PAR_UART_PAR_URXD0; - else if (info->line == 1) - *uart_par |= MCF_GPIO_PAR_UART_PAR_UTXD1 - | MCF_GPIO_PAR_UART_PAR_URXD1; + icrp = (volatile unsigned char *) (MCF_MBAR + MCFICM_INTC0 + + MCFINTC_ICR0 + MCFINT_UART0 + info->line); + *icrp = 0x03; + + imrp = (volatile unsigned long *) (MCF_MBAR + MCFICM_INTC0 + + MCFINTC_IMRL); + *imrp &= ~((1 << (info->irq - MCFINT_VECBASE)) | 1); + if (info->line < 2) { + unsigned short *uart_par; + uart_par = (unsigned short *)(MCF_IPSBAR + + MCF_GPIO_PAR_UART); + if (info->line == 0) + *uart_par |= MCF_GPIO_PAR_UART_PAR_UTXD0 + | MCF_GPIO_PAR_UART_PAR_URXD0; + else if (info->line == 1) + *uart_par |= MCF_GPIO_PAR_UART_PAR_UTXD1 + | MCF_GPIO_PAR_UART_PAR_URXD1; } else if (info->line == 2) { unsigned char *feci2c_par; - feci2c_par = (unsigned char *)(MCF_IPSBAR + MCF_GPIO_PAR_FECI2C); + feci2c_par = (unsigned char *)(MCF_IPSBAR + + MCF_GPIO_PAR_FECI2C); *feci2c_par &= ~0x0F; *feci2c_par |= MCF_GPIO_PAR_FECI2C_PAR_SCL_UTXD2 - | MCF_GPIO_PAR_FECI2C_PAR_SDA_URXD2; + | MCF_GPIO_PAR_FECI2C_PAR_SDA_URXD2; } + } #elif defined(CONFIG_M532x) - volatile unsigned char *uartp; - uartp = info->addr; + switch (info->line) { case 0: MCF_INTC0_ICR26 = 0x3; @@ -1605,7 +1628,6 @@ static void mcfrs_irqinit(struct mcf_ser break; } #else - volatile unsigned char *icrp, *uartp; switch (info->line) { case 0: @@ -1623,23 +1645,12 @@ static void mcfrs_irqinit(struct mcf_ser default: printk("MCFRS: don't know how to handle UART %d interrupt?\n", info->line); - return; + return -ENODEV; } - uartp = info->addr; uartp[MCFUART_UIVR] = info->irq; #endif - - /* Clear mask, so no surprise interrupts. */ - uartp[MCFUART_UIMR] = 0; - - if (request_irq(info->irq, mcfrs_interrupt, IRQF_DISABLED, - "ColdFire UART", NULL)) { - printk("MCFRS: Unable to attach ColdFire UART %d interrupt " - "vector=%d\n", info->line, info->irq); - } - - return; + return 0; } @@ -1729,7 +1740,6 @@ static int __init mcfrs_init(void) { struct mcf_serial *info; - unsigned long flags; int i; /* Setup base handler, and timer table. */ @@ -1769,12 +1779,12 @@ mcfrs_init(void) return(-EBUSY); } - local_irq_save(flags); - /* * Configure all the attached serial ports. */ for (i = 0, info = mcfrs_table; (i < NR_PORTS); i++, info++) { + int ret; + info->magic = SERIAL_MAGIC; info->line = i; info->tty = 0; @@ -1792,14 +1802,11 @@ mcfrs_init(void) info->imr = 0; mcfrs_setsignals(info, 0, 0); - mcfrs_irqinit(info); - - printk("ttyS%d at 0x%04x (irq = %d)", info->line, - (unsigned int) info->addr, info->irq); - printk(" is a builtin ColdFire UART\n"); + ret = mcfrs_irqinit(info); + if (!ret) + printk("ttyS%d at 0x%p (irq = %d) is a builtin " + "ColdFire UART\n", info->line, info->addr, info->irq); } - - local_irq_restore(flags); return 0; } Index: linux-2.6.24.7/fs/nfs/file.c =================================================================== --- linux-2.6.24.7.orig/fs/nfs/file.c +++ linux-2.6.24.7/fs/nfs/file.c @@ -64,7 +64,11 @@ const struct file_operations nfs_file_op .write = do_sync_write, .aio_read = nfs_file_read, .aio_write = nfs_file_write, +#ifdef CONFIG_MMU .mmap = nfs_file_mmap, +#else + .mmap = generic_file_mmap, +#endif .open = nfs_file_open, .flush = nfs_file_flush, .release = nfs_file_release, Index: linux-2.6.24.7/include/asm-generic/vmlinux.lds.h =================================================================== --- linux-2.6.24.7.orig/include/asm-generic/vmlinux.lds.h +++ linux-2.6.24.7/include/asm-generic/vmlinux.lds.h @@ -6,6 +6,10 @@ #define VMLINUX_SYMBOL(_sym_) _sym_ #endif +#ifndef OUTPUT_DATA_SECTION +#define OUTPUT_DATA_SECTION +#endif + /* Align . to a 8 byte boundary equals to maximum function alignment. */ #define ALIGN_FUNCTION() . = ALIGN(8) @@ -25,11 +29,11 @@ *(.rodata) *(.rodata.*) \ *(__vermagic) /* Kernel version magic */ \ *(__markers_strings) /* Markers: strings */ \ - } \ + } OUTPUT_DATA_SECTION \ \ .rodata1 : AT(ADDR(.rodata1) - LOAD_OFFSET) { \ *(.rodata1) \ - } \ + } OUTPUT_DATA_SECTION \ \ /* PCI quirks */ \ .pci_fixup : AT(ADDR(.pci_fixup) - LOAD_OFFSET) { \ @@ -48,89 +52,89 @@ VMLINUX_SYMBOL(__start_pci_fixups_resume) = .; \ *(.pci_fixup_resume) \ VMLINUX_SYMBOL(__end_pci_fixups_resume) = .; \ - } \ + } OUTPUT_DATA_SECTION \ \ /* RapidIO route ops */ \ .rio_route : AT(ADDR(.rio_route) - LOAD_OFFSET) { \ VMLINUX_SYMBOL(__start_rio_route_ops) = .; \ *(.rio_route_ops) \ VMLINUX_SYMBOL(__end_rio_route_ops) = .; \ - } \ + } OUTPUT_DATA_SECTION \ \ /* Kernel symbol table: Normal symbols */ \ __ksymtab : AT(ADDR(__ksymtab) - LOAD_OFFSET) { \ VMLINUX_SYMBOL(__start___ksymtab) = .; \ *(__ksymtab) \ VMLINUX_SYMBOL(__stop___ksymtab) = .; \ - } \ + } OUTPUT_DATA_SECTION \ \ /* Kernel symbol table: GPL-only symbols */ \ __ksymtab_gpl : AT(ADDR(__ksymtab_gpl) - LOAD_OFFSET) { \ VMLINUX_SYMBOL(__start___ksymtab_gpl) = .; \ *(__ksymtab_gpl) \ VMLINUX_SYMBOL(__stop___ksymtab_gpl) = .; \ - } \ + } OUTPUT_DATA_SECTION \ \ /* Kernel symbol table: Normal unused symbols */ \ __ksymtab_unused : AT(ADDR(__ksymtab_unused) - LOAD_OFFSET) { \ VMLINUX_SYMBOL(__start___ksymtab_unused) = .; \ *(__ksymtab_unused) \ VMLINUX_SYMBOL(__stop___ksymtab_unused) = .; \ - } \ + } OUTPUT_DATA_SECTION \ \ /* Kernel symbol table: GPL-only unused symbols */ \ __ksymtab_unused_gpl : AT(ADDR(__ksymtab_unused_gpl) - LOAD_OFFSET) { \ VMLINUX_SYMBOL(__start___ksymtab_unused_gpl) = .; \ *(__ksymtab_unused_gpl) \ VMLINUX_SYMBOL(__stop___ksymtab_unused_gpl) = .; \ - } \ + } OUTPUT_DATA_SECTION \ \ /* Kernel symbol table: GPL-future-only symbols */ \ __ksymtab_gpl_future : AT(ADDR(__ksymtab_gpl_future) - LOAD_OFFSET) { \ VMLINUX_SYMBOL(__start___ksymtab_gpl_future) = .; \ *(__ksymtab_gpl_future) \ VMLINUX_SYMBOL(__stop___ksymtab_gpl_future) = .; \ - } \ + } OUTPUT_DATA_SECTION \ \ /* Kernel symbol table: Normal symbols */ \ __kcrctab : AT(ADDR(__kcrctab) - LOAD_OFFSET) { \ VMLINUX_SYMBOL(__start___kcrctab) = .; \ *(__kcrctab) \ VMLINUX_SYMBOL(__stop___kcrctab) = .; \ - } \ + } OUTPUT_DATA_SECTION \ \ /* Kernel symbol table: GPL-only symbols */ \ __kcrctab_gpl : AT(ADDR(__kcrctab_gpl) - LOAD_OFFSET) { \ VMLINUX_SYMBOL(__start___kcrctab_gpl) = .; \ *(__kcrctab_gpl) \ VMLINUX_SYMBOL(__stop___kcrctab_gpl) = .; \ - } \ + } OUTPUT_DATA_SECTION \ \ /* Kernel symbol table: Normal unused symbols */ \ __kcrctab_unused : AT(ADDR(__kcrctab_unused) - LOAD_OFFSET) { \ VMLINUX_SYMBOL(__start___kcrctab_unused) = .; \ *(__kcrctab_unused) \ VMLINUX_SYMBOL(__stop___kcrctab_unused) = .; \ - } \ + } OUTPUT_DATA_SECTION \ \ /* Kernel symbol table: GPL-only unused symbols */ \ __kcrctab_unused_gpl : AT(ADDR(__kcrctab_unused_gpl) - LOAD_OFFSET) { \ VMLINUX_SYMBOL(__start___kcrctab_unused_gpl) = .; \ *(__kcrctab_unused_gpl) \ VMLINUX_SYMBOL(__stop___kcrctab_unused_gpl) = .; \ - } \ + } OUTPUT_DATA_SECTION \ \ /* Kernel symbol table: GPL-future-only symbols */ \ __kcrctab_gpl_future : AT(ADDR(__kcrctab_gpl_future) - LOAD_OFFSET) { \ VMLINUX_SYMBOL(__start___kcrctab_gpl_future) = .; \ *(__kcrctab_gpl_future) \ VMLINUX_SYMBOL(__stop___kcrctab_gpl_future) = .; \ - } \ + } OUTPUT_DATA_SECTION \ \ /* Kernel symbol table: strings */ \ __ksymtab_strings : AT(ADDR(__ksymtab_strings) - LOAD_OFFSET) { \ *(__ksymtab_strings) \ - } \ + } OUTPUT_DATA_SECTION \ \ /* Built-in module parameters. */ \ __param : AT(ADDR(__param) - LOAD_OFFSET) { \ @@ -138,7 +142,7 @@ *(__param) \ VMLINUX_SYMBOL(__stop___param) = .; \ VMLINUX_SYMBOL(__end_rodata) = .; \ - } \ + } OUTPUT_DATA_SECTION \ \ . = ALIGN((align)); @@ -227,7 +231,7 @@ __start___bug_table = .; \ *(__bug_table) \ __stop___bug_table = .; \ - } + } OUTPUT_DATA_SECTION #define NOTES \ .notes : AT(ADDR(.notes) - LOAD_OFFSET) { \ @@ -261,5 +265,5 @@ .data.percpu : AT(ADDR(.data.percpu) - LOAD_OFFSET) { \ *(.data.percpu) \ *(.data.percpu.shared_aligned) \ - } \ + } OUTPUT_DATA_SECTION \ __per_cpu_end = .; Index: linux-2.6.24.7/include/asm-m68knommu/bitops.h =================================================================== --- linux-2.6.24.7.orig/include/asm-m68knommu/bitops.h +++ linux-2.6.24.7/include/asm-m68knommu/bitops.h @@ -14,8 +14,38 @@ #error only <linux/bitops.h> can be included directly #endif +#if defined (__mcfisaaplus__) || defined (__mcfisac__) +static inline int ffs(unsigned int val) +{ + if (!val) + return 0; + + asm volatile( + "bitrev %0\n\t" + "ff1 %0\n\t" + : "=d" (val) + : "0" (val) + ); + val++; + return val; +} + +static inline int __ffs(unsigned int val) +{ + asm volatile( + "bitrev %0\n\t" + "ff1 %0\n\t" + : "=d" (val) + : "0" (val) + ); + return val; +} + +#else #include <asm-generic/bitops/ffs.h> #include <asm-generic/bitops/__ffs.h> +#endif + #include <asm-generic/bitops/sched.h> #include <asm-generic/bitops/ffz.h> Index: linux-2.6.24.7/include/asm-m68knommu/byteorder.h =================================================================== --- linux-2.6.24.7.orig/include/asm-m68knommu/byteorder.h +++ linux-2.6.24.7/include/asm-m68knommu/byteorder.h @@ -1,13 +1,27 @@ #ifndef _M68KNOMMU_BYTEORDER_H #define _M68KNOMMU_BYTEORDER_H -#include <asm/types.h> +#include <linux/types.h> #if defined(__GNUC__) && !defined(__STRICT_ANSI__) || defined(__KERNEL__) # define __BYTEORDER_HAS_U64__ # define __SWAB_64_THRU_32__ #endif +#if defined (__mcfisaaplus__) || defined (__mcfisac__) +static inline __attribute_const__ __u32 ___arch__swab32(__u32 val) +{ + asm( + "byterev %0" + : "=d" (val) + : "0" (val) + ); + return val; +} + +#define __arch__swab32(x) ___arch__swab32(x) +#endif + #include <linux/byteorder/big_endian.h> #endif /* _M68KNOMMU_BYTEORDER_H */ Index: linux-2.6.24.7/include/asm-m68knommu/cacheflush.h =================================================================== --- linux-2.6.24.7.orig/include/asm-m68knommu/cacheflush.h +++ linux-2.6.24.7/include/asm-m68knommu/cacheflush.h @@ -53,7 +53,7 @@ static inline void __flush_cache_all(voi #endif /* CONFIG_M5407 */ #if defined(CONFIG_M527x) || defined(CONFIG_M528x) __asm__ __volatile__ ( - "movel #0x81400100, %%d0\n\t" + "movel #0x81000200, %%d0\n\t" "movec %%d0, %%CACR\n\t" "nop\n\t" : : : "d0" ); Index: linux-2.6.24.7/include/asm-m68knommu/commproc.h =================================================================== --- linux-2.6.24.7.orig/include/asm-m68knommu/commproc.h +++ linux-2.6.24.7/include/asm-m68knommu/commproc.h @@ -519,25 +519,6 @@ typedef struct scc_enet { #define SICR_ENET_CLKRT ((uint)0x00002c00) #endif -#ifdef CONFIG_RPXCLASSIC -/* Bits in parallel I/O port registers that have to be set/cleared - * to configure the pins for SCC1 use. - */ -#define PA_ENET_RXD ((ushort)0x0001) -#define PA_ENET_TXD ((ushort)0x0002) -#define PA_ENET_TCLK ((ushort)0x0200) -#define PA_ENET_RCLK ((ushort)0x0800) -#define PB_ENET_TENA ((uint)0x00001000) -#define PC_ENET_CLSN ((ushort)0x0010) -#define PC_ENET_RENA ((ushort)0x0020) - -/* Control bits in the SICR to route TCLK (CLK2) and RCLK (CLK4) to - * SCC1. Also, make sure GR1 (bit 24) and SC1 (bit 25) are zero. - */ -#define SICR_ENET_MASK ((uint)0x000000ff) -#define SICR_ENET_CLKRT ((uint)0x0000003d) -#endif - /* SCC Event register as used by Ethernet. */ #define SCCE_ENET_GRA ((ushort)0x0080) /* Graceful stop complete */ Index: linux-2.6.24.7/include/asm-m68knommu/dma.h =================================================================== --- linux-2.6.24.7.orig/include/asm-m68knommu/dma.h +++ linux-2.6.24.7/include/asm-m68knommu/dma.h @@ -35,7 +35,8 @@ /* * Set number of channels of DMA on ColdFire for different implementations. */ -#if defined(CONFIG_M5249) || defined(CONFIG_M5307) || defined(CONFIG_M5407) +#if defined(CONFIG_M5249) || defined(CONFIG_M5307) || defined(CONFIG_M5407) || \ + defined(CONFIG_M523x) || defined(CONFIG_M527x) || defined(CONFIG_M528x) #define MAX_M68K_DMA_CHANNELS 4 #elif defined(CONFIG_M5272) #define MAX_M68K_DMA_CHANNELS 1 Index: linux-2.6.24.7/include/asm-m68knommu/m523xsim.h =================================================================== --- linux-2.6.24.7.orig/include/asm-m68knommu/m523xsim.h +++ linux-2.6.24.7/include/asm-m68knommu/m523xsim.h @@ -11,7 +11,6 @@ #define m523xsim_h /****************************************************************************/ - /* * Define the 523x SIM register set addresses. */ @@ -27,10 +26,35 @@ #define MCFINTC_IACKL 0x19 /* */ #define MCFINTC_ICR0 0x40 /* Base ICR register */ +/* INTC0 - interrupt numbers */ #define MCFINT_VECBASE 64 /* Vector base number */ -#define MCFINT_UART0 13 /* Interrupt number for UART0 */ -#define MCFINT_PIT1 36 /* Interrupt number for PIT1 */ -#define MCFINT_QSPI 18 /* Interrupt number for QSPI */ +#define MCFINT_EPF4 4 /* EPORT4 */ +#define MCFINT_EPF5 5 /* EPORT5 */ +#define MCFINT_EPF6 6 /* EPORT6 */ +#define MCFINT_EPF7 7 /* EPORT7 */ +#define MCFINT_UART0 13 /* UART0 */ +#define MCFINT_QSPI 18 /* QSPI */ +#define MCFINT_PIT1 36 /* PIT1 */ +#define MCFINT_PER_INTC 64 + +/* INTC1 - interrupt numbers */ +#define MCFINT_INTC1_VECBASE (MCFINT_VECBASE + MCFINT_PER_INTC) +#define MCFINT_TC0F 27 /* eTPU Channel 0 */ +#define MCFINT_TC1F 28 /* eTPU Channel 1 */ +#define MCFINT_TC2F 29 /* eTPU Channel 2 */ +#define MCFINT_TC3F 30 /* eTPU Channel 3 */ +#define MCFINT_TC4F 31 /* eTPU Channel 4 */ +#define MCFINT_TC5F 32 /* eTPU Channel 5 */ +#define MCFINT_TC6F 33 /* eTPU Channel 6 */ +#define MCFINT_TC7F 34 /* eTPU Channel 7 */ +#define MCFINT_TC8F 35 /* eTPU Channel 8 */ +#define MCFINT_TC9F 36 /* eTPU Channel 9 */ +#define MCFINT_TC10F 37 /* eTPU Channel 10 */ +#define MCFINT_TC11F 38 /* eTPU Channel 11 */ +#define MCFINT_TC12F 39 /* eTPU Channel 12 */ +#define MCFINT_TC13F 40 /* eTPU Channel 13 */ +#define MCFINT_TC14F 41 /* eTPU Channel 14 */ +#define MCFINT_TC15F 42 /* eTPU Channel 15 */ /* * SDRAM configuration registers. @@ -41,5 +65,120 @@ #define MCFSIM_DACR1 0x50 /* SDRAM base address 1 */ #define MCFSIM_DMR1 0x54 /* SDRAM address mask 1 */ +/* + * GPIO Registers and Pin Assignments + */ +#define MCF_GPIO_PAR_FECI2C 0x100047 /* FEC Pin Assignment reg */ +#define MCF523x_GPIO_PAR_UART 0x100048 /* UART Pin Assignment reg */ +#define MCF523x_GPIO_PAR_QSPI 0x10004a /* QSPI Pin Assignment reg */ +#define MCF523x_GPIO_PAR_TIMER 0x10004c /* TIMER Pin Assignment reg */ +#define MCF523x_GPIO_PDDR_QSPI 0x10001a /* QSPI Pin Direction reg */ +#define MCF523x_GPIO_PDDR_TIMER 0x10001b /* TIMER Pin Direction reg */ +#define MCF523x_GPIO_PPDSDR_QSPI 0x10002a /* QSPI Pin Data reg */ +#define MCF523x_GPIO_PPDSDR_TIMER 0x10002b /* TIMER Pin Data reg */ + +#define MCF_GPIO_PAR_FECI2C_PAR_SDA(x) (((x) & 0x03) << 0) +#define MCF_GPIO_PAR_FECI2C_PAR_SCL(x) (((x) & 0x03) << 2) + +/* + * eTPU Registers + */ +#define MCF523x_ETPU 0x1d0000 /* eTPU Base */ +#define MCF523x_ETPU_CIOSR 0x00220 /* eTPU Intr Overflow Status */ +#define MCF523x_ETPU_CIER 0x00240 /* eTPU Intr Enable */ +#define MCF523x_ETPU_CR(c) (0x00400 + ((c) * 0x10)) /* eTPU c Config */ +#define MCF523x_ETPU_SCR(c) (0x00404 + ((c) * 0x10)) /* eTPU c Status & Ctrl */ +#define MCF523x_ETPU_SDM 0x08000 /* eTPU Shared Data Memory */ + +/* + * WDOG registers + */ +#define MCF523x_WCR ((volatile uint16_t *) (MCF_IPSBAR + 0x140000)) /* control register 16 bits */ +#define MCF523x_WMR ((volatile uint16_t *) (MCF_IPSBAR + 0x140002)) /* modulus status 16 bits */ +#define MCF523x_MCNTR ((volatile uint16_t *) (MCF_IPSBAR + 0x140004)) /* count register 16 bits */ +#define MCF523x_WSR ((volatile uint16_t *) (MCF_IPSBAR + 0x140006)) /* service register 16 bits */ + +/* + * Reset registers + */ +#define MCF523x_RSR ((volatile uint8_t *) (MCF_IPSBAR + 0x110001)) /* reset reason codes */ + +/* + * WDOG bit level definitions and macros. + */ +#define MCF523x_WCR_ENABLE_BIT 0x0001 + +#define MCF523x_WCR_ENABLE 0x0001 +#define MCF523x_WCR_DISABLE 0x0000 +#define MCF523x_WCR_HALTEDSTOP 0x0002 +#define MCF523x_WCR_HALTEDRUN 0x0000 +#define MCF523x_WCR_DOZESTOP 0x0004 +#define MCF523x_WCR_DOZERUN 0x0000 +#define MCF523x_WCR_WAITSTOP 0x0008 +#define MCF523x_WCR_WAITRUN 0x0000 + +#define MCF523x_WMR_DEFAULT_VALUE 0xffff + +/* + * Inter-IC (I2C) Module + * Read/Write access macros for general use + */ +#define MCF_I2C_I2ADR ((volatile u8 *) (MCF_IPSBAR + 0x0300)) /* Address */ +#define MCF_I2C_I2FDR ((volatile u8 *) (MCF_IPSBAR + 0x0304)) /* Freq Divider */ +#define MCF_I2C_I2CR ((volatile u8 *) (MCF_IPSBAR + 0x0308)) /* Control */ +#define MCF_I2C_I2SR ((volatile u8 *) (MCF_IPSBAR + 0x030C)) /* Status */ +#define MCF_I2C_I2DR ((volatile u8 *) (MCF_IPSBAR + 0x0310)) /* Data I/O */ + +/* + * Bit level definitions and macros + */ +#define MCF_I2C_I2ADR_ADDR(x) (((x) & 0x7F) << 0x01) +#define MCF_I2C_I2FDR_IC(x) ((x) & 0x3F) + +#define MCF_I2C_I2CR_IEN 0x80 /* I2C enable */ +#define MCF_I2C_I2CR_IIEN 0x40 /* interrupt enable */ +#define MCF_I2C_I2CR_MSTA 0x20 /* master/slave mode */ +#define MCF_I2C_I2CR_MTX 0x10 /* transmit/receive mode */ +#define MCF_I2C_I2CR_TXAK 0x08 /* transmit acknowledge enable */ +#define MCF_I2C_I2CR_RSTA 0x04 /* repeat start */ + +#define MCF_I2C_I2SR_ICF 0x80 /* data transfer bit */ +#define MCF_I2C_I2SR_IAAS 0x40 /* I2C addressed as a slave */ +#define MCF_I2C_I2SR_IBB 0x20 /* I2C bus busy */ +#define MCF_I2C_I2SR_IAL 0x10 /* aribitration lost */ +#define MCF_I2C_I2SR_SRW 0x04 /* slave read/write */ +#define MCF_I2C_I2SR_IIF 0x02 /* I2C interrupt */ +#define MCF_I2C_I2SR_RXAK 0x01 /* received acknowledge */ + +/* + * Edge Port (EPORT) Module + */ +#define MCF523x_EPPAR 0x130000 +#define MCF523x_EPDDR 0x130002 +#define MCF523x_EPIER 0x130003 +#define MCF523x_EPDR 0x130004 +#define MCF523x_EPPDR 0x130005 +#define MCF523x_EPFR 0x130006 + +/* + * Chip Select (CS) Module + */ +#define MCF523x_CSAR0 0x80 +#define MCF523x_CSAR3 0xA4 +#define MCF523x_CSMR3 0xA8 + +/* + * System Access Control Unit (SACU) + */ +#define MCF523x_PACR1 0x25 +#define MCF523x_PACR2 0x26 +#define MCF523x_PACR3 0x27 +#define MCF523x_PACR4 0x28 +#define MCF523x_PACR5 0x2A +#define MCF523x_PACR6 0x2B +#define MCF523x_PACR7 0x2C +#define MCF523x_PACR8 0x2E +#define MCF523x_GPACR 0x30 + /****************************************************************************/ #endif /* m523xsim_h */ Index: linux-2.6.24.7/include/asm-m68knommu/m528xsim.h =================================================================== --- linux-2.6.24.7.orig/include/asm-m68knommu/m528xsim.h +++ linux-2.6.24.7/include/asm-m68knommu/m528xsim.h @@ -30,6 +30,9 @@ #define MCFINT_VECBASE 64 /* Vector base number */ #define MCFINT_UART0 13 /* Interrupt number for UART0 */ #define MCFINT_PIT1 55 /* Interrupt number for PIT1 */ +#define MCFINT_QSPI 18 /* Interrupt number for QSPI */ + +#define MCF5282_INTC0 (MCF_IPSBAR + MCFICM_INTC0) /* * SDRAM configuration registers. @@ -50,44 +53,53 @@ /* Port UA Pin Assignment Register (8 Bit) */ #define MCF5282_GPIO_PUAPAR 0x10005C +#define MCF5282_GPIO_PORTQS (*(volatile u8 *) (MCF_IPSBAR + 0x0010000D)) +#define MCF5282_GPIO_DDRQS (*(volatile u8 *) (MCF_IPSBAR + 0x00100021)) +#define MCF5282_GPIO_PORTQSP (*(volatile u8 *) (MCF_IPSBAR + 0x00100035)) +#define MCF5282_GPIO_PQSPAR (*(volatile u8 *) (MCF_IPSBAR + 0x00100059)) + +#define MCF5282_GPIO_PEPAR (*(volatile u16 *) (MCF_IPSBAR + 0x00100052)) + +#define MCF5282_GPIO_PORTE (*(volatile u8 *) (MCF_IPSBAR + 0x00100004)) +#define MCF5282_GPIO_DDRE (*(volatile u8 *) (MCF_IPSBAR + 0x00100018)) +#define MCF5282_GPIO_PORTEP (*(volatile u8 *) (MCF_IPSBAR + 0x0010002C)) + /* Interrupt Mask Register Register Low */ #define MCF5282_INTC0_IMRL (volatile u32 *) (MCF_IPSBAR + 0x0C0C) /* Interrupt Control Register 7 */ #define MCF5282_INTC0_ICR17 (volatile u8 *) (MCF_IPSBAR + 0x0C51) - - /********************************************************************* * * Inter-IC (I2C) Module * *********************************************************************/ /* Read/Write access macros for general use */ -#define MCF5282_I2C_I2ADR (volatile u8 *) (MCF_IPSBAR + 0x0300) // Address -#define MCF5282_I2C_I2FDR (volatile u8 *) (MCF_IPSBAR + 0x0304) // Freq Divider -#define MCF5282_I2C_I2CR (volatile u8 *) (MCF_IPSBAR + 0x0308) // Control -#define MCF5282_I2C_I2SR (volatile u8 *) (MCF_IPSBAR + 0x030C) // Status -#define MCF5282_I2C_I2DR (volatile u8 *) (MCF_IPSBAR + 0x0310) // Data I/O +#define MCF_I2C_I2ADR (volatile u8 *) (MCF_IPSBAR + 0x0300) // Address +#define MCF_I2C_I2FDR (volatile u8 *) (MCF_IPSBAR + 0x0304) // Freq Divider +#define MCF_I2C_I2CR (volatile u8 *) (MCF_IPSBAR + 0x0308) // Control +#define MCF_I2C_I2SR (volatile u8 *) (MCF_IPSBAR + 0x030C) // Status +#define MCF_I2C_I2DR (volatile u8 *) (MCF_IPSBAR + 0x0310) // Data I/O /* Bit level definitions and macros */ -#define MCF5282_I2C_I2ADR_ADDR(x) (((x)&0x7F)<<0x01) +#define MCF_I2C_I2ADR_ADDR(x) (((x)&0x7F)<<0x01) -#define MCF5282_I2C_I2FDR_IC(x) (((x)&0x3F)) +#define MCF_I2C_I2FDR_IC(x) (((x)&0x3F)) -#define MCF5282_I2C_I2CR_IEN (0x80) // I2C enable -#define MCF5282_I2C_I2CR_IIEN (0x40) // interrupt enable -#define MCF5282_I2C_I2CR_MSTA (0x20) // master/slave mode -#define MCF5282_I2C_I2CR_MTX (0x10) // transmit/receive mode -#define MCF5282_I2C_I2CR_TXAK (0x08) // transmit acknowledge enable -#define MCF5282_I2C_I2CR_RSTA (0x04) // repeat start - -#define MCF5282_I2C_I2SR_ICF (0x80) // data transfer bit -#define MCF5282_I2C_I2SR_IAAS (0x40) // I2C addressed as a slave -#define MCF5282_I2C_I2SR_IBB (0x20) // I2C bus busy -#define MCF5282_I2C_I2SR_IAL (0x10) // aribitration lost -#define MCF5282_I2C_I2SR_SRW (0x04) // slave read/write -#define MCF5282_I2C_I2SR_IIF (0x02) // I2C interrupt -#define MCF5282_I2C_I2SR_RXAK (0x01) // received acknowledge +#define MCF_I2C_I2CR_IEN (0x80) // I2C enable +#define MCF_I2C_I2CR_IIEN (0x40) // interrupt enable +#define MCF_I2C_I2CR_MSTA (0x20) // master/slave mode +#define MCF_I2C_I2CR_MTX (0x10) // transmit/receive mode +#define MCF_I2C_I2CR_TXAK (0x08) // transmit acknowledge enable +#define MCF_I2C_I2CR_RSTA (0x04) // repeat start + +#define MCF_I2C_I2SR_ICF (0x80) // data transfer bit +#define MCF_I2C_I2SR_IAAS (0x40) // I2C addressed as a slave +#define MCF_I2C_I2SR_IBB (0x20) // I2C bus busy +#define MCF_I2C_I2SR_IAL (0x10) // aribitration lost +#define MCF_I2C_I2SR_SRW (0x04) // slave read/write +#define MCF_I2C_I2SR_IIF (0x02) // I2C interrupt +#define MCF_I2C_I2SR_RXAK (0x01) // received acknowledge @@ -107,6 +119,11 @@ #define MCF5282_QSPI_QDR MCF_IPSBAR + 0x0354 #define MCF5282_QSPI_QCR MCF_IPSBAR + 0x0354 +#define MCF5282_QSPI_PAR (MCF_IPSBAR + 0x00100059) + +#define MCF5282_QSPI_IRQ_SOURCE 18 +#define MCF5282_QSPI_IRQ_VECTOR (64 + MCF5282_QSPI_IRQ_SOURCE) + /* Bit level definitions and macros */ #define MCF5282_QSPI_QMR_MSTR (0x8000) #define MCF5282_QSPI_QMR_DOHIE (0x4000) Index: linux-2.6.24.7/include/asm-m68knommu/m532xsim.h =================================================================== --- linux-2.6.24.7.orig/include/asm-m68knommu/m532xsim.h +++ linux-2.6.24.7/include/asm-m68knommu/m532xsim.h @@ -16,6 +16,7 @@ #define MCFINT_VECBASE 64 #define MCFINT_UART0 26 /* Interrupt number for UART0 */ #define MCFINT_UART1 27 /* Interrupt number for UART1 */ +#define MCFINT_UART2 28 /* Interrupt number for UART2 */ #define MCF_WTM_WCR MCF_REG16(0xFC098000) @@ -72,9 +73,21 @@ #define mcf_getimr() \ *((volatile unsigned long *) (MCF_MBAR + MCFSIM_IMR)) +#define mcf_getimrh() \ + *((volatile unsigned long *) (MCF_MBAR + MCFSIM_IMRH)) + +#define mcf_getimrl() \ + *((volatile unsigned long *) (MCF_MBAR + MCFSIM_IMRL)) + #define mcf_setimr(imr) \ *((volatile unsigned long *) (MCF_MBAR + MCFSIM_IMR)) = (imr); +#define mcf_setimrh(imr) \ + *((volatile unsigned long *) (MCF_MBAR + MCFSIM_IMRH)) = (imr); + +#define mcf_setimrl(imr) \ + *((volatile unsigned long *) (MCF_MBAR + MCFSIM_IMRL)) = (imr); + #define mcf_getipr() \ *((volatile unsigned long *) (MCF_MBAR + MCFSIM_IPR)) @@ -131,31 +144,31 @@ *********************************************************************/ /* Read/Write access macros for general use */ -#define MCF532x_I2C_I2ADR (volatile u8 *) (0xFC058000) // Address -#define MCF532x_I2C_I2FDR (volatile u8 *) (0xFC058004) // Freq Divider -#define MCF532x_I2C_I2CR (volatile u8 *) (0xFC058008) // Control -#define MCF532x_I2C_I2SR (volatile u8 *) (0xFC05800C) // Status -#define MCF532x_I2C_I2DR (volatile u8 *) (0xFC058010) // Data I/O +#define MCF_I2C_I2ADR (volatile u8 *) (0xFC058000) /* Address */ +#define MCF_I2C_I2FDR (volatile u8 *) (0xFC058004) /* Freq Divider */ +#define MCF_I2C_I2CR (volatile u8 *) (0xFC058008) /* Control */ +#define MCF_I2C_I2SR (volatile u8 *) (0xFC05800C) /* Status */ +#define MCF_I2C_I2DR (volatile u8 *) (0xFC058010) /* Data I/O */ /* Bit level definitions and macros */ -#define MCF532x_I2C_I2ADR_ADDR(x) (((x)&0x7F)<<0x01) +#define MCF_I2C_I2ADR_ADDR(x) (((x)&0x7F)<<0x01) -#define MCF532x_I2C_I2FDR_IC(x) (((x)&0x3F)) +#define MCF_I2C_I2FDR_IC(x) (((x)&0x3F)) -#define MCF532x_I2C_I2CR_IEN (0x80) // I2C enable -#define MCF532x_I2C_I2CR_IIEN (0x40) // interrupt enable -#define MCF532x_I2C_I2CR_MSTA (0x20) // master/slave mode -#define MCF532x_I2C_I2CR_MTX (0x10) // transmit/receive mode -#define MCF532x_I2C_I2CR_TXAK (0x08) // transmit acknowledge enable -#define MCF532x_I2C_I2CR_RSTA (0x04) // repeat start - -#define MCF532x_I2C_I2SR_ICF (0x80) // data transfer bit -#define MCF532x_I2C_I2SR_IAAS (0x40) // I2C addressed as a slave -#define MCF532x_I2C_I2SR_IBB (0x20) // I2C bus busy -#define MCF532x_I2C_I2SR_IAL (0x10) // aribitration lost -#define MCF532x_I2C_I2SR_SRW (0x04) // slave read/write -#define MCF532x_I2C_I2SR_IIF (0x02) // I2C interrupt -#define MCF532x_I2C_I2SR_RXAK (0x01) // received acknowledge +#define MCF_I2C_I2CR_IEN (0x80) /* I2C enable */ +#define MCF_I2C_I2CR_IIEN (0x40) /* interrupt enable */ +#define MCF_I2C_I2CR_MSTA (0x20) /* master/slave mode */ +#define MCF_I2C_I2CR_MTX (0x10) /* transmit/receive mode */ +#define MCF_I2C_I2CR_TXAK (0x08) /* transmit acknowledge enable */ +#define MCF_I2C_I2CR_RSTA (0x04) /* repeat start */ + +#define MCF_I2C_I2SR_ICF (0x80) /* data transfer bit */ +#define MCF_I2C_I2SR_IAAS (0x40) /* I2C addressed as a slave */ +#define MCF_I2C_I2SR_IBB (0x20) /* I2C bus busy */ +#define MCF_I2C_I2SR_IAL (0x10) /* aribitration lost */ +#define MCF_I2C_I2SR_SRW (0x04) /* slave read/write */ +#define MCF_I2C_I2SR_IIF (0x02) /* I2C interrupt */ +#define MCF_I2C_I2SR_RXAK (0x01) /* received acknowledge */ #define MCF532x_PAR_FECI2C (volatile u8 *) (0xFC0A4053) @@ -2234,5 +2247,36 @@ #define MCF_EPORT_EPFR_EPF6 (0x40) #define MCF_EPORT_EPFR_EPF7 (0x80) +/********************************************************************* + * + * Cross-Bar Switch (XBS) + * + *********************************************************************/ +#define MCF_XBS_PRS1 MCF_REG32(0xFC004100) +#define MCF_XBS_CRS1 MCF_REG32(0xFC004110) +#define MCF_XBS_PRS4 MCF_REG32(0xFC004400) +#define MCF_XBS_CRS4 MCF_REG32(0xFC004410) +#define MCF_XBS_PRS6 MCF_REG32(0xFC004600) +#define MCF_XBS_CRS6 MCF_REG32(0xFC004610) +#define MCF_XBS_PRS7 MCF_REG32(0xFC004700) +#define MCF_XBS_CRS7 MCF_REG32(0xFC004710) + +#define MCF_XBS_PRIO_FACTTEST(x) (((x)&0x7) << 28) +#define MCF_XBS_PRIO_USBOTG(x) (((x)&0x7) << 24) +#define MCF_XBS_PRIO_USBHOST(x) (((x)&0x7) << 20) +#define MCF_XBS_PRIO_LCD(x) (((x)&0x7) << 16) +#define MCF_XBS_PRIO_FEC(x) (((x)&0x7) << 8) +#define MCF_XBS_PRIO_EDMA(x) (((x)&0x7) << 4) +#define MCF_XBS_PRIO_CORE(x) (((x)&0x7) << 0) + +#define MCF_PRIO_LVL_1 (0) +#define MCF_PRIO_LVL_2 (1) +#define MCF_PRIO_LVL_3 (2) +#define MCF_PRIO_LVL_4 (3) +#define MCF_PRIO_LVL_5 (4) +#define MCF_PRIO_LVL_6 (5) +#define MCF_PRIO_LVL_7 (6) + + /********************************************************************/ #endif /* m532xsim_h */ Index: linux-2.6.24.7/include/asm-m68knommu/mcfcache.h =================================================================== --- linux-2.6.24.7.orig/include/asm-m68knommu/mcfcache.h +++ linux-2.6.24.7/include/asm-m68knommu/mcfcache.h @@ -60,7 +60,7 @@ nop movel #0x0000c020, %d0 /* Set SDRAM cached only */ movec %d0, %ACR0 - movel #0xff00c000, %d0 /* Cache Flash also */ + movel #0x00000000, %d0 /* No other regions cached */ movec %d0, %ACR1 movel #0x80000200, %d0 /* Setup cache mask */ movec %d0, %CACR /* Enable cache */ Index: linux-2.6.24.7/include/asm-m68knommu/mcfuart.h =================================================================== --- linux-2.6.24.7.orig/include/asm-m68knommu/mcfuart.h +++ linux-2.6.24.7/include/asm-m68knommu/mcfuart.h @@ -12,7 +12,6 @@ #define mcfuart_h /****************************************************************************/ - /* * Define the base address of the UARTS within the MBAR address * space. @@ -33,7 +32,7 @@ #define MCFUART_BASE2 0x240 /* Base address of UART2 */ #define MCFUART_BASE3 0x280 /* Base address of UART3 */ #elif defined(CONFIG_M5249) || defined(CONFIG_M5307) || defined(CONFIG_M5407) -#if defined(CONFIG_NETtel) || defined(CONFIG_DISKtel) || defined(CONFIG_SECUREEDGEMP3) +#if defined(CONFIG_NETtel) || defined(CONFIG_SECUREEDGEMP3) #define MCFUART_BASE1 0x200 /* Base address of UART1 */ #define MCFUART_BASE2 0x1c0 /* Base address of UART2 */ #else Index: linux-2.6.24.7/mm/nommu.c =================================================================== --- linux-2.6.24.7.orig/mm/nommu.c +++ linux-2.6.24.7/mm/nommu.c @@ -952,6 +952,16 @@ unsigned long do_mmap_pgoff(struct file if (ret < 0) goto error; + /* + * If the driver implemented his own mmap(), the + * base addr could have changed. Therefor + * vm_end musst be updated to. + * + * See comment of DaveM in mm/mmap.c as reference + */ + if(addr != vma->vm_start) + vma->vm_end = vma->vm_start + len; + /* okay... we have a mapping; now we have to register it */ result = (void *) vma->vm_start; Index: linux-2.6.24.7/mm/page_alloc.c =================================================================== --- linux-2.6.24.7.orig/mm/page_alloc.c +++ linux-2.6.24.7/mm/page_alloc.c @@ -4317,6 +4317,14 @@ void *__init alloc_large_system_hash(con if (numentries > max) numentries = max; + /* + * we will allocate at least a page (even on low memory systems) + * so do a fixup here to ensure we utilise the space that will be + * allocated, this also prevents us reporting -ve orders + */ + if (bucketsize * numentries < PAGE_SIZE) + numentries = (PAGE_SIZE + bucketsize - 1) / bucketsize; + log2qty = ilog2(numentries); do { �������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/0001-sched-count-of-queued-RT-tasks.patch���������������������������������������������������0000664�0000764�0000764�00000003736�11041657735�020430� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From 1e52dc93cd65ee86b162c01342fdd68b41512292 Mon Sep 17 00:00:00 2001 From: Steven Rostedt <srostedt@redhat.com> Date: Tue, 11 Dec 2007 10:02:37 +0100 Subject: [PATCH] sched: count # of queued RT tasks This patch adds accounting to keep track of the number of RT tasks running on a runqueue. This information will be used in later patches. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> --- kernel/sched.c | 1 + kernel/sched_rt.c | 15 +++++++++++++++ 2 files changed, 16 insertions(+) Index: linux-2.6.24.7/kernel/sched.c =================================================================== --- linux-2.6.24.7.orig/kernel/sched.c +++ linux-2.6.24.7/kernel/sched.c @@ -266,6 +266,7 @@ struct rt_rq { struct rt_prio_array active; int rt_load_balance_idx; struct list_head *rt_load_balance_head, *rt_load_balance_curr; + unsigned long rt_nr_running; }; /* Index: linux-2.6.24.7/kernel/sched_rt.c =================================================================== --- linux-2.6.24.7.orig/kernel/sched_rt.c +++ linux-2.6.24.7/kernel/sched_rt.c @@ -26,12 +26,26 @@ static void update_curr_rt(struct rq *rq cpuacct_charge(curr, delta_exec); } +static inline void inc_rt_tasks(struct task_struct *p, struct rq *rq) +{ + WARN_ON(!rt_task(p)); + rq->rt.rt_nr_running++; +} + +static inline void dec_rt_tasks(struct task_struct *p, struct rq *rq) +{ + WARN_ON(!rt_task(p)); + WARN_ON(!rq->rt.rt_nr_running); + rq->rt.rt_nr_running--; +} + static void enqueue_task_rt(struct rq *rq, struct task_struct *p, int wakeup) { struct rt_prio_array *array = &rq->rt.active; list_add_tail(&p->run_list, array->queue + p->prio); __set_bit(p->prio, array->bitmap); + inc_rt_tasks(p, rq); } /* @@ -46,6 +60,7 @@ static void dequeue_task_rt(struct rq *r list_del(&p->run_list); if (list_empty(array->queue + p->prio)) __clear_bit(p->prio, array->bitmap); + dec_rt_tasks(p, rq); } /* ����������������������������������patches/0002-sched-track-highest-prio-task-queued.patch���������������������������������������������0000664�0000764�0000764�00000004633�11041657734�021671� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From 723ba8d72594ec8ebe985797f71c71345d4c6c64 Mon Sep 17 00:00:00 2001 From: Steven Rostedt <srostedt@redhat.com> Date: Tue, 11 Dec 2007 10:02:37 +0100 Subject: [PATCH] sched: track highest prio task queued This patch adds accounting to each runqueue to keep track of the highest prio task queued on the run queue. We only care about RT tasks, so if the run queue does not contain any active RT tasks its priority will be considered MAX_RT_PRIO. This information will be used for later patches. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> --- kernel/sched.c | 3 +++ kernel/sched_rt.c | 18 ++++++++++++++++++ 2 files changed, 21 insertions(+) Index: linux-2.6.24.7/kernel/sched.c =================================================================== --- linux-2.6.24.7.orig/kernel/sched.c +++ linux-2.6.24.7/kernel/sched.c @@ -267,6 +267,8 @@ struct rt_rq { int rt_load_balance_idx; struct list_head *rt_load_balance_head, *rt_load_balance_curr; unsigned long rt_nr_running; + /* highest queued rt task prio */ + int highest_prio; }; /* @@ -6838,6 +6840,7 @@ void __init sched_init(void) rq->cpu = i; rq->migration_thread = NULL; INIT_LIST_HEAD(&rq->migration_queue); + rq->rt.highest_prio = MAX_RT_PRIO; #endif atomic_set(&rq->nr_iowait, 0); Index: linux-2.6.24.7/kernel/sched_rt.c =================================================================== --- linux-2.6.24.7.orig/kernel/sched_rt.c +++ linux-2.6.24.7/kernel/sched_rt.c @@ -30,6 +30,10 @@ static inline void inc_rt_tasks(struct t { WARN_ON(!rt_task(p)); rq->rt.rt_nr_running++; +#ifdef CONFIG_SMP + if (p->prio < rq->rt.highest_prio) + rq->rt.highest_prio = p->prio; +#endif /* CONFIG_SMP */ } static inline void dec_rt_tasks(struct task_struct *p, struct rq *rq) @@ -37,6 +41,20 @@ static inline void dec_rt_tasks(struct t WARN_ON(!rt_task(p)); WARN_ON(!rq->rt.rt_nr_running); rq->rt.rt_nr_running--; +#ifdef CONFIG_SMP + if (rq->rt.rt_nr_running) { + struct rt_prio_array *array; + + WARN_ON(p->prio < rq->rt.highest_prio); + if (p->prio == rq->rt.highest_prio) { + /* recalculate */ + array = &rq->rt.active; + rq->rt.highest_prio = + sched_find_first_bit(array->bitmap); + } /* otherwise leave rq->highest prio alone */ + } else + rq->rt.highest_prio = MAX_RT_PRIO; +#endif /* CONFIG_SMP */ } static void enqueue_task_rt(struct rq *rq, struct task_struct *p, int wakeup) �����������������������������������������������������������������������������������������������������patches/0003-sched-add-RT-task-pushing.patch��������������������������������������������������������0000664�0000764�0000764�00000022463�11041657734�017427� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From 6d3e7dfa47b1d3ae3abd608e5d89b56cad72cccb Mon Sep 17 00:00:00 2001 From: Steven Rostedt <srostedt@redhat.com> Date: Tue, 11 Dec 2007 10:02:37 +0100 Subject: [PATCH] sched: add RT task pushing This patch adds an algorithm to push extra RT tasks off a run queue to other CPU runqueues. When more than one RT task is added to a run queue, this algorithm takes an assertive approach to push the RT tasks that are not running onto other run queues that have lower priority. The way this works is that the highest RT task that is not running is looked at and we examine the runqueues on the CPUS for that tasks affinity mask. We find the runqueue with the lowest prio in the CPU affinity of the picked task, and if it is lower in prio than the picked task, we push the task onto that CPU runqueue. We continue pushing RT tasks off the current runqueue until we don't push any more. The algorithm stops when the next highest RT task can't preempt any other processes on other CPUS. TODO: The algorithm may stop when there are still RT tasks that can be migrated. Specifically, if the highest non running RT task CPU affinity is restricted to CPUs that are running higher priority tasks, there may be a lower priority task queued that has an affinity with a CPU that is running a lower priority task that it could be migrated to. This patch set does not address this issue. Note: checkpatch reveals two over 80 character instances. I'm not sure that breaking them up will help visually, so I left them as is. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> --- kernel/sched.c | 8 + kernel/sched_rt.c | 225 +++++++++++++++++++++++++++++++++++++++++++++++++++++- 2 files changed, 231 insertions(+), 2 deletions(-) Index: linux-2.6.24.7/kernel/sched.c =================================================================== --- linux-2.6.24.7.orig/kernel/sched.c +++ linux-2.6.24.7/kernel/sched.c @@ -1937,6 +1937,8 @@ static void finish_task_switch(struct rq prev_state = prev->state; finish_arch_switch(prev); finish_lock_switch(rq, prev); + schedule_tail_balance_rt(rq); + fire_sched_in_preempt_notifiers(current); if (mm) mmdrop(mm); @@ -2170,11 +2172,13 @@ static void double_rq_unlock(struct rq * /* * double_lock_balance - lock the busiest runqueue, this_rq is locked already. */ -static void double_lock_balance(struct rq *this_rq, struct rq *busiest) +static int double_lock_balance(struct rq *this_rq, struct rq *busiest) __releases(this_rq->lock) __acquires(busiest->lock) __acquires(this_rq->lock) { + int ret = 0; + if (unlikely(!irqs_disabled())) { /* printk() doesn't work good under rq->lock */ spin_unlock(&this_rq->lock); @@ -2185,9 +2189,11 @@ static void double_lock_balance(struct r spin_unlock(&this_rq->lock); spin_lock(&busiest->lock); spin_lock(&this_rq->lock); + ret = 1; } else spin_lock(&busiest->lock); } + return ret; } /* Index: linux-2.6.24.7/kernel/sched_rt.c =================================================================== --- linux-2.6.24.7.orig/kernel/sched_rt.c +++ linux-2.6.24.7/kernel/sched_rt.c @@ -133,6 +133,227 @@ static void put_prev_task_rt(struct rq * } #ifdef CONFIG_SMP +/* Only try algorithms three times */ +#define RT_MAX_TRIES 3 + +static int double_lock_balance(struct rq *this_rq, struct rq *busiest); +static void deactivate_task(struct rq *rq, struct task_struct *p, int sleep); + +/* Return the second highest RT task, NULL otherwise */ +static struct task_struct *pick_next_highest_task_rt(struct rq *rq) +{ + struct rt_prio_array *array = &rq->rt.active; + struct task_struct *next; + struct list_head *queue; + int idx; + + assert_spin_locked(&rq->lock); + + if (likely(rq->rt.rt_nr_running < 2)) + return NULL; + + idx = sched_find_first_bit(array->bitmap); + if (unlikely(idx >= MAX_RT_PRIO)) { + WARN_ON(1); /* rt_nr_running is bad */ + return NULL; + } + + queue = array->queue + idx; + next = list_entry(queue->next, struct task_struct, run_list); + if (unlikely(next != rq->curr)) + return next; + + if (queue->next->next != queue) { + /* same prio task */ + next = list_entry(queue->next->next, struct task_struct, run_list); + return next; + } + + /* slower, but more flexible */ + idx = find_next_bit(array->bitmap, MAX_RT_PRIO, idx+1); + if (unlikely(idx >= MAX_RT_PRIO)) { + WARN_ON(1); /* rt_nr_running was 2 and above! */ + return NULL; + } + + queue = array->queue + idx; + next = list_entry(queue->next, struct task_struct, run_list); + + return next; +} + +static DEFINE_PER_CPU(cpumask_t, local_cpu_mask); + +/* Will lock the rq it finds */ +static struct rq *find_lock_lowest_rq(struct task_struct *task, + struct rq *this_rq) +{ + struct rq *lowest_rq = NULL; + int cpu; + int tries; + cpumask_t *cpu_mask = &__get_cpu_var(local_cpu_mask); + + cpus_and(*cpu_mask, cpu_online_map, task->cpus_allowed); + + for (tries = 0; tries < RT_MAX_TRIES; tries++) { + /* + * Scan each rq for the lowest prio. + */ + for_each_cpu_mask(cpu, *cpu_mask) { + struct rq *rq = &per_cpu(runqueues, cpu); + + if (cpu == this_rq->cpu) + continue; + + /* We look for lowest RT prio or non-rt CPU */ + if (rq->rt.highest_prio >= MAX_RT_PRIO) { + lowest_rq = rq; + break; + } + + /* no locking for now */ + if (rq->rt.highest_prio > task->prio && + (!lowest_rq || rq->rt.highest_prio > lowest_rq->rt.highest_prio)) { + lowest_rq = rq; + } + } + + if (!lowest_rq) + break; + + /* if the prio of this runqueue changed, try again */ + if (double_lock_balance(this_rq, lowest_rq)) { + /* + * We had to unlock the run queue. In + * the mean time, task could have + * migrated already or had its affinity changed. + * Also make sure that it wasn't scheduled on its rq. + */ + if (unlikely(task_rq(task) != this_rq || + !cpu_isset(lowest_rq->cpu, task->cpus_allowed) || + task_running(this_rq, task) || + !task->se.on_rq)) { + spin_unlock(&lowest_rq->lock); + lowest_rq = NULL; + break; + } + } + + /* If this rq is still suitable use it. */ + if (lowest_rq->rt.highest_prio > task->prio) + break; + + /* try again */ + spin_unlock(&lowest_rq->lock); + lowest_rq = NULL; + } + + return lowest_rq; +} + +/* + * If the current CPU has more than one RT task, see if the non + * running task can migrate over to a CPU that is running a task + * of lesser priority. + */ +static int push_rt_task(struct rq *this_rq) +{ + struct task_struct *next_task; + struct rq *lowest_rq; + int ret = 0; + int paranoid = RT_MAX_TRIES; + + assert_spin_locked(&this_rq->lock); + + next_task = pick_next_highest_task_rt(this_rq); + if (!next_task) + return 0; + + retry: + if (unlikely(next_task == this_rq->curr)) + return 0; + + /* + * It's possible that the next_task slipped in of + * higher priority than current. If that's the case + * just reschedule current. + */ + if (unlikely(next_task->prio < this_rq->curr->prio)) { + resched_task(this_rq->curr); + return 0; + } + + /* We might release this_rq lock */ + get_task_struct(next_task); + + /* find_lock_lowest_rq locks the rq if found */ + lowest_rq = find_lock_lowest_rq(next_task, this_rq); + if (!lowest_rq) { + struct task_struct *task; + /* + * find lock_lowest_rq releases this_rq->lock + * so it is possible that next_task has changed. + * If it has, then try again. + */ + task = pick_next_highest_task_rt(this_rq); + if (unlikely(task != next_task) && task && paranoid--) { + put_task_struct(next_task); + next_task = task; + goto retry; + } + goto out; + } + + assert_spin_locked(&lowest_rq->lock); + + deactivate_task(this_rq, next_task, 0); + set_task_cpu(next_task, lowest_rq->cpu); + activate_task(lowest_rq, next_task, 0); + + resched_task(lowest_rq->curr); + + spin_unlock(&lowest_rq->lock); + + ret = 1; +out: + put_task_struct(next_task); + + return ret; +} + +/* + * TODO: Currently we just use the second highest prio task on + * the queue, and stop when it can't migrate (or there's + * no more RT tasks). There may be a case where a lower + * priority RT task has a different affinity than the + * higher RT task. In this case the lower RT task could + * possibly be able to migrate where as the higher priority + * RT task could not. We currently ignore this issue. + * Enhancements are welcome! + */ +static void push_rt_tasks(struct rq *rq) +{ + /* push_rt_task will return true if it moved an RT */ + while (push_rt_task(rq)) + ; +} + +static void schedule_tail_balance_rt(struct rq *rq) +{ + /* + * If we have more than one rt_task queued, then + * see if we can push the other rt_tasks off to other CPUS. + * Note we may release the rq lock, and since + * the lock was owned by prev, we need to release it + * first via finish_lock_switch and then reaquire it here. + */ + if (unlikely(rq->rt.rt_nr_running > 1)) { + spin_lock_irq(&rq->lock); + push_rt_tasks(rq); + spin_unlock_irq(&rq->lock); + } +} + /* * Load-balancing iterator. Note: while the runqueue stays locked * during the whole iteration, the current task might be @@ -237,7 +458,9 @@ move_one_task_rt(struct rq *this_rq, int return iter_move_one_task(this_rq, this_cpu, busiest, sd, idle, &rt_rq_iterator); } -#endif +#else /* CONFIG_SMP */ +# define schedule_tail_balance_rt(rq) do { } while (0) +#endif /* CONFIG_SMP */ static void task_tick_rt(struct rq *rq, struct task_struct *p) { �������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/0004-sched-add-rt-overload-tracking.patch���������������������������������������������������0000664�0000764�0000764�00000004253�11041657735�020524� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From 957d2c0535a8ea69f7f625a97ba64a3af1070c01 Mon Sep 17 00:00:00 2001 From: Steven Rostedt <srostedt@redhat.com> Date: Tue, 11 Dec 2007 10:02:37 +0100 Subject: [PATCH] sched: add rt-overload tracking This patch adds an RT overload accounting system. When a runqueue has more than one RT task queued, it is marked as overloaded. That is that it is a candidate to have RT tasks pulled from it. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> --- kernel/sched_rt.c | 36 ++++++++++++++++++++++++++++++++++++ 1 file changed, 36 insertions(+) Index: linux-2.6.24.7/kernel/sched_rt.c =================================================================== --- linux-2.6.24.7.orig/kernel/sched_rt.c +++ linux-2.6.24.7/kernel/sched_rt.c @@ -3,6 +3,38 @@ * policies) */ +#ifdef CONFIG_SMP +static cpumask_t rt_overload_mask; +static atomic_t rto_count; +static inline int rt_overloaded(void) +{ + return atomic_read(&rto_count); +} +static inline cpumask_t *rt_overload(void) +{ + return &rt_overload_mask; +} +static inline void rt_set_overload(struct rq *rq) +{ + cpu_set(rq->cpu, rt_overload_mask); + /* + * Make sure the mask is visible before we set + * the overload count. That is checked to determine + * if we should look at the mask. It would be a shame + * if we looked at the mask, but the mask was not + * updated yet. + */ + wmb(); + atomic_inc(&rto_count); +} +static inline void rt_clear_overload(struct rq *rq) +{ + /* the order here really doesn't matter */ + atomic_dec(&rto_count); + cpu_clear(rq->cpu, rt_overload_mask); +} +#endif /* CONFIG_SMP */ + /* * Update the current task's runtime statistics. Skip current tasks that * are not in our scheduling class. @@ -33,6 +65,8 @@ static inline void inc_rt_tasks(struct t #ifdef CONFIG_SMP if (p->prio < rq->rt.highest_prio) rq->rt.highest_prio = p->prio; + if (rq->rt.rt_nr_running > 1) + rt_set_overload(rq); #endif /* CONFIG_SMP */ } @@ -54,6 +88,8 @@ static inline void dec_rt_tasks(struct t } /* otherwise leave rq->highest prio alone */ } else rq->rt.highest_prio = MAX_RT_PRIO; + if (rq->rt.rt_nr_running < 2) + rt_clear_overload(rq); #endif /* CONFIG_SMP */ } �����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/0005-sched-pull-RT-tasks-from-overloaded-runqueues.patch������������������������������������0000664�0000764�0000764�00000017431�11041657733�023500� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From dffbb24fa9f221779f42ad2a7633608a7d6a2148 Mon Sep 17 00:00:00 2001 From: Steven Rostedt <srostedt@redhat.com> Date: Tue, 11 Dec 2007 10:02:37 +0100 Subject: [PATCH] sched: pull RT tasks from overloaded runqueues This patch adds the algorithm to pull tasks from RT overloaded runqueues. When a pull RT is initiated, all overloaded runqueues are examined for a RT task that is higher in prio than the highest prio task queued on the target runqueue. If another runqueue holds a RT task that is of higher prio than the highest prio task on the target runqueue is found it is pulled to the target runqueue. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> --- kernel/sched.c | 2 kernel/sched_rt.c | 187 ++++++++++++++++++++++++++++++++++++++++++++++++++---- 2 files changed, 178 insertions(+), 11 deletions(-) Index: linux-2.6.24.7/kernel/sched.c =================================================================== --- linux-2.6.24.7.orig/kernel/sched.c +++ linux-2.6.24.7/kernel/sched.c @@ -3706,6 +3706,8 @@ need_resched_nonpreemptible: switch_count = &prev->nvcsw; } + schedule_balance_rt(rq, prev); + if (unlikely(!rq->nr_running)) idle_balance(cpu, rq); Index: linux-2.6.24.7/kernel/sched_rt.c =================================================================== --- linux-2.6.24.7.orig/kernel/sched_rt.c +++ linux-2.6.24.7/kernel/sched_rt.c @@ -175,8 +175,17 @@ static void put_prev_task_rt(struct rq * static int double_lock_balance(struct rq *this_rq, struct rq *busiest); static void deactivate_task(struct rq *rq, struct task_struct *p, int sleep); +static int pick_rt_task(struct rq *rq, struct task_struct *p, int cpu) +{ + if (!task_running(rq, p) && + (cpu < 0 || cpu_isset(cpu, p->cpus_allowed))) + return 1; + return 0; +} + /* Return the second highest RT task, NULL otherwise */ -static struct task_struct *pick_next_highest_task_rt(struct rq *rq) +static struct task_struct *pick_next_highest_task_rt(struct rq *rq, + int cpu) { struct rt_prio_array *array = &rq->rt.active; struct task_struct *next; @@ -195,26 +204,36 @@ static struct task_struct *pick_next_hig } queue = array->queue + idx; + BUG_ON(list_empty(queue)); + next = list_entry(queue->next, struct task_struct, run_list); - if (unlikely(next != rq->curr)) - return next; + if (unlikely(pick_rt_task(rq, next, cpu))) + goto out; if (queue->next->next != queue) { /* same prio task */ next = list_entry(queue->next->next, struct task_struct, run_list); - return next; + if (pick_rt_task(rq, next, cpu)) + goto out; } + retry: /* slower, but more flexible */ idx = find_next_bit(array->bitmap, MAX_RT_PRIO, idx+1); - if (unlikely(idx >= MAX_RT_PRIO)) { - WARN_ON(1); /* rt_nr_running was 2 and above! */ + if (unlikely(idx >= MAX_RT_PRIO)) return NULL; - } queue = array->queue + idx; - next = list_entry(queue->next, struct task_struct, run_list); + BUG_ON(list_empty(queue)); + + list_for_each_entry(next, queue, run_list) { + if (pick_rt_task(rq, next, cpu)) + goto out; + } + + goto retry; + out: return next; } @@ -301,13 +320,15 @@ static int push_rt_task(struct rq *this_ assert_spin_locked(&this_rq->lock); - next_task = pick_next_highest_task_rt(this_rq); + next_task = pick_next_highest_task_rt(this_rq, -1); if (!next_task) return 0; retry: - if (unlikely(next_task == this_rq->curr)) + if (unlikely(next_task == this_rq->curr)) { + WARN_ON(1); return 0; + } /* * It's possible that the next_task slipped in of @@ -331,7 +352,7 @@ static int push_rt_task(struct rq *this_ * so it is possible that next_task has changed. * If it has, then try again. */ - task = pick_next_highest_task_rt(this_rq); + task = pick_next_highest_task_rt(this_rq, -1); if (unlikely(task != next_task) && task && paranoid--) { put_task_struct(next_task); next_task = task; @@ -374,6 +395,149 @@ static void push_rt_tasks(struct rq *rq) ; } +static int pull_rt_task(struct rq *this_rq) +{ + struct task_struct *next; + struct task_struct *p; + struct rq *src_rq; + cpumask_t *rto_cpumask; + int this_cpu = this_rq->cpu; + int cpu; + int ret = 0; + + assert_spin_locked(&this_rq->lock); + + /* + * If cpusets are used, and we have overlapping + * run queue cpusets, then this algorithm may not catch all. + * This is just the price you pay on trying to keep + * dirtying caches down on large SMP machines. + */ + if (likely(!rt_overloaded())) + return 0; + + next = pick_next_task_rt(this_rq); + + rto_cpumask = rt_overload(); + + for_each_cpu_mask(cpu, *rto_cpumask) { + if (this_cpu == cpu) + continue; + + src_rq = cpu_rq(cpu); + if (unlikely(src_rq->rt.rt_nr_running <= 1)) { + /* + * It is possible that overlapping cpusets + * will miss clearing a non overloaded runqueue. + * Clear it now. + */ + if (double_lock_balance(this_rq, src_rq)) { + /* unlocked our runqueue lock */ + struct task_struct *old_next = next; + next = pick_next_task_rt(this_rq); + if (next != old_next) + ret = 1; + } + if (likely(src_rq->rt.rt_nr_running <= 1)) + /* + * Small chance that this_rq->curr changed + * but it's really harmless here. + */ + rt_clear_overload(this_rq); + else + /* + * Heh, the src_rq is now overloaded, since + * we already have the src_rq lock, go straight + * to pulling tasks from it. + */ + goto try_pulling; + spin_unlock(&src_rq->lock); + continue; + } + + /* + * We can potentially drop this_rq's lock in + * double_lock_balance, and another CPU could + * steal our next task - hence we must cause + * the caller to recalculate the next task + * in that case: + */ + if (double_lock_balance(this_rq, src_rq)) { + struct task_struct *old_next = next; + next = pick_next_task_rt(this_rq); + if (next != old_next) + ret = 1; + } + + /* + * Are there still pullable RT tasks? + */ + if (src_rq->rt.rt_nr_running <= 1) { + spin_unlock(&src_rq->lock); + continue; + } + + try_pulling: + p = pick_next_highest_task_rt(src_rq, this_cpu); + + /* + * Do we have an RT task that preempts + * the to-be-scheduled task? + */ + if (p && (!next || (p->prio < next->prio))) { + WARN_ON(p == src_rq->curr); + WARN_ON(!p->se.on_rq); + + /* + * There's a chance that p is higher in priority + * than what's currently running on its cpu. + * This is just that p is wakeing up and hasn't + * had a chance to schedule. We only pull + * p if it is lower in priority than the + * current task on the run queue or + * this_rq next task is lower in prio than + * the current task on that rq. + */ + if (p->prio < src_rq->curr->prio || + (next && next->prio < src_rq->curr->prio)) + goto bail; + + ret = 1; + + deactivate_task(src_rq, p, 0); + set_task_cpu(p, this_cpu); + activate_task(this_rq, p, 0); + /* + * We continue with the search, just in + * case there's an even higher prio task + * in another runqueue. (low likelyhood + * but possible) + */ + + /* + * Update next so that we won't pick a task + * on another cpu with a priority lower (or equal) + * than the one we just picked. + */ + next = p; + + } + bail: + spin_unlock(&src_rq->lock); + } + + return ret; +} + +static void schedule_balance_rt(struct rq *rq, + struct task_struct *prev) +{ + /* Try to pull RT tasks here if we lower this rq's prio */ + if (unlikely(rt_task(prev)) && + rq->rt.highest_prio > prev->prio) + pull_rt_task(rq); +} + static void schedule_tail_balance_rt(struct rq *rq) { /* @@ -496,6 +660,7 @@ move_one_task_rt(struct rq *this_rq, int } #else /* CONFIG_SMP */ # define schedule_tail_balance_rt(rq) do { } while (0) +# define schedule_balance_rt(rq, prev) do { } while (0) #endif /* CONFIG_SMP */ static void task_tick_rt(struct rq *rq, struct task_struct *p) ���������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/0006-sched-push-RT-tasks-from-overloaded-CPUs.patch�����������������������������������������0000664�0000764�0000764�00000003500�11041657731�022250� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From 5ac6a4953efd2499a20899eecddf8160a1d9855d Mon Sep 17 00:00:00 2001 From: Steven Rostedt <srostedt@redhat.com> Date: Tue, 11 Dec 2007 10:02:37 +0100 Subject: [PATCH] sched: push RT tasks from overloaded CPUs This patch adds pushing of overloaded RT tasks from a runqueue that is having tasks (most likely RT tasks) added to the run queue. TODO: We don't cover the case of waking of new RT tasks (yet). Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> --- kernel/sched.c | 1 + kernel/sched_rt.c | 10 ++++++++++ 2 files changed, 11 insertions(+) Index: linux-2.6.24.7/kernel/sched.c =================================================================== --- linux-2.6.24.7.orig/kernel/sched.c +++ linux-2.6.24.7/kernel/sched.c @@ -1695,6 +1695,7 @@ out_activate: out_running: p->state = TASK_RUNNING; + wakeup_balance_rt(rq, p); out: task_rq_unlock(rq, &flags); Index: linux-2.6.24.7/kernel/sched_rt.c =================================================================== --- linux-2.6.24.7.orig/kernel/sched_rt.c +++ linux-2.6.24.7/kernel/sched_rt.c @@ -554,6 +554,15 @@ static void schedule_tail_balance_rt(str } } + +static void wakeup_balance_rt(struct rq *rq, struct task_struct *p) +{ + if (unlikely(rt_task(p)) && + !task_running(rq, p) && + (p->prio >= rq->curr->prio)) + push_rt_tasks(rq); +} + /* * Load-balancing iterator. Note: while the runqueue stays locked * during the whole iteration, the current task might be @@ -661,6 +670,7 @@ move_one_task_rt(struct rq *this_rq, int #else /* CONFIG_SMP */ # define schedule_tail_balance_rt(rq) do { } while (0) # define schedule_balance_rt(rq, prev) do { } while (0) +# define wakeup_balance_rt(rq, p) do { } while (0) #endif /* CONFIG_SMP */ static void task_tick_rt(struct rq *rq, struct task_struct *p) ������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/0007-sched-disable-standard-balancer-for-RT-tasks.patch�������������������������������������0000664�0000764�0000764�00000007404�11041657733�023142� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From a32200df1e2862b2ecb7a0367abbad6df466e536 Mon Sep 17 00:00:00 2001 From: Steven Rostedt <srostedt@redhat.com> Date: Tue, 11 Dec 2007 10:02:37 +0100 Subject: [PATCH] sched: disable standard balancer for RT tasks Since we now take an active approach to load balancing, we don't need to balance RT tasks via the normal task balancer. In fact, this code was found to pull RT tasks away from CPUS that the active movement performed, resulting in large latencies. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> --- kernel/sched_rt.c | 95 ++---------------------------------------------------- 1 file changed, 4 insertions(+), 91 deletions(-) Index: linux-2.6.24.7/kernel/sched_rt.c =================================================================== --- linux-2.6.24.7.orig/kernel/sched_rt.c +++ linux-2.6.24.7/kernel/sched_rt.c @@ -563,109 +563,22 @@ static void wakeup_balance_rt(struct rq push_rt_tasks(rq); } -/* - * Load-balancing iterator. Note: while the runqueue stays locked - * during the whole iteration, the current task might be - * dequeued so the iterator has to be dequeue-safe. Here we - * achieve that by always pre-iterating before returning - * the current task: - */ -static struct task_struct *load_balance_start_rt(void *arg) -{ - struct rq *rq = arg; - struct rt_prio_array *array = &rq->rt.active; - struct list_head *head, *curr; - struct task_struct *p; - int idx; - - idx = sched_find_first_bit(array->bitmap); - if (idx >= MAX_RT_PRIO) - return NULL; - - head = array->queue + idx; - curr = head->prev; - - p = list_entry(curr, struct task_struct, run_list); - - curr = curr->prev; - - rq->rt.rt_load_balance_idx = idx; - rq->rt.rt_load_balance_head = head; - rq->rt.rt_load_balance_curr = curr; - - return p; -} - -static struct task_struct *load_balance_next_rt(void *arg) -{ - struct rq *rq = arg; - struct rt_prio_array *array = &rq->rt.active; - struct list_head *head, *curr; - struct task_struct *p; - int idx; - - idx = rq->rt.rt_load_balance_idx; - head = rq->rt.rt_load_balance_head; - curr = rq->rt.rt_load_balance_curr; - - /* - * If we arrived back to the head again then - * iterate to the next queue (if any): - */ - if (unlikely(head == curr)) { - int next_idx = find_next_bit(array->bitmap, MAX_RT_PRIO, idx+1); - - if (next_idx >= MAX_RT_PRIO) - return NULL; - - idx = next_idx; - head = array->queue + idx; - curr = head->prev; - - rq->rt.rt_load_balance_idx = idx; - rq->rt.rt_load_balance_head = head; - } - - p = list_entry(curr, struct task_struct, run_list); - - curr = curr->prev; - - rq->rt.rt_load_balance_curr = curr; - - return p; -} - static unsigned long load_balance_rt(struct rq *this_rq, int this_cpu, struct rq *busiest, unsigned long max_load_move, struct sched_domain *sd, enum cpu_idle_type idle, int *all_pinned, int *this_best_prio) { - struct rq_iterator rt_rq_iterator; - - rt_rq_iterator.start = load_balance_start_rt; - rt_rq_iterator.next = load_balance_next_rt; - /* pass 'busiest' rq argument into - * load_balance_[start|next]_rt iterators - */ - rt_rq_iterator.arg = busiest; - - return balance_tasks(this_rq, this_cpu, busiest, max_load_move, sd, - idle, all_pinned, this_best_prio, &rt_rq_iterator); + /* don't touch RT tasks */ + return 0; } static int move_one_task_rt(struct rq *this_rq, int this_cpu, struct rq *busiest, struct sched_domain *sd, enum cpu_idle_type idle) { - struct rq_iterator rt_rq_iterator; - - rt_rq_iterator.start = load_balance_start_rt; - rt_rq_iterator.next = load_balance_next_rt; - rt_rq_iterator.arg = busiest; - - return iter_move_one_task(this_rq, this_cpu, busiest, sd, idle, - &rt_rq_iterator); + /* don't touch RT tasks */ + return 0; } #else /* CONFIG_SMP */ # define schedule_tail_balance_rt(rq) do { } while (0) ������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/0008-sched-add-RT-balance-cpu-weight.patch��������������������������������������������������0000664�0000764�0000764�00000017115�11041657730�020450� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From 70e614272ba7e0dbc23621e800972099f729447f Mon Sep 17 00:00:00 2001 From: Gregory Haskins <ghaskins@novell.com> Date: Tue, 11 Dec 2007 10:02:37 +0100 Subject: [PATCH] sched: add RT-balance cpu-weight Some RT tasks (particularly kthreads) are bound to one specific CPU. It is fairly common for two or more bound tasks to get queued up at the same time. Consider, for instance, softirq_timer and softirq_sched. A timer goes off in an ISR which schedules softirq_thread to run at RT50. Then the timer handler determines that it's time to smp-rebalance the system so it schedules softirq_sched to run. So we are in a situation where we have two RT50 tasks queued, and the system will go into rt-overload condition to request other CPUs for help. This causes two problems in the current code: 1) If a high-priority bound task and a low-priority unbounded task queue up behind the running task, we will fail to ever relocate the unbounded task because we terminate the search on the first unmovable task. 2) We spend precious futile cycles in the fast-path trying to pull overloaded tasks over. It is therefore optimial to strive to avoid the overhead all together if we can cheaply detect the condition before overload even occurs. This patch tries to achieve this optimization by utilizing the hamming weight of the task->cpus_allowed mask. A weight of 1 indicates that the task cannot be migrated. We will then utilize this information to skip non-migratable tasks and to eliminate uncessary rebalance attempts. We introduce a per-rq variable to count the number of migratable tasks that are currently running. We only go into overload if we have more than one rt task, AND at least one of them is migratable. In addition, we introduce a per-task variable to cache the cpus_allowed weight, since the hamming calculation is probably relatively expensive. We only update the cached value when the mask is updated which should be relatively infrequent, especially compared to scheduling frequency in the fast path. Signed-off-by: Gregory Haskins <ghaskins@novell.com> Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> --- include/linux/init_task.h | 1 include/linux/sched.h | 2 + kernel/fork.c | 1 kernel/sched.c | 9 +++++++- kernel/sched_rt.c | 50 +++++++++++++++++++++++++++++++++++++++++----- 5 files changed, 57 insertions(+), 6 deletions(-) Index: linux-2.6.24.7/include/linux/init_task.h =================================================================== --- linux-2.6.24.7.orig/include/linux/init_task.h +++ linux-2.6.24.7/include/linux/init_task.h @@ -130,6 +130,7 @@ extern struct group_info init_groups; .normal_prio = MAX_PRIO-20, \ .policy = SCHED_NORMAL, \ .cpus_allowed = CPU_MASK_ALL, \ + .nr_cpus_allowed = NR_CPUS, \ .mm = NULL, \ .active_mm = &init_mm, \ .run_list = LIST_HEAD_INIT(tsk.run_list), \ Index: linux-2.6.24.7/include/linux/sched.h =================================================================== --- linux-2.6.24.7.orig/include/linux/sched.h +++ linux-2.6.24.7/include/linux/sched.h @@ -847,6 +847,7 @@ struct sched_class { void (*set_curr_task) (struct rq *rq); void (*task_tick) (struct rq *rq, struct task_struct *p); void (*task_new) (struct rq *rq, struct task_struct *p); + void (*set_cpus_allowed)(struct task_struct *p, cpumask_t *newmask); }; struct load_weight { @@ -956,6 +957,7 @@ struct task_struct { unsigned int policy; cpumask_t cpus_allowed; + int nr_cpus_allowed; unsigned int time_slice; #if defined(CONFIG_SCHEDSTATS) || defined(CONFIG_TASK_DELAY_ACCT) Index: linux-2.6.24.7/kernel/fork.c =================================================================== --- linux-2.6.24.7.orig/kernel/fork.c +++ linux-2.6.24.7/kernel/fork.c @@ -1237,6 +1237,7 @@ static struct task_struct *copy_process( * parent's CPU). This avoids alot of nasty races. */ p->cpus_allowed = current->cpus_allowed; + p->nr_cpus_allowed = current->nr_cpus_allowed; if (unlikely(!cpu_isset(task_cpu(p), p->cpus_allowed) || !cpu_online(task_cpu(p)))) set_task_cpu(p, smp_processor_id()); Index: linux-2.6.24.7/kernel/sched.c =================================================================== --- linux-2.6.24.7.orig/kernel/sched.c +++ linux-2.6.24.7/kernel/sched.c @@ -267,6 +267,7 @@ struct rt_rq { int rt_load_balance_idx; struct list_head *rt_load_balance_head, *rt_load_balance_curr; unsigned long rt_nr_running; + unsigned long rt_nr_migratory; /* highest queued rt task prio */ int highest_prio; }; @@ -5130,7 +5131,13 @@ int set_cpus_allowed(struct task_struct goto out; } - p->cpus_allowed = new_mask; + if (p->sched_class->set_cpus_allowed) + p->sched_class->set_cpus_allowed(p, &new_mask); + else { + p->cpus_allowed = new_mask; + p->nr_cpus_allowed = cpus_weight(new_mask); + } + /* Can the task run on the task's current CPU? If so, we're done */ if (cpu_isset(task_cpu(p), new_mask)) goto out; Index: linux-2.6.24.7/kernel/sched_rt.c =================================================================== --- linux-2.6.24.7.orig/kernel/sched_rt.c +++ linux-2.6.24.7/kernel/sched_rt.c @@ -33,6 +33,14 @@ static inline void rt_clear_overload(str atomic_dec(&rto_count); cpu_clear(rq->cpu, rt_overload_mask); } + +static void update_rt_migration(struct rq *rq) +{ + if (rq->rt.rt_nr_migratory && (rq->rt.rt_nr_running > 1)) + rt_set_overload(rq); + else + rt_clear_overload(rq); +} #endif /* CONFIG_SMP */ /* @@ -65,8 +73,10 @@ static inline void inc_rt_tasks(struct t #ifdef CONFIG_SMP if (p->prio < rq->rt.highest_prio) rq->rt.highest_prio = p->prio; - if (rq->rt.rt_nr_running > 1) - rt_set_overload(rq); + if (p->nr_cpus_allowed > 1) + rq->rt.rt_nr_migratory++; + + update_rt_migration(rq); #endif /* CONFIG_SMP */ } @@ -88,8 +98,10 @@ static inline void dec_rt_tasks(struct t } /* otherwise leave rq->highest prio alone */ } else rq->rt.highest_prio = MAX_RT_PRIO; - if (rq->rt.rt_nr_running < 2) - rt_clear_overload(rq); + if (p->nr_cpus_allowed > 1) + rq->rt.rt_nr_migratory--; + + update_rt_migration(rq); #endif /* CONFIG_SMP */ } @@ -178,7 +190,8 @@ static void deactivate_task(struct rq *r static int pick_rt_task(struct rq *rq, struct task_struct *p, int cpu) { if (!task_running(rq, p) && - (cpu < 0 || cpu_isset(cpu, p->cpus_allowed))) + (cpu < 0 || cpu_isset(cpu, p->cpus_allowed)) && + (p->nr_cpus_allowed > 1)) return 1; return 0; } @@ -580,6 +593,32 @@ move_one_task_rt(struct rq *this_rq, int /* don't touch RT tasks */ return 0; } +static void set_cpus_allowed_rt(struct task_struct *p, cpumask_t *new_mask) +{ + int weight = cpus_weight(*new_mask); + + BUG_ON(!rt_task(p)); + + /* + * Update the migration status of the RQ if we have an RT task + * which is running AND changing its weight value. + */ + if (p->se.on_rq && (weight != p->nr_cpus_allowed)) { + struct rq *rq = task_rq(p); + + if ((p->nr_cpus_allowed <= 1) && (weight > 1)) + rq->rt.rt_nr_migratory++; + else if((p->nr_cpus_allowed > 1) && (weight <= 1)) { + BUG_ON(!rq->rt.rt_nr_migratory); + rq->rt.rt_nr_migratory--; + } + + update_rt_migration(rq); + } + + p->cpus_allowed = *new_mask; + p->nr_cpus_allowed = weight; +} #else /* CONFIG_SMP */ # define schedule_tail_balance_rt(rq) do { } while (0) # define schedule_balance_rt(rq, prev) do { } while (0) @@ -633,6 +672,7 @@ const struct sched_class rt_sched_class #ifdef CONFIG_SMP .load_balance = load_balance_rt, .move_one_task = move_one_task_rt, + .set_cpus_allowed = set_cpus_allowed_rt, #endif .set_curr_task = set_curr_task_rt, ���������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/0009-sched-clean-up-this_rq-use-in-kernel-sched_rt.c.patch����������������������������������0000664�0000764�0000764�00000005431�11041657731�023601� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From 320d79d4305922c7ddcc268321245a7d572f4349 Mon Sep 17 00:00:00 2001 From: Gregory Haskins <ghaskins@novell.com> Date: Tue, 11 Dec 2007 10:02:37 +0100 Subject: [PATCH] sched: clean up this_rq use in kernel/sched_rt.c "this_rq" is normally used to denote the RQ on the current cpu (i.e. "cpu_rq(this_cpu)"). So clean up the usage of this_rq to be more consistent with the rest of the code. Signed-off-by: Gregory Haskins <ghaskins@novell.com> Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> --- kernel/sched_rt.c | 22 +++++++++++----------- 1 file changed, 11 insertions(+), 11 deletions(-) Index: linux-2.6.24.7/kernel/sched_rt.c =================================================================== --- linux-2.6.24.7.orig/kernel/sched_rt.c +++ linux-2.6.24.7/kernel/sched_rt.c @@ -324,21 +324,21 @@ static struct rq *find_lock_lowest_rq(st * running task can migrate over to a CPU that is running a task * of lesser priority. */ -static int push_rt_task(struct rq *this_rq) +static int push_rt_task(struct rq *rq) { struct task_struct *next_task; struct rq *lowest_rq; int ret = 0; int paranoid = RT_MAX_TRIES; - assert_spin_locked(&this_rq->lock); + assert_spin_locked(&rq->lock); - next_task = pick_next_highest_task_rt(this_rq, -1); + next_task = pick_next_highest_task_rt(rq, -1); if (!next_task) return 0; retry: - if (unlikely(next_task == this_rq->curr)) { + if (unlikely(next_task == rq->curr)) { WARN_ON(1); return 0; } @@ -348,24 +348,24 @@ static int push_rt_task(struct rq *this_ * higher priority than current. If that's the case * just reschedule current. */ - if (unlikely(next_task->prio < this_rq->curr->prio)) { - resched_task(this_rq->curr); + if (unlikely(next_task->prio < rq->curr->prio)) { + resched_task(rq->curr); return 0; } - /* We might release this_rq lock */ + /* We might release rq lock */ get_task_struct(next_task); /* find_lock_lowest_rq locks the rq if found */ - lowest_rq = find_lock_lowest_rq(next_task, this_rq); + lowest_rq = find_lock_lowest_rq(next_task, rq); if (!lowest_rq) { struct task_struct *task; /* - * find lock_lowest_rq releases this_rq->lock + * find lock_lowest_rq releases rq->lock * so it is possible that next_task has changed. * If it has, then try again. */ - task = pick_next_highest_task_rt(this_rq, -1); + task = pick_next_highest_task_rt(rq, -1); if (unlikely(task != next_task) && task && paranoid--) { put_task_struct(next_task); next_task = task; @@ -376,7 +376,7 @@ static int push_rt_task(struct rq *this_ assert_spin_locked(&lowest_rq->lock); - deactivate_task(this_rq, next_task, 0); + deactivate_task(rq, next_task, 0); set_task_cpu(next_task, lowest_rq->cpu); activate_task(lowest_rq, next_task, 0); ���������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/0010-sched-de-SCHED_OTHER-ize-the-RT-path.patch���������������������������������������������0000664�0000764�0000764�00000032157�11041673266�020733� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From c73997c8271a00a6a2e65c90e817f1093247eb1b Mon Sep 17 00:00:00 2001 From: Gregory Haskins <ghaskins@novell.com> Date: Tue, 11 Dec 2007 10:02:38 +0100 Subject: [PATCH] sched: de-SCHED_OTHER-ize the RT path The current wake-up code path tries to determine if it can optimize the wake-up to "this_cpu" by computing load calculations. The problem is that these calculations are only relevant to SCHED_OTHER tasks where load is king. For RT tasks, priority is king. So the load calculation is completely wasted bandwidth. Therefore, we create a new sched_class interface to help with pre-wakeup routing decisions and move the load calculation as a function of CFS task's class. Signed-off-by: Gregory Haskins <ghaskins@novell.com> Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> --- include/linux/sched.h | 1 kernel/sched.c | 167 +++++++----------------------------------------- kernel/sched_fair.c | 148 ++++++++++++++++++++++++++++++++++++++++++ kernel/sched_idletask.c | 9 ++ kernel/sched_rt.c | 10 ++ 5 files changed, 195 insertions(+), 140 deletions(-) Index: linux-2.6.24.7/include/linux/sched.h =================================================================== --- linux-2.6.24.7.orig/include/linux/sched.h +++ linux-2.6.24.7/include/linux/sched.h @@ -827,6 +827,7 @@ struct sched_class { void (*enqueue_task) (struct rq *rq, struct task_struct *p, int wakeup); void (*dequeue_task) (struct rq *rq, struct task_struct *p, int sleep); void (*yield_task) (struct rq *rq); + int (*select_task_rq)(struct task_struct *p, int sync); void (*check_preempt_curr) (struct rq *rq, struct task_struct *p); Index: linux-2.6.24.7/kernel/sched.c =================================================================== --- linux-2.6.24.7.orig/kernel/sched.c +++ linux-2.6.24.7/kernel/sched.c @@ -918,6 +918,13 @@ static void cpuacct_charge(struct task_s static inline void cpuacct_charge(struct task_struct *tsk, u64 cputime) {} #endif +#ifdef CONFIG_SMP +static unsigned long source_load(int cpu, int type); +static unsigned long target_load(int cpu, int type); +static unsigned long cpu_avg_load_per_task(int cpu); +static int task_hot(struct task_struct *p, u64 now, struct sched_domain *sd); +#endif /* CONFIG_SMP */ + #include "sched_stats.h" #include "sched_idletask.c" #include "sched_fair.c" @@ -1103,7 +1110,7 @@ static inline void __set_task_cpu(struct /* * Is this task likely cache-hot: */ -static inline int +static int task_hot(struct task_struct *p, u64 now, struct sched_domain *sd) { s64 delta; @@ -1328,7 +1335,7 @@ static unsigned long target_load(int cpu /* * Return the average load per task on the cpu's run queue */ -static inline unsigned long cpu_avg_load_per_task(int cpu) +static unsigned long cpu_avg_load_per_task(int cpu) { struct rq *rq = cpu_rq(cpu); unsigned long total = weighted_cpuload(cpu); @@ -1485,58 +1492,6 @@ static int sched_balance_self(int cpu, i #endif /* CONFIG_SMP */ -/* - * wake_idle() will wake a task on an idle cpu if task->cpu is - * not idle and an idle cpu is available. The span of cpus to - * search starts with cpus closest then further out as needed, - * so we always favor a closer, idle cpu. - * - * Returns the CPU we should wake onto. - */ -#if defined(ARCH_HAS_SCHED_WAKE_IDLE) -static int wake_idle(int cpu, struct task_struct *p) -{ - cpumask_t tmp; - struct sched_domain *sd; - int i; - - /* - * If it is idle, then it is the best cpu to run this task. - * - * This cpu is also the best, if it has more than one task already. - * Siblings must be also busy(in most cases) as they didn't already - * pickup the extra load from this cpu and hence we need not check - * sibling runqueue info. This will avoid the checks and cache miss - * penalities associated with that. - */ - if (idle_cpu(cpu) || cpu_rq(cpu)->nr_running > 1) - return cpu; - - for_each_domain(cpu, sd) { - if (sd->flags & SD_WAKE_IDLE) { - cpus_and(tmp, sd->span, p->cpus_allowed); - for_each_cpu_mask(i, tmp) { - if (idle_cpu(i)) { - if (i != task_cpu(p)) { - schedstat_inc(p, - se.nr_wakeups_idle); - } - return i; - } - } - } else { - break; - } - } - return cpu; -} -#else -static inline int wake_idle(int cpu, struct task_struct *p) -{ - return cpu; -} -#endif - /*** * try_to_wake_up - wake up a thread * @p: the to-be-woken-up thread @@ -1558,8 +1513,6 @@ static int try_to_wake_up(struct task_st long old_state; struct rq *rq; #ifdef CONFIG_SMP - struct sched_domain *sd, *this_sd = NULL; - unsigned long load, this_load; int new_cpu; #endif @@ -1579,90 +1532,7 @@ static int try_to_wake_up(struct task_st if (unlikely(task_running(rq, p))) goto out_activate; - new_cpu = cpu; - - schedstat_inc(rq, ttwu_count); - if (cpu == this_cpu) { - schedstat_inc(rq, ttwu_local); - goto out_set_cpu; - } - - for_each_domain(this_cpu, sd) { - if (cpu_isset(cpu, sd->span)) { - schedstat_inc(sd, ttwu_wake_remote); - this_sd = sd; - break; - } - } - - if (unlikely(!cpu_isset(this_cpu, p->cpus_allowed))) - goto out_set_cpu; - - /* - * Check for affine wakeup and passive balancing possibilities. - */ - if (this_sd) { - int idx = this_sd->wake_idx; - unsigned int imbalance; - - imbalance = 100 + (this_sd->imbalance_pct - 100) / 2; - - load = source_load(cpu, idx); - this_load = target_load(this_cpu, idx); - - new_cpu = this_cpu; /* Wake to this CPU if we can */ - - if (this_sd->flags & SD_WAKE_AFFINE) { - unsigned long tl = this_load; - unsigned long tl_per_task; - - /* - * Attract cache-cold tasks on sync wakeups: - */ - if (sync && !task_hot(p, rq->clock, this_sd)) - goto out_set_cpu; - - schedstat_inc(p, se.nr_wakeups_affine_attempts); - tl_per_task = cpu_avg_load_per_task(this_cpu); - - /* - * If sync wakeup then subtract the (maximum possible) - * effect of the currently running task from the load - * of the current CPU: - */ - if (sync) - tl -= current->se.load.weight; - - if ((tl <= load && - tl + target_load(cpu, idx) <= tl_per_task) || - 100*(tl + p->se.load.weight) <= imbalance*load) { - /* - * This domain has SD_WAKE_AFFINE and - * p is cache cold in this domain, and - * there is no bad imbalance. - */ - schedstat_inc(this_sd, ttwu_move_affine); - schedstat_inc(p, se.nr_wakeups_affine); - goto out_set_cpu; - } - } - - /* - * Start passive balancing when half the imbalance_pct - * limit is reached. - */ - if (this_sd->flags & SD_WAKE_BALANCE) { - if (imbalance*this_load <= 100*load) { - schedstat_inc(this_sd, ttwu_move_balance); - schedstat_inc(p, se.nr_wakeups_passive); - goto out_set_cpu; - } - } - } - - new_cpu = cpu; /* Could not wake to this_cpu. Wake to cpu instead */ -out_set_cpu: - new_cpu = wake_idle(new_cpu, p); + new_cpu = p->sched_class->select_task_rq(p, sync); if (new_cpu != cpu) { set_task_cpu(p, new_cpu); task_rq_unlock(rq, &flags); @@ -1678,6 +1548,23 @@ out_set_cpu: cpu = task_cpu(p); } +#ifdef CONFIG_SCHEDSTATS + schedstat_inc(rq, ttwu_count); + if (cpu == this_cpu) + schedstat_inc(rq, ttwu_local); + else { + struct sched_domain *sd; + for_each_domain(this_cpu, sd) { + if (cpu_isset(cpu, sd->span)) { + schedstat_inc(sd, ttwu_wake_remote); + break; + } + } + } + +#endif + + out_activate: #endif /* CONFIG_SMP */ schedstat_inc(p, se.nr_wakeups); Index: linux-2.6.24.7/kernel/sched_fair.c =================================================================== --- linux-2.6.24.7.orig/kernel/sched_fair.c +++ linux-2.6.24.7/kernel/sched_fair.c @@ -832,6 +832,151 @@ static void yield_task_fair(struct rq *r } /* + * wake_idle() will wake a task on an idle cpu if task->cpu is + * not idle and an idle cpu is available. The span of cpus to + * search starts with cpus closest then further out as needed, + * so we always favor a closer, idle cpu. + * + * Returns the CPU we should wake onto. + */ +#if defined(ARCH_HAS_SCHED_WAKE_IDLE) +static int wake_idle(int cpu, struct task_struct *p) +{ + cpumask_t tmp; + struct sched_domain *sd; + int i; + + /* + * If it is idle, then it is the best cpu to run this task. + * + * This cpu is also the best, if it has more than one task already. + * Siblings must be also busy(in most cases) as they didn't already + * pickup the extra load from this cpu and hence we need not check + * sibling runqueue info. This will avoid the checks and cache miss + * penalities associated with that. + */ + if (idle_cpu(cpu) || cpu_rq(cpu)->nr_running > 1) + return cpu; + + for_each_domain(cpu, sd) { + if (sd->flags & SD_WAKE_IDLE) { + cpus_and(tmp, sd->span, p->cpus_allowed); + for_each_cpu_mask(i, tmp) { + if (idle_cpu(i)) { + if (i != task_cpu(p)) { + schedstat_inc(p, + se.nr_wakeups_idle); + } + return i; + } + } + } else { + break; + } + } + return cpu; +} +#else +static inline int wake_idle(int cpu, struct task_struct *p) +{ + return cpu; +} +#endif + +#ifdef CONFIG_SMP +static int select_task_rq_fair(struct task_struct *p, int sync) +{ + int cpu, this_cpu; + struct rq *rq; + struct sched_domain *sd, *this_sd = NULL; + int new_cpu; + + cpu = task_cpu(p); + rq = task_rq(p); + this_cpu = smp_processor_id(); + new_cpu = cpu; + + for_each_domain(this_cpu, sd) { + if (cpu_isset(cpu, sd->span)) { + this_sd = sd; + break; + } + } + + if (unlikely(!cpu_isset(this_cpu, p->cpus_allowed))) + goto out_set_cpu; + + /* + * Check for affine wakeup and passive balancing possibilities. + */ + if (this_sd) { + int idx = this_sd->wake_idx; + unsigned int imbalance; + unsigned long load, this_load; + + imbalance = 100 + (this_sd->imbalance_pct - 100) / 2; + + load = source_load(cpu, idx); + this_load = target_load(this_cpu, idx); + + new_cpu = this_cpu; /* Wake to this CPU if we can */ + + if (this_sd->flags & SD_WAKE_AFFINE) { + unsigned long tl = this_load; + unsigned long tl_per_task; + + /* + * Attract cache-cold tasks on sync wakeups: + */ + if (sync && !task_hot(p, rq->clock, this_sd)) + goto out_set_cpu; + + schedstat_inc(p, se.nr_wakeups_affine_attempts); + tl_per_task = cpu_avg_load_per_task(this_cpu); + + /* + * If sync wakeup then subtract the (maximum possible) + * effect of the currently running task from the load + * of the current CPU: + */ + if (sync) + tl -= current->se.load.weight; + + if ((tl <= load && + tl + target_load(cpu, idx) <= tl_per_task) || + 100*(tl + p->se.load.weight) <= imbalance*load) { + /* + * This domain has SD_WAKE_AFFINE and + * p is cache cold in this domain, and + * there is no bad imbalance. + */ + schedstat_inc(this_sd, ttwu_move_affine); + schedstat_inc(p, se.nr_wakeups_affine); + goto out_set_cpu; + } + } + + /* + * Start passive balancing when half the imbalance_pct + * limit is reached. + */ + if (this_sd->flags & SD_WAKE_BALANCE) { + if (imbalance*this_load <= 100*load) { + schedstat_inc(this_sd, ttwu_move_balance); + schedstat_inc(p, se.nr_wakeups_passive); + goto out_set_cpu; + } + } + } + + new_cpu = cpu; /* Could not wake to this_cpu. Wake to cpu instead */ +out_set_cpu: + return wake_idle(new_cpu, p); +} +#endif /* CONFIG_SMP */ + + +/* * Preempt the current task with a newly woken task if needed: */ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p) @@ -1108,6 +1253,9 @@ static const struct sched_class fair_sch .enqueue_task = enqueue_task_fair, .dequeue_task = dequeue_task_fair, .yield_task = yield_task_fair, +#ifdef CONFIG_SMP + .select_task_rq = select_task_rq_fair, +#endif /* CONFIG_SMP */ .check_preempt_curr = check_preempt_wakeup, Index: linux-2.6.24.7/kernel/sched_idletask.c =================================================================== --- linux-2.6.24.7.orig/kernel/sched_idletask.c +++ linux-2.6.24.7/kernel/sched_idletask.c @@ -5,6 +5,12 @@ * handled in sched_fair.c) */ +#ifdef CONFIG_SMP +static int select_task_rq_idle(struct task_struct *p, int sync) +{ + return task_cpu(p); /* IDLE tasks as never migrated */ +} +#endif /* CONFIG_SMP */ /* * Idle tasks are unconditionally rescheduled: */ @@ -72,6 +78,9 @@ const struct sched_class idle_sched_clas /* dequeue is not valid, we print a debug message there: */ .dequeue_task = dequeue_task_idle, +#ifdef CONFIG_SMP + .select_task_rq = select_task_rq_idle, +#endif /* CONFIG_SMP */ .check_preempt_curr = check_preempt_curr_idle, Index: linux-2.6.24.7/kernel/sched_rt.c =================================================================== --- linux-2.6.24.7.orig/kernel/sched_rt.c +++ linux-2.6.24.7/kernel/sched_rt.c @@ -146,6 +146,13 @@ yield_task_rt(struct rq *rq) requeue_task_rt(rq, rq->curr); } +#ifdef CONFIG_SMP +static int select_task_rq_rt(struct task_struct *p, int sync) +{ + return task_cpu(p); +} +#endif /* CONFIG_SMP */ + /* * Preempt the current task with a newly woken task if needed: */ @@ -663,6 +670,9 @@ const struct sched_class rt_sched_class .enqueue_task = enqueue_task_rt, .dequeue_task = dequeue_task_rt, .yield_task = yield_task_rt, +#ifdef CONFIG_SMP + .select_task_rq = select_task_rq_rt, +#endif /* CONFIG_SMP */ .check_preempt_curr = check_preempt_curr_rt, �����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/0011-sched-break-out-search-for-RT-tasks.patch����������������������������������������������0000664�0000764�0000764�00000006472�11041657731�021325� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From 34addd81b2d8c437daf1f295b924459a6bc34f5e Mon Sep 17 00:00:00 2001 From: Gregory Haskins <ghaskins@novell.com> Date: Tue, 11 Dec 2007 10:02:38 +0100 Subject: [PATCH] sched: break out search for RT tasks Isolate the search logic into a function so that it can be used later in places other than find_locked_lowest_rq(). Signed-off-by: Gregory Haskins <ghaskins@novell.com> Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> --- kernel/sched_rt.c | 66 +++++++++++++++++++++++++++++++----------------------- 1 file changed, 39 insertions(+), 27 deletions(-) Index: linux-2.6.24.7/kernel/sched_rt.c =================================================================== --- linux-2.6.24.7.orig/kernel/sched_rt.c +++ linux-2.6.24.7/kernel/sched_rt.c @@ -259,54 +259,66 @@ static struct task_struct *pick_next_hig static DEFINE_PER_CPU(cpumask_t, local_cpu_mask); -/* Will lock the rq it finds */ -static struct rq *find_lock_lowest_rq(struct task_struct *task, - struct rq *this_rq) +static int find_lowest_rq(struct task_struct *task) { - struct rq *lowest_rq = NULL; int cpu; - int tries; cpumask_t *cpu_mask = &__get_cpu_var(local_cpu_mask); + struct rq *lowest_rq = NULL; cpus_and(*cpu_mask, cpu_online_map, task->cpus_allowed); - for (tries = 0; tries < RT_MAX_TRIES; tries++) { - /* - * Scan each rq for the lowest prio. - */ - for_each_cpu_mask(cpu, *cpu_mask) { - struct rq *rq = &per_cpu(runqueues, cpu); + /* + * Scan each rq for the lowest prio. + */ + for_each_cpu_mask(cpu, *cpu_mask) { + struct rq *rq = cpu_rq(cpu); - if (cpu == this_rq->cpu) - continue; + if (cpu == rq->cpu) + continue; - /* We look for lowest RT prio or non-rt CPU */ - if (rq->rt.highest_prio >= MAX_RT_PRIO) { - lowest_rq = rq; - break; - } + /* We look for lowest RT prio or non-rt CPU */ + if (rq->rt.highest_prio >= MAX_RT_PRIO) { + lowest_rq = rq; + break; + } - /* no locking for now */ - if (rq->rt.highest_prio > task->prio && - (!lowest_rq || rq->rt.highest_prio > lowest_rq->rt.highest_prio)) { - lowest_rq = rq; - } + /* no locking for now */ + if (rq->rt.highest_prio > task->prio && + (!lowest_rq || rq->rt.highest_prio > lowest_rq->rt.highest_prio)) { + lowest_rq = rq; } + } + + return lowest_rq ? lowest_rq->cpu : -1; +} + +/* Will lock the rq it finds */ +static struct rq *find_lock_lowest_rq(struct task_struct *task, + struct rq *rq) +{ + struct rq *lowest_rq = NULL; + int cpu; + int tries; - if (!lowest_rq) + for (tries = 0; tries < RT_MAX_TRIES; tries++) { + cpu = find_lowest_rq(task); + + if (cpu == -1) break; + lowest_rq = cpu_rq(cpu); + /* if the prio of this runqueue changed, try again */ - if (double_lock_balance(this_rq, lowest_rq)) { + if (double_lock_balance(rq, lowest_rq)) { /* * We had to unlock the run queue. In * the mean time, task could have * migrated already or had its affinity changed. * Also make sure that it wasn't scheduled on its rq. */ - if (unlikely(task_rq(task) != this_rq || + if (unlikely(task_rq(task) != rq || !cpu_isset(lowest_rq->cpu, task->cpus_allowed) || - task_running(this_rq, task) || + task_running(rq, task) || !task->se.on_rq)) { spin_unlock(&lowest_rq->lock); lowest_rq = NULL; ������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/0012-sched-RT-balancing-include-current-CPU.patch�������������������������������������������0000664�0000764�0000764�00000002517�11041657734�021726� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From 3d9bbf7a0350b6457a2e68ea419e87e7445b0200 Mon Sep 17 00:00:00 2001 From: Gregory Haskins <ghaskins@novell.com> Date: Tue, 11 Dec 2007 10:02:38 +0100 Subject: [PATCH] sched: RT balancing: include current CPU It doesn't hurt if we allow the current CPU to be included in the search. We will just simply skip it later if the current CPU turns out to be the lowest. We will use this later in the series Signed-off-by: Gregory Haskins <ghaskins@novell.com> Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> --- kernel/sched_rt.c | 5 +---- 1 file changed, 1 insertion(+), 4 deletions(-) Index: linux-2.6.24.7/kernel/sched_rt.c =================================================================== --- linux-2.6.24.7.orig/kernel/sched_rt.c +++ linux-2.6.24.7/kernel/sched_rt.c @@ -273,9 +273,6 @@ static int find_lowest_rq(struct task_st for_each_cpu_mask(cpu, *cpu_mask) { struct rq *rq = cpu_rq(cpu); - if (cpu == rq->cpu) - continue; - /* We look for lowest RT prio or non-rt CPU */ if (rq->rt.highest_prio >= MAX_RT_PRIO) { lowest_rq = rq; @@ -303,7 +300,7 @@ static struct rq *find_lock_lowest_rq(st for (tries = 0; tries < RT_MAX_TRIES; tries++) { cpu = find_lowest_rq(task); - if (cpu == -1) + if ((cpu == -1) || (cpu == rq->cpu)) break; lowest_rq = cpu_rq(cpu); ���������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/0013-sched-pre-route-RT-tasks-on-wakeup.patch�����������������������������������������������0000664�0000764�0000764�00000004624�11041657732�021233� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From 57fa76b638dc155c4ff60c9b73d347e97345fcfc Mon Sep 17 00:00:00 2001 From: Gregory Haskins <ghaskins@novell.com> Date: Tue, 11 Dec 2007 10:02:38 +0100 Subject: [PATCH] sched: pre-route RT tasks on wakeup In the original patch series that Steven Rostedt and I worked on together, we both took different approaches to low-priority wakeup path. I utilized "pre-routing" (push the task away to a less important RQ before activating) approach, while Steve utilized a "post-routing" approach. The advantage of my approach is that you avoid the overhead of a wasted activate/deactivate cycle and peripherally related burdens. The advantage of Steve's method is that it neatly solves an issue preventing a "pull" optimization from being deployed. In the end, we ended up deploying Steve's idea. But it later dawned on me that we could get the best of both worlds by deploying both ideas together, albeit slightly modified. The idea is simple: Use a "light-weight" lookup for pre-routing, since we only need to approximate a good home for the task. And we also retain the post-routing push logic to clean up any inaccuracies caused by a condition of "priority mistargeting" caused by the lightweight lookup. Most of the time, the pre-routing should work and yield lower overhead. In the cases where it doesnt, the post-router will bat cleanup. Signed-off-by: Gregory Haskins <ghaskins@novell.com> Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> --- kernel/sched_rt.c | 19 +++++++++++++++++++ 1 file changed, 19 insertions(+) Index: linux-2.6.24.7/kernel/sched_rt.c =================================================================== --- linux-2.6.24.7.orig/kernel/sched_rt.c +++ linux-2.6.24.7/kernel/sched_rt.c @@ -147,8 +147,27 @@ yield_task_rt(struct rq *rq) } #ifdef CONFIG_SMP +static int find_lowest_rq(struct task_struct *task); + static int select_task_rq_rt(struct task_struct *p, int sync) { + struct rq *rq = task_rq(p); + + /* + * If the task will not preempt the RQ, try to find a better RQ + * before we even activate the task + */ + if ((p->prio >= rq->rt.highest_prio) + && (p->nr_cpus_allowed > 1)) { + int cpu = find_lowest_rq(p); + + return (cpu == -1) ? task_cpu(p) : cpu; + } + + /* + * Otherwise, just let it ride on the affined RQ and the + * post-schedule router will push the preempted task away + */ return task_cpu(p); } #endif /* CONFIG_SMP */ ������������������������������������������������������������������������������������������������������������patches/0014-sched-optimize-RT-affinity.patch�������������������������������������������������������0000664�0000764�0000764�00000010765�11041657734�017737� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From ed30a37a14cdeac811d99baaa8039ecad9527ace Mon Sep 17 00:00:00 2001 From: Gregory Haskins <ghaskins@novell.com> Date: Tue, 11 Dec 2007 10:02:38 +0100 Subject: [PATCH] sched: optimize RT affinity The current code base assumes a relatively flat CPU/core topology and will route RT tasks to any CPU fairly equally. In the real world, there are various toplogies and affinities that govern where a task is best suited to run with the smallest amount of overhead. NUMA and multi-core CPUs are prime examples of topologies that can impact cache performance. Fortunately, linux is already structured to represent these topologies via the sched_domains interface. So we change our RT router to consult a combination of topology and affinity policy to best place tasks during migration. Signed-off-by: Gregory Haskins <ghaskins@novell.com> Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> --- kernel/sched_rt.c | 100 +++++++++++++++++++++++++++++++++++++++++++++++------- 1 file changed, 88 insertions(+), 12 deletions(-) Index: linux-2.6.24.7/kernel/sched_rt.c =================================================================== --- linux-2.6.24.7.orig/kernel/sched_rt.c +++ linux-2.6.24.7/kernel/sched_rt.c @@ -277,35 +277,111 @@ static struct task_struct *pick_next_hig } static DEFINE_PER_CPU(cpumask_t, local_cpu_mask); +static DEFINE_PER_CPU(cpumask_t, valid_cpu_mask); -static int find_lowest_rq(struct task_struct *task) +static int find_lowest_cpus(struct task_struct *task, cpumask_t *lowest_mask) { - int cpu; - cpumask_t *cpu_mask = &__get_cpu_var(local_cpu_mask); - struct rq *lowest_rq = NULL; + int cpu; + cpumask_t *valid_mask = &__get_cpu_var(valid_cpu_mask); + int lowest_prio = -1; + int ret = 0; - cpus_and(*cpu_mask, cpu_online_map, task->cpus_allowed); + cpus_clear(*lowest_mask); + cpus_and(*valid_mask, cpu_online_map, task->cpus_allowed); /* * Scan each rq for the lowest prio. */ - for_each_cpu_mask(cpu, *cpu_mask) { + for_each_cpu_mask(cpu, *valid_mask) { struct rq *rq = cpu_rq(cpu); /* We look for lowest RT prio or non-rt CPU */ if (rq->rt.highest_prio >= MAX_RT_PRIO) { - lowest_rq = rq; - break; + if (ret) + cpus_clear(*lowest_mask); + cpu_set(rq->cpu, *lowest_mask); + return 1; } /* no locking for now */ - if (rq->rt.highest_prio > task->prio && - (!lowest_rq || rq->rt.highest_prio > lowest_rq->rt.highest_prio)) { - lowest_rq = rq; + if ((rq->rt.highest_prio > task->prio) + && (rq->rt.highest_prio >= lowest_prio)) { + if (rq->rt.highest_prio > lowest_prio) { + /* new low - clear old data */ + lowest_prio = rq->rt.highest_prio; + cpus_clear(*lowest_mask); + } + cpu_set(rq->cpu, *lowest_mask); + ret = 1; + } + } + + return ret; +} + +static inline int pick_optimal_cpu(int this_cpu, cpumask_t *mask) +{ + int first; + + /* "this_cpu" is cheaper to preempt than a remote processor */ + if ((this_cpu != -1) && cpu_isset(this_cpu, *mask)) + return this_cpu; + + first = first_cpu(*mask); + if (first != NR_CPUS) + return first; + + return -1; +} + +static int find_lowest_rq(struct task_struct *task) +{ + struct sched_domain *sd; + cpumask_t *lowest_mask = &__get_cpu_var(local_cpu_mask); + int this_cpu = smp_processor_id(); + int cpu = task_cpu(task); + + if (!find_lowest_cpus(task, lowest_mask)) + return -1; + + /* + * At this point we have built a mask of cpus representing the + * lowest priority tasks in the system. Now we want to elect + * the best one based on our affinity and topology. + * + * We prioritize the last cpu that the task executed on since + * it is most likely cache-hot in that location. + */ + if (cpu_isset(cpu, *lowest_mask)) + return cpu; + + /* + * Otherwise, we consult the sched_domains span maps to figure + * out which cpu is logically closest to our hot cache data. + */ + if (this_cpu == cpu) + this_cpu = -1; /* Skip this_cpu opt if the same */ + + for_each_domain(cpu, sd) { + if (sd->flags & SD_WAKE_AFFINE) { + cpumask_t domain_mask; + int best_cpu; + + cpus_and(domain_mask, sd->span, *lowest_mask); + + best_cpu = pick_optimal_cpu(this_cpu, + &domain_mask); + if (best_cpu != -1) + return best_cpu; } } - return lowest_rq ? lowest_rq->cpu : -1; + /* + * And finally, if there were no matches within the domains + * just give the caller *something* to work with from the compatible + * locations. + */ + return pick_optimal_cpu(this_cpu, lowest_mask); } /* Will lock the rq it finds */ �����������patches/0015-sched-wake-balance-fixes.patch���������������������������������������������������������0000664�0000764�0000764�00000005217�11041657735�017371� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From ec30f584b4d095d00067850f07f3dc65d587939b Mon Sep 17 00:00:00 2001 From: Gregory Haskins <ghaskins@novell.com> Date: Tue, 11 Dec 2007 10:02:38 +0100 Subject: [PATCH] sched: wake-balance fixes We have logic to detect whether the system has migratable tasks, but we are not using it when deciding whether to push tasks away. So we add support for considering this new information. Signed-off-by: Gregory Haskins <ghaskins@novell.com> Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> --- kernel/sched.c | 2 ++ kernel/sched_rt.c | 10 ++++++++-- 2 files changed, 10 insertions(+), 2 deletions(-) Index: linux-2.6.24.7/kernel/sched.c =================================================================== --- linux-2.6.24.7.orig/kernel/sched.c +++ linux-2.6.24.7/kernel/sched.c @@ -270,6 +270,7 @@ struct rt_rq { unsigned long rt_nr_migratory; /* highest queued rt task prio */ int highest_prio; + int overloaded; }; /* @@ -6744,6 +6745,7 @@ void __init sched_init(void) rq->migration_thread = NULL; INIT_LIST_HEAD(&rq->migration_queue); rq->rt.highest_prio = MAX_RT_PRIO; + rq->rt.overloaded = 0; #endif atomic_set(&rq->nr_iowait, 0); Index: linux-2.6.24.7/kernel/sched_rt.c =================================================================== --- linux-2.6.24.7.orig/kernel/sched_rt.c +++ linux-2.6.24.7/kernel/sched_rt.c @@ -16,6 +16,7 @@ static inline cpumask_t *rt_overload(voi } static inline void rt_set_overload(struct rq *rq) { + rq->rt.overloaded = 1; cpu_set(rq->cpu, rt_overload_mask); /* * Make sure the mask is visible before we set @@ -32,6 +33,7 @@ static inline void rt_clear_overload(str /* the order here really doesn't matter */ atomic_dec(&rto_count); cpu_clear(rq->cpu, rt_overload_mask); + rq->rt.overloaded = 0; } static void update_rt_migration(struct rq *rq) @@ -444,6 +446,9 @@ static int push_rt_task(struct rq *rq) assert_spin_locked(&rq->lock); + if (!rq->rt.overloaded) + return 0; + next_task = pick_next_highest_task_rt(rq, -1); if (!next_task) return 0; @@ -671,7 +676,7 @@ static void schedule_tail_balance_rt(str * the lock was owned by prev, we need to release it * first via finish_lock_switch and then reaquire it here. */ - if (unlikely(rq->rt.rt_nr_running > 1)) { + if (unlikely(rq->rt.overloaded)) { spin_lock_irq(&rq->lock); push_rt_tasks(rq); spin_unlock_irq(&rq->lock); @@ -683,7 +688,8 @@ static void wakeup_balance_rt(struct rq { if (unlikely(rt_task(p)) && !task_running(rq, p) && - (p->prio >= rq->curr->prio)) + (p->prio >= rq->rt.highest_prio) && + rq->rt.overloaded) push_rt_tasks(rq); } ���������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/0016-sched-RT-balance-avoid-overloading.patch�����������������������������������������������0000664�0000764�0000764�00000005630�11041657731�021256� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From 363a36e818e14bec32470f0d9e196fdef49ca293 Mon Sep 17 00:00:00 2001 From: Steven Rostedt <srostedt@redhat.com> Date: Tue, 11 Dec 2007 10:02:38 +0100 Subject: [PATCH] sched: RT-balance, avoid overloading This patch changes the searching for a run queue by a waking RT task to try to pick another runqueue if the currently running task is an RT task. The reason is that RT tasks behave different than normal tasks. Preempting a normal task to run a RT task to keep its cache hot is fine, because the preempted non-RT task may wait on that same runqueue to run again unless the migration thread comes along and pulls it off. RT tasks behave differently. If one is preempted, it makes an active effort to continue to run. So by having a high priority task preempt a lower priority RT task, that lower RT task will then quickly try to run on another runqueue. This will cause that lower RT task to replace its nice hot cache (and TLB) with a completely cold one. This is for the hope that the new high priority RT task will keep its cache hot. Remeber that this high priority RT task was just woken up. So it may likely have been sleeping for several milliseconds, and will end up with a cold cache anyway. RT tasks run till they voluntarily stop, or are preempted by a higher priority task. This means that it is unlikely that the woken RT task will have a hot cache to wake up to. So pushing off a lower RT task is just killing its cache for no good reason. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> --- kernel/sched_rt.c | 20 ++++++++++++++++---- 1 file changed, 16 insertions(+), 4 deletions(-) Index: linux-2.6.24.7/kernel/sched_rt.c =================================================================== --- linux-2.6.24.7.orig/kernel/sched_rt.c +++ linux-2.6.24.7/kernel/sched_rt.c @@ -156,11 +156,23 @@ static int select_task_rq_rt(struct task struct rq *rq = task_rq(p); /* - * If the task will not preempt the RQ, try to find a better RQ - * before we even activate the task + * If the current task is an RT task, then + * try to see if we can wake this RT task up on another + * runqueue. Otherwise simply start this RT task + * on its current runqueue. + * + * We want to avoid overloading runqueues. Even if + * the RT task is of higher priority than the current RT task. + * RT tasks behave differently than other tasks. If + * one gets preempted, we try to push it off to another queue. + * So trying to keep a preempting RT task on the same + * cache hot CPU will force the running RT task to + * a cold CPU. So we waste all the cache for the lower + * RT task in hopes of saving some of a RT task + * that is just being woken and probably will have + * cold cache anyway. */ - if ((p->prio >= rq->rt.highest_prio) - && (p->nr_cpus_allowed > 1)) { + if (unlikely(rt_task(rq->curr))) { int cpu = find_lowest_rq(p); return (cpu == -1) ? task_cpu(p) : cpu; ��������������������������������������������������������������������������������������������������������patches/0017-sched-break-out-early-if-RT-task-cannot-be-migrated.patch������������������������������0000664�0000764�0000764�00000002016�11041657731�024253� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From 8c0e147455278f9b1ea9f102dfe9d1961ff4c8fe Mon Sep 17 00:00:00 2001 From: Gregory Haskins <ghaskins@novell.com> Date: Tue, 11 Dec 2007 10:02:38 +0100 Subject: [PATCH] sched: break out early if RT task cannot be migrated We don't need to bother searching if the task cannot be migrated Signed-off-by: Gregory Haskins <ghaskins@novell.com> Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> --- kernel/sched_rt.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) Index: linux-2.6.24.7/kernel/sched_rt.c =================================================================== --- linux-2.6.24.7.orig/kernel/sched_rt.c +++ linux-2.6.24.7/kernel/sched_rt.c @@ -172,7 +172,8 @@ static int select_task_rq_rt(struct task * that is just being woken and probably will have * cold cache anyway. */ - if (unlikely(rt_task(rq->curr))) { + if (unlikely(rt_task(rq->curr)) && + (p->nr_cpus_allowed > 1)) { int cpu = find_lowest_rq(p); return (cpu == -1) ? task_cpu(p) : cpu; ������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/0018-sched-RT-balance-optimize.patch��������������������������������������������������������0000664�0000764�0000764�00000004612�11041657730�017505� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From 1dcf532e3660c064d4ff53deabcd6167ff854af8 Mon Sep 17 00:00:00 2001 From: Gregory Haskins <ghaskins@novell.com> Date: Tue, 11 Dec 2007 10:02:38 +0100 Subject: [PATCH] sched: RT-balance, optimize We can cheaply track the number of bits set in the cpumask for the lowest priority CPUs. Therefore, compute the mask's weight and use it to skip the optimal domain search logic when there is only one CPU available. Signed-off-by: Gregory Haskins <ghaskins@novell.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> --- kernel/sched_rt.c | 25 ++++++++++++++++++------- 1 file changed, 18 insertions(+), 7 deletions(-) Index: linux-2.6.24.7/kernel/sched_rt.c =================================================================== --- linux-2.6.24.7.orig/kernel/sched_rt.c +++ linux-2.6.24.7/kernel/sched_rt.c @@ -299,7 +299,7 @@ static int find_lowest_cpus(struct task_ int cpu; cpumask_t *valid_mask = &__get_cpu_var(valid_cpu_mask); int lowest_prio = -1; - int ret = 0; + int count = 0; cpus_clear(*lowest_mask); cpus_and(*valid_mask, cpu_online_map, task->cpus_allowed); @@ -312,7 +312,7 @@ static int find_lowest_cpus(struct task_ /* We look for lowest RT prio or non-rt CPU */ if (rq->rt.highest_prio >= MAX_RT_PRIO) { - if (ret) + if (count) cpus_clear(*lowest_mask); cpu_set(rq->cpu, *lowest_mask); return 1; @@ -324,14 +324,17 @@ static int find_lowest_cpus(struct task_ if (rq->rt.highest_prio > lowest_prio) { /* new low - clear old data */ lowest_prio = rq->rt.highest_prio; - cpus_clear(*lowest_mask); + if (count) { + cpus_clear(*lowest_mask); + count = 0; + } } cpu_set(rq->cpu, *lowest_mask); - ret = 1; + count++; } } - return ret; + return count; } static inline int pick_optimal_cpu(int this_cpu, cpumask_t *mask) @@ -355,9 +358,17 @@ static int find_lowest_rq(struct task_st cpumask_t *lowest_mask = &__get_cpu_var(local_cpu_mask); int this_cpu = smp_processor_id(); int cpu = task_cpu(task); + int count = find_lowest_cpus(task, lowest_mask); - if (!find_lowest_cpus(task, lowest_mask)) - return -1; + if (!count) + return -1; /* No targets found */ + + /* + * There is no sense in performing an optimal search if only one + * target is found. + */ + if (count == 1) + return first_cpu(*lowest_mask); /* * At this point we have built a mask of cpus representing the ����������������������������������������������������������������������������������������������������������������������patches/0019-sched-RT-balance-optimize-cpu-search.patch���������������������������������������������0000664�0000764�0000764�00000005732�11041657734�021546� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From 7c271179e3834348e09c96826a6209c8e122ca9a Mon Sep 17 00:00:00 2001 From: Steven Rostedt <srostedt@redhat.com> Date: Tue, 11 Dec 2007 10:02:38 +0100 Subject: [PATCH] sched: RT-balance, optimize cpu search This patch removes several cpumask operations by keeping track of the first of the CPUS that is of the lowest priority. When the search for the lowest priority runqueue is completed, all the bits up to the first CPU with the lowest priority runqueue is cleared. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> --- kernel/sched_rt.c | 49 ++++++++++++++++++++++++++++++++++++------------- 1 file changed, 36 insertions(+), 13 deletions(-) Index: linux-2.6.24.7/kernel/sched_rt.c =================================================================== --- linux-2.6.24.7.orig/kernel/sched_rt.c +++ linux-2.6.24.7/kernel/sched_rt.c @@ -292,29 +292,36 @@ static struct task_struct *pick_next_hig } static DEFINE_PER_CPU(cpumask_t, local_cpu_mask); -static DEFINE_PER_CPU(cpumask_t, valid_cpu_mask); static int find_lowest_cpus(struct task_struct *task, cpumask_t *lowest_mask) { - int cpu; - cpumask_t *valid_mask = &__get_cpu_var(valid_cpu_mask); int lowest_prio = -1; + int lowest_cpu = -1; int count = 0; + int cpu; - cpus_clear(*lowest_mask); - cpus_and(*valid_mask, cpu_online_map, task->cpus_allowed); + cpus_and(*lowest_mask, cpu_online_map, task->cpus_allowed); /* * Scan each rq for the lowest prio. */ - for_each_cpu_mask(cpu, *valid_mask) { + for_each_cpu_mask(cpu, *lowest_mask) { struct rq *rq = cpu_rq(cpu); /* We look for lowest RT prio or non-rt CPU */ if (rq->rt.highest_prio >= MAX_RT_PRIO) { - if (count) + /* + * if we already found a low RT queue + * and now we found this non-rt queue + * clear the mask and set our bit. + * Otherwise just return the queue as is + * and the count==1 will cause the algorithm + * to use the first bit found. + */ + if (lowest_cpu != -1) { cpus_clear(*lowest_mask); - cpu_set(rq->cpu, *lowest_mask); + cpu_set(rq->cpu, *lowest_mask); + } return 1; } @@ -324,13 +331,29 @@ static int find_lowest_cpus(struct task_ if (rq->rt.highest_prio > lowest_prio) { /* new low - clear old data */ lowest_prio = rq->rt.highest_prio; - if (count) { - cpus_clear(*lowest_mask); - count = 0; - } + lowest_cpu = cpu; + count = 0; } - cpu_set(rq->cpu, *lowest_mask); count++; + } else + cpu_clear(cpu, *lowest_mask); + } + + /* + * Clear out all the set bits that represent + * runqueues that were of higher prio than + * the lowest_prio. + */ + if (lowest_cpu > 0) { + /* + * Perhaps we could add another cpumask op to + * zero out bits. Like cpu_zero_bits(cpumask, nrbits); + * Then that could be optimized to use memset and such. + */ + for_each_cpu_mask(cpu, *lowest_mask) { + if (cpu >= lowest_cpu) + break; + cpu_clear(cpu, *lowest_mask); } } ��������������������������������������patches/0020-sched-RT-balance-on-new-task.patch�����������������������������������������������������0000664�0000764�0000764�00000001302�11041657735�017777� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From 05bf2130444fa78c75f46b47b39c39f0e337f647 Mon Sep 17 00:00:00 2001 From: Steven Rostedt <srostedt@redhat.com> Date: Tue, 11 Dec 2007 10:02:38 +0100 Subject: [PATCH] sched: RT-balance on new task rt-balance when creating new tasks. Signed-off-by: Ingo Molnar <mingo@elte.hu> --- kernel/sched.c | 1 + 1 file changed, 1 insertion(+) Index: linux-2.6.24.7/kernel/sched.c =================================================================== --- linux-2.6.24.7.orig/kernel/sched.c +++ linux-2.6.24.7/kernel/sched.c @@ -1707,6 +1707,7 @@ void fastcall wake_up_new_task(struct ta inc_nr_running(p, rq); } check_preempt_curr(rq, p); + wakeup_balance_rt(rq, p); task_rq_unlock(rq, &flags); } ������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/0021-sched-clean-up-pick_next_highest_task_rt.patch�����������������������������������������0000664�0000764�0000764�00000002352�11041657735�022665� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From f5e7fef1687db918dab55b3196bd5c0c7b3060b0 Mon Sep 17 00:00:00 2001 From: Ingo Molnar <mingo@elte.hu> Date: Tue, 11 Dec 2007 10:02:39 +0100 Subject: [PATCH] sched: clean up pick_next_highest_task_rt() clean up pick_next_highest_task_rt(). Signed-off-by: Ingo Molnar <mingo@elte.hu> --- kernel/sched_rt.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) Index: linux-2.6.24.7/kernel/sched_rt.c =================================================================== --- linux-2.6.24.7.orig/kernel/sched_rt.c +++ linux-2.6.24.7/kernel/sched_rt.c @@ -238,8 +238,7 @@ static int pick_rt_task(struct rq *rq, s } /* Return the second highest RT task, NULL otherwise */ -static struct task_struct *pick_next_highest_task_rt(struct rq *rq, - int cpu) +static struct task_struct *pick_next_highest_task_rt(struct rq *rq, int cpu) { struct rt_prio_array *array = &rq->rt.active; struct task_struct *next; @@ -266,7 +265,8 @@ static struct task_struct *pick_next_hig if (queue->next->next != queue) { /* same prio task */ - next = list_entry(queue->next->next, struct task_struct, run_list); + next = list_entry(queue->next->next, struct task_struct, + run_list); if (pick_rt_task(rq, next, cpu)) goto out; } ��������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/0022-sched-clean-up-find_lock_lowest_rq.patch�����������������������������������������������0000664�0000764�0000764�00000002604�11041657732�021464� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From 0779dd52a040e82f8572fc18b9d56f5cd30a286e Mon Sep 17 00:00:00 2001 From: Ingo Molnar <mingo@elte.hu> Date: Tue, 11 Dec 2007 10:02:39 +0100 Subject: [PATCH] sched: clean up find_lock_lowest_rq() clean up find_lock_lowest_rq(). Signed-off-by: Ingo Molnar <mingo@elte.hu> --- kernel/sched_rt.c | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-) Index: linux-2.6.24.7/kernel/sched_rt.c =================================================================== --- linux-2.6.24.7.orig/kernel/sched_rt.c +++ linux-2.6.24.7/kernel/sched_rt.c @@ -434,12 +434,11 @@ static int find_lowest_rq(struct task_st } /* Will lock the rq it finds */ -static struct rq *find_lock_lowest_rq(struct task_struct *task, - struct rq *rq) +static struct rq *find_lock_lowest_rq(struct task_struct *task, struct rq *rq) { struct rq *lowest_rq = NULL; - int cpu; int tries; + int cpu; for (tries = 0; tries < RT_MAX_TRIES; tries++) { cpu = find_lowest_rq(task); @@ -458,9 +457,11 @@ static struct rq *find_lock_lowest_rq(st * Also make sure that it wasn't scheduled on its rq. */ if (unlikely(task_rq(task) != rq || - !cpu_isset(lowest_rq->cpu, task->cpus_allowed) || + !cpu_isset(lowest_rq->cpu, + task->cpus_allowed) || task_running(rq, task) || !task->se.on_rq)) { + spin_unlock(&lowest_rq->lock); lowest_rq = NULL; break; ����������������������������������������������������������������������������������������������������������������������������patches/0024-sched-clean-up-kernel-sched_rt.c.patch�������������������������������������������������0000664�0000764�0000764�00000002305�11041657733�020730� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From 602b4d5727366261c3a6aca52189ae6304adf38c Mon Sep 17 00:00:00 2001 From: Ingo Molnar <mingo@elte.hu> Date: Tue, 11 Dec 2007 10:02:39 +0100 Subject: [PATCH] sched: clean up kernel/sched_rt.c clean up whitespace damage and missing comments in kernel/sched_rt.c. Signed-off-by: Ingo Molnar <mingo@elte.hu> --- kernel/sched_rt.c | 9 +++++++++ 1 file changed, 9 insertions(+) Index: linux-2.6.24.7/kernel/sched_rt.c =================================================================== --- linux-2.6.24.7.orig/kernel/sched_rt.c +++ linux-2.6.24.7/kernel/sched_rt.c @@ -4,16 +4,24 @@ */ #ifdef CONFIG_SMP + +/* + * The "RT overload" flag: it gets set if a CPU has more than + * one runnable RT task. + */ static cpumask_t rt_overload_mask; static atomic_t rto_count; + static inline int rt_overloaded(void) { return atomic_read(&rto_count); } + static inline cpumask_t *rt_overload(void) { return &rt_overload_mask; } + static inline void rt_set_overload(struct rq *rq) { rq->rt.overloaded = 1; @@ -28,6 +36,7 @@ static inline void rt_set_overload(struc wmb(); atomic_inc(&rto_count); } + static inline void rt_clear_overload(struct rq *rq) { /* the order here really doesn't matter */ ���������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/0025-sched-remove-rt_overload.patch���������������������������������������������������������0000664�0000764�0000764�00000002425�11041657734�017554� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From 03c269c753ca432cc33c3039c743bfceba10a3e9 Mon Sep 17 00:00:00 2001 From: Ingo Molnar <mingo@elte.hu> Date: Tue, 11 Dec 2007 10:02:39 +0100 Subject: [PATCH] sched: remove rt_overload() remove rt_overload() - it's an unnecessary indirection. Signed-off-by: Ingo Molnar <mingo@elte.hu> --- kernel/sched_rt.c | 10 +--------- 1 file changed, 1 insertion(+), 9 deletions(-) Index: linux-2.6.24.7/kernel/sched_rt.c =================================================================== --- linux-2.6.24.7.orig/kernel/sched_rt.c +++ linux-2.6.24.7/kernel/sched_rt.c @@ -17,11 +17,6 @@ static inline int rt_overloaded(void) return atomic_read(&rto_count); } -static inline cpumask_t *rt_overload(void) -{ - return &rt_overload_mask; -} - static inline void rt_set_overload(struct rq *rq) { rq->rt.overloaded = 1; @@ -586,7 +581,6 @@ static int pull_rt_task(struct rq *this_ struct task_struct *next; struct task_struct *p; struct rq *src_rq; - cpumask_t *rto_cpumask; int this_cpu = this_rq->cpu; int cpu; int ret = 0; @@ -604,9 +598,7 @@ static int pull_rt_task(struct rq *this_ next = pick_next_task_rt(this_rq); - rto_cpumask = rt_overload(); - - for_each_cpu_mask(cpu, *rto_cpumask) { + for_each_cpu_mask(cpu, rt_overload_mask) { if (this_cpu == cpu) continue; �������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/0026-sched-remove-leftover-debugging.patch��������������������������������������������������0000664�0000764�0000764�00000002557�11041657731�021017� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From ef0ff6d04e7609127b8227f8c0b5be2977aae986 Mon Sep 17 00:00:00 2001 From: Ingo Molnar <mingo@elte.hu> Date: Tue, 11 Dec 2007 10:02:39 +0100 Subject: [PATCH] sched: remove leftover debugging remove leftover debugging. Signed-off-by: Ingo Molnar <mingo@elte.hu> --- kernel/sched_rt.c | 8 -------- 1 file changed, 8 deletions(-) Index: linux-2.6.24.7/kernel/sched_rt.c =================================================================== --- linux-2.6.24.7.orig/kernel/sched_rt.c +++ linux-2.6.24.7/kernel/sched_rt.c @@ -249,8 +249,6 @@ static struct task_struct *pick_next_hig struct list_head *queue; int idx; - assert_spin_locked(&rq->lock); - if (likely(rq->rt.rt_nr_running < 2)) return NULL; @@ -496,8 +494,6 @@ static int push_rt_task(struct rq *rq) int ret = 0; int paranoid = RT_MAX_TRIES; - assert_spin_locked(&rq->lock); - if (!rq->rt.overloaded) return 0; @@ -542,8 +538,6 @@ static int push_rt_task(struct rq *rq) goto out; } - assert_spin_locked(&lowest_rq->lock); - deactivate_task(rq, next_task, 0); set_task_cpu(next_task, lowest_rq->cpu); activate_task(lowest_rq, next_task, 0); @@ -585,8 +579,6 @@ static int pull_rt_task(struct rq *this_ int cpu; int ret = 0; - assert_spin_locked(&this_rq->lock); - /* * If cpusets are used, and we have overlapping * run queue cpusets, then this algorithm may not catch all. �������������������������������������������������������������������������������������������������������������������������������������������������patches/0027-sched-clean-up-pull_rt_task.patch������������������������������������������������������0000664�0000764�0000764�00000005065�11041657731�020150� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From dec733755cbc9c260d74ae3163029a87f98932f7 Mon Sep 17 00:00:00 2001 From: Ingo Molnar <mingo@elte.hu> Date: Tue, 11 Dec 2007 10:02:39 +0100 Subject: [PATCH] sched: clean up pull_rt_task() clean up pull_rt_task(). Signed-off-by: Ingo Molnar <mingo@elte.hu> --- kernel/sched_rt.c | 22 ++++++++++------------ 1 file changed, 10 insertions(+), 12 deletions(-) Index: linux-2.6.24.7/kernel/sched_rt.c =================================================================== --- linux-2.6.24.7.orig/kernel/sched_rt.c +++ linux-2.6.24.7/kernel/sched_rt.c @@ -572,12 +572,9 @@ static void push_rt_tasks(struct rq *rq) static int pull_rt_task(struct rq *this_rq) { - struct task_struct *next; - struct task_struct *p; + int this_cpu = this_rq->cpu, ret = 0, cpu; + struct task_struct *p, *next; struct rq *src_rq; - int this_cpu = this_rq->cpu; - int cpu; - int ret = 0; /* * If cpusets are used, and we have overlapping @@ -604,23 +601,25 @@ static int pull_rt_task(struct rq *this_ if (double_lock_balance(this_rq, src_rq)) { /* unlocked our runqueue lock */ struct task_struct *old_next = next; + next = pick_next_task_rt(this_rq); if (next != old_next) ret = 1; } - if (likely(src_rq->rt.rt_nr_running <= 1)) + if (likely(src_rq->rt.rt_nr_running <= 1)) { /* * Small chance that this_rq->curr changed * but it's really harmless here. */ rt_clear_overload(this_rq); - else + } else { /* * Heh, the src_rq is now overloaded, since * we already have the src_rq lock, go straight * to pulling tasks from it. */ goto try_pulling; + } spin_unlock(&src_rq->lock); continue; } @@ -634,6 +633,7 @@ static int pull_rt_task(struct rq *this_ */ if (double_lock_balance(this_rq, src_rq)) { struct task_struct *old_next = next; + next = pick_next_task_rt(this_rq); if (next != old_next) ret = 1; @@ -670,7 +670,7 @@ static int pull_rt_task(struct rq *this_ */ if (p->prio < src_rq->curr->prio || (next && next->prio < src_rq->curr->prio)) - goto bail; + goto out; ret = 1; @@ -682,9 +682,7 @@ static int pull_rt_task(struct rq *this_ * case there's an even higher prio task * in another runqueue. (low likelyhood * but possible) - */ - - /* + * * Update next so that we won't pick a task * on another cpu with a priority lower (or equal) * than the one we just picked. @@ -692,7 +690,7 @@ static int pull_rt_task(struct rq *this_ next = p; } - bail: + out: spin_unlock(&src_rq->lock); } ���������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/0028-sched-clean-up-schedule_balance_rt.patch�����������������������������������������������0000664�0000764�0000764�00000001767�11041657733�021423� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From d9896b729e0088c858793d1f38a4ab19d4ba7739 Mon Sep 17 00:00:00 2001 From: Ingo Molnar <mingo@elte.hu> Date: Tue, 11 Dec 2007 10:02:42 +0100 Subject: [PATCH] sched: clean up schedule_balance_rt() clean up schedule_balance_rt(). Signed-off-by: Ingo Molnar <mingo@elte.hu> --- kernel/sched_rt.c | 6 ++---- 1 file changed, 2 insertions(+), 4 deletions(-) Index: linux-2.6.24.7/kernel/sched_rt.c =================================================================== --- linux-2.6.24.7.orig/kernel/sched_rt.c +++ linux-2.6.24.7/kernel/sched_rt.c @@ -697,12 +697,10 @@ static int pull_rt_task(struct rq *this_ return ret; } -static void schedule_balance_rt(struct rq *rq, - struct task_struct *prev) +static void schedule_balance_rt(struct rq *rq, struct task_struct *prev) { /* Try to pull RT tasks here if we lower this rq's prio */ - if (unlikely(rt_task(prev)) && - rq->rt.highest_prio > prev->prio) + if (unlikely(rt_task(prev)) && rq->rt.highest_prio > prev->prio) pull_rt_task(rq); } ���������patches/0029-sched-add-sched-domain-roots.patch�����������������������������������������������������0000664�0000764�0000764�00000016400�11041657730�020164� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From d744c377ea0bb1daff17fdefc048c9ae30787873 Mon Sep 17 00:00:00 2001 From: Gregory Haskins <ghaskins@novell.com> Date: Tue, 11 Dec 2007 10:02:43 +0100 Subject: [PATCH] sched: add sched-domain roots We add the notion of a root-domain which will be used later to rescope global variables to per-domain variables. Each exclusive cpuset essentially defines an island domain by fully partitioning the member cpus from any other cpuset. However, we currently still maintain some policy/state as global variables which transcend all cpusets. Consider, for instance, rt-overload state. Whenever a new exclusive cpuset is created, we also create a new root-domain object and move each cpu member to the root-domain's span. By default the system creates a single root-domain with all cpus as members (mimicking the global state we have today). We add some plumbing for storing class specific data in our root-domain. Whenever a RQ is switching root-domains (because of repartitioning) we give each sched_class the opportunity to remove any state from its old domain and add state to the new one. This logic doesn't have any clients yet but it will later in the series. Signed-off-by: Gregory Haskins <ghaskins@novell.com> CC: Christoph Lameter <clameter@sgi.com> CC: Paul Jackson <pj@sgi.com> CC: Simon Derr <simon.derr@bull.net> Signed-off-by: Ingo Molnar <mingo@elte.hu> --- include/linux/sched.h | 3 + kernel/sched.c | 121 ++++++++++++++++++++++++++++++++++++++++++++++++-- 2 files changed, 121 insertions(+), 3 deletions(-) Index: linux-2.6.24.7/include/linux/sched.h =================================================================== --- linux-2.6.24.7.orig/include/linux/sched.h +++ linux-2.6.24.7/include/linux/sched.h @@ -849,6 +849,9 @@ struct sched_class { void (*task_tick) (struct rq *rq, struct task_struct *p); void (*task_new) (struct rq *rq, struct task_struct *p); void (*set_cpus_allowed)(struct task_struct *p, cpumask_t *newmask); + + void (*join_domain)(struct rq *rq); + void (*leave_domain)(struct rq *rq); }; struct load_weight { Index: linux-2.6.24.7/kernel/sched.c =================================================================== --- linux-2.6.24.7.orig/kernel/sched.c +++ linux-2.6.24.7/kernel/sched.c @@ -273,6 +273,28 @@ struct rt_rq { int overloaded; }; +#ifdef CONFIG_SMP + +/* + * We add the notion of a root-domain which will be used to define per-domain + * variables. Each exclusive cpuset essentially defines an island domain by + * fully partitioning the member cpus from any other cpuset. Whenever a new + * exclusive cpuset is created, we also create and attach a new root-domain + * object. + * + * By default the system creates a single root-domain with all cpus as + * members (mimicking the global state we have today). + */ +struct root_domain { + atomic_t refcount; + cpumask_t span; + cpumask_t online; +}; + +static struct root_domain def_root_domain; + +#endif + /* * This is the main, per-CPU runqueue data structure. * @@ -330,6 +352,7 @@ struct rq { atomic_t nr_iowait; #ifdef CONFIG_SMP + struct root_domain *rd; struct sched_domain *sd; /* For active balancing */ @@ -5539,6 +5562,15 @@ migration_call(struct notifier_block *nf case CPU_ONLINE_FROZEN: /* Strictly unnecessary, as first user will wake it. */ wake_up_process(cpu_rq(cpu)->migration_thread); + + /* Update our root-domain */ + rq = cpu_rq(cpu); + spin_lock_irqsave(&rq->lock, flags); + if (rq->rd) { + BUG_ON(!cpu_isset(cpu, rq->rd->span)); + cpu_set(cpu, rq->rd->online); + } + spin_unlock_irqrestore(&rq->lock, flags); break; #ifdef CONFIG_HOTPLUG_CPU @@ -5589,6 +5621,17 @@ migration_call(struct notifier_block *nf } spin_unlock_irq(&rq->lock); break; + + case CPU_DOWN_PREPARE: + /* Update our root-domain */ + rq = cpu_rq(cpu); + spin_lock_irqsave(&rq->lock, flags); + if (rq->rd) { + BUG_ON(!cpu_isset(cpu, rq->rd->span)); + cpu_clear(cpu, rq->rd->online); + } + spin_unlock_irqrestore(&rq->lock, flags); + break; #endif case CPU_LOCK_RELEASE: mutex_unlock(&sched_hotcpu_mutex); @@ -5780,11 +5823,69 @@ sd_parent_degenerate(struct sched_domain return 1; } +static void rq_attach_root(struct rq *rq, struct root_domain *rd) +{ + unsigned long flags; + const struct sched_class *class; + + spin_lock_irqsave(&rq->lock, flags); + + if (rq->rd) { + struct root_domain *old_rd = rq->rd; + + for (class = sched_class_highest; class; class = class->next) + if (class->leave_domain) + class->leave_domain(rq); + + if (atomic_dec_and_test(&old_rd->refcount)) + kfree(old_rd); + } + + atomic_inc(&rd->refcount); + rq->rd = rd; + + for (class = sched_class_highest; class; class = class->next) + if (class->join_domain) + class->join_domain(rq); + + spin_unlock_irqrestore(&rq->lock, flags); +} + +static void init_rootdomain(struct root_domain *rd, const cpumask_t *map) +{ + memset(rd, 0, sizeof(*rd)); + + rd->span = *map; + cpus_and(rd->online, rd->span, cpu_online_map); +} + +static void init_defrootdomain(void) +{ + cpumask_t cpus = CPU_MASK_ALL; + + init_rootdomain(&def_root_domain, &cpus); + atomic_set(&def_root_domain.refcount, 1); +} + +static struct root_domain *alloc_rootdomain(const cpumask_t *map) +{ + struct root_domain *rd; + + rd = kmalloc(sizeof(*rd), GFP_KERNEL); + if (!rd) + return NULL; + + init_rootdomain(rd, map); + + return rd; +} + /* * Attach the domain 'sd' to 'cpu' as its base domain. Callers must * hold the hotplug lock. */ -static void cpu_attach_domain(struct sched_domain *sd, int cpu) +static void cpu_attach_domain(struct sched_domain *sd, + struct root_domain *rd, int cpu) { struct rq *rq = cpu_rq(cpu); struct sched_domain *tmp; @@ -5809,6 +5910,7 @@ static void cpu_attach_domain(struct sch sched_domain_debug(sd, cpu); + rq_attach_root(rq, rd); rcu_assign_pointer(rq->sd, sd); } @@ -6177,6 +6279,7 @@ static void init_sched_groups_power(int static int build_sched_domains(const cpumask_t *cpu_map) { int i; + struct root_domain *rd; #ifdef CONFIG_NUMA struct sched_group **sched_group_nodes = NULL; int sd_allnodes = 0; @@ -6193,6 +6296,12 @@ static int build_sched_domains(const cpu sched_group_nodes_bycpu[first_cpu(*cpu_map)] = sched_group_nodes; #endif + rd = alloc_rootdomain(cpu_map); + if (!rd) { + printk(KERN_WARNING "Cannot alloc root domain\n"); + return -ENOMEM; + } + /* * Set up domains for cpus specified by the cpu_map. */ @@ -6409,7 +6518,7 @@ static int build_sched_domains(const cpu #else sd = &per_cpu(phys_domains, i); #endif - cpu_attach_domain(sd, i); + cpu_attach_domain(sd, rd, i); } return 0; @@ -6467,7 +6576,7 @@ static void detach_destroy_domains(const unregister_sched_domain_sysctl(); for_each_cpu_mask(i, *cpu_map) - cpu_attach_domain(NULL, i); + cpu_attach_domain(NULL, &def_root_domain, i); synchronize_sched(); arch_destroy_sched_domains(cpu_map); } @@ -6700,6 +6809,10 @@ void __init sched_init(void) int highest_cpu = 0; int i, j; +#ifdef CONFIG_SMP + init_defrootdomain(); +#endif + for_each_possible_cpu(i) { struct rt_prio_array *array; struct rq *rq; @@ -6739,6 +6852,8 @@ void __init sched_init(void) rq->cpu_load[j] = 0; #ifdef CONFIG_SMP rq->sd = NULL; + rq->rd = NULL; + rq_attach_root(rq, &def_root_domain); rq->active_balance = 0; rq->next_balance = jiffies; rq->push_cpu = 0; ����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/0030-sched-update-root-domain-spans-upon-departure.patch������������������������������������0000664�0000764�0000764�00000001613�11041657733�023534� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From e2f365b65c687741cd14bac47d2c5c2a28a136c2 Mon Sep 17 00:00:00 2001 From: Gregory Haskins <ghaskins@novell.com> Date: Tue, 11 Dec 2007 10:02:43 +0100 Subject: [PATCH] sched: update root-domain spans upon departure We shouldnt leave cpus enabled in the spans if that RQ has left the domain. Signed-off-by: Gregory Haskins <ghaskins@novell.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> --- kernel/sched.c | 3 +++ 1 file changed, 3 insertions(+) Index: linux-2.6.24.7/kernel/sched.c =================================================================== --- linux-2.6.24.7.orig/kernel/sched.c +++ linux-2.6.24.7/kernel/sched.c @@ -5837,6 +5837,9 @@ static void rq_attach_root(struct rq *rq if (class->leave_domain) class->leave_domain(rq); + cpu_clear(rq->cpu, old_rd->span); + cpu_clear(rq->cpu, old_rd->online); + if (atomic_dec_and_test(&old_rd->refcount)) kfree(old_rd); } ���������������������������������������������������������������������������������������������������������������������patches/0031-Subject-SCHED-Only-balance-our-RT-tasks-within-ou.patch��������������������������������0000664�0000764�0000764�00000011572�11041657732�023572� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From 04746d899536b279a7bf3298d4a84a1d5baf9090 Mon Sep 17 00:00:00 2001 From: Gregory Haskins <ghaskins@novell.com> Date: Tue, 11 Dec 2007 10:02:43 +0100 Subject: [PATCH] Subject: SCHED - Only balance our RT tasks within our We move the rt-overload data as the first global to per-domain reclassification. This limits the scope of overload related cache-line bouncing to stay with a specified partition instead of affecting all cpus in the system. Finally, we limit the scope of find_lowest_cpu searches to the domain instead of the entire system. Note that we would always respect domain boundaries even without this patch, but we first would scan potentially all cpus before whittling the list down. Now we can avoid looking at RQs that are out of scope, again reducing cache-line hits. Note: In some cases, task->cpus_allowed will effectively reduce our search to within our domain. However, I believe there are cases where the cpus_allowed mask may be all ones and therefore we err on the side of caution. If it can be optimized later, so be it. Signed-off-by: Gregory Haskins <ghaskins@novell.com> CC: Christoph Lameter <clameter@sgi.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> --- kernel/sched.c | 7 ++++++ kernel/sched_rt.c | 57 +++++++++++++++++++++++++++++------------------------- 2 files changed, 38 insertions(+), 26 deletions(-) Index: linux-2.6.24.7/kernel/sched.c =================================================================== --- linux-2.6.24.7.orig/kernel/sched.c +++ linux-2.6.24.7/kernel/sched.c @@ -289,6 +289,13 @@ struct root_domain { atomic_t refcount; cpumask_t span; cpumask_t online; + + /* + * The "RT overload" flag: it gets set if a CPU has more than + * one runnable RT task. + */ + cpumask_t rto_mask; + atomic_t rto_count; }; static struct root_domain def_root_domain; Index: linux-2.6.24.7/kernel/sched_rt.c =================================================================== --- linux-2.6.24.7.orig/kernel/sched_rt.c +++ linux-2.6.24.7/kernel/sched_rt.c @@ -5,22 +5,14 @@ #ifdef CONFIG_SMP -/* - * The "RT overload" flag: it gets set if a CPU has more than - * one runnable RT task. - */ -static cpumask_t rt_overload_mask; -static atomic_t rto_count; - -static inline int rt_overloaded(void) +static inline int rt_overloaded(struct rq *rq) { - return atomic_read(&rto_count); + return atomic_read(&rq->rd->rto_count); } static inline void rt_set_overload(struct rq *rq) { - rq->rt.overloaded = 1; - cpu_set(rq->cpu, rt_overload_mask); + cpu_set(rq->cpu, rq->rd->rto_mask); /* * Make sure the mask is visible before we set * the overload count. That is checked to determine @@ -29,23 +21,25 @@ static inline void rt_set_overload(struc * updated yet. */ wmb(); - atomic_inc(&rto_count); + atomic_inc(&rq->rd->rto_count); } static inline void rt_clear_overload(struct rq *rq) { /* the order here really doesn't matter */ - atomic_dec(&rto_count); - cpu_clear(rq->cpu, rt_overload_mask); - rq->rt.overloaded = 0; + atomic_dec(&rq->rd->rto_count); + cpu_clear(rq->cpu, rq->rd->rto_mask); } static void update_rt_migration(struct rq *rq) { - if (rq->rt.rt_nr_migratory && (rq->rt.rt_nr_running > 1)) + if (rq->rt.rt_nr_migratory && (rq->rt.rt_nr_running > 1)) { rt_set_overload(rq); - else + rq->rt.overloaded = 1; + } else { rt_clear_overload(rq); + rq->rt.overloaded = 0; + } } #endif /* CONFIG_SMP */ @@ -302,7 +296,7 @@ static int find_lowest_cpus(struct task_ int count = 0; int cpu; - cpus_and(*lowest_mask, cpu_online_map, task->cpus_allowed); + cpus_and(*lowest_mask, task_rq(task)->rd->online, task->cpus_allowed); /* * Scan each rq for the lowest prio. @@ -576,18 +570,12 @@ static int pull_rt_task(struct rq *this_ struct task_struct *p, *next; struct rq *src_rq; - /* - * If cpusets are used, and we have overlapping - * run queue cpusets, then this algorithm may not catch all. - * This is just the price you pay on trying to keep - * dirtying caches down on large SMP machines. - */ - if (likely(!rt_overloaded())) + if (likely(!rt_overloaded(this_rq))) return 0; next = pick_next_task_rt(this_rq); - for_each_cpu_mask(cpu, rt_overload_mask) { + for_each_cpu_mask(cpu, this_rq->rd->rto_mask) { if (this_cpu == cpu) continue; @@ -805,6 +793,20 @@ static void task_tick_rt(struct rq *rq, } } +/* Assumes rq->lock is held */ +static void join_domain_rt(struct rq *rq) +{ + if (rq->rt.overloaded) + rt_set_overload(rq); +} + +/* Assumes rq->lock is held */ +static void leave_domain_rt(struct rq *rq) +{ + if (rq->rt.overloaded) + rt_clear_overload(rq); +} + static void set_curr_task_rt(struct rq *rq) { struct task_struct *p = rq->curr; @@ -834,4 +836,7 @@ const struct sched_class rt_sched_class .set_curr_task = set_curr_task_rt, .task_tick = task_tick_rt, + + .join_domain = join_domain_rt, + .leave_domain = leave_domain_rt, }; ��������������������������������������������������������������������������������������������������������������������������������������patches/0032-sched-fix-sched_rt.c-join-leave_domain.patch�������������������������������������������0000664�0000764�0000764�00000004005�11041657732�022107� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From d64bfbadc24903458938fed1704488eb4eef0487 Mon Sep 17 00:00:00 2001 From: Ingo Molnar <mingo@elte.hu> Date: Tue, 11 Dec 2007 10:02:43 +0100 Subject: [PATCH] sched: fix sched_rt.c:join/leave_domain fix build bug in sched_rt.c:join/leave_domain and make them only be included on SMP builds. Signed-off-by: Ingo Molnar <mingo@elte.hu> --- kernel/sched_rt.c | 33 ++++++++++++++++----------------- 1 file changed, 16 insertions(+), 17 deletions(-) Index: linux-2.6.24.7/kernel/sched_rt.c =================================================================== --- linux-2.6.24.7.orig/kernel/sched_rt.c +++ linux-2.6.24.7/kernel/sched_rt.c @@ -761,6 +761,20 @@ static void set_cpus_allowed_rt(struct t p->cpus_allowed = *new_mask; p->nr_cpus_allowed = weight; } +/* Assumes rq->lock is held */ +static void join_domain_rt(struct rq *rq) +{ + if (rq->rt.overloaded) + rt_set_overload(rq); +} + +/* Assumes rq->lock is held */ +static void leave_domain_rt(struct rq *rq) +{ + if (rq->rt.overloaded) + rt_clear_overload(rq); +} + #else /* CONFIG_SMP */ # define schedule_tail_balance_rt(rq) do { } while (0) # define schedule_balance_rt(rq, prev) do { } while (0) @@ -793,20 +807,6 @@ static void task_tick_rt(struct rq *rq, } } -/* Assumes rq->lock is held */ -static void join_domain_rt(struct rq *rq) -{ - if (rq->rt.overloaded) - rt_set_overload(rq); -} - -/* Assumes rq->lock is held */ -static void leave_domain_rt(struct rq *rq) -{ - if (rq->rt.overloaded) - rt_clear_overload(rq); -} - static void set_curr_task_rt(struct rq *rq) { struct task_struct *p = rq->curr; @@ -832,11 +832,10 @@ const struct sched_class rt_sched_class .load_balance = load_balance_rt, .move_one_task = move_one_task_rt, .set_cpus_allowed = set_cpus_allowed_rt, + .join_domain = join_domain_rt, + .leave_domain = leave_domain_rt, #endif .set_curr_task = set_curr_task_rt, .task_tick = task_tick_rt, - - .join_domain = join_domain_rt, - .leave_domain = leave_domain_rt, }; ���������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/0034-sched-style-cleanup-2.patch������������������������������������������������������������0000664�0000764�0000764�00000007233�11041657734�016667� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From 9a98470df9d0bd5860ea0c9a1c16d373c242e248 Mon Sep 17 00:00:00 2001 From: Ingo Molnar <mingo@elte.hu> Date: Tue, 11 Dec 2007 10:02:43 +0100 Subject: [PATCH] sched: style cleanup, #2 style cleanup of various changes that were done recently. no code changed: text data bss dec hex filename 26399 2578 48 29025 7161 sched.o.before 26399 2578 48 29025 7161 sched.o.after Signed-off-by: Ingo Molnar <mingo@elte.hu> --- kernel/sched.c | 28 +++++++++++++++------------- 1 file changed, 15 insertions(+), 13 deletions(-) Index: linux-2.6.24.7/kernel/sched.c =================================================================== --- linux-2.6.24.7.orig/kernel/sched.c +++ linux-2.6.24.7/kernel/sched.c @@ -186,12 +186,12 @@ static struct cfs_rq *init_cfs_rq_p[NR_C * Every task in system belong to this group at bootup. */ struct task_group init_task_group = { - .se = init_sched_entity_p, + .se = init_sched_entity_p, .cfs_rq = init_cfs_rq_p, }; #ifdef CONFIG_FAIR_USER_SCHED -# define INIT_TASK_GRP_LOAD 2*NICE_0_LOAD +# define INIT_TASK_GRP_LOAD (2*NICE_0_LOAD) #else # define INIT_TASK_GRP_LOAD NICE_0_LOAD #endif @@ -277,8 +277,8 @@ struct rt_rq { /* * We add the notion of a root-domain which will be used to define per-domain - * variables. Each exclusive cpuset essentially defines an island domain by - * fully partitioning the member cpus from any other cpuset. Whenever a new + * variables. Each exclusive cpuset essentially defines an island domain by + * fully partitioning the member cpus from any other cpuset. Whenever a new * exclusive cpuset is created, we also create and attach a new root-domain * object. * @@ -290,12 +290,12 @@ struct root_domain { cpumask_t span; cpumask_t online; - /* + /* * The "RT overload" flag: it gets set if a CPU has more than * one runnable RT task. */ cpumask_t rto_mask; - atomic_t rto_count; + atomic_t rto_count; }; static struct root_domain def_root_domain; @@ -359,7 +359,7 @@ struct rq { atomic_t nr_iowait; #ifdef CONFIG_SMP - struct root_domain *rd; + struct root_domain *rd; struct sched_domain *sd; /* For active balancing */ @@ -5053,7 +5053,7 @@ int set_cpus_allowed(struct task_struct if (p->sched_class->set_cpus_allowed) p->sched_class->set_cpus_allowed(p, &new_mask); else { - p->cpus_allowed = new_mask; + p->cpus_allowed = new_mask; p->nr_cpus_allowed = cpus_weight(new_mask); } @@ -5840,9 +5840,10 @@ static void rq_attach_root(struct rq *rq if (rq->rd) { struct root_domain *old_rd = rq->rd; - for (class = sched_class_highest; class; class = class->next) + for (class = sched_class_highest; class; class = class->next) { if (class->leave_domain) class->leave_domain(rq); + } cpu_clear(rq->cpu, old_rd->span); cpu_clear(rq->cpu, old_rd->online); @@ -5854,9 +5855,10 @@ static void rq_attach_root(struct rq *rq atomic_inc(&rd->refcount); rq->rd = rd; - for (class = sched_class_highest; class; class = class->next) + for (class = sched_class_highest; class; class = class->next) { if (class->join_domain) class->join_domain(rq); + } spin_unlock_irqrestore(&rq->lock, flags); } @@ -5891,11 +5893,11 @@ static struct root_domain *alloc_rootdom } /* - * Attach the domain 'sd' to 'cpu' as its base domain. Callers must + * Attach the domain 'sd' to 'cpu' as its base domain. Callers must * hold the hotplug lock. */ -static void cpu_attach_domain(struct sched_domain *sd, - struct root_domain *rd, int cpu) +static void +cpu_attach_domain(struct sched_domain *sd, struct root_domain *rd, int cpu) { struct rq *rq = cpu_rq(cpu); struct sched_domain *tmp; ���������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/0035-sched-add-credits-for-RT-balancing-improvements.patch����������������������������������0000664�0000764�0000764�00000001567�11041657735�023705� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From ce3b1244b9da66db0c51602a59bf1e4de9a75686 Mon Sep 17 00:00:00 2001 From: Ingo Molnar <mingo@elte.hu> Date: Tue, 11 Dec 2007 10:02:46 +0100 Subject: [PATCH] sched: add credits for RT balancing improvements add credits for RT balancing improvements. Signed-off-by: Ingo Molnar <mingo@elte.hu> --- kernel/sched.c | 2 ++ 1 file changed, 2 insertions(+) Index: linux-2.6.24.7/kernel/sched.c =================================================================== --- linux-2.6.24.7.orig/kernel/sched.c +++ linux-2.6.24.7/kernel/sched.c @@ -22,6 +22,8 @@ * by Peter Williams * 2007-05-06 Interactivity improvements to CFS by Mike Galbraith * 2007-07-01 Group scheduling enhancements by Srivatsa Vaddagiri + * 2007-11-29 RT balancing improvements by Steven Rostedt, Gregory Haskins, + * Thomas Gleixner, Mike Kravetz */ #include <linux/mm.h> �����������������������������������������������������������������������������������������������������������������������������������������patches/0037-sched-whitespace-cleanups-in-topology.h.patch������������������������������������������0000664�0000764�0000764�00000001516�11041657735�022415� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From e7e60ed8c8913b8bcebab7e9147d32b3aa430fb8 Mon Sep 17 00:00:00 2001 From: Ingo Molnar <mingo@elte.hu> Date: Tue, 11 Dec 2007 10:02:47 +0100 Subject: [PATCH] sched: whitespace cleanups in topology.h whitespace cleanups in topology.h. Signed-off-by: Ingo Molnar <mingo@elte.hu> --- include/linux/topology.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) Index: linux-2.6.24.7/include/linux/topology.h =================================================================== --- linux-2.6.24.7.orig/include/linux/topology.h +++ linux-2.6.24.7/include/linux/topology.h @@ -5,7 +5,7 @@ * * Copyright (C) 2002, IBM Corp. * - * All rights reserved. + * All rights reserved. * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by ����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/0038-sched-no-need-for-affine-wakeup-balancing-in.patch�������������������������������������0000664�0000764�0000764�00000002072�11041673262�023106� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From fad7c246ce9627cc06fa936f41b5cd40f5097f13 Mon Sep 17 00:00:00 2001 From: Dmitry Adamushko <dmitry.adamushko@gmail.com> Date: Tue, 11 Dec 2007 10:02:47 +0100 Subject: [PATCH] sched: no need for 'affine wakeup' balancing in No need to do a check for 'affine wakeup and passive balancing possibilities' in select_task_rq_fair() when task_cpu(p) == this_cpu. I guess, this part got missed upon introduction of per-sched_class select_task_rq() in try_to_wake_up(). Signed-off-by: Dmitry Adamushko <dmitry.adamushko@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> --- kernel/sched_fair.c | 3 +++ 1 file changed, 3 insertions(+) Index: linux-2.6.24.7/kernel/sched_fair.c =================================================================== --- linux-2.6.24.7.orig/kernel/sched_fair.c +++ linux-2.6.24.7/kernel/sched_fair.c @@ -896,6 +896,9 @@ static int select_task_rq_fair(struct ta this_cpu = smp_processor_id(); new_cpu = cpu; + if (cpu == this_cpu) + goto out_set_cpu; + for_each_domain(this_cpu, sd) { if (cpu_isset(cpu, sd->span)) { this_sd = sd; ����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/0039-sched-get-rid-of-new_cpu-in-try_to_wake_up.patch���������������������������������������0000664�0000764�0000764�00000003053�11041657734�022774� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From 48bbd36b0a1a82e2601ad726d9f1e1338e2af12b Mon Sep 17 00:00:00 2001 From: Dmitry Adamushko <dmitry.adamushko@gmail.com> Date: Tue, 11 Dec 2007 10:02:47 +0100 Subject: [PATCH] sched: get rid of 'new_cpu' in try_to_wake_up() Clean-up try_to_wake_up(). Get rid of the 'new_cpu' variable in try_to_wake_up() [ that's, one #ifdef section less ]. Also remove a few redundant blank lines. Signed-off-by: Dmitry Adamushko <dmitry.adamushko@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> --- kernel/sched.c | 11 +++-------- 1 file changed, 3 insertions(+), 8 deletions(-) Index: linux-2.6.24.7/kernel/sched.c =================================================================== --- linux-2.6.24.7.orig/kernel/sched.c +++ linux-2.6.24.7/kernel/sched.c @@ -1545,9 +1545,6 @@ static int try_to_wake_up(struct task_st unsigned long flags; long old_state; struct rq *rq; -#ifdef CONFIG_SMP - int new_cpu; -#endif rq = task_rq_lock(p, &flags); old_state = p->state; @@ -1565,9 +1562,9 @@ static int try_to_wake_up(struct task_st if (unlikely(task_running(rq, p))) goto out_activate; - new_cpu = p->sched_class->select_task_rq(p, sync); - if (new_cpu != cpu) { - set_task_cpu(p, new_cpu); + cpu = p->sched_class->select_task_rq(p, sync); + if (cpu != orig_cpu) { + set_task_cpu(p, cpu); task_rq_unlock(rq, &flags); /* might preempt at this point */ rq = task_rq_lock(p, &flags); @@ -1594,10 +1591,8 @@ static int try_to_wake_up(struct task_st } } } - #endif - out_activate: #endif /* CONFIG_SMP */ schedstat_inc(p, se.nr_wakeups); �������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/0041-sched-RT-balance-replace-hooks-with-pre-post-sched.patch�������������������������������0000664�0000764�0000764�00000011201�11041657732�024173� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From 28c1e054fff4eda4bea9501e420392da81f25000 Mon Sep 17 00:00:00 2001 From: Steven Rostedt <rostedt@goodmis.org> Date: Tue, 11 Dec 2007 10:02:47 +0100 Subject: [PATCH] sched: RT-balance, replace hooks with pre/post schedule and wakeup methods To make the main sched.c code more agnostic to the schedule classes. Instead of having specific hooks in the schedule code for the RT class balancing. They are replaced with a pre_schedule, post_schedule and task_wake_up methods. These methods may be used by any of the classes but currently, only the sched_rt class implements them. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> --- include/linux/sched.h | 3 +++ kernel/sched.c | 20 ++++++++++++++++---- kernel/sched_rt.c | 17 +++++++---------- 3 files changed, 26 insertions(+), 14 deletions(-) Index: linux-2.6.24.7/include/linux/sched.h =================================================================== --- linux-2.6.24.7.orig/include/linux/sched.h +++ linux-2.6.24.7/include/linux/sched.h @@ -843,6 +843,9 @@ struct sched_class { int (*move_one_task) (struct rq *this_rq, int this_cpu, struct rq *busiest, struct sched_domain *sd, enum cpu_idle_type idle); + void (*pre_schedule) (struct rq *this_rq, struct task_struct *task); + void (*post_schedule) (struct rq *this_rq); + void (*task_wake_up) (struct rq *this_rq, struct task_struct *task); #endif void (*set_curr_task) (struct rq *rq); Index: linux-2.6.24.7/kernel/sched.c =================================================================== --- linux-2.6.24.7.orig/kernel/sched.c +++ linux-2.6.24.7/kernel/sched.c @@ -1611,7 +1611,10 @@ out_activate: out_running: p->state = TASK_RUNNING; - wakeup_balance_rt(rq, p); +#ifdef CONFIG_SMP + if (p->sched_class->task_wake_up) + p->sched_class->task_wake_up(rq, p); +#endif out: task_rq_unlock(rq, &flags); @@ -1734,7 +1737,10 @@ void fastcall wake_up_new_task(struct ta inc_nr_running(p, rq); } check_preempt_curr(rq, p); - wakeup_balance_rt(rq, p); +#ifdef CONFIG_SMP + if (p->sched_class->task_wake_up) + p->sched_class->task_wake_up(rq, p); +#endif task_rq_unlock(rq, &flags); } @@ -1855,7 +1861,10 @@ static void finish_task_switch(struct rq prev_state = prev->state; finish_arch_switch(prev); finish_lock_switch(rq, prev); - schedule_tail_balance_rt(rq); +#ifdef CONFIG_SMP + if (current->sched_class->post_schedule) + current->sched_class->post_schedule(rq); +#endif fire_sched_in_preempt_notifiers(current); if (mm) @@ -3624,7 +3633,10 @@ need_resched_nonpreemptible: switch_count = &prev->nvcsw; } - schedule_balance_rt(rq, prev); +#ifdef CONFIG_SMP + if (prev->sched_class->pre_schedule) + prev->sched_class->pre_schedule(rq, prev); +#endif if (unlikely(!rq->nr_running)) idle_balance(cpu, rq); Index: linux-2.6.24.7/kernel/sched_rt.c =================================================================== --- linux-2.6.24.7.orig/kernel/sched_rt.c +++ linux-2.6.24.7/kernel/sched_rt.c @@ -685,14 +685,14 @@ static int pull_rt_task(struct rq *this_ return ret; } -static void schedule_balance_rt(struct rq *rq, struct task_struct *prev) +static void pre_schedule_rt(struct rq *rq, struct task_struct *prev) { /* Try to pull RT tasks here if we lower this rq's prio */ if (unlikely(rt_task(prev)) && rq->rt.highest_prio > prev->prio) pull_rt_task(rq); } -static void schedule_tail_balance_rt(struct rq *rq) +static void post_schedule_rt(struct rq *rq) { /* * If we have more than one rt_task queued, then @@ -709,10 +709,9 @@ static void schedule_tail_balance_rt(str } -static void wakeup_balance_rt(struct rq *rq, struct task_struct *p) +static void task_wake_up_rt(struct rq *rq, struct task_struct *p) { - if (unlikely(rt_task(p)) && - !task_running(rq, p) && + if (!task_running(rq, p) && (p->prio >= rq->rt.highest_prio) && rq->rt.overloaded) push_rt_tasks(rq); @@ -774,11 +773,6 @@ static void leave_domain_rt(struct rq *r if (rq->rt.overloaded) rt_clear_overload(rq); } - -#else /* CONFIG_SMP */ -# define schedule_tail_balance_rt(rq) do { } while (0) -# define schedule_balance_rt(rq, prev) do { } while (0) -# define wakeup_balance_rt(rq, p) do { } while (0) #endif /* CONFIG_SMP */ static void task_tick_rt(struct rq *rq, struct task_struct *p) @@ -834,6 +828,9 @@ const struct sched_class rt_sched_class .set_cpus_allowed = set_cpus_allowed_rt, .join_domain = join_domain_rt, .leave_domain = leave_domain_rt, + .pre_schedule = pre_schedule_rt, + .post_schedule = post_schedule_rt, + .task_wake_up = task_wake_up_rt, #endif .set_curr_task = set_curr_task_rt, �����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/0042-sched-RT-balance-add-new-methods-to-sched_class.patch����������������������������������0000664�0000764�0000764�00000023504�11041673261�023511� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From 687f375de014506368559a0332a3c5163a17c0f5 Mon Sep 17 00:00:00 2001 From: Steven Rostedt <rostedt@goodmis.org> Date: Tue, 11 Dec 2007 10:02:48 +0100 Subject: [PATCH] sched: RT-balance, add new methods to sched_class Dmitry Adamushko found that the current implementation of the RT balancing code left out changes to the sched_setscheduler and rt_mutex_setprio. This patch addresses this issue by adding methods to the schedule classes to handle being switched out of (switched_from) and being switched into (switched_to) a sched_class. Also a method for changing of priorities is also added (prio_changed). This patch also removes some duplicate logic between rt_mutex_setprio and sched_setscheduler. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> --- include/linux/sched.h | 7 +++ kernel/sched.c | 38 ++++++++------------ kernel/sched_fair.c | 39 +++++++++++++++++++++ kernel/sched_idletask.c | 31 ++++++++++++++++ kernel/sched_rt.c | 89 ++++++++++++++++++++++++++++++++++++++++++++++++ 5 files changed, 182 insertions(+), 22 deletions(-) Index: linux-2.6.24.7/include/linux/sched.h =================================================================== --- linux-2.6.24.7.orig/include/linux/sched.h +++ linux-2.6.24.7/include/linux/sched.h @@ -855,6 +855,13 @@ struct sched_class { void (*join_domain)(struct rq *rq); void (*leave_domain)(struct rq *rq); + + void (*switched_from) (struct rq *this_rq, struct task_struct *task, + int running); + void (*switched_to) (struct rq *this_rq, struct task_struct *task, + int running); + void (*prio_changed) (struct rq *this_rq, struct task_struct *task, + int oldprio, int running); }; struct load_weight { Index: linux-2.6.24.7/kernel/sched.c =================================================================== --- linux-2.6.24.7.orig/kernel/sched.c +++ linux-2.6.24.7/kernel/sched.c @@ -1138,6 +1138,18 @@ static inline void __set_task_cpu(struct #endif } +static inline void check_class_changed(struct rq *rq, struct task_struct *p, + const struct sched_class *prev_class, + int oldprio, int running) +{ + if (prev_class != p->sched_class) { + if (prev_class->switched_from) + prev_class->switched_from(rq, p, running); + p->sched_class->switched_to(rq, p, running); + } else + p->sched_class->prio_changed(rq, p, oldprio, running); +} + #ifdef CONFIG_SMP /* @@ -4003,6 +4015,7 @@ void rt_mutex_setprio(struct task_struct unsigned long flags; int oldprio, on_rq, running; struct rq *rq; + const struct sched_class *prev_class = p->sched_class; BUG_ON(prio < 0 || prio > MAX_PRIO); @@ -4028,17 +4041,7 @@ void rt_mutex_setprio(struct task_struct p->sched_class->set_curr_task(rq); if (on_rq) { enqueue_task(rq, p, 0); - /* - * Reschedule if we are currently running on this runqueue and - * our priority decreased, or if we are not currently running on - * this runqueue and our priority is higher than the current's - */ - if (running) { - if (p->prio > oldprio) - resched_task(rq->curr); - } else { - check_preempt_curr(rq, p); - } + check_class_changed(rq, p, prev_class, oldprio, running); } task_rq_unlock(rq, &flags); } @@ -4241,6 +4244,7 @@ int sched_setscheduler(struct task_struc { int retval, oldprio, oldpolicy = -1, on_rq, running; unsigned long flags; + const struct sched_class *prev_class = p->sched_class; struct rq *rq; /* may grab non-irq protected spin_locks */ @@ -4334,17 +4338,7 @@ recheck: p->sched_class->set_curr_task(rq); if (on_rq) { activate_task(rq, p, 0); - /* - * Reschedule if we are currently running on this runqueue and - * our priority decreased, or if we are not currently running on - * this runqueue and our priority is higher than the current's - */ - if (running) { - if (p->prio > oldprio) - resched_task(rq->curr); - } else { - check_preempt_curr(rq, p); - } + check_class_changed(rq, p, prev_class, oldprio, running); } __task_rq_unlock(rq); spin_unlock_irqrestore(&p->pi_lock, flags); Index: linux-2.6.24.7/kernel/sched_fair.c =================================================================== --- linux-2.6.24.7.orig/kernel/sched_fair.c +++ linux-2.6.24.7/kernel/sched_fair.c @@ -1235,6 +1235,42 @@ static void task_new_fair(struct rq *rq, resched_task(rq->curr); } +/* + * Priority of the task has changed. Check to see if we preempt + * the current task. + */ +static void prio_changed_fair(struct rq *rq, struct task_struct *p, + int oldprio, int running) +{ + /* + * Reschedule if we are currently running on this runqueue and + * our priority decreased, or if we are not currently running on + * this runqueue and our priority is higher than the current's + */ + if (running) { + if (p->prio > oldprio) + resched_task(rq->curr); + } else + check_preempt_curr(rq, p); +} + +/* + * We switched to the sched_fair class. + */ +static void switched_to_fair(struct rq *rq, struct task_struct *p, + int running) +{ + /* + * We were most likely switched from sched_rt, so + * kick off the schedule if running, otherwise just see + * if we can still preempt the current task. + */ + if (running) + resched_task(rq->curr); + else + check_preempt_curr(rq, p); +} + /* Account for a task changing its policy or group. * * This routine is mostly called to set cfs_rq->curr field when a task @@ -1273,6 +1309,9 @@ static const struct sched_class fair_sch .set_curr_task = set_curr_task_fair, .task_tick = task_tick_fair, .task_new = task_new_fair, + + .prio_changed = prio_changed_fair, + .switched_to = switched_to_fair, }; #ifdef CONFIG_SCHED_DEBUG Index: linux-2.6.24.7/kernel/sched_idletask.c =================================================================== --- linux-2.6.24.7.orig/kernel/sched_idletask.c +++ linux-2.6.24.7/kernel/sched_idletask.c @@ -69,6 +69,33 @@ static void set_curr_task_idle(struct rq { } +static void switched_to_idle(struct rq *rq, struct task_struct *p, + int running) +{ + /* Can this actually happen?? */ + if (running) + resched_task(rq->curr); + else + check_preempt_curr(rq, p); +} + +static void prio_changed_idle(struct rq *rq, struct task_struct *p, + int oldprio, int running) +{ + /* This can happen for hot plug CPUS */ + + /* + * Reschedule if we are currently running on this runqueue and + * our priority decreased, or if we are not currently running on + * this runqueue and our priority is higher than the current's + */ + if (running) { + if (p->prio > oldprio) + resched_task(rq->curr); + } else + check_preempt_curr(rq, p); +} + /* * Simple, special scheduling class for the per-CPU idle tasks: */ @@ -94,5 +121,9 @@ const struct sched_class idle_sched_clas .set_curr_task = set_curr_task_idle, .task_tick = task_tick_idle, + + .prio_changed = prio_changed_idle, + .switched_to = switched_to_idle, + /* no .task_new for idle tasks */ }; Index: linux-2.6.24.7/kernel/sched_rt.c =================================================================== --- linux-2.6.24.7.orig/kernel/sched_rt.c +++ linux-2.6.24.7/kernel/sched_rt.c @@ -773,7 +773,92 @@ static void leave_domain_rt(struct rq *r if (rq->rt.overloaded) rt_clear_overload(rq); } + +/* + * When switch from the rt queue, we bring ourselves to a position + * that we might want to pull RT tasks from other runqueues. + */ +static void switched_from_rt(struct rq *rq, struct task_struct *p, + int running) +{ + /* + * If there are other RT tasks then we will reschedule + * and the scheduling of the other RT tasks will handle + * the balancing. But if we are the last RT task + * we may need to handle the pulling of RT tasks + * now. + */ + if (!rq->rt.rt_nr_running) + pull_rt_task(rq); +} +#endif /* CONFIG_SMP */ + +/* + * When switching a task to RT, we may overload the runqueue + * with RT tasks. In this case we try to push them off to + * other runqueues. + */ +static void switched_to_rt(struct rq *rq, struct task_struct *p, + int running) +{ + int check_resched = 1; + + /* + * If we are already running, then there's nothing + * that needs to be done. But if we are not running + * we may need to preempt the current running task. + * If that current running task is also an RT task + * then see if we can move to another run queue. + */ + if (!running) { +#ifdef CONFIG_SMP + if (rq->rt.overloaded && push_rt_task(rq) && + /* Don't resched if we changed runqueues */ + rq != task_rq(p)) + check_resched = 0; #endif /* CONFIG_SMP */ + if (check_resched && p->prio < rq->curr->prio) + resched_task(rq->curr); + } +} + +/* + * Priority of the task has changed. This may cause + * us to initiate a push or pull. + */ +static void prio_changed_rt(struct rq *rq, struct task_struct *p, + int oldprio, int running) +{ + if (running) { +#ifdef CONFIG_SMP + /* + * If our priority decreases while running, we + * may need to pull tasks to this runqueue. + */ + if (oldprio < p->prio) + pull_rt_task(rq); + /* + * If there's a higher priority task waiting to run + * then reschedule. + */ + if (p->prio > rq->rt.highest_prio) + resched_task(p); +#else + /* For UP simply resched on drop of prio */ + if (oldprio < p->prio) + resched_task(p); +#endif /* CONFIG_SMP */ + } else { + /* + * This task is not running, but if it is + * greater than the current running task + * then reschedule. + */ + if (p->prio < rq->curr->prio) + resched_task(rq->curr); + } +} + static void task_tick_rt(struct rq *rq, struct task_struct *p) { @@ -831,8 +916,12 @@ const struct sched_class rt_sched_class .pre_schedule = pre_schedule_rt, .post_schedule = post_schedule_rt, .task_wake_up = task_wake_up_rt, + .switched_from = switched_from_rt, #endif .set_curr_task = set_curr_task_rt, .task_tick = task_tick_rt, + + .prio_changed = prio_changed_rt, + .switched_to = switched_to_rt, }; ��������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/0043-sched-RT-balance-only-adjust-overload-state-when-c.patch�������������������������������0000664�0000764�0000764�00000002552�11041657735�024230� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From 9f6477c42a3f08857b24a29e43fc0664d77deebe Mon Sep 17 00:00:00 2001 From: Gregory Haskins <ghaskins@novell.com> Date: Tue, 11 Dec 2007 10:02:48 +0100 Subject: [PATCH] sched: RT-balance, only adjust overload state when changing The overload set/clears were originally idempotent when this logic was first implemented. But that is no longer true due to the addition of the atomic counter and this logic was never updated to work properly with that change. So only adjust the overload state if it is actually changing to avoid getting out of sync. Signed-off-by: Gregory Haskins <ghaskins@novell.com> Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> --- kernel/sched_rt.c | 8 +++++--- 1 file changed, 5 insertions(+), 3 deletions(-) Index: linux-2.6.24.7/kernel/sched_rt.c =================================================================== --- linux-2.6.24.7.orig/kernel/sched_rt.c +++ linux-2.6.24.7/kernel/sched_rt.c @@ -34,9 +34,11 @@ static inline void rt_clear_overload(str static void update_rt_migration(struct rq *rq) { if (rq->rt.rt_nr_migratory && (rq->rt.rt_nr_running > 1)) { - rt_set_overload(rq); - rq->rt.overloaded = 1; - } else { + if (!rq->rt.overloaded) { + rt_set_overload(rq); + rq->rt.overloaded = 1; + } + } else if (rq->rt.overloaded) { rt_clear_overload(rq); rq->rt.overloaded = 0; } ������������������������������������������������������������������������������������������������������������������������������������������������������patches/0044-sched-remove-some-old-cpuset-logic.patch�����������������������������������������������0000664�0000764�0000764�00000003643�11041657731�021350� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From 4115ceb4075bf6156a26446777eef274de82610e Mon Sep 17 00:00:00 2001 From: Gregory Haskins <ghaskins@novell.com> Date: Tue, 11 Dec 2007 10:02:48 +0100 Subject: [PATCH] sched: remove some old cpuset logic We had support for overlapping cpuset based rto logic in early prototypes that is no longer used, so remove it. Signed-off-by: Gregory Haskins <ghaskins@novell.com> Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> --- kernel/sched_rt.c | 33 --------------------------------- 1 file changed, 33 deletions(-) Index: linux-2.6.24.7/kernel/sched_rt.c =================================================================== --- linux-2.6.24.7.orig/kernel/sched_rt.c +++ linux-2.6.24.7/kernel/sched_rt.c @@ -582,38 +582,6 @@ static int pull_rt_task(struct rq *this_ continue; src_rq = cpu_rq(cpu); - if (unlikely(src_rq->rt.rt_nr_running <= 1)) { - /* - * It is possible that overlapping cpusets - * will miss clearing a non overloaded runqueue. - * Clear it now. - */ - if (double_lock_balance(this_rq, src_rq)) { - /* unlocked our runqueue lock */ - struct task_struct *old_next = next; - - next = pick_next_task_rt(this_rq); - if (next != old_next) - ret = 1; - } - if (likely(src_rq->rt.rt_nr_running <= 1)) { - /* - * Small chance that this_rq->curr changed - * but it's really harmless here. - */ - rt_clear_overload(this_rq); - } else { - /* - * Heh, the src_rq is now overloaded, since - * we already have the src_rq lock, go straight - * to pulling tasks from it. - */ - goto try_pulling; - } - spin_unlock(&src_rq->lock); - continue; - } - /* * We can potentially drop this_rq's lock in * double_lock_balance, and another CPU could @@ -637,7 +605,6 @@ static int pull_rt_task(struct rq *this_ continue; } - try_pulling: p = pick_next_highest_task_rt(src_rq, this_cpu); /* ���������������������������������������������������������������������������������������������patches/sched-use-a-2d-bitmap-search-prio-cpu.patch�������������������������������������������������0000664�0000764�0000764�00000030255�11041657732�021143� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From ghaskins@novell.com Wed Dec 12 00:28:53 2007 Date: Tue, 11 Dec 2007 17:34:03 -0500 From: Gregory Haskins <ghaskins@novell.com> To: rostedt@goodmis.org Cc: ghaskins@novell.com Subject: [PATCH] sched: Use a 2-d bitmap for searching lowest-pri CPU [ The following text is in the "utf-8" character set. ] [ Your display is set for the "iso-8859-1" character set. ] [ Some characters may be displayed incorrectly. ] The current code use a linear algorithm which causes scaling issues on larger SMP machines. This patch replaces that algorithm with a 2-dimensional bitmap to reduce latencies in the wake-up path. Signed-off-by: Gregory Haskins <ghaskins@novell.com> CC: Christoph Lameter <clameter@sgi.com> --- kernel/Makefile | 1 kernel/sched.c | 8 ++ kernel/sched_cpupri.c | 174 ++++++++++++++++++++++++++++++++++++++++++++++++++ kernel/sched_cpupri.h | 36 ++++++++++ kernel/sched_rt.c | 85 ++++++------------------ 5 files changed, 243 insertions(+), 61 deletions(-) Index: linux-2.6.24.7/kernel/Makefile =================================================================== --- linux-2.6.24.7.orig/kernel/Makefile +++ linux-2.6.24.7/kernel/Makefile @@ -57,6 +57,7 @@ obj-$(CONFIG_SYSCTL) += utsname_sysctl.o obj-$(CONFIG_TASK_DELAY_ACCT) += delayacct.o obj-$(CONFIG_TASKSTATS) += taskstats.o tsacct.o obj-$(CONFIG_MARKERS) += marker.o +obj-$(CONFIG_SMP) += sched_cpupri.o ifneq ($(CONFIG_SCHED_NO_NO_OMIT_FRAME_POINTER),y) # According to Alan Modra <alan@linuxcare.com.au>, the -fno-omit-frame-pointer is Index: linux-2.6.24.7/kernel/sched.c =================================================================== --- linux-2.6.24.7.orig/kernel/sched.c +++ linux-2.6.24.7/kernel/sched.c @@ -69,6 +69,8 @@ #include <asm/tlb.h> #include <asm/irq_regs.h> +#include "sched_cpupri.h" + /* * Scheduler clock - returns current time in nanosec units. * This is default implementation. @@ -298,6 +300,9 @@ struct root_domain { */ cpumask_t rto_mask; atomic_t rto_count; +#ifdef CONFIG_SMP + struct cpupri cpupri; +#endif }; static struct root_domain def_root_domain; @@ -5872,6 +5877,9 @@ static void init_rootdomain(struct root_ rd->span = *map; cpus_and(rd->online, rd->span, cpu_online_map); + + cpupri_init(&rd->cpupri); + } static void init_defrootdomain(void) Index: linux-2.6.24.7/kernel/sched_cpupri.c =================================================================== --- /dev/null +++ linux-2.6.24.7/kernel/sched_cpupri.c @@ -0,0 +1,174 @@ +/* + * kernel/sched_cpupri.c + * + * CPU priority management + * + * Copyright (C) 2007 Novell + * + * Author: Gregory Haskins <ghaskins@novell.com> + * + * This code tracks the priority of each CPU so that global migration + * decisions are easy to calculate. Each CPU can be in a state as follows: + * + * (INVALID), IDLE, NORMAL, RT1, ... RT99 + * + * going from the lowest priority to the highest. CPUs in the INVALID state + * are not eligible for routing. The system maintains this state with + * a 2 dimensional bitmap (the first for priority class, the second for cpus + * in that class). Therefore a typical application without affinity + * restrictions can find a suitable CPU with O(1) complexity (e.g. two bit + * searches). For tasks with affinity restrictions, the algorithm has a + * worst case complexity of O(min(102, nr_domcpus)), though the scenario that + * yields the worst case search is fairly contrived. + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License + * as published by the Free Software Foundation; version 2 + * of the License. + */ + +#include "sched_cpupri.h" + +/* Convert between a 140 based task->prio, and our 102 based cpupri */ +static int convert_prio(int prio) +{ + int cpupri; + + if (prio == CPUPRI_INVALID) + cpupri = CPUPRI_INVALID; + else if (prio == MAX_PRIO) + cpupri = CPUPRI_IDLE; + else if (prio >= MAX_RT_PRIO) + cpupri = CPUPRI_NORMAL; + else + cpupri = MAX_RT_PRIO - prio + 1; + + return cpupri; +} + +#define for_each_cpupri_active(array, idx) \ + for (idx = find_first_bit(array, CPUPRI_NR_PRIORITIES); \ + idx < CPUPRI_NR_PRIORITIES; \ + idx = find_next_bit(array, CPUPRI_NR_PRIORITIES, idx+1)) + +/** + * cpupri_find - find the best (lowest-pri) CPU in the system + * @cp: The cpupri context + * @p: The task + * @lowest_mask: A mask to fill in with selected CPUs + * + * Note: This function returns the recommended CPUs as calculated during the + * current invokation. By the time the call returns, the CPUs may have in + * fact changed priorities any number of times. While not ideal, it is not + * an issue of correctness since the normal rebalancer logic will correct + * any discrepancies created by racing against the uncertainty of the current + * priority configuration. + * + * Returns: (int)bool - CPUs were found + */ +int cpupri_find(struct cpupri *cp, struct task_struct *p, + cpumask_t *lowest_mask) +{ + int idx = 0; + int task_pri = convert_prio(p->prio); + + for_each_cpupri_active(cp->pri_active, idx) { + struct cpupri_vec *vec = &cp->pri_to_cpu[idx]; + cpumask_t mask; + + if (idx >= task_pri) + break; + + cpus_and(mask, p->cpus_allowed, vec->mask); + + if (cpus_empty(mask)) + continue; + + *lowest_mask = mask; + return 1; + } + + return 0; +} + +/** + * cpupri_set - update the cpu priority setting + * @cp: The cpupri context + * @cpu: The target cpu + * @pri: The priority (INVALID-RT99) to assign to this CPU + * + * Note: Assumes cpu_rq(cpu)->lock is locked + * + * Returns: (void) + */ +void cpupri_set(struct cpupri *cp, int cpu, int newpri) +{ + int *currpri = &cp->cpu_to_pri[cpu]; + int oldpri = *currpri; + unsigned long flags; + + newpri = convert_prio(newpri); + + BUG_ON(newpri >= CPUPRI_NR_PRIORITIES); + + if (newpri == oldpri) + return; + + /* + * If the cpu was currently mapped to a different value, we + * first need to unmap the old value + */ + if (likely(oldpri != CPUPRI_INVALID)) { + struct cpupri_vec *vec = &cp->pri_to_cpu[oldpri]; + + spin_lock_irqsave(&vec->lock, flags); + + vec->count--; + if (!vec->count) + clear_bit(oldpri, cp->pri_active); + cpu_clear(cpu, vec->mask); + + spin_unlock_irqrestore(&vec->lock, flags); + } + + if (likely(newpri != CPUPRI_INVALID)) { + struct cpupri_vec *vec = &cp->pri_to_cpu[newpri]; + + spin_lock_irqsave(&vec->lock, flags); + + cpu_set(cpu, vec->mask); + vec->count++; + if (vec->count == 1) + set_bit(newpri, cp->pri_active); + + spin_unlock_irqrestore(&vec->lock, flags); + } + + *currpri = newpri; +} + +/** + * cpupri_init - initialize the cpupri structure + * @cp: The cpupri context + * + * Returns: (void) + */ +void cpupri_init(struct cpupri *cp) +{ + int i; + + memset(cp, 0, sizeof(*cp)); + + for (i = 0; i < CPUPRI_NR_PRIORITIES; i++) { + struct cpupri_vec *vec = &cp->pri_to_cpu[i]; + + spin_lock_init(&vec->lock); + vec->count = 0; + cpus_clear(vec->mask); + } + + for_each_possible_cpu(i) + cp->cpu_to_pri[i] = CPUPRI_INVALID; +} + + Index: linux-2.6.24.7/kernel/sched_cpupri.h =================================================================== --- /dev/null +++ linux-2.6.24.7/kernel/sched_cpupri.h @@ -0,0 +1,36 @@ +#ifndef _LINUX_CPUPRI_H +#define _LINUX_CPUPRI_H + +#include <linux/sched.h> + +#define CPUPRI_NR_PRIORITIES 2+MAX_RT_PRIO +#define CPUPRI_NR_PRI_WORDS CPUPRI_NR_PRIORITIES/BITS_PER_LONG + +#define CPUPRI_INVALID -1 +#define CPUPRI_IDLE 0 +#define CPUPRI_NORMAL 1 +/* values 2-101 are RT priorities 0-99 */ + +struct cpupri_vec { + spinlock_t lock; + int count; + cpumask_t mask; +}; + +struct cpupri { + struct cpupri_vec pri_to_cpu[CPUPRI_NR_PRIORITIES]; + long pri_active[CPUPRI_NR_PRI_WORDS]; + int cpu_to_pri[NR_CPUS]; +}; + +#ifdef CONFIG_SMP +int cpupri_find(struct cpupri *cp, + struct task_struct *p, cpumask_t *lowest_mask); +void cpupri_set(struct cpupri *cp, int cpu, int pri); +void cpupri_init(struct cpupri *cp); +#else +#define cpupri_set(cp, cpu, pri) do { } while (0) +#define cpupri_init() do { } while (0) +#endif + +#endif /* _LINUX_CPUPRI_H */ Index: linux-2.6.24.7/kernel/sched_rt.c =================================================================== --- linux-2.6.24.7.orig/kernel/sched_rt.c +++ linux-2.6.24.7/kernel/sched_rt.c @@ -73,8 +73,10 @@ static inline void inc_rt_tasks(struct t WARN_ON(!rt_task(p)); rq->rt.rt_nr_running++; #ifdef CONFIG_SMP - if (p->prio < rq->rt.highest_prio) + if (p->prio < rq->rt.highest_prio) { rq->rt.highest_prio = p->prio; + cpupri_set(&rq->rd->cpupri, rq->cpu, p->prio); + } if (p->nr_cpus_allowed > 1) rq->rt.rt_nr_migratory++; @@ -84,6 +86,8 @@ static inline void inc_rt_tasks(struct t static inline void dec_rt_tasks(struct task_struct *p, struct rq *rq) { + int highest_prio = rq->rt.highest_prio; + WARN_ON(!rt_task(p)); WARN_ON(!rq->rt.rt_nr_running); rq->rt.rt_nr_running--; @@ -103,6 +107,9 @@ static inline void dec_rt_tasks(struct t if (p->nr_cpus_allowed > 1) rq->rt.rt_nr_migratory--; + if (rq->rt.highest_prio != highest_prio) + cpupri_set(&rq->rd->cpupri, rq->cpu, rq->rt.highest_prio); + update_rt_migration(rq); #endif /* CONFIG_SMP */ } @@ -293,69 +300,17 @@ static DEFINE_PER_CPU(cpumask_t, local_c static int find_lowest_cpus(struct task_struct *task, cpumask_t *lowest_mask) { - int lowest_prio = -1; - int lowest_cpu = -1; - int count = 0; - int cpu; + int count; - cpus_and(*lowest_mask, task_rq(task)->rd->online, task->cpus_allowed); + count = cpupri_find(&task_rq(task)->rd->cpupri, task, lowest_mask); /* - * Scan each rq for the lowest prio. + * cpupri cannot efficiently tell us how many bits are set, so it only + * returns a boolean. However, the caller of this function will + * special case the value "1", so we want to return a positive integer + * other than one if there are bits to look at */ - for_each_cpu_mask(cpu, *lowest_mask) { - struct rq *rq = cpu_rq(cpu); - - /* We look for lowest RT prio or non-rt CPU */ - if (rq->rt.highest_prio >= MAX_RT_PRIO) { - /* - * if we already found a low RT queue - * and now we found this non-rt queue - * clear the mask and set our bit. - * Otherwise just return the queue as is - * and the count==1 will cause the algorithm - * to use the first bit found. - */ - if (lowest_cpu != -1) { - cpus_clear(*lowest_mask); - cpu_set(rq->cpu, *lowest_mask); - } - return 1; - } - - /* no locking for now */ - if ((rq->rt.highest_prio > task->prio) - && (rq->rt.highest_prio >= lowest_prio)) { - if (rq->rt.highest_prio > lowest_prio) { - /* new low - clear old data */ - lowest_prio = rq->rt.highest_prio; - lowest_cpu = cpu; - count = 0; - } - count++; - } else - cpu_clear(cpu, *lowest_mask); - } - - /* - * Clear out all the set bits that represent - * runqueues that were of higher prio than - * the lowest_prio. - */ - if (lowest_cpu > 0) { - /* - * Perhaps we could add another cpumask op to - * zero out bits. Like cpu_zero_bits(cpumask, nrbits); - * Then that could be optimized to use memset and such. - */ - for_each_cpu_mask(cpu, *lowest_mask) { - if (cpu >= lowest_cpu) - break; - cpu_clear(cpu, *lowest_mask); - } - } - - return count; + return count ? 2 : 0; } static inline int pick_optimal_cpu(int this_cpu, cpumask_t *mask) @@ -379,8 +334,12 @@ static int find_lowest_rq(struct task_st cpumask_t *lowest_mask = &__get_cpu_var(local_cpu_mask); int this_cpu = smp_processor_id(); int cpu = task_cpu(task); - int count = find_lowest_cpus(task, lowest_mask); + int count; + + if (task->nr_cpus_allowed == 1) + return -1; /* No other targets possible */ + count = find_lowest_cpus(task, lowest_mask); if (!count) return -1; /* No targets found */ @@ -734,6 +693,8 @@ static void join_domain_rt(struct rq *rq { if (rq->rt.overloaded) rt_set_overload(rq); + + cpupri_set(&rq->rd->cpupri, rq->cpu, rq->rt.highest_prio); } /* Assumes rq->lock is held */ @@ -741,6 +702,8 @@ static void leave_domain_rt(struct rq *r { if (rq->rt.overloaded) rt_clear_overload(rq); + + cpupri_set(&rq->rd->cpupri, rq->cpu, CPUPRI_INVALID); } /* ���������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/remove-unused-var-warning.patch�������������������������������������������������������������0000664�0000764�0000764�00000001260�11041657732�017312� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From: Remy Bohmer <linux@bohmer.net> Subject: fix warning on dec_rt_tasks on UP Signed-off-by: Remy Bohmer <linux@bohmer.net> --- kernel/sched_rt.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) Index: linux-2.6.24.7/kernel/sched_rt.c =================================================================== --- linux-2.6.24.7.orig/kernel/sched_rt.c +++ linux-2.6.24.7/kernel/sched_rt.c @@ -86,8 +86,9 @@ static inline void inc_rt_tasks(struct t static inline void dec_rt_tasks(struct task_struct *p, struct rq *rq) { +#ifdef CONFIG_SMP int highest_prio = rq->rt.highest_prio; - +#endif WARN_ON(!rt_task(p)); WARN_ON(!rq->rt.rt_nr_running); rq->rt.rt_nr_running--; ������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/markers-upstream.patch����������������������������������������������������������������������0000664�0000764�0000764�00000132377�11041657735�015606� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������--- include/linux/immediate.h | 97 ++++++ include/linux/marker.h | 117 +++++-- include/linux/module.h | 30 + kernel/marker.c | 709 ++++++++++++++++++++++++++++++++++------------ kernel/module.c | 91 +++++ 5 files changed, 817 insertions(+), 227 deletions(-) Index: linux-2.6.24.7/include/linux/immediate.h =================================================================== --- /dev/null +++ linux-2.6.24.7/include/linux/immediate.h @@ -0,0 +1,97 @@ +#ifndef _LINUX_IMMEDIATE_H +#define _LINUX_IMMEDIATE_H + +/* + * Immediate values, can be updated at runtime and save cache lines. + * + * (C) Copyright 2007 Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> + * + * This file is released under the GPLv2. + * See the file COPYING for more details. + */ + +#ifdef CONFIG_IMMEDIATE + +#include <asm/immediate.h> + +/** + * imv_set - set immediate variable (with locking) + * @name: immediate value name + * @i: required value + * + * Sets the value of @name, taking the module_mutex if required by + * the architecture. + */ +#define imv_set(name, i) \ + do { \ + name##__imv = (i); \ + core_imv_update(); \ + module_imv_update(); \ + } while (0) + +/* + * Internal update functions. + */ +extern void core_imv_update(void); +extern void imv_update_range(struct __imv *begin, struct __imv *end); +extern void imv_unref_core_init(void); +extern void imv_unref(struct __imv *begin, struct __imv *end, void *start, + unsigned long size); +extern int _is_imv_cond_end(unsigned long *begin, unsigned long *end, + unsigned long addr1, unsigned long addr2); +extern int is_imv_cond_end(unsigned long addr1, unsigned long addr2); + +#else + +/* + * Generic immediate values: a simple, standard, memory load. + */ + +/** + * imv_read - read immediate variable + * @name: immediate value name + * + * Reads the value of @name. + */ +#define imv_read(name) _imv_read(name) + +/** + * imv_cond - read immediate variable use as condition for if() + * @name: immediate value name + * + * Reads the value of @name. + */ +#define imv_cond(name) _imv_read(name) +#define imv_cond_end() + +/** + * imv_set - set immediate variable (with locking) + * @name: immediate value name + * @i: required value + * + * Sets the value of @name, taking the module_mutex if required by + * the architecture. + */ +#define imv_set(name, i) (name##__imv = (i)) + +static inline void core_imv_update(void) { } +static inline void imv_unref_core_init(void) { } + +#endif + +#define DECLARE_IMV(type, name) extern __typeof__(type) name##__imv +#define DEFINE_IMV(type, name) __typeof__(type) name##__imv + +#define EXPORT_IMV_SYMBOL(name) EXPORT_SYMBOL(name##__imv) +#define EXPORT_IMV_SYMBOL_GPL(name) EXPORT_SYMBOL_GPL(name##__imv) + +/** + * _imv_read - Read immediate value with standard memory load. + * @name: immediate value name + * + * Force a data read of the immediate value instead of the immediate value + * based mechanism. Useful for __init and __exit section data read. + */ +#define _imv_read(name) (name##__imv) + +#endif Index: linux-2.6.24.7/include/linux/marker.h =================================================================== --- linux-2.6.24.7.orig/include/linux/marker.h +++ linux-2.6.24.7/include/linux/marker.h @@ -12,6 +12,7 @@ * See the file COPYING for more details. */ +#include <linux/immediate.h> #include <linux/types.h> struct module; @@ -19,25 +20,35 @@ struct marker; /** * marker_probe_func - Type of a marker probe function - * @mdata: pointer of type struct marker - * @private_data: caller site private data + * @probe_private: probe private data + * @call_private: call site private data * @fmt: format string - * @...: variable argument list + * @args: variable argument list pointer. Use a pointer to overcome C's + * inability to pass this around as a pointer in a portable manner in + * the callee otherwise. * * Type of marker probe functions. They receive the mdata and need to parse the * format string to recover the variable argument list. */ -typedef void marker_probe_func(const struct marker *mdata, - void *private_data, const char *fmt, ...); +typedef void marker_probe_func(void *probe_private, void *call_private, + const char *fmt, va_list *args); + +struct marker_probe_closure { + marker_probe_func *func; /* Callback */ + void *probe_private; /* Private probe data */ +}; struct marker { const char *name; /* Marker name */ const char *format; /* Marker format string, describing the * variable argument list. */ - char state; /* Marker state. */ - marker_probe_func *call;/* Probe handler function pointer */ - void *private; /* Private probe data */ + DEFINE_IMV(char, state);/* Immediate value state. */ + char ptype; /* probe type : 0 : single, 1 : multi */ + /* Probe wrapper */ + void (*call)(const struct marker *mdata, void *call_private, ...); + struct marker_probe_closure single; + struct marker_probe_closure *multi; } __attribute__((aligned(8))); #ifdef CONFIG_MARKERS @@ -48,51 +59,73 @@ struct marker { * Make sure the alignment of the structure in the __markers section will * not add unwanted padding between the beginning of the section and the * structure. Force alignment to the same alignment as the section start. + * + * The "generic" argument controls which marker enabling mechanism must be used. + * If generic is true, a variable read is used. + * If generic is false, immediate values are used. */ -#define __trace_mark(name, call_data, format, args...) \ +#define __trace_mark(generic, name, call_private, format, args...) \ do { \ - static const char __mstrtab_name_##name[] \ + static const char __mstrtab_##name[] \ __attribute__((section("__markers_strings"))) \ - = #name; \ - static const char __mstrtab_format_##name[] \ - __attribute__((section("__markers_strings"))) \ - = format; \ + = #name "\0" format; \ static struct marker __mark_##name \ __attribute__((section("__markers"), aligned(8))) = \ - { __mstrtab_name_##name, __mstrtab_format_##name, \ - 0, __mark_empty_function, NULL }; \ + { __mstrtab_##name, &__mstrtab_##name[sizeof(#name)], \ + 0, 0, marker_probe_cb, \ + { __mark_empty_function, NULL}, NULL }; \ __mark_check_format(format, ## args); \ - if (unlikely(__mark_##name.state)) { \ - preempt_disable(); \ - (*__mark_##name.call) \ - (&__mark_##name, call_data, \ - format, ## args); \ - preempt_enable(); \ + if (!generic) { \ + if (unlikely(imv_cond(__mark_##name.state))) { \ + imv_cond_end(); \ + (*__mark_##name.call) \ + (&__mark_##name, call_private, \ + ## args); \ + } else \ + imv_cond_end(); \ + } else { \ + if (unlikely(_imv_read(__mark_##name.state))) \ + (*__mark_##name.call) \ + (&__mark_##name, call_private, \ + ## args); \ } \ } while (0) extern void marker_update_probe_range(struct marker *begin, - struct marker *end, struct module *probe_module, int *refcount); + struct marker *end); #else /* !CONFIG_MARKERS */ -#define __trace_mark(name, call_data, format, args...) \ +#define __trace_mark(generic, name, call_private, format, args...) \ __mark_check_format(format, ## args) static inline void marker_update_probe_range(struct marker *begin, - struct marker *end, struct module *probe_module, int *refcount) + struct marker *end) { } #endif /* CONFIG_MARKERS */ /** - * trace_mark - Marker + * trace_mark - Marker using code patching * @name: marker name, not quoted. * @format: format string * @args...: variable argument list * - * Places a marker. + * Places a marker using optimized code patching technique (imv_read()) + * to be enabled when immediate values are present. */ #define trace_mark(name, format, args...) \ - __trace_mark(name, NULL, format, ## args) + __trace_mark(0, name, NULL, format, ## args) -#define MARK_MAX_FORMAT_LEN 1024 +/** + * _trace_mark - Marker using variable read + * @name: marker name, not quoted. + * @format: format string + * @args...: variable argument list + * + * Places a marker using a standard memory read (_imv_read()) to be + * enabled. Should be used for markers in code paths where instruction + * modification based enabling is not welcome. (__init and __exit functions, + * lockdep, some traps, printk). + */ +#define _trace_mark(name, format, args...) \ + __trace_mark(1, name, NULL, format, ## args) /** * MARK_NOARGS - Format string for a marker with no argument. @@ -100,30 +133,42 @@ static inline void marker_update_probe_r #define MARK_NOARGS " " /* To be used for string format validity checking with gcc */ -static inline void __printf(1, 2) __mark_check_format(const char *fmt, ...) +static inline void __printf(1, 2) ___mark_check_format(const char *fmt, ...) { } +#define __mark_check_format(format, args...) \ + do { \ + if (0) \ + ___mark_check_format(format, ## args); \ + } while (0) + extern marker_probe_func __mark_empty_function; +extern void marker_probe_cb(const struct marker *mdata, + void *call_private, ...); +extern void marker_probe_cb_noarg(const struct marker *mdata, + void *call_private, ...); + /* * Connect a probe to a marker. * private data pointer must be a valid allocated memory address, or NULL. */ extern int marker_probe_register(const char *name, const char *format, - marker_probe_func *probe, void *private); + marker_probe_func *probe, void *probe_private); /* * Returns the private data given to marker_probe_register. */ -extern void *marker_probe_unregister(const char *name); +extern int marker_probe_unregister(const char *name, + marker_probe_func *probe, void *probe_private); /* * Unregister a marker by providing the registered private data. */ -extern void *marker_probe_unregister_private_data(void *private); +extern int marker_probe_unregister_private_data(marker_probe_func *probe, + void *probe_private); -extern int marker_arm(const char *name); -extern int marker_disarm(const char *name); -extern void *marker_get_private_data(const char *name); +extern void *marker_get_private_data(const char *name, marker_probe_func *probe, + int num); #endif Index: linux-2.6.24.7/include/linux/module.h =================================================================== --- linux-2.6.24.7.orig/include/linux/module.h +++ linux-2.6.24.7/include/linux/module.h @@ -15,6 +15,7 @@ #include <linux/stringify.h> #include <linux/kobject.h> #include <linux/moduleparam.h> +#include <linux/immediate.h> #include <linux/marker.h> #include <asm/local.h> @@ -355,6 +356,12 @@ struct module /* The command line arguments (may be mangled). People like keeping pointers to this stuff */ char *args; +#ifdef CONFIG_IMMEDIATE + struct __imv *immediate; + unsigned int num_immediate; + unsigned long *immediate_cond_end; + unsigned int num_immediate_cond_end; +#endif #ifdef CONFIG_MARKERS struct marker *markers; unsigned int num_markers; @@ -462,7 +469,7 @@ int unregister_module_notifier(struct no extern void print_modules(void); -extern void module_update_markers(struct module *probe_module, int *refcount); +extern void module_update_markers(void); #else /* !CONFIG_MODULES... */ #define EXPORT_SYMBOL(sym) @@ -563,13 +570,30 @@ static inline void print_modules(void) { } -static inline void module_update_markers(struct module *probe_module, - int *refcount) +static inline void module_update_markers(void) { } #endif /* CONFIG_MODULES */ +#if defined(CONFIG_MODULES) && defined(CONFIG_IMMEDIATE) +extern void _module_imv_update(void); +extern void module_imv_update(void); +extern int is_imv_cond_end_module(unsigned long addr1, unsigned long addr2); +#else +static inline void _module_imv_update(void) +{ +} +static inline void module_imv_update(void) +{ +} +static inline int is_imv_cond_end_module(unsigned long addr1, + unsigned long addr2) +{ + return 0; +} +#endif + struct device_driver; #ifdef CONFIG_SYSFS struct module; Index: linux-2.6.24.7/kernel/marker.c =================================================================== --- linux-2.6.24.7.orig/kernel/marker.c +++ linux-2.6.24.7/kernel/marker.c @@ -23,39 +23,48 @@ #include <linux/rcupdate.h> #include <linux/marker.h> #include <linux/err.h> +#include <linux/slab.h> +#include <linux/immediate.h> extern struct marker __start___markers[]; extern struct marker __stop___markers[]; +/* Set to 1 to enable marker debug output */ +static const int marker_debug; + /* * markers_mutex nests inside module_mutex. Markers mutex protects the builtin - * and module markers, the hash table and deferred_sync. + * and module markers and the hash table. */ static DEFINE_MUTEX(markers_mutex); /* - * Marker deferred synchronization. - * Upon marker probe_unregister, we delay call to synchronize_sched() to - * accelerate mass unregistration (only when there is no more reference to a - * given module do we call synchronize_sched()). However, we need to make sure - * every critical region has ended before we re-arm a marker that has been - * unregistered and then registered back with a different probe data. - */ -static int deferred_sync; - -/* * Marker hash table, containing the active markers. * Protected by module_mutex. */ #define MARKER_HASH_BITS 6 #define MARKER_TABLE_SIZE (1 << MARKER_HASH_BITS) +/* + * Note about RCU : + * It is used to make sure every handler has finished using its private data + * between two consecutive operation (add or remove) on a given marker. It is + * also used to delay the free of multiple probes array until a quiescent state + * is reached. + * marker entries modifications are protected by the markers_mutex. + */ struct marker_entry { struct hlist_node hlist; char *format; - marker_probe_func *probe; - void *private; + /* Probe wrapper */ + void (*call)(const struct marker *mdata, void *call_private, ...); + struct marker_probe_closure single; + struct marker_probe_closure *multi; int refcount; /* Number of times armed. 0 if disarmed. */ + struct rcu_head rcu; + void *oldptr; + unsigned char rcu_pending:1; + unsigned char ptype:1; char name[0]; /* Contains name'\0'format'\0' */ }; @@ -63,7 +72,8 @@ static struct hlist_head marker_table[MA /** * __mark_empty_function - Empty probe callback - * @mdata: pointer of type const struct marker + * @probe_private: probe private data + * @call_private: call site private data * @fmt: format string * @...: variable argument list * @@ -72,13 +82,265 @@ static struct hlist_head marker_table[MA * though the function pointer change and the marker enabling are two distinct * operations that modifies the execution flow of preemptible code. */ -void __mark_empty_function(const struct marker *mdata, void *private, - const char *fmt, ...) +void __mark_empty_function(void *probe_private, void *call_private, + const char *fmt, va_list *args) { } EXPORT_SYMBOL_GPL(__mark_empty_function); /* + * marker_probe_cb Callback that prepares the variable argument list for probes. + * @mdata: pointer of type struct marker + * @call_private: caller site private data + * @...: Variable argument list. + * + * Since we do not use "typical" pointer based RCU in the 1 argument case, we + * need to put a full smp_rmb() in this branch. This is why we do not use + * rcu_dereference() for the pointer read. + */ +void marker_probe_cb(const struct marker *mdata, void *call_private, ...) +{ + va_list args; + char ptype; + + /* + * preempt_disable does two things : disabling preemption to make sure + * the teardown of the callbacks can be done correctly when they are in + * modules and they insure RCU read coherency. + */ + preempt_disable(); + ptype = mdata->ptype; + if (likely(!ptype)) { + marker_probe_func *func; + /* Must read the ptype before ptr. They are not data dependant, + * so we put an explicit smp_rmb() here. */ + smp_rmb(); + func = mdata->single.func; + /* Must read the ptr before private data. They are not data + * dependant, so we put an explicit smp_rmb() here. */ + smp_rmb(); + va_start(args, call_private); + func(mdata->single.probe_private, call_private, mdata->format, + &args); + va_end(args); + } else { + struct marker_probe_closure *multi; + int i; + /* + * multi points to an array, therefore accessing the array + * depends on reading multi. However, even in this case, + * we must insure that the pointer is read _before_ the array + * data. Same as rcu_dereference, but we need a full smp_rmb() + * in the fast path, so put the explicit barrier here. + */ + smp_read_barrier_depends(); + multi = mdata->multi; + for (i = 0; multi[i].func; i++) { + va_start(args, call_private); + multi[i].func(multi[i].probe_private, call_private, + mdata->format, &args); + va_end(args); + } + } + preempt_enable(); +} +EXPORT_SYMBOL_GPL(marker_probe_cb); + +/* + * marker_probe_cb Callback that does not prepare the variable argument list. + * @mdata: pointer of type struct marker + * @call_private: caller site private data + * @...: Variable argument list. + * + * Should be connected to markers "MARK_NOARGS". + */ +void marker_probe_cb_noarg(const struct marker *mdata, void *call_private, ...) +{ + va_list args; /* not initialized */ + char ptype; + + preempt_disable(); + ptype = mdata->ptype; + if (likely(!ptype)) { + marker_probe_func *func; + /* Must read the ptype before ptr. They are not data dependant, + * so we put an explicit smp_rmb() here. */ + smp_rmb(); + func = mdata->single.func; + /* Must read the ptr before private data. They are not data + * dependant, so we put an explicit smp_rmb() here. */ + smp_rmb(); + func(mdata->single.probe_private, call_private, mdata->format, + &args); + } else { + struct marker_probe_closure *multi; + int i; + /* + * multi points to an array, therefore accessing the array + * depends on reading multi. However, even in this case, + * we must insure that the pointer is read _before_ the array + * data. Same as rcu_dereference, but we need a full smp_rmb() + * in the fast path, so put the explicit barrier here. + */ + smp_read_barrier_depends(); + multi = mdata->multi; + for (i = 0; multi[i].func; i++) + multi[i].func(multi[i].probe_private, call_private, + mdata->format, &args); + } + preempt_enable(); +} +EXPORT_SYMBOL_GPL(marker_probe_cb_noarg); + +static void free_old_closure(struct rcu_head *head) +{ + struct marker_entry *entry = container_of(head, + struct marker_entry, rcu); + kfree(entry->oldptr); + /* Make sure we free the data before setting the pending flag to 0 */ + smp_wmb(); + entry->rcu_pending = 0; +} + +static void debug_print_probes(struct marker_entry *entry) +{ + int i; + + if (!marker_debug) + return; + + if (!entry->ptype) { + printk(KERN_DEBUG "Single probe : %p %p\n", + entry->single.func, + entry->single.probe_private); + } else { + for (i = 0; entry->multi[i].func; i++) + printk(KERN_DEBUG "Multi probe %d : %p %p\n", i, + entry->multi[i].func, + entry->multi[i].probe_private); + } +} + +static struct marker_probe_closure * +marker_entry_add_probe(struct marker_entry *entry, + marker_probe_func *probe, void *probe_private) +{ + int nr_probes = 0; + struct marker_probe_closure *old, *new; + + WARN_ON(!probe); + + debug_print_probes(entry); + old = entry->multi; + if (!entry->ptype) { + if (entry->single.func == probe && + entry->single.probe_private == probe_private) + return ERR_PTR(-EBUSY); + if (entry->single.func == __mark_empty_function) { + /* 0 -> 1 probes */ + entry->single.func = probe; + entry->single.probe_private = probe_private; + entry->refcount = 1; + entry->ptype = 0; + debug_print_probes(entry); + return NULL; + } else { + /* 1 -> 2 probes */ + nr_probes = 1; + old = NULL; + } + } else { + /* (N -> N+1), (N != 0, 1) probes */ + for (nr_probes = 0; old[nr_probes].func; nr_probes++) + if (old[nr_probes].func == probe + && old[nr_probes].probe_private + == probe_private) + return ERR_PTR(-EBUSY); + } + /* + 2 : one for new probe, one for NULL func */ + new = kzalloc((nr_probes + 2) * sizeof(struct marker_probe_closure), + GFP_KERNEL); + if (new == NULL) + return ERR_PTR(-ENOMEM); + if (!old) + new[0] = entry->single; + else + memcpy(new, old, + nr_probes * sizeof(struct marker_probe_closure)); + new[nr_probes].func = probe; + new[nr_probes].probe_private = probe_private; + entry->refcount = nr_probes + 1; + entry->multi = new; + entry->ptype = 1; + debug_print_probes(entry); + return old; +} + +static struct marker_probe_closure * +marker_entry_remove_probe(struct marker_entry *entry, + marker_probe_func *probe, void *probe_private) +{ + int nr_probes = 0, nr_del = 0, i; + struct marker_probe_closure *old, *new; + + old = entry->multi; + + debug_print_probes(entry); + if (!entry->ptype) { + /* 0 -> N is an error */ + WARN_ON(entry->single.func == __mark_empty_function); + /* 1 -> 0 probes */ + WARN_ON(probe && entry->single.func != probe); + WARN_ON(entry->single.probe_private != probe_private); + entry->single.func = __mark_empty_function; + entry->refcount = 0; + entry->ptype = 0; + debug_print_probes(entry); + return NULL; + } else { + /* (N -> M), (N > 1, M >= 0) probes */ + for (nr_probes = 0; old[nr_probes].func; nr_probes++) { + if ((!probe || old[nr_probes].func == probe) + && old[nr_probes].probe_private + == probe_private) + nr_del++; + } + } + + if (nr_probes - nr_del == 0) { + /* N -> 0, (N > 1) */ + entry->single.func = __mark_empty_function; + entry->refcount = 0; + entry->ptype = 0; + } else if (nr_probes - nr_del == 1) { + /* N -> 1, (N > 1) */ + for (i = 0; old[i].func; i++) + if ((probe && old[i].func != probe) || + old[i].probe_private != probe_private) + entry->single = old[i]; + entry->refcount = 1; + entry->ptype = 0; + } else { + int j = 0; + /* N -> M, (N > 1, M > 1) */ + /* + 1 for NULL */ + new = kzalloc((nr_probes - nr_del + 1) + * sizeof(struct marker_probe_closure), GFP_KERNEL); + if (new == NULL) + return ERR_PTR(-ENOMEM); + for (i = 0; old[i].func; i++) + if ((probe && old[i].func != probe) || + old[i].probe_private != probe_private) + new[j++] = old[i]; + entry->refcount = nr_probes - nr_del; + entry->ptype = 1; + entry->multi = new; + } + debug_print_probes(entry); + return old; +} + +/* * Get marker if the marker is present in the marker hash table. * Must be called with markers_mutex held. * Returns NULL if not present. @@ -102,8 +364,7 @@ static struct marker_entry *get_marker(c * Add the marker to the marker hash table. Must be called with markers_mutex * held. */ -static int add_marker(const char *name, const char *format, - marker_probe_func *probe, void *private) +static struct marker_entry *add_marker(const char *name, const char *format) { struct hlist_head *head; struct hlist_node *node; @@ -118,9 +379,8 @@ static int add_marker(const char *name, hlist_for_each_entry(e, node, head, hlist) { if (!strcmp(name, e->name)) { printk(KERN_NOTICE - "Marker %s busy, probe %p already installed\n", - name, e->probe); - return -EBUSY; /* Already there */ + "Marker %s busy\n", name); + return ERR_PTR(-EBUSY); /* Already there */ } } /* @@ -130,34 +390,42 @@ static int add_marker(const char *name, e = kmalloc(sizeof(struct marker_entry) + name_len + format_len, GFP_KERNEL); if (!e) - return -ENOMEM; + return ERR_PTR(-ENOMEM); memcpy(&e->name[0], name, name_len); if (format) { e->format = &e->name[name_len]; memcpy(e->format, format, format_len); + if (strcmp(e->format, MARK_NOARGS) == 0) + e->call = marker_probe_cb_noarg; + else + e->call = marker_probe_cb; trace_mark(core_marker_format, "name %s format %s", e->name, e->format); - } else + } else { e->format = NULL; - e->probe = probe; - e->private = private; + e->call = marker_probe_cb; + } + e->single.func = __mark_empty_function; + e->single.probe_private = NULL; + e->multi = NULL; + e->ptype = 0; e->refcount = 0; + e->rcu_pending = 0; hlist_add_head(&e->hlist, head); - return 0; + return e; } /* * Remove the marker from the marker hash table. Must be called with mutex_lock * held. */ -static void *remove_marker(const char *name) +static int remove_marker(const char *name) { struct hlist_head *head; struct hlist_node *node; struct marker_entry *e; int found = 0; size_t len = strlen(name) + 1; - void *private = NULL; u32 hash = jhash(name, len-1, 0); head = &marker_table[hash & ((1 << MARKER_HASH_BITS)-1)]; @@ -167,12 +435,16 @@ static void *remove_marker(const char *n break; } } - if (found) { - private = e->private; - hlist_del(&e->hlist); - kfree(e); - } - return private; + if (!found) + return -ENOENT; + if (e->single.func != __mark_empty_function) + return -EBUSY; + hlist_del(&e->hlist); + /* Make sure the call_rcu has been executed */ + if (e->rcu_pending) + rcu_barrier(); + kfree(e); + return 0; } /* @@ -184,6 +456,7 @@ static int marker_set_format(struct mark size_t name_len = strlen((*entry)->name) + 1; size_t format_len = strlen(format) + 1; + e = kmalloc(sizeof(struct marker_entry) + name_len + format_len, GFP_KERNEL); if (!e) @@ -191,11 +464,20 @@ static int marker_set_format(struct mark memcpy(&e->name[0], (*entry)->name, name_len); e->format = &e->name[name_len]; memcpy(e->format, format, format_len); - e->probe = (*entry)->probe; - e->private = (*entry)->private; + if (strcmp(e->format, MARK_NOARGS) == 0) + e->call = marker_probe_cb_noarg; + else + e->call = marker_probe_cb; + e->single = (*entry)->single; + e->multi = (*entry)->multi; + e->ptype = (*entry)->ptype; e->refcount = (*entry)->refcount; + e->rcu_pending = 0; hlist_add_before(&e->hlist, &(*entry)->hlist); hlist_del(&(*entry)->hlist); + /* Make sure the call_rcu has been executed */ + if ((*entry)->rcu_pending) + rcu_barrier(); kfree(*entry); *entry = e; trace_mark(core_marker_format, "name %s format %s", @@ -206,7 +488,8 @@ static int marker_set_format(struct mark /* * Sets the probe callback corresponding to one marker. */ -static int set_marker(struct marker_entry **entry, struct marker *elem) +static int set_marker(struct marker_entry **entry, struct marker *elem, + int active) { int ret; WARN_ON(strcmp((*entry)->name, elem->name) != 0); @@ -226,26 +509,64 @@ static int set_marker(struct marker_entr if (ret) return ret; } - elem->call = (*entry)->probe; - elem->private = (*entry)->private; - elem->state = 1; + + /* + * probe_cb setup (statically known) is done here. It is + * asynchronous with the rest of execution, therefore we only + * pass from a "safe" callback (with argument) to an "unsafe" + * callback (does not set arguments). + */ + elem->call = (*entry)->call; + /* + * Sanity check : + * We only update the single probe private data when the ptr is + * set to a _non_ single probe! (0 -> 1 and N -> 1, N != 1) + */ + WARN_ON(elem->single.func != __mark_empty_function + && elem->single.probe_private + != (*entry)->single.probe_private && + !elem->ptype); + elem->single.probe_private = (*entry)->single.probe_private; + /* + * Make sure the private data is valid when we update the + * single probe ptr. + */ + smp_wmb(); + elem->single.func = (*entry)->single.func; + /* + * We also make sure that the new probe callbacks array is consistent + * before setting a pointer to it. + */ + rcu_assign_pointer(elem->multi, (*entry)->multi); + /* + * Update the function or multi probe array pointer before setting the + * ptype. + */ + smp_wmb(); + elem->ptype = (*entry)->ptype; + elem->state__imv = active; + return 0; } /* * Disable a marker and its probe callback. - * Note: only after a synchronize_sched() issued after setting elem->call to the - * empty function insures that the original callback is not used anymore. This - * insured by preemption disabling around the call site. + * Note: only waiting an RCU period after setting elem->call to the empty + * function insures that the original callback is not used anymore. This insured + * by preempt_disable around the call site. */ static void disable_marker(struct marker *elem) { - elem->state = 0; - elem->call = __mark_empty_function; + /* leave "call" as is. It is known statically. */ + elem->state__imv = 0; + elem->single.func = __mark_empty_function; + /* Update the function before setting the ptype */ + smp_wmb(); + elem->ptype = 0; /* single probe */ /* * Leave the private data and id there, because removal is racy and - * should be done only after a synchronize_sched(). These are never used - * until the next initialization anyway. + * should be done only after an RCU period. These are never used until + * the next initialization anyway. */ } @@ -253,14 +574,11 @@ static void disable_marker(struct marker * marker_update_probe_range - Update a probe range * @begin: beginning of the range * @end: end of the range - * @probe_module: module address of the probe being updated - * @refcount: number of references left to the given probe_module (out) * * Updates the probe callback corresponding to a range of markers. */ void marker_update_probe_range(struct marker *begin, - struct marker *end, struct module *probe_module, - int *refcount) + struct marker *end) { struct marker *iter; struct marker_entry *mark_entry; @@ -268,15 +586,12 @@ void marker_update_probe_range(struct ma mutex_lock(&markers_mutex); for (iter = begin; iter < end; iter++) { mark_entry = get_marker(iter->name); - if (mark_entry && mark_entry->refcount) { - set_marker(&mark_entry, iter); + if (mark_entry) { + set_marker(&mark_entry, iter, + !!mark_entry->refcount); /* * ignore error, continue */ - if (probe_module) - if (probe_module == - __module_text_address((unsigned long)mark_entry->probe)) - (*refcount)++; } else { disable_marker(iter); } @@ -286,23 +601,30 @@ void marker_update_probe_range(struct ma /* * Update probes, removing the faulty probes. - * Issues a synchronize_sched() when no reference to the module passed - * as parameter is found in the probes so the probe module can be - * safely unloaded from now on. + * + * Internal callback only changed before the first probe is connected to it. + * Single probe private data can only be changed on 0 -> 1 and 2 -> 1 + * transitions. All other transitions will leave the old private data valid. + * This makes the non-atomicity of the callback/private data updates valid. + * + * "special case" updates : + * 0 -> 1 callback + * 1 -> 0 callback + * 1 -> 2 callbacks + * 2 -> 1 callbacks + * Other updates all behave the same, just like the 2 -> 3 or 3 -> 2 updates. + * Site effect : marker_set_format may delete the marker entry (creating a + * replacement). */ -static void marker_update_probes(struct module *probe_module) +static void marker_update_probes(void) { - int refcount = 0; - /* Core kernel markers */ - marker_update_probe_range(__start___markers, - __stop___markers, probe_module, &refcount); + marker_update_probe_range(__start___markers, __stop___markers); /* Markers in modules. */ - module_update_markers(probe_module, &refcount); - if (probe_module && refcount == 0) { - synchronize_sched(); - deferred_sync = 0; - } + module_update_markers(); + /* Update immediate values */ + core_imv_update(); + module_imv_update(); } /** @@ -310,33 +632,52 @@ static void marker_update_probes(struct * @name: marker name * @format: format string * @probe: probe handler - * @private: probe private data + * @probe_private: probe private data * * private data must be a valid allocated memory address, or NULL. * Returns 0 if ok, error value on error. + * The probe address must at least be aligned on the architecture pointer size. */ int marker_probe_register(const char *name, const char *format, - marker_probe_func *probe, void *private) + marker_probe_func *probe, void *probe_private) { struct marker_entry *entry; int ret = 0; + struct marker_probe_closure *old; mutex_lock(&markers_mutex); entry = get_marker(name); - if (entry && entry->refcount) { - ret = -EBUSY; - goto end; - } - if (deferred_sync) { - synchronize_sched(); - deferred_sync = 0; + if (!entry) { + entry = add_marker(name, format); + if (IS_ERR(entry)) { + ret = PTR_ERR(entry); + goto end; + } } - ret = add_marker(name, format, probe, private); - if (ret) + /* + * If we detect that a call_rcu is pending for this marker, + * make sure it's executed now. + */ + if (entry->rcu_pending) + rcu_barrier(); + old = marker_entry_add_probe(entry, probe, probe_private); + if (IS_ERR(old)) { + ret = PTR_ERR(old); goto end; + } mutex_unlock(&markers_mutex); - marker_update_probes(NULL); - return ret; + marker_update_probes(); /* may update entry */ + mutex_lock(&markers_mutex); + entry = get_marker(name); + WARN_ON(!entry); + entry->oldptr = old; + entry->rcu_pending = 1; + /* write rcu_pending before calling the RCU callback */ + smp_wmb(); +#ifdef CONFIG_PREEMPT_RCU + synchronize_sched(); /* Until we have the call_rcu_sched() */ +#endif + call_rcu(&entry->rcu, free_old_closure); end: mutex_unlock(&markers_mutex); return ret; @@ -346,171 +687,173 @@ EXPORT_SYMBOL_GPL(marker_probe_register) /** * marker_probe_unregister - Disconnect a probe from a marker * @name: marker name + * @probe: probe function pointer + * @probe_private: probe private data * * Returns the private data given to marker_probe_register, or an ERR_PTR(). + * We do not need to call a synchronize_sched to make sure the probes have + * finished running before doing a module unload, because the module unload + * itself uses stop_machine(), which insures that every preempt disabled section + * have finished. */ -void *marker_probe_unregister(const char *name) +int marker_probe_unregister(const char *name, + marker_probe_func *probe, void *probe_private) { - struct module *probe_module; struct marker_entry *entry; - void *private; + struct marker_probe_closure *old; + int ret = -ENOENT; mutex_lock(&markers_mutex); entry = get_marker(name); - if (!entry) { - private = ERR_PTR(-ENOENT); + if (!entry) goto end; - } - entry->refcount = 0; - /* In what module is the probe handler ? */ - probe_module = __module_text_address((unsigned long)entry->probe); - private = remove_marker(name); - deferred_sync = 1; + if (entry->rcu_pending) + rcu_barrier(); + old = marker_entry_remove_probe(entry, probe, probe_private); mutex_unlock(&markers_mutex); - marker_update_probes(probe_module); - return private; + marker_update_probes(); /* may update entry */ + mutex_lock(&markers_mutex); + entry = get_marker(name); + if (!entry) + goto end; + entry->oldptr = old; + entry->rcu_pending = 1; + /* write rcu_pending before calling the RCU callback */ + smp_wmb(); +#ifdef CONFIG_PREEMPT_RCU + synchronize_sched(); /* Until we have the call_rcu_sched() */ +#endif + call_rcu(&entry->rcu, free_old_closure); + remove_marker(name); /* Ignore busy error message */ + ret = 0; end: mutex_unlock(&markers_mutex); - return private; + return ret; } EXPORT_SYMBOL_GPL(marker_probe_unregister); -/** - * marker_probe_unregister_private_data - Disconnect a probe from a marker - * @private: probe private data - * - * Unregister a marker by providing the registered private data. - * Returns the private data given to marker_probe_register, or an ERR_PTR(). - */ -void *marker_probe_unregister_private_data(void *private) +static struct marker_entry * +get_marker_from_private_data(marker_probe_func *probe, void *probe_private) { - struct module *probe_module; - struct hlist_head *head; - struct hlist_node *node; struct marker_entry *entry; - int found = 0; unsigned int i; + struct hlist_head *head; + struct hlist_node *node; - mutex_lock(&markers_mutex); for (i = 0; i < MARKER_TABLE_SIZE; i++) { head = &marker_table[i]; hlist_for_each_entry(entry, node, head, hlist) { - if (entry->private == private) { - found = 1; - goto iter_end; + if (!entry->ptype) { + if (entry->single.func == probe + && entry->single.probe_private + == probe_private) + return entry; + } else { + struct marker_probe_closure *closure; + closure = entry->multi; + for (i = 0; closure[i].func; i++) { + if (closure[i].func == probe && + closure[i].probe_private + == probe_private) + return entry; + } } } } -iter_end: - if (!found) { - private = ERR_PTR(-ENOENT); - goto end; - } - entry->refcount = 0; - /* In what module is the probe handler ? */ - probe_module = __module_text_address((unsigned long)entry->probe); - private = remove_marker(entry->name); - deferred_sync = 1; - mutex_unlock(&markers_mutex); - marker_update_probes(probe_module); - return private; -end: - mutex_unlock(&markers_mutex); - return private; + return NULL; } -EXPORT_SYMBOL_GPL(marker_probe_unregister_private_data); /** - * marker_arm - Arm a marker - * @name: marker name + * marker_probe_unregister_private_data - Disconnect a probe from a marker + * @probe: probe function + * @probe_private: probe private data * - * Activate a marker. It keeps a reference count of the number of - * arming/disarming done. - * Returns 0 if ok, error value on error. + * Unregister a probe by providing the registered private data. + * Only removes the first marker found in hash table. + * Return 0 on success or error value. + * We do not need to call a synchronize_sched to make sure the probes have + * finished running before doing a module unload, because the module unload + * itself uses stop_machine(), which insures that every preempt disabled section + * have finished. */ -int marker_arm(const char *name) +int marker_probe_unregister_private_data(marker_probe_func *probe, + void *probe_private) { struct marker_entry *entry; int ret = 0; + struct marker_probe_closure *old; mutex_lock(&markers_mutex); - entry = get_marker(name); + entry = get_marker_from_private_data(probe, probe_private); if (!entry) { ret = -ENOENT; goto end; } - /* - * Only need to update probes when refcount passes from 0 to 1. - */ - if (entry->refcount++) - goto end; -end: + if (entry->rcu_pending) + rcu_barrier(); + old = marker_entry_remove_probe(entry, NULL, probe_private); mutex_unlock(&markers_mutex); - marker_update_probes(NULL); - return ret; -} -EXPORT_SYMBOL_GPL(marker_arm); - -/** - * marker_disarm - Disarm a marker - * @name: marker name - * - * Disarm a marker. It keeps a reference count of the number of arming/disarming - * done. - * Returns 0 if ok, error value on error. - */ -int marker_disarm(const char *name) -{ - struct marker_entry *entry; - int ret = 0; - + marker_update_probes(); /* may update entry */ mutex_lock(&markers_mutex); - entry = get_marker(name); - if (!entry) { - ret = -ENOENT; - goto end; - } - /* - * Only permit decrement refcount if higher than 0. - * Do probe update only on 1 -> 0 transition. - */ - if (entry->refcount) { - if (--entry->refcount) - goto end; - } else { - ret = -EPERM; - goto end; - } + entry = get_marker_from_private_data(probe, probe_private); + WARN_ON(!entry); + entry->oldptr = old; + entry->rcu_pending = 1; + /* write rcu_pending before calling the RCU callback */ + smp_wmb(); +#ifdef CONFIG_PREEMPT_RCU + synchronize_sched(); /* Until we have the call_rcu_sched() */ +#endif + call_rcu(&entry->rcu, free_old_closure); + remove_marker(entry->name); /* Ignore busy error message */ end: mutex_unlock(&markers_mutex); - marker_update_probes(NULL); return ret; } -EXPORT_SYMBOL_GPL(marker_disarm); +EXPORT_SYMBOL_GPL(marker_probe_unregister_private_data); /** * marker_get_private_data - Get a marker's probe private data * @name: marker name + * @probe: probe to match + * @num: get the nth matching probe's private data * + * Returns the nth private data pointer (starting from 0) matching, or an + * ERR_PTR. * Returns the private data pointer, or an ERR_PTR. * The private data pointer should _only_ be dereferenced if the caller is the * owner of the data, or its content could vanish. This is mostly used to * confirm that a caller is the owner of a registered probe. */ -void *marker_get_private_data(const char *name) +void *marker_get_private_data(const char *name, marker_probe_func *probe, + int num) { struct hlist_head *head; struct hlist_node *node; struct marker_entry *e; size_t name_len = strlen(name) + 1; u32 hash = jhash(name, name_len-1, 0); - int found = 0; + int i; head = &marker_table[hash & ((1 << MARKER_HASH_BITS)-1)]; hlist_for_each_entry(e, node, head, hlist) { if (!strcmp(name, e->name)) { - found = 1; - return e->private; + if (!e->ptype) { + if (num == 0 && e->single.func == probe) + return e->single.probe_private; + else + break; + } else { + struct marker_probe_closure *closure; + int match = 0; + closure = e->multi; + for (i = 0; closure[i].func; i++) { + if (closure[i].func != probe) + continue; + if (match++ == num) + return closure[i].probe_private; + } + } } } return ERR_PTR(-ENOENT); Index: linux-2.6.24.7/kernel/module.c =================================================================== --- linux-2.6.24.7.orig/kernel/module.c +++ linux-2.6.24.7/kernel/module.c @@ -33,6 +33,7 @@ #include <linux/cpu.h> #include <linux/moduleparam.h> #include <linux/errno.h> +#include <linux/immediate.h> #include <linux/err.h> #include <linux/vermagic.h> #include <linux/notifier.h> @@ -46,6 +47,8 @@ #include <asm/semaphore.h> #include <asm/cacheflush.h> #include <linux/license.h> +#include <asm/sections.h> +#include <linux/marker.h> extern int module_sysfs_initialized; @@ -1675,6 +1678,8 @@ static struct module *load_module(void _ unsigned int unusedcrcindex; unsigned int unusedgplindex; unsigned int unusedgplcrcindex; + unsigned int immediateindex; + unsigned int immediatecondendindex; unsigned int markersindex; unsigned int markersstringsindex; struct module *mod; @@ -1773,6 +1778,9 @@ static struct module *load_module(void _ #ifdef ARCH_UNWIND_SECTION_NAME unwindex = find_sec(hdr, sechdrs, secstrings, ARCH_UNWIND_SECTION_NAME); #endif + immediateindex = find_sec(hdr, sechdrs, secstrings, "__imv"); + immediatecondendindex = find_sec(hdr, sechdrs, secstrings, + "__imv_cond_end"); /* Don't keep modinfo section */ sechdrs[infoindex].sh_flags &= ~(unsigned long)SHF_ALLOC; @@ -1924,6 +1932,16 @@ static struct module *load_module(void _ mod->gpl_future_syms = (void *)sechdrs[gplfutureindex].sh_addr; if (gplfuturecrcindex) mod->gpl_future_crcs = (void *)sechdrs[gplfuturecrcindex].sh_addr; +#ifdef CONFIG_IMMEDIATE + mod->immediate = (void *)sechdrs[immediateindex].sh_addr; + mod->num_immediate = + sechdrs[immediateindex].sh_size / sizeof(*mod->immediate); + mod->immediate_cond_end = + (void *)sechdrs[immediatecondendindex].sh_addr; + mod->num_immediate_cond_end = + sechdrs[immediatecondendindex].sh_size + / sizeof(*mod->immediate_cond_end); +#endif mod->unused_syms = (void *)sechdrs[unusedindex].sh_addr; if (unusedcrcindex) @@ -1991,11 +2009,17 @@ static struct module *load_module(void _ add_kallsyms(mod, sechdrs, symindex, strindex, secstrings); + if (!(mod->taints & TAINT_FORCED_MODULE)) { #ifdef CONFIG_MARKERS - if (!mod->taints) marker_update_probe_range(mod->markers, - mod->markers + mod->num_markers, NULL, NULL); + mod->markers + mod->num_markers); +#endif +#ifdef CONFIG_IMMEDIATE + /* Immediate values must be updated after markers */ + imv_update_range(mod->immediate, + mod->immediate + mod->num_immediate); #endif + } err = module_finalize(hdr, sechdrs, mod); if (err < 0) goto cleanup; @@ -2142,6 +2166,10 @@ sys_init_module(void __user *umod, /* Drop initial reference. */ module_put(mod); unwind_remove_table(mod->unwind_info, 1); +#ifdef CONFIG_IMMEDIATE + imv_unref(mod->immediate, mod->immediate + mod->num_immediate, + mod->module_init, mod->init_size); +#endif module_free(mod, mod->module_init); mod->module_init = NULL; mod->init_size = 0; @@ -2596,7 +2624,7 @@ EXPORT_SYMBOL(struct_module); #endif #ifdef CONFIG_MARKERS -void module_update_markers(struct module *probe_module, int *refcount) +void module_update_markers(void) { struct module *mod; @@ -2604,8 +2632,61 @@ void module_update_markers(struct module list_for_each_entry(mod, &modules, list) if (!mod->taints) marker_update_probe_range(mod->markers, - mod->markers + mod->num_markers, - probe_module, refcount); + mod->markers + mod->num_markers); mutex_unlock(&module_mutex); } #endif + +#ifdef CONFIG_IMMEDIATE +/** + * _module_imv_update - update all immediate values in the kernel + * + * Iterate on the kernel core and modules to update the immediate values. + * Module_mutex must be held be the caller. + */ +void _module_imv_update(void) +{ + struct module *mod; + + list_for_each_entry(mod, &modules, list) { + if (mod->taints) + continue; + imv_update_range(mod->immediate, + mod->immediate + mod->num_immediate); + } +} +EXPORT_SYMBOL_GPL(_module_imv_update); + +/** + * module_imv_update - update all immediate values in the kernel + * + * Iterate on the kernel core and modules to update the immediate values. + * Takes module_mutex. + */ +void module_imv_update(void) +{ + mutex_lock(&module_mutex); + _module_imv_update(); + mutex_unlock(&module_mutex); +} +EXPORT_SYMBOL_GPL(module_imv_update); + +/** + * is_imv_cond_end_module + * + * Check if the two given addresses are located in the immediate value condition + * end table. Addresses should be in the same object. + * The module mutex should be held. + */ +int is_imv_cond_end_module(unsigned long addr1, unsigned long addr2) +{ + struct module *mod = __module_text_address(addr1); + + if (!mod) + return 0; + + return _is_imv_cond_end(mod->immediate_cond_end, + mod->immediate_cond_end + mod->num_immediate_cond_end, + addr1, addr2); +} +#endif �����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/ftrace-upstream.patch�����������������������������������������������������������������������0000664�0000764�0000764�00000676765�11041664224�015413� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������--- Makefile | 4 arch/x86/Kconfig | 2 arch/x86/kernel/Makefile_32 | 9 arch/x86/kernel/Makefile_64 | 9 arch/x86/kernel/alternative.c | 19 arch/x86/kernel/entry_32.S | 68 arch/x86/kernel/entry_64.S | 104 + arch/x86/kernel/ftrace.c | 160 + arch/x86/kernel/i386_ksyms_32.c | 6 arch/x86/kernel/nmi_32.c | 3 arch/x86/kernel/nmi_64.c | 3 arch/x86/kernel/process_32.c | 3 arch/x86/kernel/process_64.c | 3 arch/x86/kernel/tsc_32.c | 2 arch/x86/kernel/tsc_64.c | 4 arch/x86/kernel/vsyscall_64.c | 2 arch/x86/kernel/x8664_ksyms_64.c | 6 arch/x86/lib/Makefile_32 | 2 arch/x86/lib/thunk_32.S | 47 arch/x86/lib/thunk_64.S | 18 arch/x86/mm/init_32.c | 2 arch/x86/mm/init_64.c | 2 arch/x86/vdso/vclock_gettime.c | 15 arch/x86/vdso/vgetcpu.c | 3 include/asm-x86/alternative_32.h | 2 include/asm-x86/alternative_64.h | 2 include/asm-x86/asm.h | 39 include/asm-x86/irqflags_32.h | 21 include/asm-x86/vsyscall.h | 2 include/linux/ftrace.h | 133 + include/linux/irqflags.h | 9 include/linux/ktime.h | 6 include/linux/linkage.h | 2 include/linux/mmiotrace.h | 85 include/linux/preempt.h | 34 include/linux/sched.h | 16 include/linux/writeback.h | 2 kernel/Makefile | 14 kernel/fork.c | 2 kernel/lockdep.c | 25 kernel/sched.c | 54 kernel/sched_trace.h | 41 kernel/sysctl.c | 11 kernel/trace/Kconfig | 127 + kernel/trace/Makefile | 23 kernel/trace/ftrace.c | 1488 ++++++++++++++++ kernel/trace/trace.c | 3112 ++++++++++++++++++++++++++++++++++ kernel/trace/trace.h | 375 ++++ kernel/trace/trace_functions.c | 78 kernel/trace/trace_irqsoff.c | 486 +++++ kernel/trace/trace_mmiotrace.c | 295 +++ kernel/trace/trace_sched_switch.c | 196 ++ kernel/trace/trace_sched_wakeup.c | 447 ++++ kernel/trace/trace_selftest.c | 563 ++++++ kernel/trace/trace_selftest_dynamic.c | 7 lib/Kconfig.debug | 2 lib/Makefile | 9 mm/page-writeback.c | 10 scripts/Makefile.lib | 3 59 files changed, 8158 insertions(+), 59 deletions(-) Index: linux-2.6.24.7/Makefile =================================================================== --- linux-2.6.24.7.orig/Makefile +++ linux-2.6.24.7/Makefile @@ -520,6 +520,10 @@ KBUILD_CFLAGS += -g KBUILD_AFLAGS += -gdwarf-2 endif +ifdef CONFIG_FTRACE +KBUILD_CFLAGS += -pg +endif + # Force gcc to behave correct even for buggy distributions KBUILD_CFLAGS += $(call cc-option, -fno-stack-protector) Index: linux-2.6.24.7/arch/x86/Kconfig =================================================================== --- linux-2.6.24.7.orig/arch/x86/Kconfig +++ linux-2.6.24.7/arch/x86/Kconfig @@ -19,6 +19,8 @@ config X86_64 config X86 bool default y + select HAVE_DYNAMIC_FTRACE + select HAVE_FTRACE config GENERIC_TIME bool Index: linux-2.6.24.7/arch/x86/kernel/Makefile_32 =================================================================== --- linux-2.6.24.7.orig/arch/x86/kernel/Makefile_32 +++ linux-2.6.24.7/arch/x86/kernel/Makefile_32 @@ -10,6 +10,14 @@ obj-y := process_32.o signal_32.o entry_ pci-dma_32.o i386_ksyms_32.o i387_32.o bootflag.o e820_32.o\ quirks.o i8237.o topology.o alternative.o i8253.o tsc_32.o +ifdef CONFIG_FTRACE +# Do not profile debug utilities +CFLAGS_REMOVE_tsc_32.o = -pg +ifdef CONFIG_DYNAMIC_FTRACE +CFLAGS_REMOVE_ftrace.o = -pg +endif +endif + obj-$(CONFIG_STACKTRACE) += stacktrace.o obj-y += cpu/ obj-y += acpi/ @@ -28,6 +36,7 @@ obj-$(CONFIG_X86_MPPARSE) += mpparse_32. obj-$(CONFIG_X86_LOCAL_APIC) += apic_32.o nmi_32.o obj-$(CONFIG_X86_IO_APIC) += io_apic_32.o obj-$(CONFIG_X86_REBOOTFIXUPS) += reboot_fixups_32.o +obj-$(CONFIG_DYNAMIC_FTRACE) += ftrace.o obj-$(CONFIG_KEXEC) += machine_kexec_32.o relocate_kernel_32.o crash.o obj-$(CONFIG_CRASH_DUMP) += crash_dump_32.o obj-$(CONFIG_X86_NUMAQ) += numaq_32.o Index: linux-2.6.24.7/arch/x86/kernel/Makefile_64 =================================================================== --- linux-2.6.24.7.orig/arch/x86/kernel/Makefile_64 +++ linux-2.6.24.7/arch/x86/kernel/Makefile_64 @@ -13,6 +13,14 @@ obj-y := process_64.o signal_64.o entry_ pci-dma_64.o pci-nommu_64.o alternative.o hpet.o tsc_64.o bugs_64.o \ i8253.o +ifdef CONFIG_FTRACE +# Do not profile debug utilities +CFLAGS_REMOVE_tsc_64.o = -pg +ifdef CONFIG_DYNAMIC_FTRACE +CFLAGS_REMOVE_ftrace.o = -pg +endif +endif + obj-$(CONFIG_STACKTRACE) += stacktrace.o obj-y += cpu/ obj-y += acpi/ @@ -22,6 +30,7 @@ obj-$(CONFIG_X86_CPUID) += cpuid.o obj-$(CONFIG_SMP) += smp_64.o smpboot_64.o trampoline_64.o tsc_sync.o obj-y += apic_64.o nmi_64.o obj-y += io_apic_64.o mpparse_64.o genapic_64.o genapic_flat_64.o +obj-$(CONFIG_DYNAMIC_FTRACE) += ftrace.o obj-$(CONFIG_KEXEC) += machine_kexec_64.o relocate_kernel_64.o crash.o obj-$(CONFIG_CRASH_DUMP) += crash_dump_64.o obj-$(CONFIG_PM) += suspend_64.o Index: linux-2.6.24.7/arch/x86/kernel/alternative.c =================================================================== --- linux-2.6.24.7.orig/arch/x86/kernel/alternative.c +++ linux-2.6.24.7/arch/x86/kernel/alternative.c @@ -65,7 +65,8 @@ __setup("noreplace-paravirt", setup_nore get them easily into strings. */ asm("\t.section .rodata, \"a\"\nintelnops: " GENERIC_NOP1 GENERIC_NOP2 GENERIC_NOP3 GENERIC_NOP4 GENERIC_NOP5 GENERIC_NOP6 - GENERIC_NOP7 GENERIC_NOP8); + GENERIC_NOP7 GENERIC_NOP8 + "\t.previous"); extern const unsigned char intelnops[]; static const unsigned char *const intel_nops[ASM_NOP_MAX+1] = { NULL, @@ -83,7 +84,8 @@ static const unsigned char *const intel_ #ifdef K8_NOP1 asm("\t.section .rodata, \"a\"\nk8nops: " K8_NOP1 K8_NOP2 K8_NOP3 K8_NOP4 K8_NOP5 K8_NOP6 - K8_NOP7 K8_NOP8); + K8_NOP7 K8_NOP8 + "\t.previous"); extern const unsigned char k8nops[]; static const unsigned char *const k8_nops[ASM_NOP_MAX+1] = { NULL, @@ -101,7 +103,8 @@ static const unsigned char *const k8_nop #ifdef K7_NOP1 asm("\t.section .rodata, \"a\"\nk7nops: " K7_NOP1 K7_NOP2 K7_NOP3 K7_NOP4 K7_NOP5 K7_NOP6 - K7_NOP7 K7_NOP8); + K7_NOP7 K7_NOP8 + "\t.previous"); extern const unsigned char k7nops[]; static const unsigned char *const k7_nops[ASM_NOP_MAX+1] = { NULL, @@ -119,7 +122,8 @@ static const unsigned char *const k7_nop #ifdef P6_NOP1 asm("\t.section .rodata, \"a\"\np6nops: " P6_NOP1 P6_NOP2 P6_NOP3 P6_NOP4 P6_NOP5 P6_NOP6 - P6_NOP7 P6_NOP8); + P6_NOP7 P6_NOP8 + "\t.previous"); extern const unsigned char p6nops[]; static const unsigned char *const p6_nops[ASM_NOP_MAX+1] = { NULL, @@ -137,7 +141,7 @@ static const unsigned char *const p6_nop #ifdef CONFIG_X86_64 extern char __vsyscall_0; -static inline const unsigned char*const * find_nop_table(void) +const unsigned char *const *find_nop_table(void) { return boot_cpu_data.x86_vendor != X86_VENDOR_INTEL || boot_cpu_data.x86 < 6 ? k8_nops : p6_nops; @@ -156,7 +160,7 @@ static const struct nop { { -1, NULL } }; -static const unsigned char*const * find_nop_table(void) +const unsigned char *const *find_nop_table(void) { const unsigned char *const *noptable = intel_nops; int i; @@ -173,7 +177,7 @@ static const unsigned char*const * find_ #endif /* CONFIG_X86_64 */ /* Use this to add nops to a buffer, then text_poke the whole buffer. */ -static void add_nops(void *insns, unsigned int len) +void add_nops(void *insns, unsigned int len) { const unsigned char *const *noptable = find_nop_table(); @@ -186,6 +190,7 @@ static void add_nops(void *insns, unsign len -= noplen; } } +EXPORT_SYMBOL_GPL(add_nops); extern struct alt_instr __alt_instructions[], __alt_instructions_end[]; extern u8 *__smp_locks[], *__smp_locks_end[]; Index: linux-2.6.24.7/arch/x86/kernel/entry_32.S =================================================================== --- linux-2.6.24.7.orig/arch/x86/kernel/entry_32.S +++ linux-2.6.24.7/arch/x86/kernel/entry_32.S @@ -1110,6 +1110,74 @@ ENDPROC(xen_failsafe_callback) #endif /* CONFIG_XEN */ +#ifdef CONFIG_FTRACE +#ifdef CONFIG_DYNAMIC_FTRACE + +ENTRY(mcount) + pushl %eax + pushl %ecx + pushl %edx + movl 0xc(%esp), %eax + +.globl mcount_call +mcount_call: + call ftrace_stub + + popl %edx + popl %ecx + popl %eax + + ret +END(mcount) + +ENTRY(ftrace_caller) + pushl %eax + pushl %ecx + pushl %edx + movl 0xc(%esp), %eax + movl 0x4(%ebp), %edx + +.globl ftrace_call +ftrace_call: + call ftrace_stub + + popl %edx + popl %ecx + popl %eax + +.globl ftrace_stub +ftrace_stub: + ret +END(ftrace_caller) + +#else /* ! CONFIG_DYNAMIC_FTRACE */ + +ENTRY(mcount) + cmpl $ftrace_stub, ftrace_trace_function + jnz trace +.globl ftrace_stub +ftrace_stub: + ret + + /* taken from glibc */ +trace: + pushl %eax + pushl %ecx + pushl %edx + movl 0xc(%esp), %eax + movl 0x4(%ebp), %edx + + call *ftrace_trace_function + + popl %edx + popl %ecx + popl %eax + + jmp ftrace_stub +END(mcount) +#endif /* CONFIG_DYNAMIC_FTRACE */ +#endif /* CONFIG_FTRACE */ + .section .rodata,"a" #include "syscall_table_32.S" Index: linux-2.6.24.7/arch/x86/kernel/entry_64.S =================================================================== --- linux-2.6.24.7.orig/arch/x86/kernel/entry_64.S +++ linux-2.6.24.7/arch/x86/kernel/entry_64.S @@ -53,6 +53,110 @@ .code64 +#ifdef CONFIG_FTRACE +#ifdef CONFIG_DYNAMIC_FTRACE +ENTRY(mcount) + + subq $0x38, %rsp + movq %rax, (%rsp) + movq %rcx, 8(%rsp) + movq %rdx, 16(%rsp) + movq %rsi, 24(%rsp) + movq %rdi, 32(%rsp) + movq %r8, 40(%rsp) + movq %r9, 48(%rsp) + + movq 0x38(%rsp), %rdi + +.globl mcount_call +mcount_call: + call ftrace_stub + + movq 48(%rsp), %r9 + movq 40(%rsp), %r8 + movq 32(%rsp), %rdi + movq 24(%rsp), %rsi + movq 16(%rsp), %rdx + movq 8(%rsp), %rcx + movq (%rsp), %rax + addq $0x38, %rsp + + retq +END(mcount) + +ENTRY(ftrace_caller) + + /* taken from glibc */ + subq $0x38, %rsp + movq %rax, (%rsp) + movq %rcx, 8(%rsp) + movq %rdx, 16(%rsp) + movq %rsi, 24(%rsp) + movq %rdi, 32(%rsp) + movq %r8, 40(%rsp) + movq %r9, 48(%rsp) + + movq 0x38(%rsp), %rdi + movq 8(%rbp), %rsi + +.globl ftrace_call +ftrace_call: + call ftrace_stub + + movq 48(%rsp), %r9 + movq 40(%rsp), %r8 + movq 32(%rsp), %rdi + movq 24(%rsp), %rsi + movq 16(%rsp), %rdx + movq 8(%rsp), %rcx + movq (%rsp), %rax + addq $0x38, %rsp + +.globl ftrace_stub +ftrace_stub: + retq +END(ftrace_caller) + +#else /* ! CONFIG_DYNAMIC_FTRACE */ +ENTRY(mcount) + cmpq $ftrace_stub, ftrace_trace_function + jnz trace +.globl ftrace_stub +ftrace_stub: + retq + +trace: + /* taken from glibc */ + subq $0x38, %rsp + movq %rax, (%rsp) + movq %rcx, 8(%rsp) + movq %rdx, 16(%rsp) + movq %rsi, 24(%rsp) + movq %rdi, 32(%rsp) + movq %r8, 40(%rsp) + movq %r9, 48(%rsp) + + movq 0x38(%rsp), %rdi + movq 8(%rbp), %rsi + + call *ftrace_trace_function + + movq 48(%rsp), %r9 + movq 40(%rsp), %r8 + movq 32(%rsp), %rdi + movq 24(%rsp), %rsi + movq 16(%rsp), %rdx + movq 8(%rsp), %rcx + movq (%rsp), %rax + addq $0x38, %rsp + + jmp ftrace_stub +END(mcount) +#endif /* CONFIG_DYNAMIC_FTRACE */ +#endif /* CONFIG_FTRACE */ + +#define HARDNMI_MASK 0x40000000 + #ifndef CONFIG_PREEMPT #define retint_kernel retint_restore_args #endif Index: linux-2.6.24.7/arch/x86/kernel/ftrace.c =================================================================== --- /dev/null +++ linux-2.6.24.7/arch/x86/kernel/ftrace.c @@ -0,0 +1,160 @@ +/* + * Code for replacing ftrace calls with jumps. + * + * Copyright (C) 2007-2008 Steven Rostedt <srostedt@redhat.com> + * + * Thanks goes to Ingo Molnar, for suggesting the idea. + * Mathieu Desnoyers, for suggesting postponing the modifications. + * Arjan van de Ven, for keeping me straight, and explaining to me + * the dangers of modifying code on the run. + */ + +#include <linux/spinlock.h> +#include <linux/hardirq.h> +#include <linux/ftrace.h> +#include <linux/percpu.h> +#include <linux/init.h> +#include <linux/list.h> + +#include <asm/alternative.h> +#include <asm/asm.h> + +#define CALL_BACK 5 + +/* Long is fine, even if it is only 4 bytes ;-) */ +static long *ftrace_nop; + +union ftrace_code_union { + char code[5]; + struct { + char e8; + int offset; + } __attribute__((packed)); +}; + +notrace int ftrace_ip_converted(unsigned long ip) +{ + unsigned long save; + + ip -= CALL_BACK; + save = *(long *)ip; + + return save == *ftrace_nop; +} + +static int notrace ftrace_calc_offset(long ip, long addr) +{ + return (int)(addr - ip); +} + +notrace unsigned char *ftrace_nop_replace(void) +{ + return (char *)ftrace_nop; +} + +notrace unsigned char *ftrace_call_replace(unsigned long ip, unsigned long addr) +{ + static union ftrace_code_union calc; + + calc.e8 = 0xe8; + calc.offset = ftrace_calc_offset(ip, addr); + + /* + * No locking needed, this must be called via kstop_machine + * which in essence is like running on a uniprocessor machine. + */ + return calc.code; +} + +notrace int +ftrace_modify_code(unsigned long ip, unsigned char *old_code, + unsigned char *new_code) +{ + unsigned replaced; + unsigned old = *(unsigned *)old_code; /* 4 bytes */ + unsigned new = *(unsigned *)new_code; /* 4 bytes */ + unsigned char newch = new_code[4]; + int faulted = 0; + + /* move the IP back to the start of the call */ + ip -= CALL_BACK; + + /* + * Note: Due to modules and __init, code can + * disappear and change, we need to protect against faulting + * as well as code changing. + * + * No real locking needed, this code is run through + * kstop_machine. + */ + asm volatile ( + "1: lock\n" + " cmpxchg %3, (%2)\n" + " jnz 2f\n" + " movb %b4, 4(%2)\n" + "2:\n" + ".section .fixup, \"ax\"\n" + "3: movl $1, %0\n" + " jmp 2b\n" + ".previous\n" + _ASM_EXTABLE(1b, 3b) + : "=r"(faulted), "=a"(replaced) + : "r"(ip), "r"(new), "r"(newch), + "0"(faulted), "a"(old) + : "memory"); + sync_core(); + + if (replaced != old && replaced != new) + faulted = 2; + + return faulted; +} + +notrace int ftrace_update_ftrace_func(ftrace_func_t func) +{ + unsigned long ip = (unsigned long)(&ftrace_call); + unsigned char old[5], *new; + int ret; + + ip += CALL_BACK; + + memcpy(old, &ftrace_call, 5); + new = ftrace_call_replace(ip, (unsigned long)func); + ret = ftrace_modify_code(ip, old, new); + + return ret; +} + +notrace int ftrace_mcount_set(unsigned long *data) +{ + unsigned long ip = (long)(&mcount_call); + unsigned long *addr = data; + unsigned char old[5], *new; + + /* ip is at the location, but modify code will subtact this */ + ip += CALL_BACK; + + /* + * Replace the mcount stub with a pointer to the + * ip recorder function. + */ + memcpy(old, &mcount_call, 5); + new = ftrace_call_replace(ip, *addr); + *addr = ftrace_modify_code(ip, old, new); + + return 0; +} + +int __init ftrace_dyn_arch_init(void *data) +{ + const unsigned char *const *noptable = find_nop_table(); + + /* This is running in kstop_machine */ + + ftrace_mcount_set(data); + + ftrace_nop = (unsigned long *)noptable[CALL_BACK]; + + return 0; +} + Index: linux-2.6.24.7/arch/x86/kernel/i386_ksyms_32.c =================================================================== --- linux-2.6.24.7.orig/arch/x86/kernel/i386_ksyms_32.c +++ linux-2.6.24.7/arch/x86/kernel/i386_ksyms_32.c @@ -1,9 +1,15 @@ +#include <linux/ftrace.h> #include <linux/module.h> #include <asm/semaphore.h> #include <asm/checksum.h> #include <asm/desc.h> #include <asm/pgtable.h> +#ifdef CONFIG_FTRACE +/* mcount is defined in assembly */ +EXPORT_SYMBOL(mcount); +#endif + EXPORT_SYMBOL(__down_failed); EXPORT_SYMBOL(__down_failed_interruptible); EXPORT_SYMBOL(__down_failed_trylock); Index: linux-2.6.24.7/arch/x86/kernel/nmi_32.c =================================================================== --- linux-2.6.24.7.orig/arch/x86/kernel/nmi_32.c +++ linux-2.6.24.7/arch/x86/kernel/nmi_32.c @@ -318,7 +318,8 @@ EXPORT_SYMBOL(touch_nmi_watchdog); extern void die_nmi(struct pt_regs *, const char *msg); -__kprobes int nmi_watchdog_tick(struct pt_regs * regs, unsigned reason) +notrace __kprobes int +nmi_watchdog_tick(struct pt_regs * regs, unsigned reason) { /* Index: linux-2.6.24.7/arch/x86/kernel/nmi_64.c =================================================================== --- linux-2.6.24.7.orig/arch/x86/kernel/nmi_64.c +++ linux-2.6.24.7/arch/x86/kernel/nmi_64.c @@ -314,7 +314,8 @@ void touch_nmi_watchdog(void) touch_softlockup_watchdog(); } -int __kprobes nmi_watchdog_tick(struct pt_regs * regs, unsigned reason) +notrace int __kprobes +nmi_watchdog_tick(struct pt_regs * regs, unsigned reason) { int sum; int touched = 0; Index: linux-2.6.24.7/arch/x86/kernel/process_32.c =================================================================== --- linux-2.6.24.7.orig/arch/x86/kernel/process_32.c +++ linux-2.6.24.7/arch/x86/kernel/process_32.c @@ -195,7 +195,10 @@ void cpu_idle(void) play_dead(); __get_cpu_var(irq_stat).idle_timestamp = jiffies; + /* Don't trace irqs off for idle */ + stop_critical_timings(); idle(); + start_critical_timings(); } tick_nohz_restart_sched_tick(); preempt_enable_no_resched(); Index: linux-2.6.24.7/arch/x86/kernel/process_64.c =================================================================== --- linux-2.6.24.7.orig/arch/x86/kernel/process_64.c +++ linux-2.6.24.7/arch/x86/kernel/process_64.c @@ -232,7 +232,10 @@ void cpu_idle (void) */ local_irq_disable(); enter_idle(); + /* Don't trace irqs off for idle */ + stop_critical_timings(); idle(); + start_critical_timings(); /* In many cases the interrupt that ended idle has already called exit_idle. But some idle loops can be woken up without interrupt. */ Index: linux-2.6.24.7/arch/x86/kernel/tsc_32.c =================================================================== --- linux-2.6.24.7.orig/arch/x86/kernel/tsc_32.c +++ linux-2.6.24.7/arch/x86/kernel/tsc_32.c @@ -92,7 +92,7 @@ static inline void set_cyc2ns_scale(unsi /* * Scheduler clock - returns current time in nanosec units. */ -unsigned long long native_sched_clock(void) +unsigned long long notrace native_sched_clock(void) { unsigned long long this_offset; Index: linux-2.6.24.7/arch/x86/kernel/tsc_64.c =================================================================== --- linux-2.6.24.7.orig/arch/x86/kernel/tsc_64.c +++ linux-2.6.24.7/arch/x86/kernel/tsc_64.c @@ -25,12 +25,12 @@ static inline void set_cyc2ns_scale(unsi cyc2ns_scale = (NSEC_PER_MSEC << NS_SCALE) / khz; } -static unsigned long long cycles_2_ns(unsigned long long cyc) +static unsigned long long notrace cycles_2_ns(unsigned long long cyc) { return (cyc * cyc2ns_scale) >> NS_SCALE; } -unsigned long long sched_clock(void) +unsigned long long notrace sched_clock(void) { unsigned long a = 0; Index: linux-2.6.24.7/arch/x86/kernel/vsyscall_64.c =================================================================== --- linux-2.6.24.7.orig/arch/x86/kernel/vsyscall_64.c +++ linux-2.6.24.7/arch/x86/kernel/vsyscall_64.c @@ -42,7 +42,7 @@ #include <asm/topology.h> #include <asm/vgtod.h> -#define __vsyscall(nr) __attribute__ ((unused,__section__(".vsyscall_" #nr))) +#define __vsyscall(nr) __attribute__ ((unused,__section__(".vsyscall_" #nr))) notrace #define __syscall_clobber "r11","rcx","memory" /* Index: linux-2.6.24.7/arch/x86/kernel/x8664_ksyms_64.c =================================================================== --- linux-2.6.24.7.orig/arch/x86/kernel/x8664_ksyms_64.c +++ linux-2.6.24.7/arch/x86/kernel/x8664_ksyms_64.c @@ -1,6 +1,7 @@ /* Exports for assembly files. All C exports should go in the respective C files. */ +#include <linux/ftrace.h> #include <linux/module.h> #include <linux/smp.h> @@ -9,6 +10,11 @@ #include <asm/uaccess.h> #include <asm/pgtable.h> +#ifdef CONFIG_FTRACE +/* mcount is defined in assembly */ +EXPORT_SYMBOL(mcount); +#endif + EXPORT_SYMBOL(kernel_thread); EXPORT_SYMBOL(__down_failed); Index: linux-2.6.24.7/arch/x86/lib/Makefile_32 =================================================================== --- linux-2.6.24.7.orig/arch/x86/lib/Makefile_32 +++ linux-2.6.24.7/arch/x86/lib/Makefile_32 @@ -4,7 +4,7 @@ lib-y = checksum_32.o delay_32.o usercopy_32.o getuser_32.o putuser_32.o memcpy_32.o strstr_32.o \ - bitops_32.o semaphore_32.o string_32.o + bitops_32.o semaphore_32.o string_32.o thunk_32.o lib-$(CONFIG_X86_USE_3DNOW) += mmx_32.o Index: linux-2.6.24.7/arch/x86/lib/thunk_32.S =================================================================== --- /dev/null +++ linux-2.6.24.7/arch/x86/lib/thunk_32.S @@ -0,0 +1,47 @@ +/* + * Trampoline to trace irqs off. (otherwise CALLER_ADDR1 might crash) + * Copyright 2008 by Steven Rostedt, Red Hat, Inc + * (inspired by Andi Kleen's thunk_64.S) + * Subject to the GNU public license, v.2. No warranty of any kind. + */ + + #include <linux/linkage.h> + +#define ARCH_TRACE_IRQS_ON \ + pushl %eax; \ + pushl %ecx; \ + pushl %edx; \ + call trace_hardirqs_on; \ + popl %edx; \ + popl %ecx; \ + popl %eax; + +#define ARCH_TRACE_IRQS_OFF \ + pushl %eax; \ + pushl %ecx; \ + pushl %edx; \ + call trace_hardirqs_off; \ + popl %edx; \ + popl %ecx; \ + popl %eax; + +#ifdef CONFIG_TRACE_IRQFLAGS + /* put return address in eax (arg1) */ + .macro thunk_ra name,func + .globl \name +\name: + pushl %eax + pushl %ecx + pushl %edx + /* Place EIP in the arg1 */ + movl 3*4(%esp), %eax + call \func + popl %edx + popl %ecx + popl %eax + ret + .endm + + thunk_ra trace_hardirqs_on_thunk,trace_hardirqs_on_caller + thunk_ra trace_hardirqs_off_thunk,trace_hardirqs_off_caller +#endif Index: linux-2.6.24.7/arch/x86/lib/thunk_64.S =================================================================== --- linux-2.6.24.7.orig/arch/x86/lib/thunk_64.S +++ linux-2.6.24.7/arch/x86/lib/thunk_64.S @@ -47,8 +47,22 @@ thunk __up_wakeup,__up #ifdef CONFIG_TRACE_IRQFLAGS - thunk trace_hardirqs_on_thunk,trace_hardirqs_on - thunk trace_hardirqs_off_thunk,trace_hardirqs_off + /* put return address in rdi (arg1) */ + .macro thunk_ra name,func + .globl \name +\name: + CFI_STARTPROC + SAVE_ARGS + /* SAVE_ARGS pushs 9 elements */ + /* the next element would be the rip */ + movq 9*8(%rsp), %rdi + call \func + jmp restore + CFI_ENDPROC + .endm + + thunk_ra trace_hardirqs_on_thunk,trace_hardirqs_on_caller + thunk_ra trace_hardirqs_off_thunk,trace_hardirqs_off_caller #endif #ifdef CONFIG_DEBUG_LOCK_ALLOC Index: linux-2.6.24.7/arch/x86/mm/init_32.c =================================================================== --- linux-2.6.24.7.orig/arch/x86/mm/init_32.c +++ linux-2.6.24.7/arch/x86/mm/init_32.c @@ -795,7 +795,7 @@ void mark_rodata_ro(void) unsigned long start = PFN_ALIGN(_text); unsigned long size = PFN_ALIGN(_etext) - start; -#ifndef CONFIG_KPROBES +#if !defined(CONFIG_KPROBES) && !defined(CONFIG_DYNAMIC_FTRACE) #ifdef CONFIG_HOTPLUG_CPU /* It must still be possible to apply SMP alternatives. */ if (num_possible_cpus() <= 1) Index: linux-2.6.24.7/arch/x86/mm/init_64.c =================================================================== --- linux-2.6.24.7.orig/arch/x86/mm/init_64.c +++ linux-2.6.24.7/arch/x86/mm/init_64.c @@ -600,7 +600,7 @@ void mark_rodata_ro(void) start = (unsigned long)_etext; #endif -#ifdef CONFIG_KPROBES +#if defined(CONFIG_KPROBES) || defined(CONFIG_DYNAMIC_FTRACE) start = (unsigned long)__start_rodata; #endif Index: linux-2.6.24.7/arch/x86/vdso/vclock_gettime.c =================================================================== --- linux-2.6.24.7.orig/arch/x86/vdso/vclock_gettime.c +++ linux-2.6.24.7/arch/x86/vdso/vclock_gettime.c @@ -24,7 +24,7 @@ #define gtod vdso_vsyscall_gtod_data -static long vdso_fallback_gettime(long clock, struct timespec *ts) +notrace static long vdso_fallback_gettime(long clock, struct timespec *ts) { long ret; asm("syscall" : "=a" (ret) : @@ -32,7 +32,7 @@ static long vdso_fallback_gettime(long c return ret; } -static inline long vgetns(void) +notrace static inline long vgetns(void) { long v; cycles_t (*vread)(void); @@ -41,7 +41,7 @@ static inline long vgetns(void) return (v * gtod->clock.mult) >> gtod->clock.shift; } -static noinline int do_realtime(struct timespec *ts) +notrace static noinline int do_realtime(struct timespec *ts) { unsigned long seq, ns; do { @@ -55,7 +55,8 @@ static noinline int do_realtime(struct t } /* Copy of the version in kernel/time.c which we cannot directly access */ -static void vset_normalized_timespec(struct timespec *ts, long sec, long nsec) +notrace static void +vset_normalized_timespec(struct timespec *ts, long sec, long nsec) { while (nsec >= NSEC_PER_SEC) { nsec -= NSEC_PER_SEC; @@ -69,7 +70,7 @@ static void vset_normalized_timespec(str ts->tv_nsec = nsec; } -static noinline int do_monotonic(struct timespec *ts) +notrace static noinline int do_monotonic(struct timespec *ts) { unsigned long seq, ns, secs; do { @@ -83,7 +84,7 @@ static noinline int do_monotonic(struct return 0; } -int __vdso_clock_gettime(clockid_t clock, struct timespec *ts) +notrace int __vdso_clock_gettime(clockid_t clock, struct timespec *ts) { if (likely(gtod->sysctl_enabled && gtod->clock.vread)) switch (clock) { @@ -97,7 +98,7 @@ int __vdso_clock_gettime(clockid_t clock int clock_gettime(clockid_t, struct timespec *) __attribute__((weak, alias("__vdso_clock_gettime"))); -int __vdso_gettimeofday(struct timeval *tv, struct timezone *tz) +notrace int __vdso_gettimeofday(struct timeval *tv, struct timezone *tz) { long ret; if (likely(gtod->sysctl_enabled && gtod->clock.vread)) { Index: linux-2.6.24.7/arch/x86/vdso/vgetcpu.c =================================================================== --- linux-2.6.24.7.orig/arch/x86/vdso/vgetcpu.c +++ linux-2.6.24.7/arch/x86/vdso/vgetcpu.c @@ -13,7 +13,8 @@ #include <asm/vgtod.h> #include "vextern.h" -long __vdso_getcpu(unsigned *cpu, unsigned *node, struct getcpu_cache *unused) +notrace long +__vdso_getcpu(unsigned *cpu, unsigned *node, struct getcpu_cache *unused) { unsigned int dummy, p; Index: linux-2.6.24.7/include/asm-x86/alternative_32.h =================================================================== --- linux-2.6.24.7.orig/include/asm-x86/alternative_32.h +++ linux-2.6.24.7/include/asm-x86/alternative_32.h @@ -151,4 +151,6 @@ apply_paravirt(struct paravirt_patch_sit extern void text_poke(void *addr, unsigned char *opcode, int len); +const unsigned char *const *find_nop_table(void); + #endif /* _I386_ALTERNATIVE_H */ Index: linux-2.6.24.7/include/asm-x86/alternative_64.h =================================================================== --- linux-2.6.24.7.orig/include/asm-x86/alternative_64.h +++ linux-2.6.24.7/include/asm-x86/alternative_64.h @@ -156,4 +156,6 @@ apply_paravirt(struct paravirt_patch *st extern void text_poke(void *addr, unsigned char *opcode, int len); +const unsigned char *const *find_nop_table(void); + #endif /* _X86_64_ALTERNATIVE_H */ Index: linux-2.6.24.7/include/asm-x86/asm.h =================================================================== --- /dev/null +++ linux-2.6.24.7/include/asm-x86/asm.h @@ -0,0 +1,39 @@ +#ifndef _ASM_X86_ASM_H +#define _ASM_X86_ASM_H + +#ifdef CONFIG_X86_32 +/* 32 bits */ + +# define _ASM_PTR " .long " +# define _ASM_ALIGN " .balign 4 " +# define _ASM_MOV_UL " movl " + +# define _ASM_INC " incl " +# define _ASM_DEC " decl " +# define _ASM_ADD " addl " +# define _ASM_SUB " subl " +# define _ASM_XADD " xaddl " + +#else +/* 64 bits */ + +# define _ASM_PTR " .quad " +# define _ASM_ALIGN " .balign 8 " +# define _ASM_MOV_UL " movq " + +# define _ASM_INC " incq " +# define _ASM_DEC " decq " +# define _ASM_ADD " addq " +# define _ASM_SUB " subq " +# define _ASM_XADD " xaddq " + +#endif /* CONFIG_X86_32 */ + +/* Exception table entry */ +# define _ASM_EXTABLE(from,to) \ + " .section __ex_table,\"a\"\n" \ + _ASM_ALIGN "\n" \ + _ASM_PTR #from "," #to "\n" \ + " .previous\n" + +#endif /* _ASM_X86_ASM_H */ Index: linux-2.6.24.7/include/asm-x86/irqflags_32.h =================================================================== --- linux-2.6.24.7.orig/include/asm-x86/irqflags_32.h +++ linux-2.6.24.7/include/asm-x86/irqflags_32.h @@ -157,25 +157,8 @@ static inline void trace_hardirqs_fixup( * C function, so save all the C-clobbered registers: */ #ifdef CONFIG_TRACE_IRQFLAGS - -# define TRACE_IRQS_ON \ - pushl %eax; \ - pushl %ecx; \ - pushl %edx; \ - call trace_hardirqs_on; \ - popl %edx; \ - popl %ecx; \ - popl %eax; - -# define TRACE_IRQS_OFF \ - pushl %eax; \ - pushl %ecx; \ - pushl %edx; \ - call trace_hardirqs_off; \ - popl %edx; \ - popl %ecx; \ - popl %eax; - +# define TRACE_IRQS_ON call trace_hardirqs_on_thunk; +# define TRACE_IRQS_OFF call trace_hardirqs_off_thunk; #else # define TRACE_IRQS_ON # define TRACE_IRQS_OFF Index: linux-2.6.24.7/include/asm-x86/vsyscall.h =================================================================== --- linux-2.6.24.7.orig/include/asm-x86/vsyscall.h +++ linux-2.6.24.7/include/asm-x86/vsyscall.h @@ -24,7 +24,7 @@ enum vsyscall_num { ((unused, __section__ (".vsyscall_gtod_data"),aligned(16))) #define __section_vsyscall_clock __attribute__ \ ((unused, __section__ (".vsyscall_clock"),aligned(16))) -#define __vsyscall_fn __attribute__ ((unused,__section__(".vsyscall_fn"))) +#define __vsyscall_fn __attribute__ ((unused,__section__(".vsyscall_fn"))) notrace #define VGETCPU_RDTSCP 1 #define VGETCPU_LSL 2 Index: linux-2.6.24.7/include/linux/ftrace.h =================================================================== --- /dev/null +++ linux-2.6.24.7/include/linux/ftrace.h @@ -0,0 +1,133 @@ +#ifndef _LINUX_FTRACE_H +#define _LINUX_FTRACE_H + +#ifdef CONFIG_FTRACE + +#include <linux/linkage.h> +#include <linux/fs.h> + +extern int ftrace_enabled; +extern int +ftrace_enable_sysctl(struct ctl_table *table, int write, + struct file *filp, void __user *buffer, size_t *lenp, + loff_t *ppos); + +typedef void (*ftrace_func_t)(unsigned long ip, unsigned long parent_ip); + +struct ftrace_ops { + ftrace_func_t func; + struct ftrace_ops *next; +}; + +/* + * The ftrace_ops must be a static and should also + * be read_mostly. These functions do modify read_mostly variables + * so use them sparely. Never free an ftrace_op or modify the + * next pointer after it has been registered. Even after unregistering + * it, the next pointer may still be used internally. + */ +int register_ftrace_function(struct ftrace_ops *ops); +int unregister_ftrace_function(struct ftrace_ops *ops); +void clear_ftrace_function(void); + +extern void ftrace_stub(unsigned long a0, unsigned long a1); +extern void mcount(void); + +#else /* !CONFIG_FTRACE */ +# define register_ftrace_function(ops) do { } while (0) +# define unregister_ftrace_function(ops) do { } while (0) +# define clear_ftrace_function(ops) do { } while (0) +#endif /* CONFIG_FTRACE */ + +#ifdef CONFIG_DYNAMIC_FTRACE +# define FTRACE_HASHBITS 10 +# define FTRACE_HASHSIZE (1<<FTRACE_HASHBITS) + +enum { + FTRACE_FL_FREE = (1 << 0), + FTRACE_FL_FAILED = (1 << 1), + FTRACE_FL_FILTER = (1 << 2), + FTRACE_FL_ENABLED = (1 << 3), + FTRACE_FL_NOTRACE = (1 << 4), +}; + +struct dyn_ftrace { + struct hlist_node node; + unsigned long ip; + unsigned long flags; +}; + +int ftrace_force_update(void); +void ftrace_set_filter(unsigned char *buf, int len, int reset); + +/* defined in arch */ +extern int ftrace_ip_converted(unsigned long ip); +extern unsigned char *ftrace_nop_replace(void); +extern unsigned char *ftrace_call_replace(unsigned long ip, unsigned long addr); +extern int ftrace_dyn_arch_init(void *data); +extern int ftrace_mcount_set(unsigned long *data); +extern int ftrace_modify_code(unsigned long ip, unsigned char *old_code, + unsigned char *new_code); +extern int ftrace_update_ftrace_func(ftrace_func_t func); +extern void ftrace_caller(void); +extern void ftrace_call(void); +extern void mcount_call(void); +#else +# define ftrace_force_update() ({ 0; }) +# define ftrace_set_filter(buf, len, reset) do { } while (0) +#endif + +/* totally disable ftrace - can not re-enable after this */ +void ftrace_kill(void); + +static inline void tracer_disable(void) +{ +#ifdef CONFIG_FTRACE + ftrace_enabled = 0; +#endif +} + +#ifdef CONFIG_FRAME_POINTER +/* TODO: need to fix this for ARM */ +# define CALLER_ADDR0 ((unsigned long)__builtin_return_address(0)) +# define CALLER_ADDR1 ((unsigned long)__builtin_return_address(1)) +# define CALLER_ADDR2 ((unsigned long)__builtin_return_address(2)) +# define CALLER_ADDR3 ((unsigned long)__builtin_return_address(3)) +# define CALLER_ADDR4 ((unsigned long)__builtin_return_address(4)) +# define CALLER_ADDR5 ((unsigned long)__builtin_return_address(5)) +# define CALLER_ADDR6 ((unsigned long)__builtin_return_address(6)) +#else +# define CALLER_ADDR0 ((unsigned long)__builtin_return_address(0)) +# define CALLER_ADDR1 0UL +# define CALLER_ADDR2 0UL +# define CALLER_ADDR3 0UL +# define CALLER_ADDR4 0UL +# define CALLER_ADDR5 0UL +# define CALLER_ADDR6 0UL +#endif + +#ifdef CONFIG_IRQSOFF_TRACER + extern void time_hardirqs_on(unsigned long a0, unsigned long a1); + extern void time_hardirqs_off(unsigned long a0, unsigned long a1); +#else +# define time_hardirqs_on(a0, a1) do { } while (0) +# define time_hardirqs_off(a0, a1) do { } while (0) +#endif + +#ifdef CONFIG_PREEMPT_TRACER + extern void trace_preempt_on(unsigned long a0, unsigned long a1); + extern void trace_preempt_off(unsigned long a0, unsigned long a1); +#else +# define trace_preempt_on(a0, a1) do { } while (0) +# define trace_preempt_off(a0, a1) do { } while (0) +#endif + +#ifdef CONFIG_TRACING +extern void +ftrace_special(unsigned long arg1, unsigned long arg2, unsigned long arg3); +#else +static inline void +ftrace_special(unsigned long arg1, unsigned long arg2, unsigned long arg3) { } +#endif + +#endif /* _LINUX_FTRACE_H */ Index: linux-2.6.24.7/include/linux/irqflags.h =================================================================== --- linux-2.6.24.7.orig/include/linux/irqflags.h +++ linux-2.6.24.7/include/linux/irqflags.h @@ -41,6 +41,15 @@ # define INIT_TRACE_IRQFLAGS #endif +#if defined(CONFIG_IRQSOFF_TRACER) || \ + defined(CONFIG_PREEMPT_TRACER) + extern void stop_critical_timings(void); + extern void start_critical_timings(void); +#else +# define stop_critical_timings() do { } while (0) +# define start_critical_timings() do { } while (0) +#endif + #ifdef CONFIG_TRACE_IRQFLAGS_SUPPORT #include <asm/irqflags.h> Index: linux-2.6.24.7/include/linux/ktime.h =================================================================== --- linux-2.6.24.7.orig/include/linux/ktime.h +++ linux-2.6.24.7/include/linux/ktime.h @@ -326,4 +326,10 @@ extern void ktime_get_ts(struct timespec /* Get the real (wall-) time in timespec format: */ #define ktime_get_real_ts(ts) getnstimeofday(ts) +static inline ktime_t ns_to_ktime(u64 ns) +{ + static const ktime_t ktime_zero = { .tv64 = 0 }; + return ktime_add_ns(ktime_zero, ns); +} + #endif Index: linux-2.6.24.7/include/linux/linkage.h =================================================================== --- linux-2.6.24.7.orig/include/linux/linkage.h +++ linux-2.6.24.7/include/linux/linkage.h @@ -3,6 +3,8 @@ #include <asm/linkage.h> +#define notrace __attribute__((no_instrument_function)) + #ifdef __cplusplus #define CPP_ASMLINKAGE extern "C" #else Index: linux-2.6.24.7/include/linux/mmiotrace.h =================================================================== --- /dev/null +++ linux-2.6.24.7/include/linux/mmiotrace.h @@ -0,0 +1,85 @@ +#ifndef MMIOTRACE_H +#define MMIOTRACE_H + +#include <linux/types.h> +#include <linux/list.h> + +struct kmmio_probe; +struct pt_regs; + +typedef void (*kmmio_pre_handler_t)(struct kmmio_probe *, + struct pt_regs *, unsigned long addr); +typedef void (*kmmio_post_handler_t)(struct kmmio_probe *, + unsigned long condition, struct pt_regs *); + +struct kmmio_probe { + struct list_head list; /* kmmio internal list */ + unsigned long addr; /* start location of the probe point */ + unsigned long len; /* length of the probe region */ + kmmio_pre_handler_t pre_handler; /* Called before addr is executed. */ + kmmio_post_handler_t post_handler; /* Called after addr is executed */ + void *private; +}; + +/* kmmio is active by some kmmio_probes? */ +static inline int is_kmmio_active(void) +{ + extern unsigned int kmmio_count; + return kmmio_count; +} + +extern int register_kmmio_probe(struct kmmio_probe *p); +extern void unregister_kmmio_probe(struct kmmio_probe *p); + +/* Called from page fault handler. */ +extern int kmmio_handler(struct pt_regs *regs, unsigned long addr); + +/* Called from ioremap.c */ +#ifdef CONFIG_MMIOTRACE +extern void mmiotrace_ioremap(resource_size_t offset, unsigned long size, + void __iomem *addr); +extern void mmiotrace_iounmap(volatile void __iomem *addr); +#else +static inline void mmiotrace_ioremap(resource_size_t offset, + unsigned long size, void __iomem *addr) +{ +} + +static inline void mmiotrace_iounmap(volatile void __iomem *addr) +{ +} +#endif /* CONFIG_MMIOTRACE_HOOKS */ + +enum mm_io_opcode { + MMIO_READ = 0x1, /* struct mmiotrace_rw */ + MMIO_WRITE = 0x2, /* struct mmiotrace_rw */ + MMIO_PROBE = 0x3, /* struct mmiotrace_map */ + MMIO_UNPROBE = 0x4, /* struct mmiotrace_map */ + MMIO_MARKER = 0x5, /* raw char data */ + MMIO_UNKNOWN_OP = 0x6, /* struct mmiotrace_rw */ +}; + +struct mmiotrace_rw { + resource_size_t phys; /* PCI address of register */ + unsigned long value; + unsigned long pc; /* optional program counter */ + int map_id; + unsigned char opcode; /* one of MMIO_{READ,WRITE,UNKNOWN_OP} */ + unsigned char width; /* size of register access in bytes */ +}; + +struct mmiotrace_map { + resource_size_t phys; /* base address in PCI space */ + unsigned long virt; /* base virtual address */ + unsigned long len; /* mapping size */ + int map_id; + unsigned char opcode; /* MMIO_PROBE or MMIO_UNPROBE */ +}; + +/* in kernel/trace/trace_mmiotrace.c */ +extern void enable_mmiotrace(void); +extern void disable_mmiotrace(void); +extern void mmio_trace_rw(struct mmiotrace_rw *rw); +extern void mmio_trace_mapping(struct mmiotrace_map *map); + +#endif /* MMIOTRACE_H */ Index: linux-2.6.24.7/include/linux/preempt.h =================================================================== --- linux-2.6.24.7.orig/include/linux/preempt.h +++ linux-2.6.24.7/include/linux/preempt.h @@ -10,7 +10,7 @@ #include <linux/linkage.h> #include <linux/list.h> -#ifdef CONFIG_DEBUG_PREEMPT +#if defined(CONFIG_DEBUG_PREEMPT) || defined(CONFIG_PREEMPT_TRACER) extern void fastcall add_preempt_count(int val); extern void fastcall sub_preempt_count(int val); #else @@ -52,6 +52,34 @@ do { \ preempt_check_resched(); \ } while (0) +/* For debugging and tracer internals only! */ +#define add_preempt_count_notrace(val) \ + do { preempt_count() += (val); } while (0) +#define sub_preempt_count_notrace(val) \ + do { preempt_count() -= (val); } while (0) +#define inc_preempt_count_notrace() add_preempt_count_notrace(1) +#define dec_preempt_count_notrace() sub_preempt_count_notrace(1) + +#define preempt_disable_notrace() \ +do { \ + inc_preempt_count_notrace(); \ + barrier(); \ +} while (0) + +#define preempt_enable_no_resched_notrace() \ +do { \ + barrier(); \ + dec_preempt_count_notrace(); \ +} while (0) + +/* preempt_check_resched is OK to trace */ +#define preempt_enable_notrace() \ +do { \ + preempt_enable_no_resched_notrace(); \ + barrier(); \ + preempt_check_resched(); \ +} while (0) + #else #define preempt_disable() do { } while (0) @@ -59,6 +87,10 @@ do { \ #define preempt_enable() do { } while (0) #define preempt_check_resched() do { } while (0) +#define preempt_disable_notrace() do { } while (0) +#define preempt_enable_no_resched_notrace() do { } while (0) +#define preempt_enable_notrace() do { } while (0) + #endif #ifdef CONFIG_PREEMPT_NOTIFIERS Index: linux-2.6.24.7/include/linux/sched.h =================================================================== --- linux-2.6.24.7.orig/include/linux/sched.h +++ linux-2.6.24.7/include/linux/sched.h @@ -220,6 +220,8 @@ extern void sched_init_smp(void); extern void init_idle(struct task_struct *idle, int cpu); extern void init_idle_bootup_task(struct task_struct *idle); +extern int runqueue_is_locked(void); + extern cpumask_t nohz_cpu_mask; #if defined(CONFIG_SMP) && defined(CONFIG_NO_HZ) extern int select_nohz_load_balancer(int cpu); @@ -1944,6 +1946,18 @@ static inline void arch_pick_mmap_layout } #endif +#ifdef CONFIG_TRACING +extern void +__trace_special(void *__tr, void *__data, + unsigned long arg1, unsigned long arg2, unsigned long arg3); +#else +static inline void +__trace_special(void *__tr, void *__data, + unsigned long arg1, unsigned long arg2, unsigned long arg3) +{ +} +#endif + extern long sched_setaffinity(pid_t pid, cpumask_t new_mask); extern long sched_getaffinity(pid_t pid, cpumask_t *mask); @@ -2009,6 +2023,8 @@ static inline void migration_init(void) } #endif +#define TASK_STATE_TO_CHAR_STR "RSDTtZX" + #endif /* __KERNEL__ */ #endif Index: linux-2.6.24.7/include/linux/writeback.h =================================================================== --- linux-2.6.24.7.orig/include/linux/writeback.h +++ linux-2.6.24.7/include/linux/writeback.h @@ -103,6 +103,8 @@ extern int dirty_expire_interval; extern int block_dump; extern int laptop_mode; +extern unsigned long determine_dirtyable_memory(void); + extern int dirty_ratio_handler(struct ctl_table *table, int write, struct file *filp, void __user *buffer, size_t *lenp, loff_t *ppos); Index: linux-2.6.24.7/kernel/Makefile =================================================================== --- linux-2.6.24.7.orig/kernel/Makefile +++ linux-2.6.24.7/kernel/Makefile @@ -11,6 +11,18 @@ obj-y = sched.o fork.o exec_domain.o hrtimer.o rwsem.o latency.o nsproxy.o srcu.o \ utsname.o notifier.o +CFLAGS_REMOVE_sched.o = -pg -mno-spe + +ifdef CONFIG_FTRACE +# Do not trace debug files and internal ftrace files +CFLAGS_REMOVE_lockdep.o = -pg +CFLAGS_REMOVE_lockdep_proc.o = -pg +CFLAGS_REMOVE_mutex-debug.o = -pg +CFLAGS_REMOVE_rtmutex-debug.o = -pg +CFLAGS_REMOVE_cgroup-debug.o = -pg +CFLAGS_REMOVE_sched_clock.o = -pg +endif + obj-$(CONFIG_SYSCTL) += sysctl_check.o obj-$(CONFIG_STACKTRACE) += stacktrace.o obj-y += time/ @@ -57,6 +69,8 @@ obj-$(CONFIG_SYSCTL) += utsname_sysctl.o obj-$(CONFIG_TASK_DELAY_ACCT) += delayacct.o obj-$(CONFIG_TASKSTATS) += taskstats.o tsacct.o obj-$(CONFIG_MARKERS) += marker.o +obj-$(CONFIG_FTRACE) += trace/ +obj-$(CONFIG_TRACING) += trace/ obj-$(CONFIG_SMP) += sched_cpupri.o ifneq ($(CONFIG_SCHED_NO_NO_OMIT_FRAME_POINTER),y) Index: linux-2.6.24.7/kernel/fork.c =================================================================== --- linux-2.6.24.7.orig/kernel/fork.c +++ linux-2.6.24.7/kernel/fork.c @@ -1010,7 +1010,7 @@ static struct task_struct *copy_process( rt_mutex_init_task(p); -#ifdef CONFIG_TRACE_IRQFLAGS +#if defined(CONFIG_TRACE_IRQFLAGS) && defined(CONFIG_LOCKDEP) DEBUG_LOCKS_WARN_ON(!p->hardirqs_enabled); DEBUG_LOCKS_WARN_ON(!p->softirqs_enabled); #endif Index: linux-2.6.24.7/kernel/lockdep.c =================================================================== --- linux-2.6.24.7.orig/kernel/lockdep.c +++ linux-2.6.24.7/kernel/lockdep.c @@ -39,6 +39,7 @@ #include <linux/irqflags.h> #include <linux/utsname.h> #include <linux/hash.h> +#include <linux/ftrace.h> #include <asm/sections.h> @@ -81,6 +82,8 @@ static int graph_lock(void) __raw_spin_unlock(&lockdep_lock); return 0; } + /* prevent any recursions within lockdep from causing deadlocks */ + current->lockdep_recursion++; return 1; } @@ -89,6 +92,7 @@ static inline int graph_unlock(void) if (debug_locks && !__raw_spin_is_locked(&lockdep_lock)) return DEBUG_LOCKS_WARN_ON(1); + current->lockdep_recursion--; __raw_spin_unlock(&lockdep_lock); return 0; } @@ -978,7 +982,7 @@ check_noncircular(struct lock_class *sou return 1; } -#ifdef CONFIG_TRACE_IRQFLAGS +#if defined(CONFIG_TRACE_IRQFLAGS) && defined(CONFIG_PROVE_LOCKING) /* * Forwards and backwards subgraph searching, for the purposes of * proving that two subgraphs can be connected by a new dependency @@ -1676,7 +1680,7 @@ valid_state(struct task_struct *curr, st static int mark_lock(struct task_struct *curr, struct held_lock *this, enum lock_usage_bit new_bit); -#ifdef CONFIG_TRACE_IRQFLAGS +#if defined(CONFIG_TRACE_IRQFLAGS) && defined(CONFIG_PROVE_LOCKING) /* * print irq inversion bug: @@ -2009,11 +2013,12 @@ void early_boot_irqs_on(void) /* * Hardirqs will be enabled: */ -void trace_hardirqs_on(void) +void trace_hardirqs_on_caller(unsigned long a0) { struct task_struct *curr = current; unsigned long ip; + time_hardirqs_on(CALLER_ADDR0, a0); if (unlikely(!debug_locks || current->lockdep_recursion)) return; @@ -2051,16 +2056,23 @@ void trace_hardirqs_on(void) curr->hardirq_enable_event = ++curr->irq_events; debug_atomic_inc(&hardirqs_on_events); } +EXPORT_SYMBOL(trace_hardirqs_on_caller); +void trace_hardirqs_on(void) +{ + trace_hardirqs_on_caller(CALLER_ADDR0); +} EXPORT_SYMBOL(trace_hardirqs_on); /* * Hardirqs were disabled: */ -void trace_hardirqs_off(void) +void trace_hardirqs_off_caller(unsigned long a0) { struct task_struct *curr = current; + time_hardirqs_off(CALLER_ADDR0, a0); + if (unlikely(!debug_locks || current->lockdep_recursion)) return; @@ -2078,7 +2090,12 @@ void trace_hardirqs_off(void) } else debug_atomic_inc(&redundant_hardirqs_off); } +EXPORT_SYMBOL(trace_hardirqs_off_caller); +void trace_hardirqs_off(void) +{ + trace_hardirqs_off_caller(CALLER_ADDR0); +} EXPORT_SYMBOL(trace_hardirqs_off); /* Index: linux-2.6.24.7/kernel/sched.c =================================================================== --- linux-2.6.24.7.orig/kernel/sched.c +++ linux-2.6.24.7/kernel/sched.c @@ -65,6 +65,7 @@ #include <linux/reciprocal_div.h> #include <linux/unistd.h> #include <linux/pagemap.h> +#include <linux/ftrace.h> #include <asm/tlb.h> #include <asm/irq_regs.h> @@ -421,6 +422,8 @@ static inline int cpu_of(struct rq *rq) #endif } +#include "sched_trace.h" + /* * Update the per-runqueue clock, as finegrained as the platform can give * us, but without assuming monotonicity, etc.: @@ -492,6 +495,24 @@ static void update_rq_clock(struct rq *r # define const_debug static const #endif +/** + * runqueue_is_locked + * + * Returns true if the current cpu runqueue is locked. + * This interface allows printk to be called with the runqueue lock + * held and know whether or not it is OK to wake up the klogd. + */ +int runqueue_is_locked(void) +{ + int cpu = get_cpu(); + struct rq *rq = cpu_rq(cpu); + int ret; + + ret = spin_is_locked(&rq->lock); + put_cpu(); + return ret; +} + /* * Debugging: various feature bits */ @@ -522,7 +543,7 @@ const_debug unsigned int sysctl_sched_nr * For kernel-internal use: high-speed (but slightly incorrect) per-cpu * clock constructed from sched_clock(): */ -unsigned long long cpu_clock(int cpu) +unsigned long long notrace cpu_clock(int cpu) { unsigned long long now; unsigned long flags; @@ -1287,6 +1308,7 @@ void wait_task_inactive(struct task_stru * just go back and repeat. */ rq = task_rq_lock(p, &flags); + trace_kernel_sched_wait(p); running = task_running(rq, p); on_rq = p->se.on_rq; task_rq_unlock(rq, &flags); @@ -1627,6 +1649,7 @@ out_activate: success = 1; out_running: + trace_kernel_sched_wakeup(rq, p); p->state = TASK_RUNNING; #ifdef CONFIG_SMP if (p->sched_class->task_wake_up) @@ -1753,6 +1776,7 @@ void fastcall wake_up_new_task(struct ta p->sched_class->task_new(rq, p); inc_nr_running(p, rq); } + trace_kernel_sched_wakeup_new(rq, p); check_preempt_curr(rq, p); #ifdef CONFIG_SMP if (p->sched_class->task_wake_up) @@ -1925,6 +1949,8 @@ context_switch(struct rq *rq, struct tas struct mm_struct *mm, *oldmm; prepare_task_switch(rq, prev, next); + + trace_kernel_sched_switch(rq, prev, next); mm = next->mm; oldmm = prev->active_mm; /* @@ -2157,6 +2183,7 @@ static void sched_migrate_task(struct ta || unlikely(cpu_is_offline(dest_cpu))) goto out; + trace_kernel_sched_migrate_task(p, cpu_of(rq), dest_cpu); /* force the process onto the specified CPU */ if (migrate_task(p, dest_cpu, &req)) { /* Need to wait for migration thread (might exit: take ref). */ @@ -3495,26 +3522,44 @@ void scheduler_tick(void) #endif } -#if defined(CONFIG_PREEMPT) && defined(CONFIG_DEBUG_PREEMPT) +#if defined(CONFIG_PREEMPT) && (defined(CONFIG_DEBUG_PREEMPT) || \ + defined(CONFIG_PREEMPT_TRACER)) + +static inline unsigned long get_parent_ip(unsigned long addr) +{ + if (in_lock_functions(addr)) { + addr = CALLER_ADDR2; + if (in_lock_functions(addr)) + addr = CALLER_ADDR3; + } + return addr; +} void fastcall add_preempt_count(int val) { +#ifdef CONFIG_DEBUG_PREEMPT /* * Underflow? */ if (DEBUG_LOCKS_WARN_ON((preempt_count() < 0))) return; +#endif preempt_count() += val; +#ifdef CONFIG_DEBUG_PREEMPT /* * Spinlock count overflowing soon? */ DEBUG_LOCKS_WARN_ON((preempt_count() & PREEMPT_MASK) >= PREEMPT_MASK - 10); +#endif + if (preempt_count() == val) + trace_preempt_off(CALLER_ADDR0, get_parent_ip(CALLER_ADDR1)); } EXPORT_SYMBOL(add_preempt_count); void fastcall sub_preempt_count(int val) { +#ifdef CONFIG_DEBUG_PREEMPT /* * Underflow? */ @@ -3526,7 +3571,10 @@ void fastcall sub_preempt_count(int val) if (DEBUG_LOCKS_WARN_ON((val < PREEMPT_MASK) && !(preempt_count() & PREEMPT_MASK))) return; +#endif + if (preempt_count() == val) + trace_preempt_on(CALLER_ADDR0, get_parent_ip(CALLER_ADDR1)); preempt_count() -= val; } EXPORT_SYMBOL(sub_preempt_count); @@ -4869,7 +4917,7 @@ out_unlock: return retval; } -static const char stat_nam[] = "RSDTtZX"; +static const char stat_nam[] = TASK_STATE_TO_CHAR_STR; static void show_task(struct task_struct *p) { Index: linux-2.6.24.7/kernel/sched_trace.h =================================================================== --- /dev/null +++ linux-2.6.24.7/kernel/sched_trace.h @@ -0,0 +1,41 @@ +#include <linux/marker.h> + +static inline void trace_kernel_sched_wait(struct task_struct *p) +{ + trace_mark(kernel_sched_wait_task, "pid %d state %ld", + p->pid, p->state); +} + +static inline +void trace_kernel_sched_wakeup(struct rq *rq, struct task_struct *p) +{ + trace_mark(kernel_sched_wakeup, + "pid %d state %ld ## rq %p task %p rq->curr %p", + p->pid, p->state, rq, p, rq->curr); +} + +static inline +void trace_kernel_sched_wakeup_new(struct rq *rq, struct task_struct *p) +{ + trace_mark(kernel_sched_wakeup_new, + "pid %d state %ld ## rq %p task %p rq->curr %p", + p->pid, p->state, rq, p, rq->curr); +} + +static inline void trace_kernel_sched_switch(struct rq *rq, + struct task_struct *prev, struct task_struct *next) +{ + trace_mark(kernel_sched_schedule, + "prev_pid %d next_pid %d prev_state %ld " + "## rq %p prev %p next %p", + prev->pid, next->pid, prev->state, + rq, prev, next); +} + +static inline void +trace_kernel_sched_migrate_task(struct task_struct *p, int src, int dst) +{ + trace_mark(kernel_sched_migrate_task, + "pid %d state %ld dest_cpu %d", + p->pid, p->state, dst); +} Index: linux-2.6.24.7/kernel/sysctl.c =================================================================== --- linux-2.6.24.7.orig/kernel/sysctl.c +++ linux-2.6.24.7/kernel/sysctl.c @@ -46,6 +46,7 @@ #include <linux/nfs_fs.h> #include <linux/acpi.h> #include <linux/reboot.h> +#include <linux/ftrace.h> #include <asm/uaccess.h> #include <asm/processor.h> @@ -470,6 +471,16 @@ static struct ctl_table kern_table[] = { .mode = 0644, .proc_handler = &proc_dointvec, }, +#ifdef CONFIG_FTRACE + { + .ctl_name = CTL_UNNUMBERED, + .procname = "ftrace_enabled", + .data = &ftrace_enabled, + .maxlen = sizeof(int), + .mode = 0644, + .proc_handler = &ftrace_enable_sysctl, + }, +#endif #ifdef CONFIG_KMOD { .ctl_name = KERN_MODPROBE, Index: linux-2.6.24.7/kernel/trace/Kconfig =================================================================== --- /dev/null +++ linux-2.6.24.7/kernel/trace/Kconfig @@ -0,0 +1,127 @@ +# +# Architectures that offer an FTRACE implementation should select HAVE_FTRACE: +# +config HAVE_FTRACE + bool + +config HAVE_DYNAMIC_FTRACE + bool + +config TRACER_MAX_TRACE + bool + +config TRACING + bool + select DEBUG_FS + select STACKTRACE + +config FTRACE + bool "Kernel Function Tracer" + depends on HAVE_FTRACE + select FRAME_POINTER + select TRACING + select CONTEXT_SWITCH_TRACER + help + Enable the kernel to trace every kernel function. This is done + by using a compiler feature to insert a small, 5-byte No-Operation + instruction to the beginning of every kernel function, which NOP + sequence is then dynamically patched into a tracer call when + tracing is enabled by the administrator. If it's runtime disabled + (the bootup default), then the overhead of the instructions is very + small and not measurable even in micro-benchmarks. + +config IRQSOFF_TRACER + bool "Interrupts-off Latency Tracer" + default n + depends on TRACE_IRQFLAGS_SUPPORT + depends on GENERIC_TIME + depends on HAVE_FTRACE + select TRACE_IRQFLAGS + select TRACING + select TRACER_MAX_TRACE + help + This option measures the time spent in irqs-off critical + sections, with microsecond accuracy. + + The default measurement method is a maximum search, which is + disabled by default and can be runtime (re-)started + via: + + echo 0 > /debugfs/tracing/tracing_max_latency + + (Note that kernel size and overhead increases with this option + enabled. This option and the preempt-off timing option can be + used together or separately.) + +config PREEMPT_TRACER + bool "Preemption-off Latency Tracer" + default n + depends on GENERIC_TIME + depends on PREEMPT + depends on HAVE_FTRACE + select TRACING + select TRACER_MAX_TRACE + help + This option measures the time spent in preemption off critical + sections, with microsecond accuracy. + + The default measurement method is a maximum search, which is + disabled by default and can be runtime (re-)started + via: + + echo 0 > /debugfs/tracing/tracing_max_latency + + (Note that kernel size and overhead increases with this option + enabled. This option and the irqs-off timing option can be + used together or separately.) + +config SCHED_TRACER + bool "Scheduling Latency Tracer" + depends on HAVE_FTRACE + select TRACING + select CONTEXT_SWITCH_TRACER + select TRACER_MAX_TRACE + help + This tracer tracks the latency of the highest priority task + to be scheduled in, starting from the point it has woken up. + +config CONTEXT_SWITCH_TRACER + bool "Trace process context switches" + depends on HAVE_FTRACE + select TRACING + select MARKERS + help + This tracer gets called from the context switch and records + all switching of tasks. + +config DYNAMIC_FTRACE + bool "enable/disable ftrace tracepoints dynamically" + depends on FTRACE + depends on HAVE_DYNAMIC_FTRACE + default y + help + This option will modify all the calls to ftrace dynamically + (will patch them out of the binary image and replaces them + with a No-Op instruction) as they are called. A table is + created to dynamically enable them again. + + This way a CONFIG_FTRACE kernel is slightly larger, but otherwise + has native performance as long as no tracing is active. + + The changes to the code are done by a kernel thread that + wakes up once a second and checks to see if any ftrace calls + were made. If so, it runs stop_machine (stops all CPUS) + and modifies the code to jump over the call to ftrace. + +config FTRACE_SELFTEST + bool + +config FTRACE_STARTUP_TEST + bool "Perform a startup test on ftrace" + depends on TRACING + select FTRACE_SELFTEST + help + This option performs a series of startup tests on ftrace. On bootup + a series of tests are made to verify that the tracer is + functioning properly. It will do tests on all the configured + tracers of ftrace. Index: linux-2.6.24.7/kernel/trace/Makefile =================================================================== --- /dev/null +++ linux-2.6.24.7/kernel/trace/Makefile @@ -0,0 +1,23 @@ + +# Do not instrument the tracer itself: + +ifdef CONFIG_FTRACE +ORIG_CFLAGS := $(KBUILD_CFLAGS) +KBUILD_CFLAGS = $(subst -pg,,$(ORIG_CFLAGS)) + +# selftest needs instrumentation +CFLAGS_trace_selftest_dynamic.o = -pg +obj-y += trace_selftest_dynamic.o +endif + +obj-$(CONFIG_FTRACE) += libftrace.o + +obj-$(CONFIG_TRACING) += trace.o +obj-$(CONFIG_CONTEXT_SWITCH_TRACER) += trace_sched_switch.o +obj-$(CONFIG_FTRACE) += trace_functions.o +obj-$(CONFIG_IRQSOFF_TRACER) += trace_irqsoff.o +obj-$(CONFIG_PREEMPT_TRACER) += trace_irqsoff.o +obj-$(CONFIG_SCHED_TRACER) += trace_sched_wakeup.o +obj-$(CONFIG_MMIOTRACE) += trace_mmiotrace.o + +libftrace-y := ftrace.o Index: linux-2.6.24.7/kernel/trace/ftrace.c =================================================================== --- /dev/null +++ linux-2.6.24.7/kernel/trace/ftrace.c @@ -0,0 +1,1488 @@ +/* + * Infrastructure for profiling code inserted by 'gcc -pg'. + * + * Copyright (C) 2007-2008 Steven Rostedt <srostedt@redhat.com> + * Copyright (C) 2004-2008 Ingo Molnar <mingo@redhat.com> + * + * Originally ported from the -rt patch by: + * Copyright (C) 2007 Arnaldo Carvalho de Melo <acme@redhat.com> + * + * Based on code in the latency_tracer, that is: + * + * Copyright (C) 2004-2006 Ingo Molnar + * Copyright (C) 2004 William Lee Irwin III + */ + +#include <linux/stop_machine.h> +#include <linux/clocksource.h> +#include <linux/kallsyms.h> +#include <linux/seq_file.h> +#include <linux/debugfs.h> +#include <linux/hardirq.h> +#include <linux/kthread.h> +#include <linux/uaccess.h> +#include <linux/ftrace.h> +#include <linux/sysctl.h> +#include <linux/ctype.h> +#include <linux/hash.h> +#include <linux/list.h> + +#include "trace.h" + +/* ftrace_enabled is a method to turn ftrace on or off */ +int ftrace_enabled __read_mostly; +static int last_ftrace_enabled; + +/* + * ftrace_disabled is set when an anomaly is discovered. + * ftrace_disabled is much stronger than ftrace_enabled. + */ +static int ftrace_disabled __read_mostly; + +static DEFINE_SPINLOCK(ftrace_lock); +static DEFINE_MUTEX(ftrace_sysctl_lock); + +static struct ftrace_ops ftrace_list_end __read_mostly = +{ + .func = ftrace_stub, +}; + +static struct ftrace_ops *ftrace_list __read_mostly = &ftrace_list_end; +ftrace_func_t ftrace_trace_function __read_mostly = ftrace_stub; + +void ftrace_list_func(unsigned long ip, unsigned long parent_ip) +{ + struct ftrace_ops *op = ftrace_list; + + /* in case someone actually ports this to alpha! */ + read_barrier_depends(); + + while (op != &ftrace_list_end) { + /* silly alpha */ + read_barrier_depends(); + op->func(ip, parent_ip); + op = op->next; + }; +} + +/** + * clear_ftrace_function - reset the ftrace function + * + * This NULLs the ftrace function and in essence stops + * tracing. There may be lag + */ +void clear_ftrace_function(void) +{ + ftrace_trace_function = ftrace_stub; +} + +static int __register_ftrace_function(struct ftrace_ops *ops) +{ + /* Should never be called by interrupts */ + spin_lock(&ftrace_lock); + + ops->next = ftrace_list; + /* + * We are entering ops into the ftrace_list but another + * CPU might be walking that list. We need to make sure + * the ops->next pointer is valid before another CPU sees + * the ops pointer included into the ftrace_list. + */ + smp_wmb(); + ftrace_list = ops; + + if (ftrace_enabled) { + /* + * For one func, simply call it directly. + * For more than one func, call the chain. + */ + if (ops->next == &ftrace_list_end) + ftrace_trace_function = ops->func; + else + ftrace_trace_function = ftrace_list_func; + } + + spin_unlock(&ftrace_lock); + + return 0; +} + +static int __unregister_ftrace_function(struct ftrace_ops *ops) +{ + struct ftrace_ops **p; + int ret = 0; + + spin_lock(&ftrace_lock); + + /* + * If we are removing the last function, then simply point + * to the ftrace_stub. + */ + if (ftrace_list == ops && ops->next == &ftrace_list_end) { + ftrace_trace_function = ftrace_stub; + ftrace_list = &ftrace_list_end; + goto out; + } + + for (p = &ftrace_list; *p != &ftrace_list_end; p = &(*p)->next) + if (*p == ops) + break; + + if (*p != ops) { + ret = -1; + goto out; + } + + *p = (*p)->next; + + if (ftrace_enabled) { + /* If we only have one func left, then call that directly */ + if (ftrace_list == &ftrace_list_end || + ftrace_list->next == &ftrace_list_end) + ftrace_trace_function = ftrace_list->func; + } + + out: + spin_unlock(&ftrace_lock); + + return ret; +} + +#ifdef CONFIG_DYNAMIC_FTRACE + +static struct task_struct *ftraced_task; +static DECLARE_WAIT_QUEUE_HEAD(ftraced_waiters); +static unsigned long ftraced_iteration_counter; + +enum { + FTRACE_ENABLE_CALLS = (1 << 0), + FTRACE_DISABLE_CALLS = (1 << 1), + FTRACE_UPDATE_TRACE_FUNC = (1 << 2), + FTRACE_ENABLE_MCOUNT = (1 << 3), + FTRACE_DISABLE_MCOUNT = (1 << 4), +}; + +static int ftrace_filtered; + +static struct hlist_head ftrace_hash[FTRACE_HASHSIZE]; + +static DEFINE_PER_CPU(int, ftrace_shutdown_disable_cpu); + +static DEFINE_SPINLOCK(ftrace_shutdown_lock); +static DEFINE_MUTEX(ftraced_lock); +static DEFINE_MUTEX(ftrace_regex_lock); + +struct ftrace_page { + struct ftrace_page *next; + unsigned long index; + struct dyn_ftrace records[]; +}; + +#define ENTRIES_PER_PAGE \ + ((PAGE_SIZE - sizeof(struct ftrace_page)) / sizeof(struct dyn_ftrace)) + +/* estimate from running different kernels */ +#define NR_TO_INIT 10000 + +static struct ftrace_page *ftrace_pages_start; +static struct ftrace_page *ftrace_pages; + +static int ftraced_trigger; +static int ftraced_suspend; + +static int ftrace_record_suspend; + +static struct dyn_ftrace *ftrace_free_records; + +static inline int +ftrace_ip_in_hash(unsigned long ip, unsigned long key) +{ + struct dyn_ftrace *p; + struct hlist_node *t; + int found = 0; + + hlist_for_each_entry(p, t, &ftrace_hash[key], node) { + if (p->ip == ip) { + found = 1; + break; + } + } + + return found; +} + +static inline void +ftrace_add_hash(struct dyn_ftrace *node, unsigned long key) +{ + hlist_add_head(&node->node, &ftrace_hash[key]); +} + +static void ftrace_free_rec(struct dyn_ftrace *rec) +{ + /* no locking, only called from kstop_machine */ + + rec->ip = (unsigned long)ftrace_free_records; + ftrace_free_records = rec; + rec->flags |= FTRACE_FL_FREE; +} + +static struct dyn_ftrace *ftrace_alloc_dyn_node(unsigned long ip) +{ + struct dyn_ftrace *rec; + + /* First check for freed records */ + if (ftrace_free_records) { + rec = ftrace_free_records; + + if (unlikely(!(rec->flags & FTRACE_FL_FREE))) { + WARN_ON_ONCE(1); + ftrace_free_records = NULL; + ftrace_disabled = 1; + ftrace_enabled = 0; + return NULL; + } + + ftrace_free_records = (void *)rec->ip; + memset(rec, 0, sizeof(*rec)); + return rec; + } + + if (ftrace_pages->index == ENTRIES_PER_PAGE) { + if (!ftrace_pages->next) + return NULL; + ftrace_pages = ftrace_pages->next; + } + + return &ftrace_pages->records[ftrace_pages->index++]; +} + +static void +ftrace_record_ip(unsigned long ip) +{ + struct dyn_ftrace *node; + unsigned long flags; + unsigned long key; + int resched; + int atomic; + int cpu; + + if (!ftrace_enabled || ftrace_disabled) + return; + + resched = need_resched(); + preempt_disable_notrace(); + + /* + * We simply need to protect against recursion. + * Use the the raw version of smp_processor_id and not + * __get_cpu_var which can call debug hooks that can + * cause a recursive crash here. + */ + cpu = raw_smp_processor_id(); + per_cpu(ftrace_shutdown_disable_cpu, cpu)++; + if (per_cpu(ftrace_shutdown_disable_cpu, cpu) != 1) + goto out; + + if (unlikely(ftrace_record_suspend)) + goto out; + + key = hash_long(ip, FTRACE_HASHBITS); + + WARN_ON_ONCE(key >= FTRACE_HASHSIZE); + + if (ftrace_ip_in_hash(ip, key)) + goto out; + + atomic = irqs_disabled(); + + spin_lock_irqsave(&ftrace_shutdown_lock, flags); + + /* This ip may have hit the hash before the lock */ + if (ftrace_ip_in_hash(ip, key)) + goto out_unlock; + + /* + * There's a slight race that the ftraced will update the + * hash and reset here. If it is already converted, skip it. + */ + if (ftrace_ip_converted(ip)) + goto out_unlock; + + node = ftrace_alloc_dyn_node(ip); + if (!node) + goto out_unlock; + + node->ip = ip; + + ftrace_add_hash(node, key); + + ftraced_trigger = 1; + + out_unlock: + spin_unlock_irqrestore(&ftrace_shutdown_lock, flags); + out: + per_cpu(ftrace_shutdown_disable_cpu, cpu)--; + + /* prevent recursion with scheduler */ + if (resched) + preempt_enable_no_resched_notrace(); + else + preempt_enable_notrace(); +} + +#define FTRACE_ADDR ((long)(ftrace_caller)) +#define MCOUNT_ADDR ((long)(mcount)) + +static void +__ftrace_replace_code(struct dyn_ftrace *rec, + unsigned char *old, unsigned char *new, int enable) +{ + unsigned long ip, fl; + int failed; + + ip = rec->ip; + + if (ftrace_filtered && enable) { + /* + * If filtering is on: + * + * If this record is set to be filtered and + * is enabled then do nothing. + * + * If this record is set to be filtered and + * it is not enabled, enable it. + * + * If this record is not set to be filtered + * and it is not enabled do nothing. + * + * If this record is set not to trace then + * do nothing. + * + * If this record is not set to be filtered and + * it is enabled, disable it. + */ + fl = rec->flags & (FTRACE_FL_FILTER | FTRACE_FL_ENABLED); + + if ((fl == (FTRACE_FL_FILTER | FTRACE_FL_ENABLED)) || + (fl == 0) || (rec->flags & FTRACE_FL_NOTRACE)) + return; + + /* + * If it is enabled disable it, + * otherwise enable it! + */ + if (fl == FTRACE_FL_ENABLED) { + /* swap new and old */ + new = old; + old = ftrace_call_replace(ip, FTRACE_ADDR); + rec->flags &= ~FTRACE_FL_ENABLED; + } else { + new = ftrace_call_replace(ip, FTRACE_ADDR); + rec->flags |= FTRACE_FL_ENABLED; + } + } else { + + if (enable) { + /* + * If this record is set not to trace and is + * not enabled, do nothing. + */ + fl = rec->flags & (FTRACE_FL_NOTRACE | FTRACE_FL_ENABLED); + if (fl == FTRACE_FL_NOTRACE) + return; + + new = ftrace_call_replace(ip, FTRACE_ADDR); + } else + old = ftrace_call_replace(ip, FTRACE_ADDR); + + if (enable) { + if (rec->flags & FTRACE_FL_ENABLED) + return; + rec->flags |= FTRACE_FL_ENABLED; + } else { + if (!(rec->flags & FTRACE_FL_ENABLED)) + return; + rec->flags &= ~FTRACE_FL_ENABLED; + } + } + + failed = ftrace_modify_code(ip, old, new); + if (failed) { + unsigned long key; + /* It is possible that the function hasn't been converted yet */ + key = hash_long(ip, FTRACE_HASHBITS); + if (!ftrace_ip_in_hash(ip, key)) { + rec->flags |= FTRACE_FL_FAILED; + ftrace_free_rec(rec); + } + + } +} + +static void ftrace_replace_code(int enable) +{ + unsigned char *new = NULL, *old = NULL; + struct dyn_ftrace *rec; + struct ftrace_page *pg; + int i; + + if (enable) + old = ftrace_nop_replace(); + else + new = ftrace_nop_replace(); + + for (pg = ftrace_pages_start; pg; pg = pg->next) { + for (i = 0; i < pg->index; i++) { + rec = &pg->records[i]; + + /* don't modify code that has already faulted */ + if (rec->flags & FTRACE_FL_FAILED) + continue; + + __ftrace_replace_code(rec, old, new, enable); + } + } +} + +static void ftrace_shutdown_replenish(void) +{ + if (ftrace_pages->next) + return; + + /* allocate another page */ + ftrace_pages->next = (void *)get_zeroed_page(GFP_KERNEL); +} + +static void +ftrace_code_disable(struct dyn_ftrace *rec) +{ + unsigned long ip; + unsigned char *nop, *call; + int failed; + + ip = rec->ip; + + nop = ftrace_nop_replace(); + call = ftrace_call_replace(ip, MCOUNT_ADDR); + + failed = ftrace_modify_code(ip, call, nop); + if (failed) { + rec->flags |= FTRACE_FL_FAILED; + ftrace_free_rec(rec); + } +} + +static int __ftrace_modify_code(void *data) +{ + unsigned long addr; + int *command = data; + + if (*command & FTRACE_ENABLE_CALLS) + ftrace_replace_code(1); + else if (*command & FTRACE_DISABLE_CALLS) + ftrace_replace_code(0); + + if (*command & FTRACE_UPDATE_TRACE_FUNC) + ftrace_update_ftrace_func(ftrace_trace_function); + + if (*command & FTRACE_ENABLE_MCOUNT) { + addr = (unsigned long)ftrace_record_ip; + ftrace_mcount_set(&addr); + } else if (*command & FTRACE_DISABLE_MCOUNT) { + addr = (unsigned long)ftrace_stub; + ftrace_mcount_set(&addr); + } + + return 0; +} + +static void ftrace_run_update_code(int command) +{ + stop_machine_run(__ftrace_modify_code, &command, NR_CPUS); +} + +static ftrace_func_t saved_ftrace_func; + +static void ftrace_startup(void) +{ + int command = 0; + + if (unlikely(ftrace_disabled)) + return; + + mutex_lock(&ftraced_lock); + ftraced_suspend++; + if (ftraced_suspend == 1) + command |= FTRACE_ENABLE_CALLS; + + if (saved_ftrace_func != ftrace_trace_function) { + saved_ftrace_func = ftrace_trace_function; + command |= FTRACE_UPDATE_TRACE_FUNC; + } + + if (!command || !ftrace_enabled) + goto out; + + ftrace_run_update_code(command); + out: + mutex_unlock(&ftraced_lock); +} + +static void ftrace_shutdown(void) +{ + int command = 0; + + if (unlikely(ftrace_disabled)) + return; + + mutex_lock(&ftraced_lock); + ftraced_suspend--; + if (!ftraced_suspend) + command |= FTRACE_DISABLE_CALLS; + + if (saved_ftrace_func != ftrace_trace_function) { + saved_ftrace_func = ftrace_trace_function; + command |= FTRACE_UPDATE_TRACE_FUNC; + } + + if (!command || !ftrace_enabled) + goto out; + + ftrace_run_update_code(command); + out: + mutex_unlock(&ftraced_lock); +} + +static void ftrace_startup_sysctl(void) +{ + int command = FTRACE_ENABLE_MCOUNT; + + if (unlikely(ftrace_disabled)) + return; + + mutex_lock(&ftraced_lock); + /* Force update next time */ + saved_ftrace_func = NULL; + /* ftraced_suspend is true if we want ftrace running */ + if (ftraced_suspend) + command |= FTRACE_ENABLE_CALLS; + + ftrace_run_update_code(command); + mutex_unlock(&ftraced_lock); +} + +static void ftrace_shutdown_sysctl(void) +{ + int command = FTRACE_DISABLE_MCOUNT; + + if (unlikely(ftrace_disabled)) + return; + + mutex_lock(&ftraced_lock); + /* ftraced_suspend is true if ftrace is running */ + if (ftraced_suspend) + command |= FTRACE_DISABLE_CALLS; + + ftrace_run_update_code(command); + mutex_unlock(&ftraced_lock); +} + +static cycle_t ftrace_update_time; +static unsigned long ftrace_update_cnt; +unsigned long ftrace_update_tot_cnt; + +static int __ftrace_update_code(void *ignore) +{ + struct dyn_ftrace *p; + struct hlist_head head; + struct hlist_node *t; + int save_ftrace_enabled; + cycle_t start, stop; + int i; + + /* Don't be recording funcs now */ + save_ftrace_enabled = ftrace_enabled; + ftrace_enabled = 0; + + start = ftrace_now(raw_smp_processor_id()); + ftrace_update_cnt = 0; + + /* No locks needed, the machine is stopped! */ + for (i = 0; i < FTRACE_HASHSIZE; i++) { + if (hlist_empty(&ftrace_hash[i])) + continue; + + head = ftrace_hash[i]; + INIT_HLIST_HEAD(&ftrace_hash[i]); + + /* all CPUS are stopped, we are safe to modify code */ + hlist_for_each_entry(p, t, &head, node) { + ftrace_code_disable(p); + ftrace_update_cnt++; + } + + } + + stop = ftrace_now(raw_smp_processor_id()); + ftrace_update_time = stop - start; + ftrace_update_tot_cnt += ftrace_update_cnt; + + ftrace_enabled = save_ftrace_enabled; + + return 0; +} + +static void ftrace_update_code(void) +{ + if (unlikely(ftrace_disabled)) + return; + + stop_machine_run(__ftrace_update_code, NULL, NR_CPUS); +} + +static int ftraced(void *ignore) +{ + unsigned long usecs; + + while (!kthread_should_stop()) { + + set_current_state(TASK_INTERRUPTIBLE); + + /* check once a second */ + schedule_timeout(HZ); + + if (unlikely(ftrace_disabled)) + continue; + + mutex_lock(&ftrace_sysctl_lock); + mutex_lock(&ftraced_lock); + if (ftrace_enabled && ftraced_trigger && !ftraced_suspend) { + ftrace_record_suspend++; + ftrace_update_code(); + usecs = nsecs_to_usecs(ftrace_update_time); + if (ftrace_update_tot_cnt > 100000) { + ftrace_update_tot_cnt = 0; + pr_info("hm, dftrace overflow: %lu change%s" + " (%lu total) in %lu usec%s\n", + ftrace_update_cnt, + ftrace_update_cnt != 1 ? "s" : "", + ftrace_update_tot_cnt, + usecs, usecs != 1 ? "s" : ""); + ftrace_disabled = 1; + WARN_ON_ONCE(1); + } + ftraced_trigger = 0; + ftrace_record_suspend--; + } + ftraced_iteration_counter++; + mutex_unlock(&ftraced_lock); + mutex_unlock(&ftrace_sysctl_lock); + + wake_up_interruptible(&ftraced_waiters); + + ftrace_shutdown_replenish(); + } + __set_current_state(TASK_RUNNING); + return 0; +} + +static int __init ftrace_dyn_table_alloc(void) +{ + struct ftrace_page *pg; + int cnt; + int i; + + /* allocate a few pages */ + ftrace_pages_start = (void *)get_zeroed_page(GFP_KERNEL); + if (!ftrace_pages_start) + return -1; + + /* + * Allocate a few more pages. + * + * TODO: have some parser search vmlinux before + * final linking to find all calls to ftrace. + * Then we can: + * a) know how many pages to allocate. + * and/or + * b) set up the table then. + * + * The dynamic code is still necessary for + * modules. + */ + + pg = ftrace_pages = ftrace_pages_start; + + cnt = NR_TO_INIT / ENTRIES_PER_PAGE; + + for (i = 0; i < cnt; i++) { + pg->next = (void *)get_zeroed_page(GFP_KERNEL); + + /* If we fail, we'll try later anyway */ + if (!pg->next) + break; + + pg = pg->next; + } + + return 0; +} + +enum { + FTRACE_ITER_FILTER = (1 << 0), + FTRACE_ITER_CONT = (1 << 1), + FTRACE_ITER_NOTRACE = (1 << 2), +}; + +#define FTRACE_BUFF_MAX (KSYM_SYMBOL_LEN+4) /* room for wildcards */ + +struct ftrace_iterator { + loff_t pos; + struct ftrace_page *pg; + unsigned idx; + unsigned flags; + unsigned char buffer[FTRACE_BUFF_MAX+1]; + unsigned buffer_idx; + unsigned filtered; +}; + +static void * +t_next(struct seq_file *m, void *v, loff_t *pos) +{ + struct ftrace_iterator *iter = m->private; + struct dyn_ftrace *rec = NULL; + + (*pos)++; + + retry: + if (iter->idx >= iter->pg->index) { + if (iter->pg->next) { + iter->pg = iter->pg->next; + iter->idx = 0; + goto retry; + } + } else { + rec = &iter->pg->records[iter->idx++]; + if ((rec->flags & FTRACE_FL_FAILED) || + ((iter->flags & FTRACE_ITER_FILTER) && + !(rec->flags & FTRACE_FL_FILTER)) || + ((iter->flags & FTRACE_ITER_NOTRACE) && + !(rec->flags & FTRACE_FL_NOTRACE))) { + rec = NULL; + goto retry; + } + } + + iter->pos = *pos; + + return rec; +} + +static void *t_start(struct seq_file *m, loff_t *pos) +{ + struct ftrace_iterator *iter = m->private; + void *p = NULL; + loff_t l = -1; + + if (*pos != iter->pos) { + for (p = t_next(m, p, &l); p && l < *pos; p = t_next(m, p, &l)) + ; + } else { + l = *pos; + p = t_next(m, p, &l); + } + + return p; +} + +static void t_stop(struct seq_file *m, void *p) +{ +} + +static int t_show(struct seq_file *m, void *v) +{ + struct dyn_ftrace *rec = v; + char str[KSYM_SYMBOL_LEN]; + + if (!rec) + return 0; + + kallsyms_lookup(rec->ip, NULL, NULL, NULL, str); + + seq_printf(m, "%s\n", str); + + return 0; +} + +static struct seq_operations show_ftrace_seq_ops = { + .start = t_start, + .next = t_next, + .stop = t_stop, + .show = t_show, +}; + +static int +ftrace_avail_open(struct inode *inode, struct file *file) +{ + struct ftrace_iterator *iter; + int ret; + + if (unlikely(ftrace_disabled)) + return -ENODEV; + + iter = kzalloc(sizeof(*iter), GFP_KERNEL); + if (!iter) + return -ENOMEM; + + iter->pg = ftrace_pages_start; + iter->pos = -1; + + ret = seq_open(file, &show_ftrace_seq_ops); + if (!ret) { + struct seq_file *m = file->private_data; + + m->private = iter; + } else { + kfree(iter); + } + + return ret; +} + +int ftrace_avail_release(struct inode *inode, struct file *file) +{ + struct seq_file *m = (struct seq_file *)file->private_data; + struct ftrace_iterator *iter = m->private; + + seq_release(inode, file); + kfree(iter); + + return 0; +} + +static void ftrace_filter_reset(int enable) +{ + struct ftrace_page *pg; + struct dyn_ftrace *rec; + unsigned long type = enable ? FTRACE_FL_FILTER : FTRACE_FL_NOTRACE; + unsigned i; + + /* keep kstop machine from running */ + preempt_disable(); + if (enable) + ftrace_filtered = 0; + pg = ftrace_pages_start; + while (pg) { + for (i = 0; i < pg->index; i++) { + rec = &pg->records[i]; + if (rec->flags & FTRACE_FL_FAILED) + continue; + rec->flags &= ~type; + } + pg = pg->next; + } + preempt_enable(); +} + +static int +ftrace_regex_open(struct inode *inode, struct file *file, int enable) +{ + struct ftrace_iterator *iter; + int ret = 0; + + if (unlikely(ftrace_disabled)) + return -ENODEV; + + iter = kzalloc(sizeof(*iter), GFP_KERNEL); + if (!iter) + return -ENOMEM; + + mutex_lock(&ftrace_regex_lock); + if ((file->f_mode & FMODE_WRITE) && + !(file->f_flags & O_APPEND)) + ftrace_filter_reset(enable); + + if (file->f_mode & FMODE_READ) { + iter->pg = ftrace_pages_start; + iter->pos = -1; + iter->flags = enable ? FTRACE_ITER_FILTER : + FTRACE_ITER_NOTRACE; + + ret = seq_open(file, &show_ftrace_seq_ops); + if (!ret) { + struct seq_file *m = file->private_data; + m->private = iter; + } else + kfree(iter); + } else + file->private_data = iter; + mutex_unlock(&ftrace_regex_lock); + + return ret; +} + +static int +ftrace_filter_open(struct inode *inode, struct file *file) +{ + return ftrace_regex_open(inode, file, 1); +} + +static int +ftrace_notrace_open(struct inode *inode, struct file *file) +{ + return ftrace_regex_open(inode, file, 0); +} + +static ssize_t +ftrace_regex_read(struct file *file, char __user *ubuf, + size_t cnt, loff_t *ppos) +{ + if (file->f_mode & FMODE_READ) + return seq_read(file, ubuf, cnt, ppos); + else + return -EPERM; +} + +static loff_t +ftrace_regex_lseek(struct file *file, loff_t offset, int origin) +{ + loff_t ret; + + if (file->f_mode & FMODE_READ) + ret = seq_lseek(file, offset, origin); + else + file->f_pos = ret = 1; + + return ret; +} + +enum { + MATCH_FULL, + MATCH_FRONT_ONLY, + MATCH_MIDDLE_ONLY, + MATCH_END_ONLY, +}; + +static void +ftrace_match(unsigned char *buff, int len, int enable) +{ + char str[KSYM_SYMBOL_LEN]; + char *search = NULL; + struct ftrace_page *pg; + struct dyn_ftrace *rec; + int type = MATCH_FULL; + unsigned long flag = enable ? FTRACE_FL_FILTER : FTRACE_FL_NOTRACE; + unsigned i, match = 0, search_len = 0; + + for (i = 0; i < len; i++) { + if (buff[i] == '*') { + if (!i) { + search = buff + i + 1; + type = MATCH_END_ONLY; + search_len = len - (i + 1); + } else { + if (type == MATCH_END_ONLY) { + type = MATCH_MIDDLE_ONLY; + } else { + match = i; + type = MATCH_FRONT_ONLY; + } + buff[i] = 0; + break; + } + } + } + + /* keep kstop machine from running */ + preempt_disable(); + if (enable) + ftrace_filtered = 1; + pg = ftrace_pages_start; + while (pg) { + for (i = 0; i < pg->index; i++) { + int matched = 0; + char *ptr; + + rec = &pg->records[i]; + if (rec->flags & FTRACE_FL_FAILED) + continue; + kallsyms_lookup(rec->ip, NULL, NULL, NULL, str); + switch (type) { + case MATCH_FULL: + if (strcmp(str, buff) == 0) + matched = 1; + break; + case MATCH_FRONT_ONLY: + if (memcmp(str, buff, match) == 0) + matched = 1; + break; + case MATCH_MIDDLE_ONLY: + if (strstr(str, search)) + matched = 1; + break; + case MATCH_END_ONLY: + ptr = strstr(str, search); + if (ptr && (ptr[search_len] == 0)) + matched = 1; + break; + } + if (matched) + rec->flags |= flag; + } + pg = pg->next; + } + preempt_enable(); +} + +static ssize_t +ftrace_regex_write(struct file *file, const char __user *ubuf, + size_t cnt, loff_t *ppos, int enable) +{ + struct ftrace_iterator *iter; + char ch; + size_t read = 0; + ssize_t ret; + + if (!cnt || cnt < 0) + return 0; + + mutex_lock(&ftrace_regex_lock); + + if (file->f_mode & FMODE_READ) { + struct seq_file *m = file->private_data; + iter = m->private; + } else + iter = file->private_data; + + if (!*ppos) { + iter->flags &= ~FTRACE_ITER_CONT; + iter->buffer_idx = 0; + } + + ret = get_user(ch, ubuf++); + if (ret) + goto out; + read++; + cnt--; + + if (!(iter->flags & ~FTRACE_ITER_CONT)) { + /* skip white space */ + while (cnt && isspace(ch)) { + ret = get_user(ch, ubuf++); + if (ret) + goto out; + read++; + cnt--; + } + + if (isspace(ch)) { + file->f_pos += read; + ret = read; + goto out; + } + + iter->buffer_idx = 0; + } + + while (cnt && !isspace(ch)) { + if (iter->buffer_idx < FTRACE_BUFF_MAX) + iter->buffer[iter->buffer_idx++] = ch; + else { + ret = -EINVAL; + goto out; + } + ret = get_user(ch, ubuf++); + if (ret) + goto out; + read++; + cnt--; + } + + if (isspace(ch)) { + iter->filtered++; + iter->buffer[iter->buffer_idx] = 0; + ftrace_match(iter->buffer, iter->buffer_idx, enable); + iter->buffer_idx = 0; + } else + iter->flags |= FTRACE_ITER_CONT; + + + file->f_pos += read; + + ret = read; + out: + mutex_unlock(&ftrace_regex_lock); + + return ret; +} + +static ssize_t +ftrace_filter_write(struct file *file, const char __user *ubuf, + size_t cnt, loff_t *ppos) +{ + return ftrace_regex_write(file, ubuf, cnt, ppos, 1); +} + +static ssize_t +ftrace_notrace_write(struct file *file, const char __user *ubuf, + size_t cnt, loff_t *ppos) +{ + return ftrace_regex_write(file, ubuf, cnt, ppos, 0); +} + +static void +ftrace_set_regex(unsigned char *buf, int len, int reset, int enable) +{ + if (unlikely(ftrace_disabled)) + return; + + mutex_lock(&ftrace_regex_lock); + if (reset) + ftrace_filter_reset(enable); + if (buf) + ftrace_match(buf, len, enable); + mutex_unlock(&ftrace_regex_lock); +} + +/** + * ftrace_set_filter - set a function to filter on in ftrace + * @buf - the string that holds the function filter text. + * @len - the length of the string. + * @reset - non zero to reset all filters before applying this filter. + * + * Filters denote which functions should be enabled when tracing is enabled. + * If @buf is NULL and reset is set, all functions will be enabled for tracing. + */ +void ftrace_set_filter(unsigned char *buf, int len, int reset) +{ + ftrace_set_regex(buf, len, reset, 1); +} + +/** + * ftrace_set_notrace - set a function to not trace in ftrace + * @buf - the string that holds the function notrace text. + * @len - the length of the string. + * @reset - non zero to reset all filters before applying this filter. + * + * Notrace Filters denote which functions should not be enabled when tracing + * is enabled. If @buf is NULL and reset is set, all functions will be enabled + * for tracing. + */ +void ftrace_set_notrace(unsigned char *buf, int len, int reset) +{ + ftrace_set_regex(buf, len, reset, 0); +} + +static int +ftrace_regex_release(struct inode *inode, struct file *file, int enable) +{ + struct seq_file *m = (struct seq_file *)file->private_data; + struct ftrace_iterator *iter; + + mutex_lock(&ftrace_regex_lock); + if (file->f_mode & FMODE_READ) { + iter = m->private; + + seq_release(inode, file); + } else + iter = file->private_data; + + if (iter->buffer_idx) { + iter->filtered++; + iter->buffer[iter->buffer_idx] = 0; + ftrace_match(iter->buffer, iter->buffer_idx, enable); + } + + mutex_lock(&ftrace_sysctl_lock); + mutex_lock(&ftraced_lock); + if (iter->filtered && ftraced_suspend && ftrace_enabled) + ftrace_run_update_code(FTRACE_ENABLE_CALLS); + mutex_unlock(&ftraced_lock); + mutex_unlock(&ftrace_sysctl_lock); + + kfree(iter); + mutex_unlock(&ftrace_regex_lock); + return 0; +} + +static int +ftrace_filter_release(struct inode *inode, struct file *file) +{ + return ftrace_regex_release(inode, file, 1); +} + +static int +ftrace_notrace_release(struct inode *inode, struct file *file) +{ + return ftrace_regex_release(inode, file, 0); +} + +static struct file_operations ftrace_avail_fops = { + .open = ftrace_avail_open, + .read = seq_read, + .llseek = seq_lseek, + .release = ftrace_avail_release, +}; + +static struct file_operations ftrace_filter_fops = { + .open = ftrace_filter_open, + .read = ftrace_regex_read, + .write = ftrace_filter_write, + .llseek = ftrace_regex_lseek, + .release = ftrace_filter_release, +}; + +static struct file_operations ftrace_notrace_fops = { + .open = ftrace_notrace_open, + .read = ftrace_regex_read, + .write = ftrace_notrace_write, + .llseek = ftrace_regex_lseek, + .release = ftrace_notrace_release, +}; + +/** + * ftrace_force_update - force an update to all recording ftrace functions + * + * The ftrace dynamic update daemon only wakes up once a second. + * There may be cases where an update needs to be done immediately + * for tests or internal kernel tracing to begin. This function + * wakes the daemon to do an update and will not return until the + * update is complete. + */ +int ftrace_force_update(void) +{ + unsigned long last_counter; + DECLARE_WAITQUEUE(wait, current); + int ret = 0; + + if (unlikely(ftrace_disabled)) + return -ENODEV; + + mutex_lock(&ftraced_lock); + last_counter = ftraced_iteration_counter; + + set_current_state(TASK_INTERRUPTIBLE); + add_wait_queue(&ftraced_waiters, &wait); + + if (unlikely(!ftraced_task)) { + ret = -ENODEV; + goto out; + } + + do { + mutex_unlock(&ftraced_lock); + wake_up_process(ftraced_task); + schedule(); + mutex_lock(&ftraced_lock); + if (signal_pending(current)) { + ret = -EINTR; + break; + } + set_current_state(TASK_INTERRUPTIBLE); + } while (last_counter == ftraced_iteration_counter); + + out: + mutex_unlock(&ftraced_lock); + remove_wait_queue(&ftraced_waiters, &wait); + set_current_state(TASK_RUNNING); + + return ret; +} + +static void ftrace_force_shutdown(void) +{ + struct task_struct *task; + int command = FTRACE_DISABLE_CALLS | FTRACE_UPDATE_TRACE_FUNC; + + mutex_lock(&ftraced_lock); + task = ftraced_task; + ftraced_task = NULL; + ftraced_suspend = -1; + ftrace_run_update_code(command); + mutex_unlock(&ftraced_lock); + + if (task) + kthread_stop(task); +} + +static __init int ftrace_init_debugfs(void) +{ + struct dentry *d_tracer; + struct dentry *entry; + + d_tracer = tracing_init_dentry(); + + entry = debugfs_create_file("available_filter_functions", 0444, + d_tracer, NULL, &ftrace_avail_fops); + if (!entry) + pr_warning("Could not create debugfs " + "'available_filter_functions' entry\n"); + + entry = debugfs_create_file("set_ftrace_filter", 0644, d_tracer, + NULL, &ftrace_filter_fops); + if (!entry) + pr_warning("Could not create debugfs " + "'set_ftrace_filter' entry\n"); + + entry = debugfs_create_file("set_ftrace_notrace", 0644, d_tracer, + NULL, &ftrace_notrace_fops); + if (!entry) + pr_warning("Could not create debugfs " + "'set_ftrace_notrace' entry\n"); + return 0; +} + +fs_initcall(ftrace_init_debugfs); + +static int __init ftrace_dynamic_init(void) +{ + struct task_struct *p; + unsigned long addr; + int ret; + + addr = (unsigned long)ftrace_record_ip; + + stop_machine_run(ftrace_dyn_arch_init, &addr, NR_CPUS); + + /* ftrace_dyn_arch_init places the return code in addr */ + if (addr) { + ret = (int)addr; + goto failed; + } + + ret = ftrace_dyn_table_alloc(); + if (ret) + goto failed; + + p = kthread_run(ftraced, NULL, "ftraced"); + if (IS_ERR(p)) { + ret = -1; + goto failed; + } + + last_ftrace_enabled = ftrace_enabled = 1; + ftraced_task = p; + + return 0; + + failed: + ftrace_disabled = 1; + return ret; +} + +core_initcall(ftrace_dynamic_init); +#else +# define ftrace_startup() do { } while (0) +# define ftrace_shutdown() do { } while (0) +# define ftrace_startup_sysctl() do { } while (0) +# define ftrace_shutdown_sysctl() do { } while (0) +# define ftrace_force_shutdown() do { } while (0) +#endif /* CONFIG_DYNAMIC_FTRACE */ + +/** + * ftrace_kill - totally shutdown ftrace + * + * This is a safety measure. If something was detected that seems + * wrong, calling this function will keep ftrace from doing + * any more modifications, and updates. + * used when something went wrong. + */ +void ftrace_kill(void) +{ + mutex_lock(&ftrace_sysctl_lock); + ftrace_disabled = 1; + ftrace_enabled = 0; + + clear_ftrace_function(); + mutex_unlock(&ftrace_sysctl_lock); + + /* Try to totally disable ftrace */ + ftrace_force_shutdown(); +} + +/** + * register_ftrace_function - register a function for profiling + * @ops - ops structure that holds the function for profiling. + * + * Register a function to be called by all functions in the + * kernel. + * + * Note: @ops->func and all the functions it calls must be labeled + * with "notrace", otherwise it will go into a + * recursive loop. + */ +int register_ftrace_function(struct ftrace_ops *ops) +{ + int ret; + + if (unlikely(ftrace_disabled)) + return -1; + + mutex_lock(&ftrace_sysctl_lock); + ret = __register_ftrace_function(ops); + ftrace_startup(); + mutex_unlock(&ftrace_sysctl_lock); + + return ret; +} + +/** + * unregister_ftrace_function - unresgister a function for profiling. + * @ops - ops structure that holds the function to unregister + * + * Unregister a function that was added to be called by ftrace profiling. + */ +int unregister_ftrace_function(struct ftrace_ops *ops) +{ + int ret; + + mutex_lock(&ftrace_sysctl_lock); + ret = __unregister_ftrace_function(ops); + ftrace_shutdown(); + mutex_unlock(&ftrace_sysctl_lock); + + return ret; +} + +int +ftrace_enable_sysctl(struct ctl_table *table, int write, + struct file *file, void __user *buffer, size_t *lenp, + loff_t *ppos) +{ + int ret; + + if (unlikely(ftrace_disabled)) + return -ENODEV; + + mutex_lock(&ftrace_sysctl_lock); + + ret = proc_dointvec(table, write, file, buffer, lenp, ppos); + + if (ret || !write || (last_ftrace_enabled == ftrace_enabled)) + goto out; + + last_ftrace_enabled = ftrace_enabled; + + if (ftrace_enabled) { + + ftrace_startup_sysctl(); + + /* we are starting ftrace again */ + if (ftrace_list != &ftrace_list_end) { + if (ftrace_list->next == &ftrace_list_end) + ftrace_trace_function = ftrace_list->func; + else + ftrace_trace_function = ftrace_list_func; + } + + } else { + /* stopping ftrace calls (just send to ftrace_stub) */ + ftrace_trace_function = ftrace_stub; + + ftrace_shutdown_sysctl(); + } + + out: + mutex_unlock(&ftrace_sysctl_lock); + return ret; +} Index: linux-2.6.24.7/kernel/trace/trace.c =================================================================== --- /dev/null +++ linux-2.6.24.7/kernel/trace/trace.c @@ -0,0 +1,3112 @@ +/* + * ring buffer based function tracer + * + * Copyright (C) 2007-2008 Steven Rostedt <srostedt@redhat.com> + * Copyright (C) 2008 Ingo Molnar <mingo@redhat.com> + * + * Originally taken from the RT patch by: + * Arnaldo Carvalho de Melo <acme@redhat.com> + * + * Based on code from the latency_tracer, that is: + * Copyright (C) 2004-2006 Ingo Molnar + * Copyright (C) 2004 William Lee Irwin III + */ +#include <linux/utsrelease.h> +#include <linux/kallsyms.h> +#include <linux/seq_file.h> +#include <linux/debugfs.h> +#include <linux/pagemap.h> +#include <linux/hardirq.h> +#include <linux/linkage.h> +#include <linux/uaccess.h> +#include <linux/ftrace.h> +#include <linux/module.h> +#include <linux/percpu.h> +#include <linux/ctype.h> +#include <linux/init.h> +#include <linux/poll.h> +#include <linux/gfp.h> +#include <linux/fs.h> +#include <linux/writeback.h> + +#include <linux/stacktrace.h> + +#include "trace.h" + +unsigned long __read_mostly tracing_max_latency = (cycle_t)ULONG_MAX; +unsigned long __read_mostly tracing_thresh; + +static unsigned long __read_mostly tracing_nr_buffers; +static cpumask_t __read_mostly tracing_buffer_mask; + +#define for_each_cpu_mask_nr(cpu, mask) for_each_cpu_mask(cpu, mask) +#define for_each_tracing_cpu(cpu) \ + for_each_cpu_mask_nr(cpu, tracing_buffer_mask) + +/* dummy trace to disable tracing */ +static struct tracer no_tracer __read_mostly = { + .name = "none", +}; + +static int trace_alloc_page(void); +static int trace_free_page(void); + +static int tracing_disabled = 1; + +static unsigned long tracing_pages_allocated; + +long +ns2usecs(cycle_t nsec) +{ + nsec += 500; + do_div(nsec, 1000); + return nsec; +} + +cycle_t ftrace_now(int cpu) +{ +// return cpu_clock(cpu); + return sched_clock(); +} + +/* + * The global_trace is the descriptor that holds the tracing + * buffers for the live tracing. For each CPU, it contains + * a link list of pages that will store trace entries. The + * page descriptor of the pages in the memory is used to hold + * the link list by linking the lru item in the page descriptor + * to each of the pages in the buffer per CPU. + * + * For each active CPU there is a data field that holds the + * pages for the buffer for that CPU. Each CPU has the same number + * of pages allocated for its buffer. + */ +static struct trace_array global_trace; + +static DEFINE_PER_CPU(struct trace_array_cpu, global_trace_cpu); + +/* + * The max_tr is used to snapshot the global_trace when a maximum + * latency is reached. Some tracers will use this to store a maximum + * trace while it continues examining live traces. + * + * The buffers for the max_tr are set up the same as the global_trace. + * When a snapshot is taken, the link list of the max_tr is swapped + * with the link list of the global_trace and the buffers are reset for + * the global_trace so the tracing can continue. + */ +static struct trace_array max_tr; + +static DEFINE_PER_CPU(struct trace_array_cpu, max_data); + +/* tracer_enabled is used to toggle activation of a tracer */ +static int tracer_enabled = 1; + +/* + * trace_nr_entries is the number of entries that is allocated + * for a buffer. Note, the number of entries is always rounded + * to ENTRIES_PER_PAGE. + */ +static unsigned long trace_nr_entries = 65536UL; + +/* trace_types holds a link list of available tracers. */ +static struct tracer *trace_types __read_mostly; + +/* current_trace points to the tracer that is currently active */ +static struct tracer *current_trace __read_mostly; + +/* + * max_tracer_type_len is used to simplify the allocating of + * buffers to read userspace tracer names. We keep track of + * the longest tracer name registered. + */ +static int max_tracer_type_len; + +/* + * trace_types_lock is used to protect the trace_types list. + * This lock is also used to keep user access serialized. + * Accesses from userspace will grab this lock while userspace + * activities happen inside the kernel. + */ +static DEFINE_MUTEX(trace_types_lock); + +/* trace_wait is a waitqueue for tasks blocked on trace_poll */ +static DECLARE_WAIT_QUEUE_HEAD(trace_wait); + +/* trace_flags holds iter_ctrl options */ +unsigned long trace_flags = TRACE_ITER_PRINT_PARENT; + +/** + * trace_wake_up - wake up tasks waiting for trace input + * + * Simply wakes up any task that is blocked on the trace_wait + * queue. These is used with trace_poll for tasks polling the trace. + */ +void trace_wake_up(void) +{ + /* + * The runqueue_is_locked() can fail, but this is the best we + * have for now: + */ + if (!(trace_flags & TRACE_ITER_BLOCK) && !runqueue_is_locked()) + wake_up(&trace_wait); +} + +#define ENTRIES_PER_PAGE (PAGE_SIZE / sizeof(struct trace_entry)) + +static int __init set_nr_entries(char *str) +{ + unsigned long nr_entries; + int ret; + + if (!str) + return 0; + ret = strict_strtoul(str, 0, &nr_entries); + /* nr_entries can not be zero */ + if (ret < 0 || nr_entries == 0) + return 0; + trace_nr_entries = nr_entries; + return 1; +} +__setup("trace_entries=", set_nr_entries); + +unsigned long nsecs_to_usecs(unsigned long nsecs) +{ + return nsecs / 1000; +} + +/* + * trace_flag_type is an enumeration that holds different + * states when a trace occurs. These are: + * IRQS_OFF - interrupts were disabled + * NEED_RESCED - reschedule is requested + * HARDIRQ - inside an interrupt handler + * SOFTIRQ - inside a softirq handler + */ +enum trace_flag_type { + TRACE_FLAG_IRQS_OFF = 0x01, + TRACE_FLAG_NEED_RESCHED = 0x02, + TRACE_FLAG_HARDIRQ = 0x04, + TRACE_FLAG_SOFTIRQ = 0x08, +}; + +/* + * TRACE_ITER_SYM_MASK masks the options in trace_flags that + * control the output of kernel symbols. + */ +#define TRACE_ITER_SYM_MASK \ + (TRACE_ITER_PRINT_PARENT|TRACE_ITER_SYM_OFFSET|TRACE_ITER_SYM_ADDR) + +/* These must match the bit postions in trace_iterator_flags */ +static const char *trace_options[] = { + "print-parent", + "sym-offset", + "sym-addr", + "verbose", + "raw", + "hex", + "bin", + "block", + "stacktrace", + "sched-tree", + NULL +}; + +/* + * ftrace_max_lock is used to protect the swapping of buffers + * when taking a max snapshot. The buffers themselves are + * protected by per_cpu spinlocks. But the action of the swap + * needs its own lock. + * + * This is defined as a raw_spinlock_t in order to help + * with performance when lockdep debugging is enabled. + */ +static raw_spinlock_t ftrace_max_lock = + (raw_spinlock_t)__RAW_SPIN_LOCK_UNLOCKED; + +/* + * Copy the new maximum trace into the separate maximum-trace + * structure. (this way the maximum trace is permanently saved, + * for later retrieval via /debugfs/tracing/latency_trace) + */ +static void +__update_max_tr(struct trace_array *tr, struct task_struct *tsk, int cpu) +{ + struct trace_array_cpu *data = tr->data[cpu]; + + max_tr.cpu = cpu; + max_tr.time_start = data->preempt_timestamp; + + data = max_tr.data[cpu]; + data->saved_latency = tracing_max_latency; + + memcpy(data->comm, tsk->comm, TASK_COMM_LEN); + data->pid = tsk->pid; + data->uid = tsk->uid; + data->nice = tsk->static_prio - 20 - MAX_RT_PRIO; + data->policy = tsk->policy; + data->rt_priority = tsk->rt_priority; + + /* record this tasks comm */ + tracing_record_cmdline(current); +} + +#define CHECK_COND(cond) \ + if (unlikely(cond)) { \ + tracing_disabled = 1; \ + WARN_ON(1); \ + return -1; \ + } + +/** + * check_pages - integrity check of trace buffers + * + * As a safty measure we check to make sure the data pages have not + * been corrupted. + */ +int check_pages(struct trace_array_cpu *data) +{ + struct page *page, *tmp; + + CHECK_COND(data->trace_pages.next->prev != &data->trace_pages); + CHECK_COND(data->trace_pages.prev->next != &data->trace_pages); + + list_for_each_entry_safe(page, tmp, &data->trace_pages, lru) { + CHECK_COND(page->lru.next->prev != &page->lru); + CHECK_COND(page->lru.prev->next != &page->lru); + } + + return 0; +} + +/** + * head_page - page address of the first page in per_cpu buffer. + * + * head_page returns the page address of the first page in + * a per_cpu buffer. This also preforms various consistency + * checks to make sure the buffer has not been corrupted. + */ +void *head_page(struct trace_array_cpu *data) +{ + struct page *page; + + if (list_empty(&data->trace_pages)) + return NULL; + + page = list_entry(data->trace_pages.next, struct page, lru); + BUG_ON(&page->lru == &data->trace_pages); + + return page_address(page); +} + +/** + * trace_seq_printf - sequence printing of trace information + * @s: trace sequence descriptor + * @fmt: printf format string + * + * The tracer may use either sequence operations or its own + * copy to user routines. To simplify formating of a trace + * trace_seq_printf is used to store strings into a special + * buffer (@s). Then the output may be either used by + * the sequencer or pulled into another buffer. + */ +int +trace_seq_printf(struct trace_seq *s, const char *fmt, ...) +{ + int len = (PAGE_SIZE - 1) - s->len; + va_list ap; + int ret; + + if (!len) + return 0; + + va_start(ap, fmt); + ret = vsnprintf(s->buffer + s->len, len, fmt, ap); + va_end(ap); + + /* If we can't write it all, don't bother writing anything */ + if (ret >= len) + return 0; + + s->len += ret; + + return len; +} + +/** + * trace_seq_puts - trace sequence printing of simple string + * @s: trace sequence descriptor + * @str: simple string to record + * + * The tracer may use either the sequence operations or its own + * copy to user routines. This function records a simple string + * into a special buffer (@s) for later retrieval by a sequencer + * or other mechanism. + */ +static int +trace_seq_puts(struct trace_seq *s, const char *str) +{ + int len = strlen(str); + + if (len > ((PAGE_SIZE - 1) - s->len)) + return 0; + + memcpy(s->buffer + s->len, str, len); + s->len += len; + + return len; +} + +static int +trace_seq_putc(struct trace_seq *s, unsigned char c) +{ + if (s->len >= (PAGE_SIZE - 1)) + return 0; + + s->buffer[s->len++] = c; + + return 1; +} + +static int +trace_seq_putmem(struct trace_seq *s, void *mem, size_t len) +{ + if (len > ((PAGE_SIZE - 1) - s->len)) + return 0; + + memcpy(s->buffer + s->len, mem, len); + s->len += len; + + return len; +} + +#define HEX_CHARS 17 +static const char hex2asc[] = "0123456789abcdef"; + +static int +trace_seq_putmem_hex(struct trace_seq *s, void *mem, size_t len) +{ + unsigned char hex[HEX_CHARS]; + unsigned char *data = mem; + unsigned char byte; + int i, j; + + BUG_ON(len >= HEX_CHARS); + +#ifdef __BIG_ENDIAN + for (i = 0, j = 0; i < len; i++) { +#else + for (i = len-1, j = 0; i >= 0; i--) { +#endif + byte = data[i]; + + hex[j++] = hex2asc[byte & 0x0f]; + hex[j++] = hex2asc[byte >> 4]; + } + hex[j++] = ' '; + + return trace_seq_putmem(s, hex, j); +} + +static void +trace_seq_reset(struct trace_seq *s) +{ + s->len = 0; + s->readpos = 0; +} + +ssize_t trace_seq_to_user(struct trace_seq *s, char __user *ubuf, size_t cnt) +{ + int len; + int ret; + + if (s->len <= s->readpos) + return -EBUSY; + + len = s->len - s->readpos; + if (cnt > len) + cnt = len; + ret = copy_to_user(ubuf, s->buffer + s->readpos, cnt); + if (ret) + return -EFAULT; + + s->readpos += len; + return cnt; +} + +static void +trace_print_seq(struct seq_file *m, struct trace_seq *s) +{ + int len = s->len >= PAGE_SIZE ? PAGE_SIZE - 1 : s->len; + + s->buffer[len] = 0; + seq_puts(m, s->buffer); + + trace_seq_reset(s); +} + +/* + * flip the trace buffers between two trace descriptors. + * This usually is the buffers between the global_trace and + * the max_tr to record a snapshot of a current trace. + * + * The ftrace_max_lock must be held. + */ +static void +flip_trace(struct trace_array_cpu *tr1, struct trace_array_cpu *tr2) +{ + struct list_head flip_pages; + + INIT_LIST_HEAD(&flip_pages); + + memcpy(&tr1->trace_head_idx, &tr2->trace_head_idx, + sizeof(struct trace_array_cpu) - + offsetof(struct trace_array_cpu, trace_head_idx)); + + check_pages(tr1); + check_pages(tr2); + list_splice_init(&tr1->trace_pages, &flip_pages); + list_splice_init(&tr2->trace_pages, &tr1->trace_pages); + list_splice_init(&flip_pages, &tr2->trace_pages); + BUG_ON(!list_empty(&flip_pages)); + check_pages(tr1); + check_pages(tr2); +} + +/** + * update_max_tr - snapshot all trace buffers from global_trace to max_tr + * @tr: tracer + * @tsk: the task with the latency + * @cpu: The cpu that initiated the trace. + * + * Flip the buffers between the @tr and the max_tr and record information + * about which task was the cause of this latency. + */ +void +update_max_tr(struct trace_array *tr, struct task_struct *tsk, int cpu) +{ + struct trace_array_cpu *data; + int i; + + WARN_ON_ONCE(!irqs_disabled()); + __raw_spin_lock(&ftrace_max_lock); + /* clear out all the previous traces */ + for_each_tracing_cpu(i) { + data = tr->data[i]; + flip_trace(max_tr.data[i], data); + tracing_reset(data); + } + + __update_max_tr(tr, tsk, cpu); + __raw_spin_unlock(&ftrace_max_lock); +} + +/** + * update_max_tr_single - only copy one trace over, and reset the rest + * @tr - tracer + * @tsk - task with the latency + * @cpu - the cpu of the buffer to copy. + * + * Flip the trace of a single CPU buffer between the @tr and the max_tr. + */ +void +update_max_tr_single(struct trace_array *tr, struct task_struct *tsk, int cpu) +{ + struct trace_array_cpu *data = tr->data[cpu]; + int i; + + WARN_ON_ONCE(!irqs_disabled()); + __raw_spin_lock(&ftrace_max_lock); + for_each_tracing_cpu(i) + tracing_reset(max_tr.data[i]); + + flip_trace(max_tr.data[cpu], data); + tracing_reset(data); + + __update_max_tr(tr, tsk, cpu); + __raw_spin_unlock(&ftrace_max_lock); +} + +/** + * register_tracer - register a tracer with the ftrace system. + * @type - the plugin for the tracer + * + * Register a new plugin tracer. + */ +int register_tracer(struct tracer *type) +{ + struct tracer *t; + int len; + int ret = 0; + + if (!type->name) { + pr_info("Tracer must have a name\n"); + return -1; + } + + mutex_lock(&trace_types_lock); + for (t = trace_types; t; t = t->next) { + if (strcmp(type->name, t->name) == 0) { + /* already found */ + pr_info("Trace %s already registered\n", + type->name); + ret = -1; + goto out; + } + } + +#ifdef CONFIG_FTRACE_STARTUP_TEST + if (type->selftest) { + struct tracer *saved_tracer = current_trace; + struct trace_array_cpu *data; + struct trace_array *tr = &global_trace; + int saved_ctrl = tr->ctrl; + int i; + /* + * Run a selftest on this tracer. + * Here we reset the trace buffer, and set the current + * tracer to be this tracer. The tracer can then run some + * internal tracing to verify that everything is in order. + * If we fail, we do not register this tracer. + */ + for_each_tracing_cpu(i) { + data = tr->data[i]; + if (!head_page(data)) + continue; + tracing_reset(data); + } + current_trace = type; + tr->ctrl = 0; + /* the test is responsible for initializing and enabling */ + pr_info("Testing tracer %s: ", type->name); + ret = type->selftest(type, tr); + /* the test is responsible for resetting too */ + current_trace = saved_tracer; + tr->ctrl = saved_ctrl; + if (ret) { + printk(KERN_CONT "FAILED!\n"); + goto out; + } + /* Only reset on passing, to avoid touching corrupted buffers */ + for_each_tracing_cpu(i) { + data = tr->data[i]; + if (!head_page(data)) + continue; + tracing_reset(data); + } + printk(KERN_CONT "PASSED\n"); + } +#endif + + type->next = trace_types; + trace_types = type; + len = strlen(type->name); + if (len > max_tracer_type_len) + max_tracer_type_len = len; + + out: + mutex_unlock(&trace_types_lock); + + return ret; +} + +void unregister_tracer(struct tracer *type) +{ + struct tracer **t; + int len; + + mutex_lock(&trace_types_lock); + for (t = &trace_types; *t; t = &(*t)->next) { + if (*t == type) + goto found; + } + pr_info("Trace %s not registered\n", type->name); + goto out; + + found: + *t = (*t)->next; + if (strlen(type->name) != max_tracer_type_len) + goto out; + + max_tracer_type_len = 0; + for (t = &trace_types; *t; t = &(*t)->next) { + len = strlen((*t)->name); + if (len > max_tracer_type_len) + max_tracer_type_len = len; + } + out: + mutex_unlock(&trace_types_lock); +} + +void tracing_reset(struct trace_array_cpu *data) +{ + data->trace_idx = 0; + data->overrun = 0; + data->trace_head = data->trace_tail = head_page(data); + data->trace_head_idx = 0; + data->trace_tail_idx = 0; +} + +#define SAVED_CMDLINES 128 +static unsigned map_pid_to_cmdline[PID_MAX_DEFAULT+1]; +static unsigned map_cmdline_to_pid[SAVED_CMDLINES]; +static char saved_cmdlines[SAVED_CMDLINES][TASK_COMM_LEN]; +static int cmdline_idx; +static DEFINE_SPINLOCK(trace_cmdline_lock); + +/* temporary disable recording */ +atomic_t trace_record_cmdline_disabled __read_mostly; + +static void trace_init_cmdlines(void) +{ + memset(&map_pid_to_cmdline, -1, sizeof(map_pid_to_cmdline)); + memset(&map_cmdline_to_pid, -1, sizeof(map_cmdline_to_pid)); + cmdline_idx = 0; +} + +void trace_stop_cmdline_recording(void); + +static void trace_save_cmdline(struct task_struct *tsk) +{ + unsigned map; + unsigned idx; + + if (!tsk->pid || unlikely(tsk->pid > PID_MAX_DEFAULT)) + return; + + /* + * It's not the end of the world if we don't get + * the lock, but we also don't want to spin + * nor do we want to disable interrupts, + * so if we miss here, then better luck next time. + */ + if (!spin_trylock(&trace_cmdline_lock)) + return; + + idx = map_pid_to_cmdline[tsk->pid]; + if (idx >= SAVED_CMDLINES) { + idx = (cmdline_idx + 1) % SAVED_CMDLINES; + + map = map_cmdline_to_pid[idx]; + if (map <= PID_MAX_DEFAULT) + map_pid_to_cmdline[map] = (unsigned)-1; + + map_pid_to_cmdline[tsk->pid] = idx; + + cmdline_idx = idx; + } + + memcpy(&saved_cmdlines[idx], tsk->comm, TASK_COMM_LEN); + + spin_unlock(&trace_cmdline_lock); +} + +static char *trace_find_cmdline(int pid) +{ + char *cmdline = "<...>"; + unsigned map; + + if (!pid) + return "<idle>"; + + if (pid > PID_MAX_DEFAULT) + goto out; + + map = map_pid_to_cmdline[pid]; + if (map >= SAVED_CMDLINES) + goto out; + + cmdline = saved_cmdlines[map]; + + out: + return cmdline; +} + +void tracing_record_cmdline(struct task_struct *tsk) +{ + if (atomic_read(&trace_record_cmdline_disabled)) + return; + + trace_save_cmdline(tsk); +} + +static inline struct list_head * +trace_next_list(struct trace_array_cpu *data, struct list_head *next) +{ + /* + * Roundrobin - but skip the head (which is not a real page): + */ + next = next->next; + if (unlikely(next == &data->trace_pages)) + next = next->next; + BUG_ON(next == &data->trace_pages); + + return next; +} + +static inline void * +trace_next_page(struct trace_array_cpu *data, void *addr) +{ + struct list_head *next; + struct page *page; + + page = virt_to_page(addr); + + next = trace_next_list(data, &page->lru); + page = list_entry(next, struct page, lru); + + return page_address(page); +} + +static inline struct trace_entry * +tracing_get_trace_entry(struct trace_array *tr, struct trace_array_cpu *data) +{ + unsigned long idx, idx_next; + struct trace_entry *entry; + + data->trace_idx++; + idx = data->trace_head_idx; + idx_next = idx + 1; + + BUG_ON(idx * TRACE_ENTRY_SIZE >= PAGE_SIZE); + + entry = data->trace_head + idx * TRACE_ENTRY_SIZE; + + if (unlikely(idx_next >= ENTRIES_PER_PAGE)) { + data->trace_head = trace_next_page(data, data->trace_head); + idx_next = 0; + } + + if (data->trace_head == data->trace_tail && + idx_next == data->trace_tail_idx) { + /* overrun */ + data->overrun++; + data->trace_tail_idx++; + if (data->trace_tail_idx >= ENTRIES_PER_PAGE) { + data->trace_tail = + trace_next_page(data, data->trace_tail); + data->trace_tail_idx = 0; + } + } + + data->trace_head_idx = idx_next; + + return entry; +} + +static inline void +tracing_generic_entry_update(struct trace_entry *entry, unsigned long flags) +{ + struct task_struct *tsk = current; + unsigned long pc; + + pc = preempt_count(); + + entry->preempt_count = pc & 0xff; + entry->pid = (tsk) ? tsk->pid : 0; + entry->t = ftrace_now(raw_smp_processor_id()); + entry->flags = (irqs_disabled_flags(flags) ? TRACE_FLAG_IRQS_OFF : 0) | + ((pc & HARDIRQ_MASK) ? TRACE_FLAG_HARDIRQ : 0) | + ((pc & SOFTIRQ_MASK) ? TRACE_FLAG_SOFTIRQ : 0) | + (need_resched() ? TRACE_FLAG_NEED_RESCHED : 0); +} + +void +trace_function(struct trace_array *tr, struct trace_array_cpu *data, + unsigned long ip, unsigned long parent_ip, unsigned long flags) +{ + struct trace_entry *entry; + unsigned long irq_flags; + + raw_local_irq_save(irq_flags); + __raw_spin_lock(&data->lock); + entry = tracing_get_trace_entry(tr, data); + tracing_generic_entry_update(entry, flags); + entry->type = TRACE_FN; + entry->fn.ip = ip; + entry->fn.parent_ip = parent_ip; + __raw_spin_unlock(&data->lock); + raw_local_irq_restore(irq_flags); +} + +void +ftrace(struct trace_array *tr, struct trace_array_cpu *data, + unsigned long ip, unsigned long parent_ip, unsigned long flags) +{ + if (likely(!atomic_read(&data->disabled))) + trace_function(tr, data, ip, parent_ip, flags); +} + +#ifdef CONFIG_MMIOTRACE +void __trace_mmiotrace_rw(struct trace_array *tr, struct trace_array_cpu *data, + struct mmiotrace_rw *rw) +{ + struct trace_entry *entry; + unsigned long irq_flags; + + raw_local_irq_save(irq_flags); + __raw_spin_lock(&data->lock); + + entry = tracing_get_trace_entry(tr, data); + tracing_generic_entry_update(entry, 0); + entry->type = TRACE_MMIO_RW; + entry->mmiorw = *rw; + + __raw_spin_unlock(&data->lock); + raw_local_irq_restore(irq_flags); + + trace_wake_up(); +} + +void __trace_mmiotrace_map(struct trace_array *tr, struct trace_array_cpu *data, + struct mmiotrace_map *map) +{ + struct trace_entry *entry; + unsigned long irq_flags; + + raw_local_irq_save(irq_flags); + __raw_spin_lock(&data->lock); + + entry = tracing_get_trace_entry(tr, data); + tracing_generic_entry_update(entry, 0); + entry->type = TRACE_MMIO_MAP; + entry->mmiomap = *map; + + __raw_spin_unlock(&data->lock); + raw_local_irq_restore(irq_flags); + + trace_wake_up(); +} +#endif + +void __trace_stack(struct trace_array *tr, + struct trace_array_cpu *data, + unsigned long flags, + int skip) +{ + struct trace_entry *entry; + struct stack_trace trace; + + if (!(trace_flags & TRACE_ITER_STACKTRACE)) + return; + + entry = tracing_get_trace_entry(tr, data); + tracing_generic_entry_update(entry, flags); + entry->type = TRACE_STACK; + + memset(&entry->stack, 0, sizeof(entry->stack)); + + trace.nr_entries = 0; + trace.max_entries = FTRACE_STACK_ENTRIES; + trace.skip = skip; + trace.entries = entry->stack.caller; + + save_stack_trace(&trace); +} + +void +__trace_special(void *__tr, void *__data, + unsigned long arg1, unsigned long arg2, unsigned long arg3) +{ + struct trace_array_cpu *data = __data; + struct trace_array *tr = __tr; + struct trace_entry *entry; + unsigned long irq_flags; + + raw_local_irq_save(irq_flags); + __raw_spin_lock(&data->lock); + entry = tracing_get_trace_entry(tr, data); + tracing_generic_entry_update(entry, 0); + entry->type = TRACE_SPECIAL; + entry->special.arg1 = arg1; + entry->special.arg2 = arg2; + entry->special.arg3 = arg3; + __trace_stack(tr, data, irq_flags, 4); + __raw_spin_unlock(&data->lock); + raw_local_irq_restore(irq_flags); + + trace_wake_up(); +} + +void +tracing_sched_switch_trace(struct trace_array *tr, + struct trace_array_cpu *data, + struct task_struct *prev, + struct task_struct *next, + unsigned long flags) +{ + struct trace_entry *entry; + unsigned long irq_flags; + + raw_local_irq_save(irq_flags); + __raw_spin_lock(&data->lock); + entry = tracing_get_trace_entry(tr, data); + tracing_generic_entry_update(entry, flags); + entry->type = TRACE_CTX; + entry->ctx.prev_pid = prev->pid; + entry->ctx.prev_prio = prev->prio; + entry->ctx.prev_state = prev->state; + entry->ctx.next_pid = next->pid; + entry->ctx.next_prio = next->prio; + entry->ctx.next_state = next->state; + __trace_stack(tr, data, flags, 5); + __raw_spin_unlock(&data->lock); + raw_local_irq_restore(irq_flags); +} + +void +tracing_sched_wakeup_trace(struct trace_array *tr, + struct trace_array_cpu *data, + struct task_struct *wakee, + struct task_struct *curr, + unsigned long flags) +{ + struct trace_entry *entry; + unsigned long irq_flags; + + raw_local_irq_save(irq_flags); + __raw_spin_lock(&data->lock); + entry = tracing_get_trace_entry(tr, data); + tracing_generic_entry_update(entry, flags); + entry->type = TRACE_WAKE; + entry->ctx.prev_pid = curr->pid; + entry->ctx.prev_prio = curr->prio; + entry->ctx.prev_state = curr->state; + entry->ctx.next_pid = wakee->pid; + entry->ctx.next_prio = wakee->prio; + entry->ctx.next_state = wakee->state; + __trace_stack(tr, data, flags, 6); + __raw_spin_unlock(&data->lock); + raw_local_irq_restore(irq_flags); + + trace_wake_up(); +} + +void +ftrace_special(unsigned long arg1, unsigned long arg2, unsigned long arg3) +{ + struct trace_array *tr = &global_trace; + struct trace_array_cpu *data; + unsigned long flags; + long disabled; + int cpu; + + if (tracing_disabled || current_trace == &no_tracer || !tr->ctrl) + return; + + local_irq_save(flags); + cpu = raw_smp_processor_id(); + data = tr->data[cpu]; + disabled = atomic_inc_return(&data->disabled); + + if (likely(disabled == 1)) + __trace_special(tr, data, arg1, arg2, arg3); + + atomic_dec(&data->disabled); + local_irq_restore(flags); +} + +#ifdef CONFIG_FTRACE +static void +function_trace_call(unsigned long ip, unsigned long parent_ip) +{ + struct trace_array *tr = &global_trace; + struct trace_array_cpu *data; + unsigned long flags; + long disabled; + int cpu; + + if (unlikely(!tracer_enabled)) + return; + + local_irq_save(flags); + cpu = raw_smp_processor_id(); + data = tr->data[cpu]; + disabled = atomic_inc_return(&data->disabled); + + if (likely(disabled == 1)) + trace_function(tr, data, ip, parent_ip, flags); + + atomic_dec(&data->disabled); + local_irq_restore(flags); +} + +static struct ftrace_ops trace_ops __read_mostly = +{ + .func = function_trace_call, +}; + +void tracing_start_function_trace(void) +{ + register_ftrace_function(&trace_ops); +} + +void tracing_stop_function_trace(void) +{ + unregister_ftrace_function(&trace_ops); +} +#endif + +enum trace_file_type { + TRACE_FILE_LAT_FMT = 1, +}; + +static struct trace_entry * +trace_entry_idx(struct trace_array *tr, struct trace_array_cpu *data, + struct trace_iterator *iter, int cpu) +{ + struct page *page; + struct trace_entry *array; + + if (iter->next_idx[cpu] >= tr->entries || + iter->next_idx[cpu] >= data->trace_idx || + (data->trace_head == data->trace_tail && + data->trace_head_idx == data->trace_tail_idx)) + return NULL; + + if (!iter->next_page[cpu]) { + /* Initialize the iterator for this cpu trace buffer */ + WARN_ON(!data->trace_tail); + page = virt_to_page(data->trace_tail); + iter->next_page[cpu] = &page->lru; + iter->next_page_idx[cpu] = data->trace_tail_idx; + } + + page = list_entry(iter->next_page[cpu], struct page, lru); + BUG_ON(&data->trace_pages == &page->lru); + + array = page_address(page); + + WARN_ON(iter->next_page_idx[cpu] >= ENTRIES_PER_PAGE); + return &array[iter->next_page_idx[cpu]]; +} + +static struct trace_entry * +find_next_entry(struct trace_iterator *iter, int *ent_cpu) +{ + struct trace_array *tr = iter->tr; + struct trace_entry *ent, *next = NULL; + int next_cpu = -1; + int cpu; + + for_each_tracing_cpu(cpu) { + if (!head_page(tr->data[cpu])) + continue; + ent = trace_entry_idx(tr, tr->data[cpu], iter, cpu); + /* + * Pick the entry with the smallest timestamp: + */ + if (ent && (!next || ent->t < next->t)) { + next = ent; + next_cpu = cpu; + } + } + + if (ent_cpu) + *ent_cpu = next_cpu; + + return next; +} + +static void trace_iterator_increment(struct trace_iterator *iter) +{ + iter->idx++; + iter->next_idx[iter->cpu]++; + iter->next_page_idx[iter->cpu]++; + + if (iter->next_page_idx[iter->cpu] >= ENTRIES_PER_PAGE) { + struct trace_array_cpu *data = iter->tr->data[iter->cpu]; + + iter->next_page_idx[iter->cpu] = 0; + iter->next_page[iter->cpu] = + trace_next_list(data, iter->next_page[iter->cpu]); + } +} + +static void trace_consume(struct trace_iterator *iter) +{ + struct trace_array_cpu *data = iter->tr->data[iter->cpu]; + + data->trace_tail_idx++; + if (data->trace_tail_idx >= ENTRIES_PER_PAGE) { + data->trace_tail = trace_next_page(data, data->trace_tail); + data->trace_tail_idx = 0; + } + + /* Check if we empty it, then reset the index */ + if (data->trace_head == data->trace_tail && + data->trace_head_idx == data->trace_tail_idx) + data->trace_idx = 0; +} + +static void *find_next_entry_inc(struct trace_iterator *iter) +{ + struct trace_entry *next; + int next_cpu = -1; + + next = find_next_entry(iter, &next_cpu); + + iter->prev_ent = iter->ent; + iter->prev_cpu = iter->cpu; + + iter->ent = next; + iter->cpu = next_cpu; + + if (next) + trace_iterator_increment(iter); + + return next ? iter : NULL; +} + +static void *s_next(struct seq_file *m, void *v, loff_t *pos) +{ + struct trace_iterator *iter = m->private; + void *last_ent = iter->ent; + int i = (int)*pos; + void *ent; + + (*pos)++; + + /* can't go backwards */ + if (iter->idx > i) + return NULL; + + if (iter->idx < 0) + ent = find_next_entry_inc(iter); + else + ent = iter; + + while (ent && iter->idx < i) + ent = find_next_entry_inc(iter); + + iter->pos = *pos; + + if (last_ent && !ent) + seq_puts(m, "\n\nvim:ft=help\n"); + + return ent; +} + +static void *s_start(struct seq_file *m, loff_t *pos) +{ + struct trace_iterator *iter = m->private; + void *p = NULL; + loff_t l = 0; + int i; + + mutex_lock(&trace_types_lock); + + if (!current_trace || current_trace != iter->trace) { + mutex_unlock(&trace_types_lock); + return NULL; + } + + atomic_inc(&trace_record_cmdline_disabled); + + /* let the tracer grab locks here if needed */ + if (current_trace->start) + current_trace->start(iter); + + if (*pos != iter->pos) { + iter->ent = NULL; + iter->cpu = 0; + iter->idx = -1; + iter->prev_ent = NULL; + iter->prev_cpu = -1; + + for_each_tracing_cpu(i) { + iter->next_idx[i] = 0; + iter->next_page[i] = NULL; + } + + for (p = iter; p && l < *pos; p = s_next(m, p, &l)) + ; + + } else { + l = *pos - 1; + p = s_next(m, p, &l); + } + + return p; +} + +static void s_stop(struct seq_file *m, void *p) +{ + struct trace_iterator *iter = m->private; + + atomic_dec(&trace_record_cmdline_disabled); + + /* let the tracer release locks here if needed */ + if (current_trace && current_trace == iter->trace && iter->trace->stop) + iter->trace->stop(iter); + + mutex_unlock(&trace_types_lock); +} + +static int +seq_print_sym_short(struct trace_seq *s, const char *fmt, unsigned long address) +{ +#ifdef CONFIG_KALLSYMS + char str[KSYM_SYMBOL_LEN]; + + kallsyms_lookup(address, NULL, NULL, NULL, str); + + return trace_seq_printf(s, fmt, str); +#endif + return 1; +} + +static int +seq_print_sym_offset(struct trace_seq *s, const char *fmt, + unsigned long address) +{ +#ifdef CONFIG_KALLSYMS + char str[KSYM_SYMBOL_LEN]; + + sprint_symbol(str, address); + return trace_seq_printf(s, fmt, str); +#endif + return 1; +} + +#ifndef CONFIG_64BIT +# define IP_FMT "%08lx" +#else +# define IP_FMT "%016lx" +#endif + +static int +seq_print_ip_sym(struct trace_seq *s, unsigned long ip, unsigned long sym_flags) +{ + int ret; + + if (!ip) + return trace_seq_printf(s, "0"); + + if (sym_flags & TRACE_ITER_SYM_OFFSET) + ret = seq_print_sym_offset(s, "%s", ip); + else + ret = seq_print_sym_short(s, "%s", ip); + + if (!ret) + return 0; + + if (sym_flags & TRACE_ITER_SYM_ADDR) + ret = trace_seq_printf(s, " <" IP_FMT ">", ip); + return ret; +} + +static void print_lat_help_header(struct seq_file *m) +{ + seq_puts(m, "# _------=> CPU# \n"); + seq_puts(m, "# / _-----=> irqs-off \n"); + seq_puts(m, "# | / _----=> need-resched \n"); + seq_puts(m, "# || / _---=> hardirq/softirq \n"); + seq_puts(m, "# ||| / _--=> preempt-depth \n"); + seq_puts(m, "# |||| / \n"); + seq_puts(m, "# ||||| delay \n"); + seq_puts(m, "# cmd pid ||||| time | caller \n"); + seq_puts(m, "# \\ / ||||| \\ | / \n"); +} + +static void print_func_help_header(struct seq_file *m) +{ + seq_puts(m, "# TASK-PID CPU# TIMESTAMP FUNCTION\n"); + seq_puts(m, "# | | | | |\n"); +} + + +static void +print_trace_header(struct seq_file *m, struct trace_iterator *iter) +{ + unsigned long sym_flags = (trace_flags & TRACE_ITER_SYM_MASK); + struct trace_array *tr = iter->tr; + struct trace_array_cpu *data = tr->data[tr->cpu]; + struct tracer *type = current_trace; + unsigned long total = 0; + unsigned long entries = 0; + int cpu; + const char *name = "preemption"; + + if (type) + name = type->name; + + for_each_tracing_cpu(cpu) { + if (head_page(tr->data[cpu])) { + total += tr->data[cpu]->trace_idx; + if (tr->data[cpu]->trace_idx > tr->entries) + entries += tr->entries; + else + entries += tr->data[cpu]->trace_idx; + } + } + + seq_printf(m, "%s latency trace v1.1.5 on %s\n", + name, UTS_RELEASE); + seq_puts(m, "-----------------------------------" + "---------------------------------\n"); + seq_printf(m, " latency: %lu us, #%lu/%lu, CPU#%d |" + " (M:%s VP:%d, KP:%d, SP:%d HP:%d", + nsecs_to_usecs(data->saved_latency), + entries, + total, + tr->cpu, +#if defined(CONFIG_PREEMPT_NONE) + "server", +#elif defined(CONFIG_PREEMPT_VOLUNTARY) + "desktop", +#elif defined(CONFIG_PREEMPT_DESKTOP) + "preempt", +#else + "unknown", +#endif + /* These are reserved for later use */ + 0, 0, 0, 0); +#ifdef CONFIG_SMP + seq_printf(m, " #P:%d)\n", num_online_cpus()); +#else + seq_puts(m, ")\n"); +#endif + seq_puts(m, " -----------------\n"); + seq_printf(m, " | task: %.16s-%d " + "(uid:%d nice:%ld policy:%ld rt_prio:%ld)\n", + data->comm, data->pid, data->uid, data->nice, + data->policy, data->rt_priority); + seq_puts(m, " -----------------\n"); + + if (data->critical_start) { + seq_puts(m, " => started at: "); + seq_print_ip_sym(&iter->seq, data->critical_start, sym_flags); + trace_print_seq(m, &iter->seq); + seq_puts(m, "\n => ended at: "); + seq_print_ip_sym(&iter->seq, data->critical_end, sym_flags); + trace_print_seq(m, &iter->seq); + seq_puts(m, "\n"); + } + + seq_puts(m, "\n"); +} + +static void +lat_print_generic(struct trace_seq *s, struct trace_entry *entry, int cpu) +{ + int hardirq, softirq; + char *comm; + + comm = trace_find_cmdline(entry->pid); + + trace_seq_printf(s, "%8.8s-%-5d ", comm, entry->pid); + trace_seq_printf(s, "%d", cpu); + trace_seq_printf(s, "%c%c", + (entry->flags & TRACE_FLAG_IRQS_OFF) ? 'd' : '.', + ((entry->flags & TRACE_FLAG_NEED_RESCHED) ? 'N' : '.')); + + hardirq = entry->flags & TRACE_FLAG_HARDIRQ; + softirq = entry->flags & TRACE_FLAG_SOFTIRQ; + if (hardirq && softirq) { + trace_seq_putc(s, 'H'); + } else { + if (hardirq) { + trace_seq_putc(s, 'h'); + } else { + if (softirq) + trace_seq_putc(s, 's'); + else + trace_seq_putc(s, '.'); + } + } + + if (entry->preempt_count) + trace_seq_printf(s, "%x", entry->preempt_count); + else + trace_seq_puts(s, "."); +} + +unsigned long preempt_mark_thresh = 100; + +static void +lat_print_timestamp(struct trace_seq *s, unsigned long long abs_usecs, + unsigned long rel_usecs) +{ + trace_seq_printf(s, " %4lldus", abs_usecs); + if (rel_usecs > preempt_mark_thresh) + trace_seq_puts(s, "!: "); + else if (rel_usecs > 1) + trace_seq_puts(s, "+: "); + else + trace_seq_puts(s, " : "); +} + +static const char state_to_char[] = TASK_STATE_TO_CHAR_STR; + +static int +print_lat_fmt(struct trace_iterator *iter, unsigned int trace_idx, int cpu) +{ + struct trace_seq *s = &iter->seq; + unsigned long sym_flags = (trace_flags & TRACE_ITER_SYM_MASK); + struct trace_entry *next_entry = find_next_entry(iter, NULL); + unsigned long verbose = (trace_flags & TRACE_ITER_VERBOSE); + struct trace_entry *entry = iter->ent; + unsigned long abs_usecs; + unsigned long rel_usecs; + char *comm; + int S, T; + int i; + unsigned state; + + if (!next_entry) + next_entry = entry; + rel_usecs = ns2usecs(next_entry->t - entry->t); + abs_usecs = ns2usecs(entry->t - iter->tr->time_start); + + if (verbose) { + comm = trace_find_cmdline(entry->pid); + trace_seq_printf(s, "%16s %5d %d %d %08x %08x [%08lx]" + " %ld.%03ldms (+%ld.%03ldms): ", + comm, + entry->pid, cpu, entry->flags, + entry->preempt_count, trace_idx, + ns2usecs(entry->t), + abs_usecs/1000, + abs_usecs % 1000, rel_usecs/1000, + rel_usecs % 1000); + } else { + lat_print_generic(s, entry, cpu); + lat_print_timestamp(s, abs_usecs, rel_usecs); + } + switch (entry->type) { + case TRACE_FN: + seq_print_ip_sym(s, entry->fn.ip, sym_flags); + trace_seq_puts(s, " ("); + seq_print_ip_sym(s, entry->fn.parent_ip, sym_flags); + trace_seq_puts(s, ")\n"); + break; + case TRACE_CTX: + case TRACE_WAKE: + T = entry->ctx.next_state < sizeof(state_to_char) ? + state_to_char[entry->ctx.next_state] : 'X'; + + state = entry->ctx.prev_state ? __ffs(entry->ctx.prev_state) + 1 : 0; + S = state < sizeof(state_to_char) - 1 ? state_to_char[state] : 'X'; + comm = trace_find_cmdline(entry->ctx.next_pid); + trace_seq_printf(s, " %5d:%3d:%c %s %5d:%3d:%c %s\n", + entry->ctx.prev_pid, + entry->ctx.prev_prio, + S, entry->type == TRACE_CTX ? "==>" : " +", + entry->ctx.next_pid, + entry->ctx.next_prio, + T, comm); + break; + case TRACE_SPECIAL: + trace_seq_printf(s, "# %ld %ld %ld\n", + entry->special.arg1, + entry->special.arg2, + entry->special.arg3); + break; + case TRACE_STACK: + for (i = 0; i < FTRACE_STACK_ENTRIES; i++) { + if (i) + trace_seq_puts(s, " <= "); + seq_print_ip_sym(s, entry->stack.caller[i], sym_flags); + } + trace_seq_puts(s, "\n"); + break; + default: + trace_seq_printf(s, "Unknown type %d\n", entry->type); + } + return 1; +} + +static int print_trace_fmt(struct trace_iterator *iter) +{ + struct trace_seq *s = &iter->seq; + unsigned long sym_flags = (trace_flags & TRACE_ITER_SYM_MASK); + struct trace_entry *entry; + unsigned long usec_rem; + unsigned long long t; + unsigned long secs; + char *comm; + int ret; + int S, T; + int i; + + entry = iter->ent; + + comm = trace_find_cmdline(iter->ent->pid); + + t = ns2usecs(entry->t); + usec_rem = do_div(t, 1000000ULL); + secs = (unsigned long)t; + + ret = trace_seq_printf(s, "%16s-%-5d ", comm, entry->pid); + if (!ret) + return 0; + ret = trace_seq_printf(s, "[%02d] ", iter->cpu); + if (!ret) + return 0; + ret = trace_seq_printf(s, "%5lu.%06lu: ", secs, usec_rem); + if (!ret) + return 0; + + switch (entry->type) { + case TRACE_FN: + ret = seq_print_ip_sym(s, entry->fn.ip, sym_flags); + if (!ret) + return 0; + if ((sym_flags & TRACE_ITER_PRINT_PARENT) && + entry->fn.parent_ip) { + ret = trace_seq_printf(s, " <-"); + if (!ret) + return 0; + ret = seq_print_ip_sym(s, entry->fn.parent_ip, + sym_flags); + if (!ret) + return 0; + } + ret = trace_seq_printf(s, "\n"); + if (!ret) + return 0; + break; + case TRACE_CTX: + case TRACE_WAKE: + S = entry->ctx.prev_state < sizeof(state_to_char) ? + state_to_char[entry->ctx.prev_state] : 'X'; + T = entry->ctx.next_state < sizeof(state_to_char) ? + state_to_char[entry->ctx.next_state] : 'X'; + ret = trace_seq_printf(s, " %5d:%3d:%c %s %5d:%3d:%c\n", + entry->ctx.prev_pid, + entry->ctx.prev_prio, + S, + entry->type == TRACE_CTX ? "==>" : " +", + entry->ctx.next_pid, + entry->ctx.next_prio, + T); + if (!ret) + return 0; + break; + case TRACE_SPECIAL: + ret = trace_seq_printf(s, "# %ld %ld %ld\n", + entry->special.arg1, + entry->special.arg2, + entry->special.arg3); + if (!ret) + return 0; + break; + case TRACE_STACK: + for (i = 0; i < FTRACE_STACK_ENTRIES; i++) { + if (i) { + ret = trace_seq_puts(s, " <= "); + if (!ret) + return 0; + } + ret = seq_print_ip_sym(s, entry->stack.caller[i], + sym_flags); + if (!ret) + return 0; + } + ret = trace_seq_puts(s, "\n"); + if (!ret) + return 0; + break; + } + return 1; +} + +static int print_raw_fmt(struct trace_iterator *iter) +{ + struct trace_seq *s = &iter->seq; + struct trace_entry *entry; + int ret; + int S, T; + + entry = iter->ent; + + ret = trace_seq_printf(s, "%d %d %llu ", + entry->pid, iter->cpu, entry->t); + if (!ret) + return 0; + + switch (entry->type) { + case TRACE_FN: + ret = trace_seq_printf(s, "%x %x\n", + entry->fn.ip, entry->fn.parent_ip); + if (!ret) + return 0; + break; + case TRACE_CTX: + case TRACE_WAKE: + S = entry->ctx.prev_state < sizeof(state_to_char) ? + state_to_char[entry->ctx.prev_state] : 'X'; + T = entry->ctx.next_state < sizeof(state_to_char) ? + state_to_char[entry->ctx.next_state] : 'X'; + if (entry->type == TRACE_WAKE) + S = '+'; + ret = trace_seq_printf(s, "%d %d %c %d %d %c\n", + entry->ctx.prev_pid, + entry->ctx.prev_prio, + S, + entry->ctx.next_pid, + entry->ctx.next_prio, + T); + if (!ret) + return 0; + break; + case TRACE_SPECIAL: + case TRACE_STACK: + ret = trace_seq_printf(s, "# %ld %ld %ld\n", + entry->special.arg1, + entry->special.arg2, + entry->special.arg3); + if (!ret) + return 0; + break; + } + return 1; +} + +#define SEQ_PUT_FIELD_RET(s, x) \ +do { \ + if (!trace_seq_putmem(s, &(x), sizeof(x))) \ + return 0; \ +} while (0) + +#define SEQ_PUT_HEX_FIELD_RET(s, x) \ +do { \ + if (!trace_seq_putmem_hex(s, &(x), sizeof(x))) \ + return 0; \ +} while (0) + +static int print_hex_fmt(struct trace_iterator *iter) +{ + struct trace_seq *s = &iter->seq; + unsigned char newline = '\n'; + struct trace_entry *entry; + int S, T; + + entry = iter->ent; + + SEQ_PUT_HEX_FIELD_RET(s, entry->pid); + SEQ_PUT_HEX_FIELD_RET(s, iter->cpu); + SEQ_PUT_HEX_FIELD_RET(s, entry->t); + + switch (entry->type) { + case TRACE_FN: + SEQ_PUT_HEX_FIELD_RET(s, entry->fn.ip); + SEQ_PUT_HEX_FIELD_RET(s, entry->fn.parent_ip); + break; + case TRACE_CTX: + case TRACE_WAKE: + S = entry->ctx.prev_state < sizeof(state_to_char) ? + state_to_char[entry->ctx.prev_state] : 'X'; + T = entry->ctx.next_state < sizeof(state_to_char) ? + state_to_char[entry->ctx.next_state] : 'X'; + if (entry->type == TRACE_WAKE) + S = '+'; + SEQ_PUT_HEX_FIELD_RET(s, entry->ctx.prev_pid); + SEQ_PUT_HEX_FIELD_RET(s, entry->ctx.prev_prio); + SEQ_PUT_HEX_FIELD_RET(s, S); + SEQ_PUT_HEX_FIELD_RET(s, entry->ctx.next_pid); + SEQ_PUT_HEX_FIELD_RET(s, entry->ctx.next_prio); + SEQ_PUT_HEX_FIELD_RET(s, entry->fn.parent_ip); + SEQ_PUT_HEX_FIELD_RET(s, T); + break; + case TRACE_SPECIAL: + case TRACE_STACK: + SEQ_PUT_HEX_FIELD_RET(s, entry->special.arg1); + SEQ_PUT_HEX_FIELD_RET(s, entry->special.arg2); + SEQ_PUT_HEX_FIELD_RET(s, entry->special.arg3); + break; + } + SEQ_PUT_FIELD_RET(s, newline); + + return 1; +} + +static int print_bin_fmt(struct trace_iterator *iter) +{ + struct trace_seq *s = &iter->seq; + struct trace_entry *entry; + + entry = iter->ent; + + SEQ_PUT_FIELD_RET(s, entry->pid); + SEQ_PUT_FIELD_RET(s, entry->cpu); + SEQ_PUT_FIELD_RET(s, entry->t); + + switch (entry->type) { + case TRACE_FN: + SEQ_PUT_FIELD_RET(s, entry->fn.ip); + SEQ_PUT_FIELD_RET(s, entry->fn.parent_ip); + break; + case TRACE_CTX: + SEQ_PUT_FIELD_RET(s, entry->ctx.prev_pid); + SEQ_PUT_FIELD_RET(s, entry->ctx.prev_prio); + SEQ_PUT_FIELD_RET(s, entry->ctx.prev_state); + SEQ_PUT_FIELD_RET(s, entry->ctx.next_pid); + SEQ_PUT_FIELD_RET(s, entry->ctx.next_prio); + SEQ_PUT_FIELD_RET(s, entry->ctx.next_state); + break; + case TRACE_SPECIAL: + case TRACE_STACK: + SEQ_PUT_FIELD_RET(s, entry->special.arg1); + SEQ_PUT_FIELD_RET(s, entry->special.arg2); + SEQ_PUT_FIELD_RET(s, entry->special.arg3); + break; + } + return 1; +} + +static int trace_empty(struct trace_iterator *iter) +{ + struct trace_array_cpu *data; + int cpu; + + for_each_tracing_cpu(cpu) { + data = iter->tr->data[cpu]; + + if (head_page(data) && data->trace_idx && + (data->trace_tail != data->trace_head || + data->trace_tail_idx != data->trace_head_idx)) + return 0; + } + return 1; +} + +static int print_trace_line(struct trace_iterator *iter) +{ + if (iter->trace && iter->trace->print_line) + return iter->trace->print_line(iter); + + if (trace_flags & TRACE_ITER_BIN) + return print_bin_fmt(iter); + + if (trace_flags & TRACE_ITER_HEX) + return print_hex_fmt(iter); + + if (trace_flags & TRACE_ITER_RAW) + return print_raw_fmt(iter); + + if (iter->iter_flags & TRACE_FILE_LAT_FMT) + return print_lat_fmt(iter, iter->idx, iter->cpu); + + return print_trace_fmt(iter); +} + +static int s_show(struct seq_file *m, void *v) +{ + struct trace_iterator *iter = v; + + if (iter->ent == NULL) { + if (iter->tr) { + seq_printf(m, "# tracer: %s\n", iter->trace->name); + seq_puts(m, "#\n"); + } + if (iter->iter_flags & TRACE_FILE_LAT_FMT) { + /* print nothing if the buffers are empty */ + if (trace_empty(iter)) + return 0; + print_trace_header(m, iter); + if (!(trace_flags & TRACE_ITER_VERBOSE)) + print_lat_help_header(m); + } else { + if (!(trace_flags & TRACE_ITER_VERBOSE)) + print_func_help_header(m); + } + } else { + print_trace_line(iter); + trace_print_seq(m, &iter->seq); + } + + return 0; +} + +static struct seq_operations tracer_seq_ops = { + .start = s_start, + .next = s_next, + .stop = s_stop, + .show = s_show, +}; + +static struct trace_iterator * +__tracing_open(struct inode *inode, struct file *file, int *ret) +{ + struct trace_iterator *iter; + + if (tracing_disabled) { + *ret = -ENODEV; + return NULL; + } + + iter = kzalloc(sizeof(*iter), GFP_KERNEL); + if (!iter) { + *ret = -ENOMEM; + goto out; + } + + mutex_lock(&trace_types_lock); + if (current_trace && current_trace->print_max) + iter->tr = &max_tr; + else + iter->tr = inode->i_private; + iter->trace = current_trace; + iter->pos = -1; + + /* TODO stop tracer */ + *ret = seq_open(file, &tracer_seq_ops); + if (!*ret) { + struct seq_file *m = file->private_data; + m->private = iter; + + /* stop the trace while dumping */ + if (iter->tr->ctrl) + tracer_enabled = 0; + + if (iter->trace && iter->trace->open) + iter->trace->open(iter); + } else { + kfree(iter); + iter = NULL; + } + mutex_unlock(&trace_types_lock); + + out: + return iter; +} + +int tracing_open_generic(struct inode *inode, struct file *filp) +{ + if (tracing_disabled) + return -ENODEV; + + filp->private_data = inode->i_private; + return 0; +} + +int tracing_release(struct inode *inode, struct file *file) +{ + struct seq_file *m = (struct seq_file *)file->private_data; + struct trace_iterator *iter = m->private; + + mutex_lock(&trace_types_lock); + if (iter->trace && iter->trace->close) + iter->trace->close(iter); + + /* reenable tracing if it was previously enabled */ + if (iter->tr->ctrl) + tracer_enabled = 1; + mutex_unlock(&trace_types_lock); + + seq_release(inode, file); + kfree(iter); + return 0; +} + +static int tracing_open(struct inode *inode, struct file *file) +{ + int ret; + + __tracing_open(inode, file, &ret); + + return ret; +} + +static int tracing_lt_open(struct inode *inode, struct file *file) +{ + struct trace_iterator *iter; + int ret; + + iter = __tracing_open(inode, file, &ret); + + if (!ret) + iter->iter_flags |= TRACE_FILE_LAT_FMT; + + return ret; +} + + +static void * +t_next(struct seq_file *m, void *v, loff_t *pos) +{ + struct tracer *t = m->private; + + (*pos)++; + + if (t) + t = t->next; + + m->private = t; + + return t; +} + +static void *t_start(struct seq_file *m, loff_t *pos) +{ + struct tracer *t = m->private; + loff_t l = 0; + + mutex_lock(&trace_types_lock); + for (; t && l < *pos; t = t_next(m, t, &l)) + ; + + return t; +} + +static void t_stop(struct seq_file *m, void *p) +{ + mutex_unlock(&trace_types_lock); +} + +static int t_show(struct seq_file *m, void *v) +{ + struct tracer *t = v; + + if (!t) + return 0; + + seq_printf(m, "%s", t->name); + if (t->next) + seq_putc(m, ' '); + else + seq_putc(m, '\n'); + + return 0; +} + +static struct seq_operations show_traces_seq_ops = { + .start = t_start, + .next = t_next, + .stop = t_stop, + .show = t_show, +}; + +static int show_traces_open(struct inode *inode, struct file *file) +{ + int ret; + + if (tracing_disabled) + return -ENODEV; + + ret = seq_open(file, &show_traces_seq_ops); + if (!ret) { + struct seq_file *m = file->private_data; + m->private = trace_types; + } + + return ret; +} + +static struct file_operations tracing_fops = { + .open = tracing_open, + .read = seq_read, + .llseek = seq_lseek, + .release = tracing_release, +}; + +static struct file_operations tracing_lt_fops = { + .open = tracing_lt_open, + .read = seq_read, + .llseek = seq_lseek, + .release = tracing_release, +}; + +static struct file_operations show_traces_fops = { + .open = show_traces_open, + .read = seq_read, + .release = seq_release, +}; + +/* + * Only trace on a CPU if the bitmask is set: + */ +static cpumask_t tracing_cpumask = CPU_MASK_ALL; + +/* + * When tracing/tracing_cpu_mask is modified then this holds + * the new bitmask we are about to install: + */ +static cpumask_t tracing_cpumask_new; + +/* + * The tracer itself will not take this lock, but still we want + * to provide a consistent cpumask to user-space: + */ +static DEFINE_MUTEX(tracing_cpumask_update_lock); + +/* + * Temporary storage for the character representation of the + * CPU bitmask (and one more byte for the newline): + */ +static char mask_str[NR_CPUS + 1]; + +static ssize_t +tracing_cpumask_read(struct file *filp, char __user *ubuf, + size_t count, loff_t *ppos) +{ + int len; + + mutex_lock(&tracing_cpumask_update_lock); + + len = cpumask_scnprintf(mask_str, count, tracing_cpumask); + if (count - len < 2) { + count = -EINVAL; + goto out_err; + } + len += sprintf(mask_str + len, "\n"); + count = simple_read_from_buffer(ubuf, count, ppos, mask_str, NR_CPUS+1); + +out_err: + mutex_unlock(&tracing_cpumask_update_lock); + + return count; +} + +static ssize_t +tracing_cpumask_write(struct file *filp, const char __user *ubuf, + size_t count, loff_t *ppos) +{ + int err, cpu; + + mutex_lock(&tracing_cpumask_update_lock); + err = cpumask_parse_user(ubuf, count, tracing_cpumask_new); + if (err) + goto err_unlock; + + raw_local_irq_disable(); + __raw_spin_lock(&ftrace_max_lock); + for_each_tracing_cpu(cpu) { + /* + * Increase/decrease the disabled counter if we are + * about to flip a bit in the cpumask: + */ + if (cpu_isset(cpu, tracing_cpumask) && + !cpu_isset(cpu, tracing_cpumask_new)) { + atomic_inc(&global_trace.data[cpu]->disabled); + } + if (!cpu_isset(cpu, tracing_cpumask) && + cpu_isset(cpu, tracing_cpumask_new)) { + atomic_dec(&global_trace.data[cpu]->disabled); + } + } + __raw_spin_unlock(&ftrace_max_lock); + raw_local_irq_enable(); + + tracing_cpumask = tracing_cpumask_new; + + mutex_unlock(&tracing_cpumask_update_lock); + + return count; + +err_unlock: + mutex_unlock(&tracing_cpumask_update_lock); + + return err; +} + +static struct file_operations tracing_cpumask_fops = { + .open = tracing_open_generic, + .read = tracing_cpumask_read, + .write = tracing_cpumask_write, +}; + +static ssize_t +tracing_iter_ctrl_read(struct file *filp, char __user *ubuf, + size_t cnt, loff_t *ppos) +{ + char *buf; + int r = 0; + int len = 0; + int i; + + /* calulate max size */ + for (i = 0; trace_options[i]; i++) { + len += strlen(trace_options[i]); + len += 3; /* "no" and space */ + } + + /* +2 for \n and \0 */ + buf = kmalloc(len + 2, GFP_KERNEL); + if (!buf) + return -ENOMEM; + + for (i = 0; trace_options[i]; i++) { + if (trace_flags & (1 << i)) + r += sprintf(buf + r, "%s ", trace_options[i]); + else + r += sprintf(buf + r, "no%s ", trace_options[i]); + } + + r += sprintf(buf + r, "\n"); + WARN_ON(r >= len + 2); + + r = simple_read_from_buffer(ubuf, cnt, ppos, buf, r); + + kfree(buf); + + return r; +} + +static ssize_t +tracing_iter_ctrl_write(struct file *filp, const char __user *ubuf, + size_t cnt, loff_t *ppos) +{ + char buf[64]; + char *cmp = buf; + int neg = 0; + int i; + + if (cnt >= sizeof(buf)) + return -EINVAL; + + if (copy_from_user(&buf, ubuf, cnt)) + return -EFAULT; + + buf[cnt] = 0; + + if (strncmp(buf, "no", 2) == 0) { + neg = 1; + cmp += 2; + } + + for (i = 0; trace_options[i]; i++) { + int len = strlen(trace_options[i]); + + if (strncmp(cmp, trace_options[i], len) == 0) { + if (neg) + trace_flags &= ~(1 << i); + else + trace_flags |= (1 << i); + break; + } + } + /* + * If no option could be set, return an error: + */ + if (!trace_options[i]) + return -EINVAL; + + filp->f_pos += cnt; + + return cnt; +} + +static struct file_operations tracing_iter_fops = { + .open = tracing_open_generic, + .read = tracing_iter_ctrl_read, + .write = tracing_iter_ctrl_write, +}; + +static const char readme_msg[] = + "tracing mini-HOWTO:\n\n" + "# mkdir /debug\n" + "# mount -t debugfs nodev /debug\n\n" + "# cat /debug/tracing/available_tracers\n" + "wakeup preemptirqsoff preemptoff irqsoff ftrace sched_switch none\n\n" + "# cat /debug/tracing/current_tracer\n" + "none\n" + "# echo sched_switch > /debug/tracing/current_tracer\n" + "# cat /debug/tracing/current_tracer\n" + "sched_switch\n" + "# cat /debug/tracing/iter_ctrl\n" + "noprint-parent nosym-offset nosym-addr noverbose\n" + "# echo print-parent > /debug/tracing/iter_ctrl\n" + "# echo 1 > /debug/tracing/tracing_enabled\n" + "# cat /debug/tracing/trace > /tmp/trace.txt\n" + "echo 0 > /debug/tracing/tracing_enabled\n" +; + +static ssize_t +tracing_readme_read(struct file *filp, char __user *ubuf, + size_t cnt, loff_t *ppos) +{ + return simple_read_from_buffer(ubuf, cnt, ppos, + readme_msg, strlen(readme_msg)); +} + +static struct file_operations tracing_readme_fops = { + .open = tracing_open_generic, + .read = tracing_readme_read, +}; + +static ssize_t +tracing_ctrl_read(struct file *filp, char __user *ubuf, + size_t cnt, loff_t *ppos) +{ + struct trace_array *tr = filp->private_data; + char buf[64]; + int r; + + r = sprintf(buf, "%ld\n", tr->ctrl); + return simple_read_from_buffer(ubuf, cnt, ppos, buf, r); +} + +static ssize_t +tracing_ctrl_write(struct file *filp, const char __user *ubuf, + size_t cnt, loff_t *ppos) +{ + struct trace_array *tr = filp->private_data; + char buf[64]; + long val; + int ret; + + if (cnt >= sizeof(buf)) + return -EINVAL; + + if (copy_from_user(&buf, ubuf, cnt)) + return -EFAULT; + + buf[cnt] = 0; + + ret = strict_strtoul(buf, 10, &val); + if (ret < 0) + return ret; + + val = !!val; + + mutex_lock(&trace_types_lock); + if (tr->ctrl ^ val) { + if (val) + tracer_enabled = 1; + else + tracer_enabled = 0; + + tr->ctrl = val; + + if (current_trace && current_trace->ctrl_update) + current_trace->ctrl_update(tr); + } + mutex_unlock(&trace_types_lock); + + filp->f_pos += cnt; + + return cnt; +} + +static ssize_t +tracing_set_trace_read(struct file *filp, char __user *ubuf, + size_t cnt, loff_t *ppos) +{ + char buf[max_tracer_type_len+2]; + int r; + + mutex_lock(&trace_types_lock); + if (current_trace) + r = sprintf(buf, "%s\n", current_trace->name); + else + r = sprintf(buf, "\n"); + mutex_unlock(&trace_types_lock); + + return simple_read_from_buffer(ubuf, cnt, ppos, buf, r); +} + +static ssize_t +tracing_set_trace_write(struct file *filp, const char __user *ubuf, + size_t cnt, loff_t *ppos) +{ + struct trace_array *tr = &global_trace; + struct tracer *t; + char buf[max_tracer_type_len+1]; + int i; + + if (cnt > max_tracer_type_len) + cnt = max_tracer_type_len; + + if (copy_from_user(&buf, ubuf, cnt)) + return -EFAULT; + + buf[cnt] = 0; + + /* strip ending whitespace. */ + for (i = cnt - 1; i > 0 && isspace(buf[i]); i--) + buf[i] = 0; + + mutex_lock(&trace_types_lock); + for (t = trace_types; t; t = t->next) { + if (strcmp(t->name, buf) == 0) + break; + } + if (!t || t == current_trace) + goto out; + + if (current_trace && current_trace->reset) + current_trace->reset(tr); + + current_trace = t; + if (t->init) + t->init(tr); + + out: + mutex_unlock(&trace_types_lock); + + filp->f_pos += cnt; + + return cnt; +} + +static ssize_t +tracing_max_lat_read(struct file *filp, char __user *ubuf, + size_t cnt, loff_t *ppos) +{ + unsigned long *ptr = filp->private_data; + char buf[64]; + int r; + + r = snprintf(buf, sizeof(buf), "%ld\n", + *ptr == (unsigned long)-1 ? -1 : nsecs_to_usecs(*ptr)); + if (r > sizeof(buf)) + r = sizeof(buf); + return simple_read_from_buffer(ubuf, cnt, ppos, buf, r); +} + +static ssize_t +tracing_max_lat_write(struct file *filp, const char __user *ubuf, + size_t cnt, loff_t *ppos) +{ + long *ptr = filp->private_data; + char buf[64]; + long val; + int ret; + + if (cnt >= sizeof(buf)) + return -EINVAL; + + if (copy_from_user(&buf, ubuf, cnt)) + return -EFAULT; + + buf[cnt] = 0; + + ret = strict_strtoul(buf, 10, &val); + if (ret < 0) + return ret; + + *ptr = val * 1000; + + return cnt; +} + +static atomic_t tracing_reader; + +static int tracing_open_pipe(struct inode *inode, struct file *filp) +{ + struct trace_iterator *iter; + + if (tracing_disabled) + return -ENODEV; + + /* We only allow for reader of the pipe */ + if (atomic_inc_return(&tracing_reader) != 1) { + atomic_dec(&tracing_reader); + return -EBUSY; + } + + /* create a buffer to store the information to pass to userspace */ + iter = kzalloc(sizeof(*iter), GFP_KERNEL); + if (!iter) + return -ENOMEM; + + mutex_lock(&trace_types_lock); + iter->tr = &global_trace; + iter->trace = current_trace; + filp->private_data = iter; + + if (iter->trace->pipe_open) + iter->trace->pipe_open(iter); + mutex_unlock(&trace_types_lock); + + return 0; +} + +static int tracing_release_pipe(struct inode *inode, struct file *file) +{ + struct trace_iterator *iter = file->private_data; + + kfree(iter); + atomic_dec(&tracing_reader); + + return 0; +} + +static unsigned int +tracing_poll_pipe(struct file *filp, poll_table *poll_table) +{ + struct trace_iterator *iter = filp->private_data; + + if (trace_flags & TRACE_ITER_BLOCK) { + /* + * Always select as readable when in blocking mode + */ + return POLLIN | POLLRDNORM; + } else { + if (!trace_empty(iter)) + return POLLIN | POLLRDNORM; + poll_wait(filp, &trace_wait, poll_table); + if (!trace_empty(iter)) + return POLLIN | POLLRDNORM; + + return 0; + } +} + +/* + * Consumer reader. + */ +static ssize_t +tracing_read_pipe(struct file *filp, char __user *ubuf, + size_t cnt, loff_t *ppos) +{ + struct trace_iterator *iter = filp->private_data; + struct trace_array_cpu *data; + static cpumask_t mask; + unsigned long flags; +#ifdef CONFIG_FTRACE + int ftrace_save; +#endif + int cpu; + ssize_t sret; + + /* return any leftover data */ + sret = trace_seq_to_user(&iter->seq, ubuf, cnt); + if (sret != -EBUSY) + return sret; + sret = 0; + + trace_seq_reset(&iter->seq); + + mutex_lock(&trace_types_lock); + if (iter->trace->read) { + sret = iter->trace->read(iter, filp, ubuf, cnt, ppos); + if (sret) + goto out; + } + + while (trace_empty(iter)) { + + if ((filp->f_flags & O_NONBLOCK)) { + sret = -EAGAIN; + goto out; + } + + /* + * This is a make-shift waitqueue. The reason we don't use + * an actual wait queue is because: + * 1) we only ever have one waiter + * 2) the tracing, traces all functions, we don't want + * the overhead of calling wake_up and friends + * (and tracing them too) + * Anyway, this is really very primitive wakeup. + */ + set_current_state(TASK_INTERRUPTIBLE); + iter->tr->waiter = current; + + mutex_unlock(&trace_types_lock); + + /* sleep for 100 msecs, and try again. */ + schedule_timeout(HZ/10); + + mutex_lock(&trace_types_lock); + + iter->tr->waiter = NULL; + + if (signal_pending(current)) { + sret = -EINTR; + goto out; + } + + if (iter->trace != current_trace) + goto out; + + /* + * We block until we read something and tracing is disabled. + * We still block if tracing is disabled, but we have never + * read anything. This allows a user to cat this file, and + * then enable tracing. But after we have read something, + * we give an EOF when tracing is again disabled. + * + * iter->pos will be 0 if we haven't read anything. + */ + if (!tracer_enabled && iter->pos) + break; + + continue; + } + + /* stop when tracing is finished */ + if (trace_empty(iter)) + goto out; + + if (cnt >= PAGE_SIZE) + cnt = PAGE_SIZE - 1; + + /* reset all but tr, trace, and overruns */ + memset(&iter->seq, 0, + sizeof(struct trace_iterator) - + offsetof(struct trace_iterator, seq)); + iter->pos = -1; + + /* + * We need to stop all tracing on all CPUS to read the + * the next buffer. This is a bit expensive, but is + * not done often. We fill all what we can read, + * and then release the locks again. + */ + + cpus_clear(mask); + local_irq_save(flags); +#ifdef CONFIG_FTRACE + ftrace_save = ftrace_enabled; + ftrace_enabled = 0; +#endif + smp_wmb(); + for_each_tracing_cpu(cpu) { + data = iter->tr->data[cpu]; + + if (!head_page(data) || !data->trace_idx) + continue; + + atomic_inc(&data->disabled); + cpu_set(cpu, mask); + } + + for_each_cpu_mask_nr(cpu, mask) { + data = iter->tr->data[cpu]; + __raw_spin_lock(&data->lock); + + if (data->overrun > iter->last_overrun[cpu]) + iter->overrun[cpu] += + data->overrun - iter->last_overrun[cpu]; + iter->last_overrun[cpu] = data->overrun; + } + + while (find_next_entry_inc(iter) != NULL) { + int ret; + int len = iter->seq.len; + + ret = print_trace_line(iter); + if (!ret) { + /* don't print partial lines */ + iter->seq.len = len; + break; + } + + trace_consume(iter); + + if (iter->seq.len >= cnt) + break; + } + + for_each_cpu_mask_nr(cpu, mask) { + data = iter->tr->data[cpu]; + __raw_spin_unlock(&data->lock); + } + + for_each_cpu_mask_nr(cpu, mask) { + data = iter->tr->data[cpu]; + atomic_dec(&data->disabled); + } +#ifdef CONFIG_FTRACE + ftrace_enabled = ftrace_save; +#endif + local_irq_restore(flags); + + /* Now copy what we have to the user */ + sret = trace_seq_to_user(&iter->seq, ubuf, cnt); + if (iter->seq.readpos >= iter->seq.len) + trace_seq_reset(&iter->seq); + if (sret == -EBUSY) + sret = 0; + +out: + mutex_unlock(&trace_types_lock); + + return sret; +} + +static ssize_t +tracing_entries_read(struct file *filp, char __user *ubuf, + size_t cnt, loff_t *ppos) +{ + struct trace_array *tr = filp->private_data; + char buf[64]; + int r; + + r = sprintf(buf, "%lu\n", tr->entries); + return simple_read_from_buffer(ubuf, cnt, ppos, buf, r); +} + +static ssize_t +tracing_entries_write(struct file *filp, const char __user *ubuf, + size_t cnt, loff_t *ppos) +{ + unsigned long val; + char buf[64]; + int i, ret; + + if (cnt >= sizeof(buf)) + return -EINVAL; + + if (copy_from_user(&buf, ubuf, cnt)) + return -EFAULT; + + buf[cnt] = 0; + + ret = strict_strtoul(buf, 10, &val); + if (ret < 0) + return ret; + + /* must have at least 1 entry */ + if (!val) + return -EINVAL; + + mutex_lock(&trace_types_lock); + + if (current_trace != &no_tracer) { + cnt = -EBUSY; + pr_info("ftrace: set current_tracer to none" + " before modifying buffer size\n"); + goto out; + } + + if (val > global_trace.entries) { + long pages_requested; + unsigned long freeable_pages; + + /* make sure we have enough memory before mapping */ + pages_requested = + (val + (ENTRIES_PER_PAGE-1)) / ENTRIES_PER_PAGE; + + /* account for each buffer (and max_tr) */ + pages_requested *= tracing_nr_buffers * 2; + + /* Check for overflow */ + if (pages_requested < 0) { + cnt = -ENOMEM; + goto out; + } + + freeable_pages = determine_dirtyable_memory(); + + /* we only allow to request 1/4 of useable memory */ + if (pages_requested > + ((freeable_pages + tracing_pages_allocated) / 4)) { + cnt = -ENOMEM; + goto out; + } + + while (global_trace.entries < val) { + if (trace_alloc_page()) { + cnt = -ENOMEM; + goto out; + } + /* double check that we don't go over the known pages */ + if (tracing_pages_allocated > pages_requested) + break; + } + + } else { + /* include the number of entries in val (inc of page entries) */ + while (global_trace.entries > val + (ENTRIES_PER_PAGE - 1)) + trace_free_page(); + } + + /* check integrity */ + for_each_tracing_cpu(i) + check_pages(global_trace.data[i]); + + filp->f_pos += cnt; + + /* If check pages failed, return ENOMEM */ + if (tracing_disabled) + cnt = -ENOMEM; + out: + max_tr.entries = global_trace.entries; + mutex_unlock(&trace_types_lock); + + return cnt; +} + +static struct file_operations tracing_max_lat_fops = { + .open = tracing_open_generic, + .read = tracing_max_lat_read, + .write = tracing_max_lat_write, +}; + +static struct file_operations tracing_ctrl_fops = { + .open = tracing_open_generic, + .read = tracing_ctrl_read, + .write = tracing_ctrl_write, +}; + +static struct file_operations set_tracer_fops = { + .open = tracing_open_generic, + .read = tracing_set_trace_read, + .write = tracing_set_trace_write, +}; + +static struct file_operations tracing_pipe_fops = { + .open = tracing_open_pipe, + .poll = tracing_poll_pipe, + .read = tracing_read_pipe, + .release = tracing_release_pipe, +}; + +static struct file_operations tracing_entries_fops = { + .open = tracing_open_generic, + .read = tracing_entries_read, + .write = tracing_entries_write, +}; + +#ifdef CONFIG_DYNAMIC_FTRACE + +static ssize_t +tracing_read_long(struct file *filp, char __user *ubuf, + size_t cnt, loff_t *ppos) +{ + unsigned long *p = filp->private_data; + char buf[64]; + int r; + + r = sprintf(buf, "%ld\n", *p); + + return simple_read_from_buffer(ubuf, cnt, ppos, buf, r); +} + +static struct file_operations tracing_read_long_fops = { + .open = tracing_open_generic, + .read = tracing_read_long, +}; +#endif + +static struct dentry *d_tracer; + +struct dentry *tracing_init_dentry(void) +{ + static int once; + + if (d_tracer) + return d_tracer; + + d_tracer = debugfs_create_dir("tracing", NULL); + + if (!d_tracer && !once) { + once = 1; + pr_warning("Could not create debugfs directory 'tracing'\n"); + return NULL; + } + + return d_tracer; +} + +#ifdef CONFIG_FTRACE_SELFTEST +/* Let selftest have access to static functions in this file */ +#include "trace_selftest.c" +#endif + +static __init void tracer_init_debugfs(void) +{ + struct dentry *d_tracer; + struct dentry *entry; + + d_tracer = tracing_init_dentry(); + + entry = debugfs_create_file("tracing_enabled", 0644, d_tracer, + &global_trace, &tracing_ctrl_fops); + if (!entry) + pr_warning("Could not create debugfs 'tracing_enabled' entry\n"); + + entry = debugfs_create_file("iter_ctrl", 0644, d_tracer, + NULL, &tracing_iter_fops); + if (!entry) + pr_warning("Could not create debugfs 'iter_ctrl' entry\n"); + + entry = debugfs_create_file("tracing_cpumask", 0644, d_tracer, + NULL, &tracing_cpumask_fops); + if (!entry) + pr_warning("Could not create debugfs 'tracing_cpumask' entry\n"); + + entry = debugfs_create_file("latency_trace", 0444, d_tracer, + &global_trace, &tracing_lt_fops); + if (!entry) + pr_warning("Could not create debugfs 'latency_trace' entry\n"); + + entry = debugfs_create_file("trace", 0444, d_tracer, + &global_trace, &tracing_fops); + if (!entry) + pr_warning("Could not create debugfs 'trace' entry\n"); + + entry = debugfs_create_file("available_tracers", 0444, d_tracer, + &global_trace, &show_traces_fops); + if (!entry) + pr_warning("Could not create debugfs 'trace' entry\n"); + + entry = debugfs_create_file("current_tracer", 0444, d_tracer, + &global_trace, &set_tracer_fops); + if (!entry) + pr_warning("Could not create debugfs 'trace' entry\n"); + + entry = debugfs_create_file("tracing_max_latency", 0644, d_tracer, + &tracing_max_latency, + &tracing_max_lat_fops); + if (!entry) + pr_warning("Could not create debugfs " + "'tracing_max_latency' entry\n"); + + entry = debugfs_create_file("tracing_thresh", 0644, d_tracer, + &tracing_thresh, &tracing_max_lat_fops); + if (!entry) + pr_warning("Could not create debugfs " + "'tracing_threash' entry\n"); + entry = debugfs_create_file("README", 0644, d_tracer, + NULL, &tracing_readme_fops); + if (!entry) + pr_warning("Could not create debugfs 'README' entry\n"); + + entry = debugfs_create_file("trace_pipe", 0644, d_tracer, + NULL, &tracing_pipe_fops); + if (!entry) + pr_warning("Could not create debugfs " + "'tracing_threash' entry\n"); + + entry = debugfs_create_file("trace_entries", 0644, d_tracer, + &global_trace, &tracing_entries_fops); + if (!entry) + pr_warning("Could not create debugfs " + "'tracing_threash' entry\n"); + +#ifdef CONFIG_DYNAMIC_FTRACE + entry = debugfs_create_file("dyn_ftrace_total_info", 0444, d_tracer, + &ftrace_update_tot_cnt, + &tracing_read_long_fops); + if (!entry) + pr_warning("Could not create debugfs " + "'dyn_ftrace_total_info' entry\n"); +#endif +#ifdef CONFIG_SYSPROF_TRACER + init_tracer_sysprof_debugfs(d_tracer); +#endif +} + +static int trace_alloc_page(void) +{ + struct trace_array_cpu *data; + struct page *page, *tmp; + LIST_HEAD(pages); + void *array; + unsigned pages_allocated = 0; + int i; + + /* first allocate a page for each CPU */ + for_each_tracing_cpu(i) { + array = (void *)__get_free_page(GFP_KERNEL); + if (array == NULL) { + printk(KERN_ERR "tracer: failed to allocate page" + "for trace buffer!\n"); + goto free_pages; + } + + pages_allocated++; + page = virt_to_page(array); + list_add(&page->lru, &pages); + +/* Only allocate if we are actually using the max trace */ +#ifdef CONFIG_TRACER_MAX_TRACE + array = (void *)__get_free_page(GFP_KERNEL); + if (array == NULL) { + printk(KERN_ERR "tracer: failed to allocate page" + "for trace buffer!\n"); + goto free_pages; + } + pages_allocated++; + page = virt_to_page(array); + list_add(&page->lru, &pages); +#endif + } + + /* Now that we successfully allocate a page per CPU, add them */ + for_each_tracing_cpu(i) { + data = global_trace.data[i]; + page = list_entry(pages.next, struct page, lru); + list_del_init(&page->lru); + list_add_tail(&page->lru, &data->trace_pages); + ClearPageLRU(page); + +#ifdef CONFIG_TRACER_MAX_TRACE + data = max_tr.data[i]; + page = list_entry(pages.next, struct page, lru); + list_del_init(&page->lru); + list_add_tail(&page->lru, &data->trace_pages); + SetPageLRU(page); +#endif + } + tracing_pages_allocated += pages_allocated; + global_trace.entries += ENTRIES_PER_PAGE; + + return 0; + + free_pages: + list_for_each_entry_safe(page, tmp, &pages, lru) { + list_del_init(&page->lru); + __free_page(page); + } + return -ENOMEM; +} + +static int trace_free_page(void) +{ + struct trace_array_cpu *data; + struct page *page; + struct list_head *p; + int i; + int ret = 0; + + /* free one page from each buffer */ + for_each_tracing_cpu(i) { + data = global_trace.data[i]; + p = data->trace_pages.next; + if (p == &data->trace_pages) { + /* should never happen */ + WARN_ON(1); + tracing_disabled = 1; + ret = -1; + break; + } + page = list_entry(p, struct page, lru); + ClearPageLRU(page); + list_del(&page->lru); + tracing_pages_allocated--; + tracing_pages_allocated--; + __free_page(page); + + tracing_reset(data); + +#ifdef CONFIG_TRACER_MAX_TRACE + data = max_tr.data[i]; + p = data->trace_pages.next; + if (p == &data->trace_pages) { + /* should never happen */ + WARN_ON(1); + tracing_disabled = 1; + ret = -1; + break; + } + page = list_entry(p, struct page, lru); + ClearPageLRU(page); + list_del(&page->lru); + __free_page(page); + + tracing_reset(data); +#endif + } + global_trace.entries -= ENTRIES_PER_PAGE; + + return ret; +} + +__init static int tracer_alloc_buffers(void) +{ + struct trace_array_cpu *data; + void *array; + struct page *page; + int pages = 0; + int ret = -ENOMEM; + int i; + + /* TODO: make the number of buffers hot pluggable with CPUS */ + tracing_nr_buffers = num_possible_cpus(); + tracing_buffer_mask = cpu_possible_map; + + /* Allocate the first page for all buffers */ + for_each_tracing_cpu(i) { + data = global_trace.data[i] = &per_cpu(global_trace_cpu, i); + max_tr.data[i] = &per_cpu(max_data, i); + + array = (void *)__get_free_page(GFP_KERNEL); + if (array == NULL) { + printk(KERN_ERR "tracer: failed to allocate page" + "for trace buffer!\n"); + goto free_buffers; + } + + /* set the array to the list */ + INIT_LIST_HEAD(&data->trace_pages); + page = virt_to_page(array); + list_add(&page->lru, &data->trace_pages); + /* use the LRU flag to differentiate the two buffers */ + ClearPageLRU(page); + + data->lock = (raw_spinlock_t)__RAW_SPIN_LOCK_UNLOCKED; + max_tr.data[i]->lock = (raw_spinlock_t)__RAW_SPIN_LOCK_UNLOCKED; + +/* Only allocate if we are actually using the max trace */ +#ifdef CONFIG_TRACER_MAX_TRACE + array = (void *)__get_free_page(GFP_KERNEL); + if (array == NULL) { + printk(KERN_ERR "tracer: failed to allocate page" + "for trace buffer!\n"); + goto free_buffers; + } + + INIT_LIST_HEAD(&max_tr.data[i]->trace_pages); + page = virt_to_page(array); + list_add(&page->lru, &max_tr.data[i]->trace_pages); + SetPageLRU(page); +#endif + } + + /* + * Since we allocate by orders of pages, we may be able to + * round up a bit. + */ + global_trace.entries = ENTRIES_PER_PAGE; + pages++; + + while (global_trace.entries < trace_nr_entries) { + if (trace_alloc_page()) + break; + pages++; + } + max_tr.entries = global_trace.entries; + + pr_info("tracer: %d pages allocated for %ld", + pages, trace_nr_entries); + pr_info(" entries of %ld bytes\n", (long)TRACE_ENTRY_SIZE); + pr_info(" actual entries %ld\n", global_trace.entries); + + tracer_init_debugfs(); + + trace_init_cmdlines(); + + register_tracer(&no_tracer); + current_trace = &no_tracer; + + /* All seems OK, enable tracing */ + global_trace.ctrl = tracer_enabled; + tracing_disabled = 0; + + return 0; + + free_buffers: + for (i-- ; i >= 0; i--) { + struct page *page, *tmp; + struct trace_array_cpu *data = global_trace.data[i]; + + if (data) { + list_for_each_entry_safe(page, tmp, + &data->trace_pages, lru) { + list_del_init(&page->lru); + __free_page(page); + } + } + +#ifdef CONFIG_TRACER_MAX_TRACE + data = max_tr.data[i]; + if (data) { + list_for_each_entry_safe(page, tmp, + &data->trace_pages, lru) { + list_del_init(&page->lru); + __free_page(page); + } + } +#endif + } + return ret; +} +fs_initcall(tracer_alloc_buffers); Index: linux-2.6.24.7/kernel/trace/trace.h =================================================================== --- /dev/null +++ linux-2.6.24.7/kernel/trace/trace.h @@ -0,0 +1,375 @@ +#ifndef _LINUX_KERNEL_TRACE_H +#define _LINUX_KERNEL_TRACE_H + +#include <linux/fs.h> +#include <asm/atomic.h> +#include <linux/sched.h> +#include <linux/clocksource.h> +#include <linux/mmiotrace.h> + +enum trace_type { + __TRACE_FIRST_TYPE = 0, + + TRACE_FN, + TRACE_CTX, + TRACE_WAKE, + TRACE_STACK, + TRACE_SPECIAL, + TRACE_MMIO_RW, + TRACE_MMIO_MAP, + + __TRACE_LAST_TYPE +}; + +/* + * Function trace entry - function address and parent function addres: + */ +struct ftrace_entry { + unsigned long ip; + unsigned long parent_ip; +}; + +/* + * Context switch trace entry - which task (and prio) we switched from/to: + */ +struct ctx_switch_entry { + unsigned int prev_pid; + unsigned char prev_prio; + unsigned char prev_state; + unsigned int next_pid; + unsigned char next_prio; + unsigned char next_state; +}; + +/* + * Special (free-form) trace entry: + */ +struct special_entry { + unsigned long arg1; + unsigned long arg2; + unsigned long arg3; +}; + +/* + * Stack-trace entry: + */ + +#define FTRACE_STACK_ENTRIES 8 + +struct stack_entry { + unsigned long caller[FTRACE_STACK_ENTRIES]; +}; + +/* + * The trace entry - the most basic unit of tracing. This is what + * is printed in the end as a single line in the trace output, such as: + * + * bash-15816 [01] 235.197585: idle_cpu <- irq_enter + */ +struct trace_entry { + char type; + char cpu; + char flags; + char preempt_count; + int pid; + cycle_t t; + union { + struct ftrace_entry fn; + struct ctx_switch_entry ctx; + struct special_entry special; + struct stack_entry stack; + struct mmiotrace_rw mmiorw; + struct mmiotrace_map mmiomap; + }; +}; + +#define TRACE_ENTRY_SIZE sizeof(struct trace_entry) + +/* + * The CPU trace array - it consists of thousands of trace entries + * plus some other descriptor data: (for example which task started + * the trace, etc.) + */ +struct trace_array_cpu { + struct list_head trace_pages; + atomic_t disabled; + raw_spinlock_t lock; + struct lock_class_key lock_key; + + /* these fields get copied into max-trace: */ + unsigned trace_head_idx; + unsigned trace_tail_idx; + void *trace_head; /* producer */ + void *trace_tail; /* consumer */ + unsigned long trace_idx; + unsigned long overrun; + unsigned long saved_latency; + unsigned long critical_start; + unsigned long critical_end; + unsigned long critical_sequence; + unsigned long nice; + unsigned long policy; + unsigned long rt_priority; + cycle_t preempt_timestamp; + pid_t pid; + uid_t uid; + char comm[TASK_COMM_LEN]; +}; + +struct trace_iterator; + +/* + * The trace array - an array of per-CPU trace arrays. This is the + * highest level data structure that individual tracers deal with. + * They have on/off state as well: + */ +struct trace_array { + unsigned long entries; + long ctrl; + int cpu; + cycle_t time_start; + struct task_struct *waiter; + struct trace_array_cpu *data[NR_CPUS]; +}; + +/* + * A specific tracer, represented by methods that operate on a trace array: + */ +struct tracer { + const char *name; + void (*init)(struct trace_array *tr); + void (*reset)(struct trace_array *tr); + void (*open)(struct trace_iterator *iter); + void (*pipe_open)(struct trace_iterator *iter); + void (*close)(struct trace_iterator *iter); + void (*start)(struct trace_iterator *iter); + void (*stop)(struct trace_iterator *iter); + ssize_t (*read)(struct trace_iterator *iter, + struct file *filp, char __user *ubuf, + size_t cnt, loff_t *ppos); + void (*ctrl_update)(struct trace_array *tr); +#ifdef CONFIG_FTRACE_STARTUP_TEST + int (*selftest)(struct tracer *trace, + struct trace_array *tr); +#endif + int (*print_line)(struct trace_iterator *iter); + struct tracer *next; + int print_max; +}; + +struct trace_seq { + unsigned char buffer[PAGE_SIZE]; + unsigned int len; + unsigned int readpos; +}; + +/* + * Trace iterator - used by printout routines who present trace + * results to users and which routines might sleep, etc: + */ +struct trace_iterator { + struct trace_array *tr; + struct tracer *trace; + void *private; + long last_overrun[NR_CPUS]; + long overrun[NR_CPUS]; + + /* The below is zeroed out in pipe_read */ + struct trace_seq seq; + struct trace_entry *ent; + int cpu; + + struct trace_entry *prev_ent; + int prev_cpu; + + unsigned long iter_flags; + loff_t pos; + unsigned long next_idx[NR_CPUS]; + struct list_head *next_page[NR_CPUS]; + unsigned next_page_idx[NR_CPUS]; + long idx; +}; + +void tracing_reset(struct trace_array_cpu *data); +int tracing_open_generic(struct inode *inode, struct file *filp); +struct dentry *tracing_init_dentry(void); +void init_tracer_sysprof_debugfs(struct dentry *d_tracer); + +void ftrace(struct trace_array *tr, + struct trace_array_cpu *data, + unsigned long ip, + unsigned long parent_ip, + unsigned long flags); +void tracing_sched_switch_trace(struct trace_array *tr, + struct trace_array_cpu *data, + struct task_struct *prev, + struct task_struct *next, + unsigned long flags); +void tracing_record_cmdline(struct task_struct *tsk); + +void tracing_sched_wakeup_trace(struct trace_array *tr, + struct trace_array_cpu *data, + struct task_struct *wakee, + struct task_struct *cur, + unsigned long flags); +void trace_special(struct trace_array *tr, + struct trace_array_cpu *data, + unsigned long arg1, + unsigned long arg2, + unsigned long arg3); +void trace_function(struct trace_array *tr, + struct trace_array_cpu *data, + unsigned long ip, + unsigned long parent_ip, + unsigned long flags); + +void tracing_start_function_trace(void); +void tracing_stop_function_trace(void); +void tracing_start_cmdline_record(void); +void tracing_stop_cmdline_record(void); +int register_tracer(struct tracer *type); +void unregister_tracer(struct tracer *type); + +extern unsigned long nsecs_to_usecs(unsigned long nsecs); + +extern unsigned long tracing_max_latency; +extern unsigned long tracing_thresh; + +void update_max_tr(struct trace_array *tr, struct task_struct *tsk, int cpu); +void update_max_tr_single(struct trace_array *tr, + struct task_struct *tsk, int cpu); + +extern cycle_t ftrace_now(int cpu); + +#ifdef CONFIG_CONTEXT_SWITCH_TRACER +typedef void +(*tracer_switch_func_t)(void *private, + void *__rq, + struct task_struct *prev, + struct task_struct *next); + +struct tracer_switch_ops { + tracer_switch_func_t func; + void *private; + struct tracer_switch_ops *next; +}; + +#endif /* CONFIG_CONTEXT_SWITCH_TRACER */ + +#ifdef CONFIG_DYNAMIC_FTRACE +extern unsigned long ftrace_update_tot_cnt; +#define DYN_FTRACE_TEST_NAME trace_selftest_dynamic_test_func +extern int DYN_FTRACE_TEST_NAME(void); +#endif + +#ifdef CONFIG_MMIOTRACE +extern void __trace_mmiotrace_rw(struct trace_array *tr, + struct trace_array_cpu *data, + struct mmiotrace_rw *rw); +extern void __trace_mmiotrace_map(struct trace_array *tr, + struct trace_array_cpu *data, + struct mmiotrace_map *map); +#endif + +#ifdef CONFIG_FTRACE_STARTUP_TEST +#ifdef CONFIG_FTRACE +extern int trace_selftest_startup_function(struct tracer *trace, + struct trace_array *tr); +#endif +#ifdef CONFIG_IRQSOFF_TRACER +extern int trace_selftest_startup_irqsoff(struct tracer *trace, + struct trace_array *tr); +#endif +#ifdef CONFIG_PREEMPT_TRACER +extern int trace_selftest_startup_preemptoff(struct tracer *trace, + struct trace_array *tr); +#endif +#if defined(CONFIG_IRQSOFF_TRACER) && defined(CONFIG_PREEMPT_TRACER) +extern int trace_selftest_startup_preemptirqsoff(struct tracer *trace, + struct trace_array *tr); +#endif +#ifdef CONFIG_SCHED_TRACER +extern int trace_selftest_startup_wakeup(struct tracer *trace, + struct trace_array *tr); +#endif +#ifdef CONFIG_CONTEXT_SWITCH_TRACER +extern int trace_selftest_startup_sched_switch(struct tracer *trace, + struct trace_array *tr); +#endif +#ifdef CONFIG_SYSPROF_TRACER +extern int trace_selftest_startup_sysprof(struct tracer *trace, + struct trace_array *tr); +#endif +#endif /* CONFIG_FTRACE_STARTUP_TEST */ + +extern void *head_page(struct trace_array_cpu *data); +extern int trace_seq_printf(struct trace_seq *s, const char *fmt, ...); +extern ssize_t trace_seq_to_user(struct trace_seq *s, char __user *ubuf, + size_t cnt); +extern long ns2usecs(cycle_t nsec); + +extern unsigned long trace_flags; + +/* + * trace_iterator_flags is an enumeration that defines bit + * positions into trace_flags that controls the output. + * + * NOTE: These bits must match the trace_options array in + * trace.c. + */ +enum trace_iterator_flags { + TRACE_ITER_PRINT_PARENT = 0x01, + TRACE_ITER_SYM_OFFSET = 0x02, + TRACE_ITER_SYM_ADDR = 0x04, + TRACE_ITER_VERBOSE = 0x08, + TRACE_ITER_RAW = 0x10, + TRACE_ITER_HEX = 0x20, + TRACE_ITER_BIN = 0x40, + TRACE_ITER_BLOCK = 0x80, + TRACE_ITER_STACKTRACE = 0x100, + TRACE_ITER_SCHED_TREE = 0x200, +}; + +/* COMPAT FOR 2.6.24 */ +#define define_strict_strtoux(type, valtype) \ +static inline int strict_strtou##type(const char *cp, unsigned int base, valtype *res)\ +{ \ + char *tail; \ + valtype val; \ + size_t len; \ + \ + *res = 0; \ + len = strlen(cp); \ + if (len == 0) \ + return -EINVAL; \ + \ + val = simple_strtoul(cp, &tail, base); \ + if ((*tail == '\0') || \ + ((len == (size_t)(tail - cp) + 1) && (*tail == '\n'))) {\ + *res = val; \ + return 0; \ + } \ + \ + return -EINVAL; \ +} \ + +#define define_strict_strtox(type, valtype) \ +static inline int strict_strto##type(const char *cp, unsigned int base, valtype *res) \ +{ \ + int ret; \ + if (*cp == '-') { \ + ret = strict_strtou##type(cp+1, base, res); \ + if (!ret) \ + *res = -(*res); \ + } else \ + ret = strict_strtou##type(cp, base, res); \ + \ + return ret; \ +} \ + +define_strict_strtoux(l, unsigned long) +define_strict_strtox(l, long) +define_strict_strtoux(ll, unsigned long long) +define_strict_strtox(ll, long long) + +#endif /* _LINUX_KERNEL_TRACE_H */ Index: linux-2.6.24.7/kernel/trace/trace_functions.c =================================================================== --- /dev/null +++ linux-2.6.24.7/kernel/trace/trace_functions.c @@ -0,0 +1,78 @@ +/* + * ring buffer based function tracer + * + * Copyright (C) 2007-2008 Steven Rostedt <srostedt@redhat.com> + * Copyright (C) 2008 Ingo Molnar <mingo@redhat.com> + * + * Based on code from the latency_tracer, that is: + * + * Copyright (C) 2004-2006 Ingo Molnar + * Copyright (C) 2004 William Lee Irwin III + */ +#include <linux/debugfs.h> +#include <linux/uaccess.h> +#include <linux/ftrace.h> +#include <linux/fs.h> + +#include "trace.h" + +static void function_reset(struct trace_array *tr) +{ + int cpu; + + tr->time_start = ftrace_now(tr->cpu); + + for_each_online_cpu(cpu) + tracing_reset(tr->data[cpu]); +} + +static void start_function_trace(struct trace_array *tr) +{ + function_reset(tr); + tracing_start_cmdline_record(); + tracing_start_function_trace(); +} + +static void stop_function_trace(struct trace_array *tr) +{ + tracing_stop_function_trace(); + tracing_stop_cmdline_record(); +} + +static void function_trace_init(struct trace_array *tr) +{ + if (tr->ctrl) + start_function_trace(tr); +} + +static void function_trace_reset(struct trace_array *tr) +{ + if (tr->ctrl) + stop_function_trace(tr); +} + +static void function_trace_ctrl_update(struct trace_array *tr) +{ + if (tr->ctrl) + start_function_trace(tr); + else + stop_function_trace(tr); +} + +static struct tracer function_trace __read_mostly = +{ + .name = "ftrace", + .init = function_trace_init, + .reset = function_trace_reset, + .ctrl_update = function_trace_ctrl_update, +#ifdef CONFIG_FTRACE_SELFTEST + .selftest = trace_selftest_startup_function, +#endif +}; + +static __init int init_function_trace(void) +{ + return register_tracer(&function_trace); +} + +device_initcall(init_function_trace); Index: linux-2.6.24.7/kernel/trace/trace_irqsoff.c =================================================================== --- /dev/null +++ linux-2.6.24.7/kernel/trace/trace_irqsoff.c @@ -0,0 +1,486 @@ +/* + * trace irqs off criticall timings + * + * Copyright (C) 2007-2008 Steven Rostedt <srostedt@redhat.com> + * Copyright (C) 2008 Ingo Molnar <mingo@redhat.com> + * + * From code in the latency_tracer, that is: + * + * Copyright (C) 2004-2006 Ingo Molnar + * Copyright (C) 2004 William Lee Irwin III + */ +#include <linux/kallsyms.h> +#include <linux/debugfs.h> +#include <linux/uaccess.h> +#include <linux/module.h> +#include <linux/ftrace.h> +#include <linux/fs.h> + +#include "trace.h" + +static struct trace_array *irqsoff_trace __read_mostly; +static int tracer_enabled __read_mostly; + +static DEFINE_PER_CPU(int, tracing_cpu); + +static DEFINE_SPINLOCK(max_trace_lock); + +enum { + TRACER_IRQS_OFF = (1 << 1), + TRACER_PREEMPT_OFF = (1 << 2), +}; + +static int trace_type __read_mostly; + +#ifdef CONFIG_PREEMPT_TRACER +static inline int +preempt_trace(void) +{ + return ((trace_type & TRACER_PREEMPT_OFF) && preempt_count()); +} +#else +# define preempt_trace() (0) +#endif + +#ifdef CONFIG_IRQSOFF_TRACER +static inline int +irq_trace(void) +{ + return ((trace_type & TRACER_IRQS_OFF) && + irqs_disabled()); +} +#else +# define irq_trace() (0) +#endif + +/* + * Sequence count - we record it when starting a measurement and + * skip the latency if the sequence has changed - some other section + * did a maximum and could disturb our measurement with serial console + * printouts, etc. Truly coinciding maximum latencies should be rare + * and what happens together happens separately as well, so this doesnt + * decrease the validity of the maximum found: + */ +static __cacheline_aligned_in_smp unsigned long max_sequence; + +#ifdef CONFIG_FTRACE +/* + * irqsoff uses its own tracer function to keep the overhead down: + */ +static void +irqsoff_tracer_call(unsigned long ip, unsigned long parent_ip) +{ + struct trace_array *tr = irqsoff_trace; + struct trace_array_cpu *data; + unsigned long flags; + long disabled; + int cpu; + + /* + * Does not matter if we preempt. We test the flags + * afterward, to see if irqs are disabled or not. + * If we preempt and get a false positive, the flags + * test will fail. + */ + cpu = raw_smp_processor_id(); + if (likely(!per_cpu(tracing_cpu, cpu))) + return; + + local_save_flags(flags); + /* slight chance to get a false positive on tracing_cpu */ + if (!irqs_disabled_flags(flags)) + return; + + data = tr->data[cpu]; + disabled = atomic_inc_return(&data->disabled); + + if (likely(disabled == 1)) + trace_function(tr, data, ip, parent_ip, flags); + + atomic_dec(&data->disabled); +} + +static struct ftrace_ops trace_ops __read_mostly = +{ + .func = irqsoff_tracer_call, +}; +#endif /* CONFIG_FTRACE */ + +/* + * Should this new latency be reported/recorded? + */ +static int report_latency(cycle_t delta) +{ + if (tracing_thresh) { + if (delta < tracing_thresh) + return 0; + } else { + if (delta <= tracing_max_latency) + return 0; + } + return 1; +} + +static void +check_critical_timing(struct trace_array *tr, + struct trace_array_cpu *data, + unsigned long parent_ip, + int cpu) +{ + unsigned long latency, t0, t1; + cycle_t T0, T1, delta; + unsigned long flags; + + /* + * usecs conversion is slow so we try to delay the conversion + * as long as possible: + */ + T0 = data->preempt_timestamp; + T1 = ftrace_now(cpu); + delta = T1-T0; + + local_save_flags(flags); + + if (!report_latency(delta)) + goto out; + + spin_lock_irqsave(&max_trace_lock, flags); + + /* check if we are still the max latency */ + if (!report_latency(delta)) + goto out_unlock; + + trace_function(tr, data, CALLER_ADDR0, parent_ip, flags); + + latency = nsecs_to_usecs(delta); + + if (data->critical_sequence != max_sequence) + goto out_unlock; + + tracing_max_latency = delta; + t0 = nsecs_to_usecs(T0); + t1 = nsecs_to_usecs(T1); + + data->critical_end = parent_ip; + + update_max_tr_single(tr, current, cpu); + + max_sequence++; + +out_unlock: + spin_unlock_irqrestore(&max_trace_lock, flags); + +out: + data->critical_sequence = max_sequence; + data->preempt_timestamp = ftrace_now(cpu); + tracing_reset(data); + trace_function(tr, data, CALLER_ADDR0, parent_ip, flags); +} + +static inline void +start_critical_timing(unsigned long ip, unsigned long parent_ip) +{ + int cpu; + struct trace_array *tr = irqsoff_trace; + struct trace_array_cpu *data; + unsigned long flags; + + if (likely(!tracer_enabled)) + return; + + cpu = raw_smp_processor_id(); + + if (per_cpu(tracing_cpu, cpu)) + return; + + data = tr->data[cpu]; + + if (unlikely(!data) || atomic_read(&data->disabled)) + return; + + atomic_inc(&data->disabled); + + data->critical_sequence = max_sequence; + data->preempt_timestamp = ftrace_now(cpu); + data->critical_start = parent_ip ? : ip; + tracing_reset(data); + + local_save_flags(flags); + + trace_function(tr, data, ip, parent_ip, flags); + + per_cpu(tracing_cpu, cpu) = 1; + + atomic_dec(&data->disabled); +} + +static inline void +stop_critical_timing(unsigned long ip, unsigned long parent_ip) +{ + int cpu; + struct trace_array *tr = irqsoff_trace; + struct trace_array_cpu *data; + unsigned long flags; + + cpu = raw_smp_processor_id(); + /* Always clear the tracing cpu on stopping the trace */ + if (unlikely(per_cpu(tracing_cpu, cpu))) + per_cpu(tracing_cpu, cpu) = 0; + else + return; + + if (!tracer_enabled) + return; + + data = tr->data[cpu]; + + if (unlikely(!data) || unlikely(!head_page(data)) || + !data->critical_start || atomic_read(&data->disabled)) + return; + + atomic_inc(&data->disabled); + + local_save_flags(flags); + trace_function(tr, data, ip, parent_ip, flags); + check_critical_timing(tr, data, parent_ip ? : ip, cpu); + data->critical_start = 0; + atomic_dec(&data->disabled); +} + +/* start and stop critical timings used to for stoppage (in idle) */ +void start_critical_timings(void) +{ + if (preempt_trace() || irq_trace()) + start_critical_timing(CALLER_ADDR0, CALLER_ADDR1); +} + +void stop_critical_timings(void) +{ + if (preempt_trace() || irq_trace()) + stop_critical_timing(CALLER_ADDR0, CALLER_ADDR1); +} + +#ifdef CONFIG_IRQSOFF_TRACER +#ifdef CONFIG_PROVE_LOCKING +void time_hardirqs_on(unsigned long a0, unsigned long a1) +{ + if (!preempt_trace() && irq_trace()) + stop_critical_timing(a0, a1); +} + +void time_hardirqs_off(unsigned long a0, unsigned long a1) +{ + if (!preempt_trace() && irq_trace()) + start_critical_timing(a0, a1); +} + +#else /* !CONFIG_PROVE_LOCKING */ + +/* + * Stubs: + */ + +void early_boot_irqs_off(void) +{ +} + +void early_boot_irqs_on(void) +{ +} + +void trace_softirqs_on(unsigned long ip) +{ +} + +void trace_softirqs_off(unsigned long ip) +{ +} + +inline void print_irqtrace_events(struct task_struct *curr) +{ +} + +/* + * We are only interested in hardirq on/off events: + */ +void trace_hardirqs_on(void) +{ + if (!preempt_trace() && irq_trace()) + stop_critical_timing(CALLER_ADDR0, CALLER_ADDR1); +} +EXPORT_SYMBOL(trace_hardirqs_on); + +void trace_hardirqs_off(void) +{ + if (!preempt_trace() && irq_trace()) + start_critical_timing(CALLER_ADDR0, CALLER_ADDR1); +} +EXPORT_SYMBOL(trace_hardirqs_off); + +void trace_hardirqs_on_caller(unsigned long caller_addr) +{ + if (!preempt_trace() && irq_trace()) + stop_critical_timing(CALLER_ADDR0, caller_addr); +} +EXPORT_SYMBOL(trace_hardirqs_on_caller); + +void trace_hardirqs_off_caller(unsigned long caller_addr) +{ + if (!preempt_trace() && irq_trace()) + start_critical_timing(CALLER_ADDR0, caller_addr); +} +EXPORT_SYMBOL(trace_hardirqs_off_caller); + +#endif /* CONFIG_PROVE_LOCKING */ +#endif /* CONFIG_IRQSOFF_TRACER */ + +#ifdef CONFIG_PREEMPT_TRACER +void trace_preempt_on(unsigned long a0, unsigned long a1) +{ + stop_critical_timing(a0, a1); +} + +void trace_preempt_off(unsigned long a0, unsigned long a1) +{ + start_critical_timing(a0, a1); +} +#endif /* CONFIG_PREEMPT_TRACER */ + +static void start_irqsoff_tracer(struct trace_array *tr) +{ + register_ftrace_function(&trace_ops); + tracer_enabled = 1; +} + +static void stop_irqsoff_tracer(struct trace_array *tr) +{ + tracer_enabled = 0; + unregister_ftrace_function(&trace_ops); +} + +static void __irqsoff_tracer_init(struct trace_array *tr) +{ + irqsoff_trace = tr; + /* make sure that the tracer is visible */ + smp_wmb(); + + if (tr->ctrl) + start_irqsoff_tracer(tr); +} + +static void irqsoff_tracer_reset(struct trace_array *tr) +{ + if (tr->ctrl) + stop_irqsoff_tracer(tr); +} + +static void irqsoff_tracer_ctrl_update(struct trace_array *tr) +{ + if (tr->ctrl) + start_irqsoff_tracer(tr); + else + stop_irqsoff_tracer(tr); +} + +static void irqsoff_tracer_open(struct trace_iterator *iter) +{ + /* stop the trace while dumping */ + if (iter->tr->ctrl) + stop_irqsoff_tracer(iter->tr); +} + +static void irqsoff_tracer_close(struct trace_iterator *iter) +{ + if (iter->tr->ctrl) + start_irqsoff_tracer(iter->tr); +} + +#ifdef CONFIG_IRQSOFF_TRACER +static void irqsoff_tracer_init(struct trace_array *tr) +{ + trace_type = TRACER_IRQS_OFF; + + __irqsoff_tracer_init(tr); +} +static struct tracer irqsoff_tracer __read_mostly = +{ + .name = "irqsoff", + .init = irqsoff_tracer_init, + .reset = irqsoff_tracer_reset, + .open = irqsoff_tracer_open, + .close = irqsoff_tracer_close, + .ctrl_update = irqsoff_tracer_ctrl_update, + .print_max = 1, +#ifdef CONFIG_FTRACE_SELFTEST + .selftest = trace_selftest_startup_irqsoff, +#endif +}; +# define register_irqsoff(trace) register_tracer(&trace) +#else +# define register_irqsoff(trace) do { } while (0) +#endif + +#ifdef CONFIG_PREEMPT_TRACER +static void preemptoff_tracer_init(struct trace_array *tr) +{ + trace_type = TRACER_PREEMPT_OFF; + + __irqsoff_tracer_init(tr); +} + +static struct tracer preemptoff_tracer __read_mostly = +{ + .name = "preemptoff", + .init = preemptoff_tracer_init, + .reset = irqsoff_tracer_reset, + .open = irqsoff_tracer_open, + .close = irqsoff_tracer_close, + .ctrl_update = irqsoff_tracer_ctrl_update, + .print_max = 1, +#ifdef CONFIG_FTRACE_SELFTEST + .selftest = trace_selftest_startup_preemptoff, +#endif +}; +# define register_preemptoff(trace) register_tracer(&trace) +#else +# define register_preemptoff(trace) do { } while (0) +#endif + +#if defined(CONFIG_IRQSOFF_TRACER) && \ + defined(CONFIG_PREEMPT_TRACER) + +static void preemptirqsoff_tracer_init(struct trace_array *tr) +{ + trace_type = TRACER_IRQS_OFF | TRACER_PREEMPT_OFF; + + __irqsoff_tracer_init(tr); +} + +static struct tracer preemptirqsoff_tracer __read_mostly = +{ + .name = "preemptirqsoff", + .init = preemptirqsoff_tracer_init, + .reset = irqsoff_tracer_reset, + .open = irqsoff_tracer_open, + .close = irqsoff_tracer_close, + .ctrl_update = irqsoff_tracer_ctrl_update, + .print_max = 1, +#ifdef CONFIG_FTRACE_SELFTEST + .selftest = trace_selftest_startup_preemptirqsoff, +#endif +}; + +# define register_preemptirqsoff(trace) register_tracer(&trace) +#else +# define register_preemptirqsoff(trace) do { } while (0) +#endif + +__init static int init_irqsoff_tracer(void) +{ + register_irqsoff(irqsoff_tracer); + register_preemptoff(preemptoff_tracer); + register_preemptirqsoff(preemptirqsoff_tracer); + + return 0; +} +device_initcall(init_irqsoff_tracer); Index: linux-2.6.24.7/kernel/trace/trace_mmiotrace.c =================================================================== --- /dev/null +++ linux-2.6.24.7/kernel/trace/trace_mmiotrace.c @@ -0,0 +1,295 @@ +/* + * Memory mapped I/O tracing + * + * Copyright (C) 2008 Pekka Paalanen <pq@iki.fi> + */ + +#define DEBUG 1 + +#include <linux/kernel.h> +#include <linux/mmiotrace.h> +#include <linux/pci.h> + +#include "trace.h" + +struct header_iter { + struct pci_dev *dev; +}; + +static struct trace_array *mmio_trace_array; +static bool overrun_detected; + +static void mmio_reset_data(struct trace_array *tr) +{ + int cpu; + + overrun_detected = false; + tr->time_start = ftrace_now(tr->cpu); + + for_each_online_cpu(cpu) + tracing_reset(tr->data[cpu]); +} + +static void mmio_trace_init(struct trace_array *tr) +{ + pr_debug("in %s\n", __func__); + mmio_trace_array = tr; + if (tr->ctrl) { + mmio_reset_data(tr); + enable_mmiotrace(); + } +} + +static void mmio_trace_reset(struct trace_array *tr) +{ + pr_debug("in %s\n", __func__); + if (tr->ctrl) + disable_mmiotrace(); + mmio_reset_data(tr); + mmio_trace_array = NULL; +} + +static void mmio_trace_ctrl_update(struct trace_array *tr) +{ + pr_debug("in %s\n", __func__); + if (tr->ctrl) { + mmio_reset_data(tr); + enable_mmiotrace(); + } else { + disable_mmiotrace(); + } +} + +static int mmio_print_pcidev(struct trace_seq *s, const struct pci_dev *dev) +{ + int ret = 0; + int i; + resource_size_t start, end; + const struct pci_driver *drv = pci_dev_driver(dev); + + /* XXX: incomplete checks for trace_seq_printf() return value */ + ret += trace_seq_printf(s, "PCIDEV %02x%02x %04x%04x %x", + dev->bus->number, dev->devfn, + dev->vendor, dev->device, dev->irq); + /* + * XXX: is pci_resource_to_user() appropriate, since we are + * supposed to interpret the __ioremap() phys_addr argument based on + * these printed values? + */ + for (i = 0; i < 7; i++) { + pci_resource_to_user(dev, i, &dev->resource[i], &start, &end); + ret += trace_seq_printf(s, " %llx", + (unsigned long long)(start | + (dev->resource[i].flags & PCI_REGION_FLAG_MASK))); + } + for (i = 0; i < 7; i++) { + pci_resource_to_user(dev, i, &dev->resource[i], &start, &end); + ret += trace_seq_printf(s, " %llx", + dev->resource[i].start < dev->resource[i].end ? + (unsigned long long)(end - start) + 1 : 0); + } + if (drv) + ret += trace_seq_printf(s, " %s\n", drv->name); + else + ret += trace_seq_printf(s, " \n"); + return ret; +} + +static void destroy_header_iter(struct header_iter *hiter) +{ + if (!hiter) + return; + pci_dev_put(hiter->dev); + kfree(hiter); +} + +static void mmio_pipe_open(struct trace_iterator *iter) +{ + struct header_iter *hiter; + struct trace_seq *s = &iter->seq; + + trace_seq_printf(s, "VERSION 20070824\n"); + + hiter = kzalloc(sizeof(*hiter), GFP_KERNEL); + if (!hiter) + return; + + hiter->dev = pci_get_device(PCI_ANY_ID, PCI_ANY_ID, NULL); + iter->private = hiter; +} + +/* XXX: This is not called when the pipe is closed! */ +static void mmio_close(struct trace_iterator *iter) +{ + struct header_iter *hiter = iter->private; + destroy_header_iter(hiter); + iter->private = NULL; +} + +static unsigned long count_overruns(struct trace_iterator *iter) +{ + int cpu; + unsigned long cnt = 0; + for_each_online_cpu(cpu) { + cnt += iter->overrun[cpu]; + iter->overrun[cpu] = 0; + } + return cnt; +} + +static ssize_t mmio_read(struct trace_iterator *iter, struct file *filp, + char __user *ubuf, size_t cnt, loff_t *ppos) +{ + ssize_t ret; + struct header_iter *hiter = iter->private; + struct trace_seq *s = &iter->seq; + unsigned long n; + + n = count_overruns(iter); + if (n) { + /* XXX: This is later than where events were lost. */ + trace_seq_printf(s, "MARK 0.000000 Lost %lu events.\n", n); + if (!overrun_detected) + pr_warning("mmiotrace has lost events.\n"); + overrun_detected = true; + goto print_out; + } + + if (!hiter) + return 0; + + mmio_print_pcidev(s, hiter->dev); + hiter->dev = pci_get_device(PCI_ANY_ID, PCI_ANY_ID, hiter->dev); + + if (!hiter->dev) { + destroy_header_iter(hiter); + iter->private = NULL; + } + +print_out: + ret = trace_seq_to_user(s, ubuf, cnt); + return (ret == -EBUSY) ? 0 : ret; +} + +static int mmio_print_rw(struct trace_iterator *iter) +{ + struct trace_entry *entry = iter->ent; + struct mmiotrace_rw *rw = &entry->mmiorw; + struct trace_seq *s = &iter->seq; + unsigned long long t = ns2usecs(entry->t); + unsigned long usec_rem = do_div(t, 1000000ULL); + unsigned secs = (unsigned long)t; + int ret = 1; + + switch (entry->mmiorw.opcode) { + case MMIO_READ: + ret = trace_seq_printf(s, + "R %d %lu.%06lu %d 0x%llx 0x%lx 0x%lx %d\n", + rw->width, secs, usec_rem, rw->map_id, + (unsigned long long)rw->phys, + rw->value, rw->pc, 0); + break; + case MMIO_WRITE: + ret = trace_seq_printf(s, + "W %d %lu.%06lu %d 0x%llx 0x%lx 0x%lx %d\n", + rw->width, secs, usec_rem, rw->map_id, + (unsigned long long)rw->phys, + rw->value, rw->pc, 0); + break; + case MMIO_UNKNOWN_OP: + ret = trace_seq_printf(s, + "UNKNOWN %lu.%06lu %d 0x%llx %02x,%02x,%02x 0x%lx %d\n", + secs, usec_rem, rw->map_id, + (unsigned long long)rw->phys, + (rw->value >> 16) & 0xff, (rw->value >> 8) & 0xff, + (rw->value >> 0) & 0xff, rw->pc, 0); + break; + default: + ret = trace_seq_printf(s, "rw what?\n"); + break; + } + if (ret) + return 1; + return 0; +} + +static int mmio_print_map(struct trace_iterator *iter) +{ + struct trace_entry *entry = iter->ent; + struct mmiotrace_map *m = &entry->mmiomap; + struct trace_seq *s = &iter->seq; + unsigned long long t = ns2usecs(entry->t); + unsigned long usec_rem = do_div(t, 1000000ULL); + unsigned secs = (unsigned long)t; + int ret = 1; + + switch (entry->mmiorw.opcode) { + case MMIO_PROBE: + ret = trace_seq_printf(s, + "MAP %lu.%06lu %d 0x%llx 0x%lx 0x%lx 0x%lx %d\n", + secs, usec_rem, m->map_id, + (unsigned long long)m->phys, m->virt, m->len, + 0UL, 0); + break; + case MMIO_UNPROBE: + ret = trace_seq_printf(s, + "UNMAP %lu.%06lu %d 0x%lx %d\n", + secs, usec_rem, m->map_id, 0UL, 0); + break; + default: + ret = trace_seq_printf(s, "map what?\n"); + break; + } + if (ret) + return 1; + return 0; +} + +/* return 0 to abort printing without consuming current entry in pipe mode */ +static int mmio_print_line(struct trace_iterator *iter) +{ + switch (iter->ent->type) { + case TRACE_MMIO_RW: + return mmio_print_rw(iter); + case TRACE_MMIO_MAP: + return mmio_print_map(iter); + default: + return 1; /* ignore unknown entries */ + } +} + +static struct tracer mmio_tracer __read_mostly = +{ + .name = "mmiotrace", + .init = mmio_trace_init, + .reset = mmio_trace_reset, + .pipe_open = mmio_pipe_open, + .close = mmio_close, + .read = mmio_read, + .ctrl_update = mmio_trace_ctrl_update, + .print_line = mmio_print_line, +}; + +__init static int init_mmio_trace(void) +{ + return register_tracer(&mmio_tracer); +} +device_initcall(init_mmio_trace); + +void mmio_trace_rw(struct mmiotrace_rw *rw) +{ + struct trace_array *tr = mmio_trace_array; + struct trace_array_cpu *data = tr->data[smp_processor_id()]; + __trace_mmiotrace_rw(tr, data, rw); +} + +void mmio_trace_mapping(struct mmiotrace_map *map) +{ + struct trace_array *tr = mmio_trace_array; + struct trace_array_cpu *data; + + preempt_disable(); + data = tr->data[smp_processor_id()]; + __trace_mmiotrace_map(tr, data, map); + preempt_enable(); +} Index: linux-2.6.24.7/kernel/trace/trace_sched_switch.c =================================================================== --- /dev/null +++ linux-2.6.24.7/kernel/trace/trace_sched_switch.c @@ -0,0 +1,196 @@ +/* + * trace context switch + * + * Copyright (C) 2007 Steven Rostedt <srostedt@redhat.com> + * + */ +#include <linux/module.h> +#include <linux/fs.h> +#include <linux/debugfs.h> +#include <linux/kallsyms.h> +#include <linux/uaccess.h> +#include <linux/marker.h> +#include <linux/ftrace.h> + +#include "trace.h" + +static struct trace_array *ctx_trace; +static int __read_mostly tracer_enabled; +static atomic_t sched_ref; + +static void +sched_switch_func(void *private, void *__rq, struct task_struct *prev, + struct task_struct *next) +{ + struct trace_array **ptr = private; + struct trace_array *tr = *ptr; + struct trace_array_cpu *data; + unsigned long flags; + long disabled; + int cpu; + + tracing_record_cmdline(prev); + tracing_record_cmdline(next); + + if (!tracer_enabled) + return; + + local_irq_save(flags); + cpu = raw_smp_processor_id(); + data = tr->data[cpu]; + disabled = atomic_inc_return(&data->disabled); + + if (likely(disabled == 1)) + tracing_sched_switch_trace(tr, data, prev, next, flags); + + atomic_dec(&data->disabled); + local_irq_restore(flags); +} + +static notrace void +sched_switch_callback(void *probe_data, void *call_data, + const char *format, va_list *args) +{ + struct task_struct *prev; + struct task_struct *next; + struct rq *__rq; + + if (!atomic_read(&sched_ref)) + return; + + /* skip prev_pid %d next_pid %d prev_state %ld */ + (void)va_arg(*args, int); + (void)va_arg(*args, int); + (void)va_arg(*args, long); + __rq = va_arg(*args, typeof(__rq)); + prev = va_arg(*args, typeof(prev)); + next = va_arg(*args, typeof(next)); + + /* + * If tracer_switch_func only points to the local + * switch func, it still needs the ptr passed to it. + */ + sched_switch_func(probe_data, __rq, prev, next); +} + +static void sched_switch_reset(struct trace_array *tr) +{ + int cpu; + + tr->time_start = ftrace_now(tr->cpu); + + for_each_online_cpu(cpu) + tracing_reset(tr->data[cpu]); +} + +static int tracing_sched_register(void) +{ + int ret; + + ret = marker_probe_register("kernel_sched_schedule", + "prev_pid %d next_pid %d prev_state %ld " + "## rq %p prev %p next %p", + sched_switch_callback, + &ctx_trace); + if (ret) + pr_info("sched trace: Couldn't add marker" + " probe to kernel_sched_schedule\n"); + + return ret; +} + +static void tracing_sched_unregister(void) +{ + marker_probe_unregister("kernel_sched_schedule", + sched_switch_callback, + &ctx_trace); +} + +void tracing_start_sched_switch(void) +{ + long ref; + + ref = atomic_inc_return(&sched_ref); + if (ref == 1) + tracing_sched_register(); +} + +void tracing_stop_sched_switch(void) +{ + long ref; + + ref = atomic_dec_and_test(&sched_ref); + if (ref) + tracing_sched_unregister(); +} + +void tracing_start_cmdline_record(void) +{ + tracing_start_sched_switch(); +} + +void tracing_stop_cmdline_record(void) +{ + tracing_stop_sched_switch(); +} + +static void start_sched_trace(struct trace_array *tr) +{ + sched_switch_reset(tr); + tracer_enabled = 1; + tracing_start_cmdline_record(); +} + +static void stop_sched_trace(struct trace_array *tr) +{ + tracing_stop_cmdline_record(); + tracer_enabled = 0; +} + +static void sched_switch_trace_init(struct trace_array *tr) +{ + ctx_trace = tr; + + if (tr->ctrl) + start_sched_trace(tr); +} + +static void sched_switch_trace_reset(struct trace_array *tr) +{ + if (tr->ctrl) + stop_sched_trace(tr); +} + +static void sched_switch_trace_ctrl_update(struct trace_array *tr) +{ + /* When starting a new trace, reset the buffers */ + if (tr->ctrl) + start_sched_trace(tr); + else + stop_sched_trace(tr); +} + +static struct tracer sched_switch_trace __read_mostly = +{ + .name = "sched_switch", + .init = sched_switch_trace_init, + .reset = sched_switch_trace_reset, + .ctrl_update = sched_switch_trace_ctrl_update, +#ifdef CONFIG_FTRACE_SELFTEST + .selftest = trace_selftest_startup_sched_switch, +#endif +}; + +__init static int init_sched_switch_trace(void) +{ + int ret = 0; + + if (atomic_read(&sched_ref)) + ret = tracing_sched_register(); + if (ret) { + pr_info("error registering scheduler trace\n"); + return ret; + } + return register_tracer(&sched_switch_trace); +} +device_initcall(init_sched_switch_trace); Index: linux-2.6.24.7/kernel/trace/trace_sched_wakeup.c =================================================================== --- /dev/null +++ linux-2.6.24.7/kernel/trace/trace_sched_wakeup.c @@ -0,0 +1,447 @@ +/* + * trace task wakeup timings + * + * Copyright (C) 2007-2008 Steven Rostedt <srostedt@redhat.com> + * Copyright (C) 2008 Ingo Molnar <mingo@redhat.com> + * + * Based on code from the latency_tracer, that is: + * + * Copyright (C) 2004-2006 Ingo Molnar + * Copyright (C) 2004 William Lee Irwin III + */ +#include <linux/module.h> +#include <linux/fs.h> +#include <linux/debugfs.h> +#include <linux/kallsyms.h> +#include <linux/uaccess.h> +#include <linux/ftrace.h> +#include <linux/marker.h> + +#include "trace.h" + +static struct trace_array *wakeup_trace; +static int __read_mostly tracer_enabled; + +static struct task_struct *wakeup_task; +static int wakeup_cpu; +static unsigned wakeup_prio = -1; + +static DEFINE_SPINLOCK(wakeup_lock); + +static void __wakeup_reset(struct trace_array *tr); + +#ifdef CONFIG_FTRACE +/* + * irqsoff uses its own tracer function to keep the overhead down: + */ +static void +wakeup_tracer_call(unsigned long ip, unsigned long parent_ip) +{ + struct trace_array *tr = wakeup_trace; + struct trace_array_cpu *data; + unsigned long flags; + long disabled; + int resched; + int cpu; + + if (likely(!wakeup_task)) + return; + + resched = need_resched(); + preempt_disable_notrace(); + + cpu = raw_smp_processor_id(); + data = tr->data[cpu]; + disabled = atomic_inc_return(&data->disabled); + if (unlikely(disabled != 1)) + goto out; + + spin_lock_irqsave(&wakeup_lock, flags); + + if (unlikely(!wakeup_task)) + goto unlock; + + /* + * The task can't disappear because it needs to + * wake up first, and we have the wakeup_lock. + */ + if (task_cpu(wakeup_task) != cpu) + goto unlock; + + trace_function(tr, data, ip, parent_ip, flags); + + unlock: + spin_unlock_irqrestore(&wakeup_lock, flags); + + out: + atomic_dec(&data->disabled); + + /* + * To prevent recursion from the scheduler, if the + * resched flag was set before we entered, then + * don't reschedule. + */ + if (resched) + preempt_enable_no_resched_notrace(); + else + preempt_enable_notrace(); +} + +static struct ftrace_ops trace_ops __read_mostly = +{ + .func = wakeup_tracer_call, +}; +#endif /* CONFIG_FTRACE */ + +/* + * Should this new latency be reported/recorded? + */ +static int report_latency(cycle_t delta) +{ + if (tracing_thresh) { + if (delta < tracing_thresh) + return 0; + } else { + if (delta <= tracing_max_latency) + return 0; + } + return 1; +} + +static void notrace +wakeup_sched_switch(void *private, void *rq, struct task_struct *prev, + struct task_struct *next) +{ + unsigned long latency = 0, t0 = 0, t1 = 0; + struct trace_array **ptr = private; + struct trace_array *tr = *ptr; + struct trace_array_cpu *data; + cycle_t T0, T1, delta; + unsigned long flags; + long disabled; + int cpu; + + if (unlikely(!tracer_enabled)) + return; + + /* + * When we start a new trace, we set wakeup_task to NULL + * and then set tracer_enabled = 1. We want to make sure + * that another CPU does not see the tracer_enabled = 1 + * and the wakeup_task with an older task, that might + * actually be the same as next. + */ + smp_rmb(); + + if (next != wakeup_task) + return; + + /* The task we are waiting for is waking up */ + data = tr->data[wakeup_cpu]; + + /* disable local data, not wakeup_cpu data */ + cpu = raw_smp_processor_id(); + disabled = atomic_inc_return(&tr->data[cpu]->disabled); + if (likely(disabled != 1)) + goto out; + + spin_lock_irqsave(&wakeup_lock, flags); + + /* We could race with grabbing wakeup_lock */ + if (unlikely(!tracer_enabled || next != wakeup_task)) + goto out_unlock; + + trace_function(tr, data, CALLER_ADDR1, CALLER_ADDR2, flags); + + /* + * usecs conversion is slow so we try to delay the conversion + * as long as possible: + */ + T0 = data->preempt_timestamp; + T1 = ftrace_now(cpu); + delta = T1-T0; + + if (!report_latency(delta)) + goto out_unlock; + + latency = nsecs_to_usecs(delta); + + tracing_max_latency = delta; + t0 = nsecs_to_usecs(T0); + t1 = nsecs_to_usecs(T1); + + update_max_tr(tr, wakeup_task, wakeup_cpu); + +out_unlock: + __wakeup_reset(tr); + spin_unlock_irqrestore(&wakeup_lock, flags); +out: + atomic_dec(&tr->data[cpu]->disabled); +} + +static notrace void +sched_switch_callback(void *probe_data, void *call_data, + const char *format, va_list *args) +{ + struct task_struct *prev; + struct task_struct *next; + struct rq *__rq; + + /* skip prev_pid %d next_pid %d prev_state %ld */ + (void)va_arg(*args, int); + (void)va_arg(*args, int); + (void)va_arg(*args, long); + __rq = va_arg(*args, typeof(__rq)); + prev = va_arg(*args, typeof(prev)); + next = va_arg(*args, typeof(next)); + + tracing_record_cmdline(prev); + + /* + * If tracer_switch_func only points to the local + * switch func, it still needs the ptr passed to it. + */ + wakeup_sched_switch(probe_data, __rq, prev, next); +} + +static void __wakeup_reset(struct trace_array *tr) +{ + struct trace_array_cpu *data; + int cpu; + + assert_spin_locked(&wakeup_lock); + + for_each_possible_cpu(cpu) { + data = tr->data[cpu]; + tracing_reset(data); + } + + wakeup_cpu = -1; + wakeup_prio = -1; + + if (wakeup_task) + put_task_struct(wakeup_task); + + wakeup_task = NULL; +} + +static void wakeup_reset(struct trace_array *tr) +{ + unsigned long flags; + + spin_lock_irqsave(&wakeup_lock, flags); + __wakeup_reset(tr); + spin_unlock_irqrestore(&wakeup_lock, flags); +} + +static void +wakeup_check_start(struct trace_array *tr, struct task_struct *p, + struct task_struct *curr) +{ + int cpu = smp_processor_id(); + unsigned long flags; + long disabled; + + if (likely(!rt_task(p)) || + p->prio >= wakeup_prio || + p->prio >= curr->prio) + return; + + disabled = atomic_inc_return(&tr->data[cpu]->disabled); + if (unlikely(disabled != 1)) + goto out; + + /* interrupts should be off from try_to_wake_up */ + spin_lock(&wakeup_lock); + + /* check for races. */ + if (!tracer_enabled || p->prio >= wakeup_prio) + goto out_locked; + + /* reset the trace */ + __wakeup_reset(tr); + + wakeup_cpu = task_cpu(p); + wakeup_prio = p->prio; + + wakeup_task = p; + get_task_struct(wakeup_task); + + local_save_flags(flags); + + tr->data[wakeup_cpu]->preempt_timestamp = ftrace_now(cpu); + trace_function(tr, tr->data[wakeup_cpu], + CALLER_ADDR1, CALLER_ADDR2, flags); + +out_locked: + spin_unlock(&wakeup_lock); +out: + atomic_dec(&tr->data[cpu]->disabled); +} + +static notrace void +wake_up_callback(void *probe_data, void *call_data, + const char *format, va_list *args) +{ + struct trace_array **ptr = probe_data; + struct trace_array *tr = *ptr; + struct task_struct *curr; + struct task_struct *task; + struct rq *__rq; + + if (likely(!tracer_enabled)) + return; + + /* Skip pid %d state %ld */ + (void)va_arg(*args, int); + (void)va_arg(*args, long); + /* now get the meat: "rq %p task %p rq->curr %p" */ + __rq = va_arg(*args, typeof(__rq)); + task = va_arg(*args, typeof(task)); + curr = va_arg(*args, typeof(curr)); + + tracing_record_cmdline(task); + tracing_record_cmdline(curr); + + wakeup_check_start(tr, task, curr); +} + +static void start_wakeup_tracer(struct trace_array *tr) +{ + int ret; + + ret = marker_probe_register("kernel_sched_wakeup", + "pid %d state %ld ## rq %p task %p rq->curr %p", + wake_up_callback, + &wakeup_trace); + if (ret) { + pr_info("wakeup trace: Couldn't add marker" + " probe to kernel_sched_wakeup\n"); + return; + } + + ret = marker_probe_register("kernel_sched_wakeup_new", + "pid %d state %ld ## rq %p task %p rq->curr %p", + wake_up_callback, + &wakeup_trace); + if (ret) { + pr_info("wakeup trace: Couldn't add marker" + " probe to kernel_sched_wakeup_new\n"); + goto fail_deprobe; + } + + ret = marker_probe_register("kernel_sched_schedule", + "prev_pid %d next_pid %d prev_state %ld " + "## rq %p prev %p next %p", + sched_switch_callback, + &wakeup_trace); + if (ret) { + pr_info("sched trace: Couldn't add marker" + " probe to kernel_sched_schedule\n"); + goto fail_deprobe_wake_new; + } + + wakeup_reset(tr); + + /* + * Don't let the tracer_enabled = 1 show up before + * the wakeup_task is reset. This may be overkill since + * wakeup_reset does a spin_unlock after setting the + * wakeup_task to NULL, but I want to be safe. + * This is a slow path anyway. + */ + smp_wmb(); + + tracer_enabled = 1; + register_ftrace_function(&trace_ops); + + return; +fail_deprobe_wake_new: + marker_probe_unregister("kernel_sched_wakeup_new", + wake_up_callback, + &wakeup_trace); +fail_deprobe: + marker_probe_unregister("kernel_sched_wakeup", + wake_up_callback, + &wakeup_trace); +} + +static void stop_wakeup_tracer(struct trace_array *tr) +{ + tracer_enabled = 0; + unregister_ftrace_function(&trace_ops); + marker_probe_unregister("kernel_sched_schedule", + sched_switch_callback, + &wakeup_trace); + marker_probe_unregister("kernel_sched_wakeup_new", + wake_up_callback, + &wakeup_trace); + marker_probe_unregister("kernel_sched_wakeup", + wake_up_callback, + &wakeup_trace); +} + +static void wakeup_tracer_init(struct trace_array *tr) +{ + wakeup_trace = tr; + + if (tr->ctrl) + start_wakeup_tracer(tr); +} + +static void wakeup_tracer_reset(struct trace_array *tr) +{ + if (tr->ctrl) { + stop_wakeup_tracer(tr); + /* make sure we put back any tasks we are tracing */ + wakeup_reset(tr); + } +} + +static void wakeup_tracer_ctrl_update(struct trace_array *tr) +{ + if (tr->ctrl) + start_wakeup_tracer(tr); + else + stop_wakeup_tracer(tr); +} + +static void wakeup_tracer_open(struct trace_iterator *iter) +{ + /* stop the trace while dumping */ + if (iter->tr->ctrl) + stop_wakeup_tracer(iter->tr); +} + +static void wakeup_tracer_close(struct trace_iterator *iter) +{ + /* forget about any processes we were recording */ + if (iter->tr->ctrl) + start_wakeup_tracer(iter->tr); +} + +static struct tracer wakeup_tracer __read_mostly = +{ + .name = "wakeup", + .init = wakeup_tracer_init, + .reset = wakeup_tracer_reset, + .open = wakeup_tracer_open, + .close = wakeup_tracer_close, + .ctrl_update = wakeup_tracer_ctrl_update, + .print_max = 1, +#ifdef CONFIG_FTRACE_SELFTEST + .selftest = trace_selftest_startup_wakeup, +#endif +}; + +__init static int init_wakeup_tracer(void) +{ + int ret; + + ret = register_tracer(&wakeup_tracer); + if (ret) + return ret; + + return 0; +} +device_initcall(init_wakeup_tracer); Index: linux-2.6.24.7/kernel/trace/trace_selftest.c =================================================================== --- /dev/null +++ linux-2.6.24.7/kernel/trace/trace_selftest.c @@ -0,0 +1,563 @@ +/* Include in trace.c */ + +#include <linux/kthread.h> +#include <linux/delay.h> + +static inline int trace_valid_entry(struct trace_entry *entry) +{ + switch (entry->type) { + case TRACE_FN: + case TRACE_CTX: + case TRACE_WAKE: + case TRACE_STACK: + case TRACE_SPECIAL: + return 1; + } + return 0; +} + +static int +trace_test_buffer_cpu(struct trace_array *tr, struct trace_array_cpu *data) +{ + struct trace_entry *entries; + struct page *page; + int idx = 0; + int i; + + BUG_ON(list_empty(&data->trace_pages)); + page = list_entry(data->trace_pages.next, struct page, lru); + entries = page_address(page); + + check_pages(data); + if (head_page(data) != entries) + goto failed; + + /* + * The starting trace buffer always has valid elements, + * if any element exists. + */ + entries = head_page(data); + + for (i = 0; i < tr->entries; i++) { + + if (i < data->trace_idx && !trace_valid_entry(&entries[idx])) { + printk(KERN_CONT ".. invalid entry %d ", + entries[idx].type); + goto failed; + } + + idx++; + if (idx >= ENTRIES_PER_PAGE) { + page = virt_to_page(entries); + if (page->lru.next == &data->trace_pages) { + if (i != tr->entries - 1) { + printk(KERN_CONT ".. entries buffer mismatch"); + goto failed; + } + } else { + page = list_entry(page->lru.next, struct page, lru); + entries = page_address(page); + } + idx = 0; + } + } + + page = virt_to_page(entries); + if (page->lru.next != &data->trace_pages) { + printk(KERN_CONT ".. too many entries"); + goto failed; + } + + return 0; + + failed: + /* disable tracing */ + tracing_disabled = 1; + printk(KERN_CONT ".. corrupted trace buffer .. "); + return -1; +} + +/* + * Test the trace buffer to see if all the elements + * are still sane. + */ +static int trace_test_buffer(struct trace_array *tr, unsigned long *count) +{ + unsigned long flags, cnt = 0; + int cpu, ret = 0; + + /* Don't allow flipping of max traces now */ + raw_local_irq_save(flags); + __raw_spin_lock(&ftrace_max_lock); + for_each_possible_cpu(cpu) { + if (!head_page(tr->data[cpu])) + continue; + + cnt += tr->data[cpu]->trace_idx; + + ret = trace_test_buffer_cpu(tr, tr->data[cpu]); + if (ret) + break; + } + __raw_spin_unlock(&ftrace_max_lock); + raw_local_irq_restore(flags); + + if (count) + *count = cnt; + + return ret; +} + +#ifdef CONFIG_FTRACE + +#ifdef CONFIG_DYNAMIC_FTRACE + +#define __STR(x) #x +#define STR(x) __STR(x) + +/* Test dynamic code modification and ftrace filters */ +int trace_selftest_startup_dynamic_tracing(struct tracer *trace, + struct trace_array *tr, + int (*func)(void)) +{ + unsigned long count; + int ret; + int save_ftrace_enabled = ftrace_enabled; + int save_tracer_enabled = tracer_enabled; + char *func_name; + + /* The ftrace test PASSED */ + printk(KERN_CONT "PASSED\n"); + pr_info("Testing dynamic ftrace: "); + + /* enable tracing, and record the filter function */ + ftrace_enabled = 1; + tracer_enabled = 1; + + /* passed in by parameter to fool gcc from optimizing */ + func(); + + /* update the records */ + ret = ftrace_force_update(); + if (ret) { + printk(KERN_CONT ".. ftraced failed .. "); + return ret; + } + + /* + * Some archs *cough*PowerPC*cough* add charachters to the + * start of the function names. We simply put a '*' to + * accomodate them. + */ + func_name = "*" STR(DYN_FTRACE_TEST_NAME); + + /* filter only on our function */ + ftrace_set_filter(func_name, strlen(func_name), 1); + + /* enable tracing */ + tr->ctrl = 1; + trace->init(tr); + /* Sleep for a 1/10 of a second */ + msleep(100); + + /* we should have nothing in the buffer */ + ret = trace_test_buffer(tr, &count); + if (ret) + goto out; + + if (count) { + ret = -1; + printk(KERN_CONT ".. filter did not filter .. "); + goto out; + } + + /* call our function again */ + func(); + + /* sleep again */ + msleep(100); + + /* stop the tracing. */ + tr->ctrl = 0; + trace->ctrl_update(tr); + ftrace_enabled = 0; + + /* check the trace buffer */ + ret = trace_test_buffer(tr, &count); + trace->reset(tr); + + /* we should only have one item */ + if (!ret && count != 1) { + printk(KERN_CONT ".. filter failed count=%ld ..", count); + ret = -1; + goto out; + } + out: + ftrace_enabled = save_ftrace_enabled; + tracer_enabled = save_tracer_enabled; + + /* Enable tracing on all functions again */ + ftrace_set_filter(NULL, 0, 1); + + return ret; +} +#else +# define trace_selftest_startup_dynamic_tracing(trace, tr, func) ({ 0; }) +#endif /* CONFIG_DYNAMIC_FTRACE */ +/* + * Simple verification test of ftrace function tracer. + * Enable ftrace, sleep 1/10 second, and then read the trace + * buffer to see if all is in order. + */ +int +trace_selftest_startup_function(struct tracer *trace, struct trace_array *tr) +{ + unsigned long count; + int ret; + int save_ftrace_enabled = ftrace_enabled; + int save_tracer_enabled = tracer_enabled; + + /* make sure msleep has been recorded */ + msleep(1); + + /* force the recorded functions to be traced */ + ret = ftrace_force_update(); + if (ret) { + printk(KERN_CONT ".. ftraced failed .. "); + return ret; + } + + /* start the tracing */ + ftrace_enabled = 1; + tracer_enabled = 1; + + tr->ctrl = 1; + trace->init(tr); + /* Sleep for a 1/10 of a second */ + msleep(100); + /* stop the tracing. */ + tr->ctrl = 0; + trace->ctrl_update(tr); + ftrace_enabled = 0; + + /* check the trace buffer */ + ret = trace_test_buffer(tr, &count); + trace->reset(tr); + + if (!ret && !count) { + printk(KERN_CONT ".. no entries found .."); + ret = -1; + goto out; + } + + ret = trace_selftest_startup_dynamic_tracing(trace, tr, + DYN_FTRACE_TEST_NAME); + + out: + ftrace_enabled = save_ftrace_enabled; + tracer_enabled = save_tracer_enabled; + + /* kill ftrace totally if we failed */ + if (ret) + ftrace_kill(); + + return ret; +} +#endif /* CONFIG_FTRACE */ + +#ifdef CONFIG_IRQSOFF_TRACER +int +trace_selftest_startup_irqsoff(struct tracer *trace, struct trace_array *tr) +{ + unsigned long save_max = tracing_max_latency; + unsigned long count; + int ret; + + /* start the tracing */ + tr->ctrl = 1; + trace->init(tr); + /* reset the max latency */ + tracing_max_latency = 0; + /* disable interrupts for a bit */ + local_irq_disable(); + udelay(100); + local_irq_enable(); + /* stop the tracing. */ + tr->ctrl = 0; + trace->ctrl_update(tr); + /* check both trace buffers */ + ret = trace_test_buffer(tr, NULL); + if (!ret) + ret = trace_test_buffer(&max_tr, &count); + trace->reset(tr); + + if (!ret && !count) { + printk(KERN_CONT ".. no entries found .."); + ret = -1; + } + + tracing_max_latency = save_max; + + return ret; +} +#endif /* CONFIG_IRQSOFF_TRACER */ + +#ifdef CONFIG_PREEMPT_TRACER +int +trace_selftest_startup_preemptoff(struct tracer *trace, struct trace_array *tr) +{ + unsigned long save_max = tracing_max_latency; + unsigned long count; + int ret; + + /* start the tracing */ + tr->ctrl = 1; + trace->init(tr); + /* reset the max latency */ + tracing_max_latency = 0; + /* disable preemption for a bit */ + preempt_disable(); + udelay(100); + preempt_enable(); + /* stop the tracing. */ + tr->ctrl = 0; + trace->ctrl_update(tr); + /* check both trace buffers */ + ret = trace_test_buffer(tr, NULL); + if (!ret) + ret = trace_test_buffer(&max_tr, &count); + trace->reset(tr); + + if (!ret && !count) { + printk(KERN_CONT ".. no entries found .."); + ret = -1; + } + + tracing_max_latency = save_max; + + return ret; +} +#endif /* CONFIG_PREEMPT_TRACER */ + +#if defined(CONFIG_IRQSOFF_TRACER) && defined(CONFIG_PREEMPT_TRACER) +int +trace_selftest_startup_preemptirqsoff(struct tracer *trace, struct trace_array *tr) +{ + unsigned long save_max = tracing_max_latency; + unsigned long count; + int ret; + + /* start the tracing */ + tr->ctrl = 1; + trace->init(tr); + + /* reset the max latency */ + tracing_max_latency = 0; + + /* disable preemption and interrupts for a bit */ + preempt_disable(); + local_irq_disable(); + udelay(100); + preempt_enable(); + /* reverse the order of preempt vs irqs */ + local_irq_enable(); + + /* stop the tracing. */ + tr->ctrl = 0; + trace->ctrl_update(tr); + /* check both trace buffers */ + ret = trace_test_buffer(tr, NULL); + if (ret) + goto out; + + ret = trace_test_buffer(&max_tr, &count); + if (ret) + goto out; + + if (!ret && !count) { + printk(KERN_CONT ".. no entries found .."); + ret = -1; + goto out; + } + + /* do the test by disabling interrupts first this time */ + tracing_max_latency = 0; + tr->ctrl = 1; + trace->ctrl_update(tr); + preempt_disable(); + local_irq_disable(); + udelay(100); + preempt_enable(); + /* reverse the order of preempt vs irqs */ + local_irq_enable(); + + /* stop the tracing. */ + tr->ctrl = 0; + trace->ctrl_update(tr); + /* check both trace buffers */ + ret = trace_test_buffer(tr, NULL); + if (ret) + goto out; + + ret = trace_test_buffer(&max_tr, &count); + + if (!ret && !count) { + printk(KERN_CONT ".. no entries found .."); + ret = -1; + goto out; + } + + out: + trace->reset(tr); + tracing_max_latency = save_max; + + return ret; +} +#endif /* CONFIG_IRQSOFF_TRACER && CONFIG_PREEMPT_TRACER */ + +#ifdef CONFIG_SCHED_TRACER +static int trace_wakeup_test_thread(void *data) +{ + /* Make this a RT thread, doesn't need to be too high */ + struct sched_param param = { .sched_priority = 5 }; + struct completion *x = data; + + sched_setscheduler(current, SCHED_FIFO, ¶m); + + /* Make it know we have a new prio */ + complete(x); + + /* now go to sleep and let the test wake us up */ + set_current_state(TASK_INTERRUPTIBLE); + schedule(); + + /* we are awake, now wait to disappear */ + while (!kthread_should_stop()) { + /* + * This is an RT task, do short sleeps to let + * others run. + */ + msleep(100); + } + + return 0; +} + +int +trace_selftest_startup_wakeup(struct tracer *trace, struct trace_array *tr) +{ + unsigned long save_max = tracing_max_latency; + struct task_struct *p; + struct completion isrt; + unsigned long count; + int ret; + + init_completion(&isrt); + + /* create a high prio thread */ + p = kthread_run(trace_wakeup_test_thread, &isrt, "ftrace-test"); + if (IS_ERR(p)) { + printk(KERN_CONT "Failed to create ftrace wakeup test thread "); + return -1; + } + + /* make sure the thread is running at an RT prio */ + wait_for_completion(&isrt); + + /* start the tracing */ + tr->ctrl = 1; + trace->init(tr); + /* reset the max latency */ + tracing_max_latency = 0; + + /* sleep to let the RT thread sleep too */ + msleep(100); + + /* + * Yes this is slightly racy. It is possible that for some + * strange reason that the RT thread we created, did not + * call schedule for 100ms after doing the completion, + * and we do a wakeup on a task that already is awake. + * But that is extremely unlikely, and the worst thing that + * happens in such a case, is that we disable tracing. + * Honestly, if this race does happen something is horrible + * wrong with the system. + */ + + wake_up_process(p); + + /* stop the tracing. */ + tr->ctrl = 0; + trace->ctrl_update(tr); + /* check both trace buffers */ + ret = trace_test_buffer(tr, NULL); + if (!ret) + ret = trace_test_buffer(&max_tr, &count); + + + trace->reset(tr); + + tracing_max_latency = save_max; + + /* kill the thread */ + kthread_stop(p); + + if (!ret && !count) { + printk(KERN_CONT ".. no entries found .."); + ret = -1; + } + + return ret; +} +#endif /* CONFIG_SCHED_TRACER */ + +#ifdef CONFIG_CONTEXT_SWITCH_TRACER +int +trace_selftest_startup_sched_switch(struct tracer *trace, struct trace_array *tr) +{ + unsigned long count; + int ret; + + /* start the tracing */ + tr->ctrl = 1; + trace->init(tr); + /* Sleep for a 1/10 of a second */ + msleep(100); + /* stop the tracing. */ + tr->ctrl = 0; + trace->ctrl_update(tr); + /* check the trace buffer */ + ret = trace_test_buffer(tr, &count); + trace->reset(tr); + + if (!ret && !count) { + printk(KERN_CONT ".. no entries found .."); + ret = -1; + } + + return ret; +} +#endif /* CONFIG_CONTEXT_SWITCH_TRACER */ + +#ifdef CONFIG_SYSPROF_TRACER +int +trace_selftest_startup_sysprof(struct tracer *trace, struct trace_array *tr) +{ + unsigned long count; + int ret; + + /* start the tracing */ + tr->ctrl = 1; + trace->init(tr); + /* Sleep for a 1/10 of a second */ + msleep(100); + /* stop the tracing. */ + tr->ctrl = 0; + trace->ctrl_update(tr); + /* check the trace buffer */ + ret = trace_test_buffer(tr, &count); + trace->reset(tr); + + return ret; +} +#endif /* CONFIG_SYSPROF_TRACER */ Index: linux-2.6.24.7/kernel/trace/trace_selftest_dynamic.c =================================================================== --- /dev/null +++ linux-2.6.24.7/kernel/trace/trace_selftest_dynamic.c @@ -0,0 +1,7 @@ +#include "trace.h" + +int DYN_FTRACE_TEST_NAME(void) +{ + /* used to call mcount */ + return 0; +} Index: linux-2.6.24.7/lib/Kconfig.debug =================================================================== --- linux-2.6.24.7.orig/lib/Kconfig.debug +++ linux-2.6.24.7/lib/Kconfig.debug @@ -517,4 +517,6 @@ config FAULT_INJECTION_STACKTRACE_FILTER help Provide stacktrace filter for fault-injection capabilities +source kernel/trace/Kconfig + source "samples/Kconfig" Index: linux-2.6.24.7/lib/Makefile =================================================================== --- linux-2.6.24.7.orig/lib/Makefile +++ linux-2.6.24.7/lib/Makefile @@ -8,6 +8,15 @@ lib-y := ctype.o string.o vsprintf.o cmd sha1.o irq_regs.o reciprocal_div.o argv_split.o \ proportions.o prio_heap.o +ifdef CONFIG_FTRACE +# Do not profile string.o, since it may be used in early boot or vdso +CFLAGS_REMOVE_string.o = -pg +# Also do not profile any debug utilities +CFLAGS_REMOVE_spinlock_debug.o = -pg +CFLAGS_REMOVE_list_debug.o = -pg +CFLAGS_REMOVE_debugobjects.o = -pg +endif + lib-$(CONFIG_MMU) += ioremap.o lib-$(CONFIG_SMP) += cpumask.o Index: linux-2.6.24.7/mm/page-writeback.c =================================================================== --- linux-2.6.24.7.orig/mm/page-writeback.c +++ linux-2.6.24.7/mm/page-writeback.c @@ -120,8 +120,6 @@ static void background_writeout(unsigned static struct prop_descriptor vm_completions; static struct prop_descriptor vm_dirties; -static unsigned long determine_dirtyable_memory(void); - /* * couple the period to the dirty_ratio: * @@ -280,7 +278,13 @@ static unsigned long highmem_dirtyable_m #endif } -static unsigned long determine_dirtyable_memory(void) +/** + * determine_dirtyable_memory - amount of memory that may be used + * + * Returns the numebr of pages that can currently be freed and used + * by the kernel for direct mappings. + */ +unsigned long determine_dirtyable_memory(void) { unsigned long x; Index: linux-2.6.24.7/scripts/Makefile.lib =================================================================== --- linux-2.6.24.7.orig/scripts/Makefile.lib +++ linux-2.6.24.7/scripts/Makefile.lib @@ -90,7 +90,8 @@ basename_flags = -D"KBUILD_BASENAME=KBUI modname_flags = $(if $(filter 1,$(words $(modname))),\ -D"KBUILD_MODNAME=KBUILD_STR($(call name-fix,$(modname)))") -_c_flags = $(KBUILD_CFLAGS) $(ccflags-y) $(CFLAGS_$(basetarget).o) +orig_c_flags = $(KBUILD_CFLAGS) $(ccflags-y) $(CFLAGS_$(basetarget).o) +_c_flags = $(filter-out $(CFLAGS_REMOVE_$(basetarget).o), $(orig_c_flags)) _a_flags = $(KBUILD_AFLAGS) $(asflags-y) $(AFLAGS_$(basetarget).o) _cpp_flags = $(KBUILD_CPPFLAGS) $(cppflags-y) $(CPPFLAGS_$(@F)) �����������patches/ftrace-disable-daemon.patch�����������������������������������������������������������������0000664�0000764�0000764�00000017072�11041657730�016377� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������--- include/linux/ftrace.h | 6 + kernel/trace/ftrace.c | 157 ++++++++++++++++++++++++++++++++++--------------- 2 files changed, 116 insertions(+), 47 deletions(-) Index: linux-2.6.24.7/include/linux/ftrace.h =================================================================== --- linux-2.6.24.7.orig/include/linux/ftrace.h +++ linux-2.6.24.7/include/linux/ftrace.h @@ -72,9 +72,15 @@ extern int ftrace_update_ftrace_func(ftr extern void ftrace_caller(void); extern void ftrace_call(void); extern void mcount_call(void); + +void ftrace_disable_daemon(void); +void ftrace_enable_daemon(void); + #else # define ftrace_force_update() ({ 0; }) # define ftrace_set_filter(buf, len, reset) do { } while (0) +# define ftrace_disable_daemon() do { } while (0) +# define ftrace_enable_daemon() do { } while (0) #endif /* totally disable ftrace - can not re-enable after this */ Index: linux-2.6.24.7/kernel/trace/ftrace.c =================================================================== --- linux-2.6.24.7.orig/kernel/trace/ftrace.c +++ linux-2.6.24.7/kernel/trace/ftrace.c @@ -151,8 +151,6 @@ static int __unregister_ftrace_function( #ifdef CONFIG_DYNAMIC_FTRACE static struct task_struct *ftraced_task; -static DECLARE_WAIT_QUEUE_HEAD(ftraced_waiters); -static unsigned long ftraced_iteration_counter; enum { FTRACE_ENABLE_CALLS = (1 << 0), @@ -189,6 +187,7 @@ static struct ftrace_page *ftrace_pages; static int ftraced_trigger; static int ftraced_suspend; +static int ftraced_stop; static int ftrace_record_suspend; @@ -472,14 +471,21 @@ ftrace_code_disable(struct dyn_ftrace *r } } +static int __ftrace_update_code(void *ignore); + static int __ftrace_modify_code(void *data) { unsigned long addr; int *command = data; - if (*command & FTRACE_ENABLE_CALLS) + if (*command & FTRACE_ENABLE_CALLS) { + /* + * Update any recorded ips now that we have the + * machine stopped + */ + __ftrace_update_code(NULL); ftrace_replace_code(1); - else if (*command & FTRACE_DISABLE_CALLS) + } else if (*command & FTRACE_DISABLE_CALLS) ftrace_replace_code(0); if (*command & FTRACE_UPDATE_TRACE_FUNC) @@ -501,6 +507,25 @@ static void ftrace_run_update_code(int c stop_machine_run(__ftrace_modify_code, &command, NR_CPUS); } +void ftrace_disable_daemon(void) +{ + /* Stop the daemon from calling kstop_machine */ + mutex_lock(&ftraced_lock); + ftraced_stop = 1; + mutex_unlock(&ftraced_lock); + + ftrace_force_update(); +} + +void ftrace_enable_daemon(void) +{ + mutex_lock(&ftraced_lock); + ftraced_stop = 0; + mutex_unlock(&ftraced_lock); + + ftrace_force_update(); +} + static ftrace_func_t saved_ftrace_func; static void ftrace_startup(void) @@ -601,6 +626,7 @@ static int __ftrace_update_code(void *ig int i; /* Don't be recording funcs now */ + ftrace_record_suspend++; save_ftrace_enabled = ftrace_enabled; ftrace_enabled = 0; @@ -626,18 +652,23 @@ static int __ftrace_update_code(void *ig stop = ftrace_now(raw_smp_processor_id()); ftrace_update_time = stop - start; ftrace_update_tot_cnt += ftrace_update_cnt; + ftraced_trigger = 0; ftrace_enabled = save_ftrace_enabled; + ftrace_record_suspend--; return 0; } -static void ftrace_update_code(void) +static int ftrace_update_code(void) { - if (unlikely(ftrace_disabled)) - return; + if (unlikely(ftrace_disabled) || + !ftrace_enabled || !ftraced_trigger) + return 0; stop_machine_run(__ftrace_update_code, NULL, NR_CPUS); + + return 1; } static int ftraced(void *ignore) @@ -656,14 +687,13 @@ static int ftraced(void *ignore) mutex_lock(&ftrace_sysctl_lock); mutex_lock(&ftraced_lock); - if (ftrace_enabled && ftraced_trigger && !ftraced_suspend) { - ftrace_record_suspend++; - ftrace_update_code(); + if (!ftraced_suspend && !ftraced_stop && + ftrace_update_code()) { usecs = nsecs_to_usecs(ftrace_update_time); if (ftrace_update_tot_cnt > 100000) { ftrace_update_tot_cnt = 0; pr_info("hm, dftrace overflow: %lu change%s" - " (%lu total) in %lu usec%s\n", + " (%lu total) in %lu usec%s\n", ftrace_update_cnt, ftrace_update_cnt != 1 ? "s" : "", ftrace_update_tot_cnt, @@ -671,15 +701,10 @@ static int ftraced(void *ignore) ftrace_disabled = 1; WARN_ON_ONCE(1); } - ftraced_trigger = 0; - ftrace_record_suspend--; } - ftraced_iteration_counter++; mutex_unlock(&ftraced_lock); mutex_unlock(&ftrace_sysctl_lock); - wake_up_interruptible(&ftraced_waiters); - ftrace_shutdown_replenish(); } __set_current_state(TASK_RUNNING); @@ -1217,6 +1242,55 @@ ftrace_notrace_release(struct inode *ino return ftrace_regex_release(inode, file, 0); } +static ssize_t +ftraced_read(struct file *filp, char __user *ubuf, + size_t cnt, loff_t *ppos) +{ + /* don't worry about races */ + char *buf = ftraced_stop ? "disabled\n" : "enabled\n"; + int r = strlen(buf); + + return simple_read_from_buffer(ubuf, cnt, ppos, buf, r); +} + +static ssize_t +ftraced_write(struct file *filp, const char __user *ubuf, + size_t cnt, loff_t *ppos) +{ + char buf[64]; + long val; + int ret; + + if (cnt >= sizeof(buf)) + return -EINVAL; + + if (copy_from_user(&buf, ubuf, cnt)) + return -EFAULT; + + if (strncmp(buf, "enable", 6) == 0) + val = 1; + else if (strncmp(buf, "disable", 7) == 0) + val = 0; + else { + buf[cnt] = 0; + + ret = strict_strtoul(buf, 10, &val); + if (ret < 0) + return ret; + + val = !!val; + } + + if (val) + ftrace_enable_daemon(); + else + ftrace_disable_daemon(); + + filp->f_pos += cnt; + + return cnt; +} + static struct file_operations ftrace_avail_fops = { .open = ftrace_avail_open, .read = seq_read, @@ -1240,51 +1314,34 @@ static struct file_operations ftrace_not .release = ftrace_notrace_release, }; +static struct file_operations ftraced_fops = { + .open = tracing_open_generic, + .read = ftraced_read, + .write = ftraced_write, +}; + /** * ftrace_force_update - force an update to all recording ftrace functions - * - * The ftrace dynamic update daemon only wakes up once a second. - * There may be cases where an update needs to be done immediately - * for tests or internal kernel tracing to begin. This function - * wakes the daemon to do an update and will not return until the - * update is complete. */ int ftrace_force_update(void) { - unsigned long last_counter; - DECLARE_WAITQUEUE(wait, current); int ret = 0; if (unlikely(ftrace_disabled)) return -ENODEV; + mutex_lock(&ftrace_sysctl_lock); mutex_lock(&ftraced_lock); - last_counter = ftraced_iteration_counter; - - set_current_state(TASK_INTERRUPTIBLE); - add_wait_queue(&ftraced_waiters, &wait); - if (unlikely(!ftraced_task)) { - ret = -ENODEV; - goto out; - } - - do { - mutex_unlock(&ftraced_lock); - wake_up_process(ftraced_task); - schedule(); - mutex_lock(&ftraced_lock); - if (signal_pending(current)) { - ret = -EINTR; - break; - } - set_current_state(TASK_INTERRUPTIBLE); - } while (last_counter == ftraced_iteration_counter); + /* + * If ftraced_trigger is not set, then there is nothing + * to update. + */ + if (ftraced_trigger && !ftrace_update_code()) + ret = -EBUSY; - out: mutex_unlock(&ftraced_lock); - remove_wait_queue(&ftraced_waiters, &wait); - set_current_state(TASK_RUNNING); + mutex_unlock(&ftrace_sysctl_lock); return ret; } @@ -1329,6 +1386,12 @@ static __init int ftrace_init_debugfs(vo if (!entry) pr_warning("Could not create debugfs " "'set_ftrace_notrace' entry\n"); + + entry = debugfs_create_file("ftraced_enabled", 0644, d_tracer, + NULL, &ftraced_fops); + if (!entry) + pr_warning("Could not create debugfs " + "'ftraced_enabled' entry\n"); return 0; } ����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/ftrace-safe-traversal-hlist.patch�����������������������������������������������������������0000664�0000764�0000764�00000002777�11041657734�017605� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From sagar.abhishek@gmail.com Tue May 27 11:53:35 2008 Date: Sat, 24 May 2008 23:45:02 +0530 From: Abhishek Sagar <sagar.abhishek@gmail.com> To: rostedt@goodmis.org Cc: Ingo Molnar <mingo@elte.hu>, LKML <linux-kernel@vger.kernel.org> Subject: [PATCH] ftrace: safe traversal of ftrace_hash hlist Hi Steven, I noticed that concurrent instances of ftrace_record_ip() have a race between ftrace_hash list traversal during ftrace_ip_in_hash() (before acquiring ftrace_shutdown_lock) and ftrace_add_hash().If it's so then this should fix it. Signed-off-by: Abhishek Sagar <sagar.abhishek@gmail.com> --- Accommodate traversal of ftrace_hash hlist with concurrent insertions. --- kernel/trace/ftrace.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) Index: linux-2.6.24.7/kernel/trace/ftrace.c =================================================================== --- linux-2.6.24.7.orig/kernel/trace/ftrace.c +++ linux-2.6.24.7/kernel/trace/ftrace.c @@ -200,7 +200,7 @@ ftrace_ip_in_hash(unsigned long ip, unsi struct hlist_node *t; int found = 0; - hlist_for_each_entry(p, t, &ftrace_hash[key], node) { + hlist_for_each_entry_rcu(p, t, &ftrace_hash[key], node) { if (p->ip == ip) { found = 1; break; @@ -213,7 +213,7 @@ ftrace_ip_in_hash(unsigned long ip, unsi static inline void ftrace_add_hash(struct dyn_ftrace *node, unsigned long key) { - hlist_add_head(&node->node, &ftrace_hash[key]); + hlist_add_head_rcu(&node->node, &ftrace_hash[key]); } static void ftrace_free_rec(struct dyn_ftrace *rec) �patches/ftrace-update-cnt-stat-fix.patch������������������������������������������������������������0000664�0000764�0000764�00000003145�11041657734�017334� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From sagar.abhishek@gmail.com Tue May 27 11:54:47 2008 Date: Sun, 25 May 2008 00:10:04 +0530 From: Abhishek Sagar <sagar.abhishek@gmail.com> To: Ingo Molnar <mingo@elte.hu>, rostedt@goodmis.org Cc: LKML <linux-kernel@vger.kernel.org> Subject: [PATCH] ftrace: fix updating of ftrace_update_cnt Hi Ingo/Steven, Ftrace currently maintains an update count which includes false updates, i.e, updates which failed. If anything, such failures should be tracked by some separate variable, but this patch provides a minimal fix. Signed-off-by: Abhishek Sagar <sagar.abhishek@gmail.com> --- fix updating of ftrace_update_cnt --- kernel/trace/ftrace.c | 8 +++++--- 1 file changed, 5 insertions(+), 3 deletions(-) Index: linux-2.6.24.7/kernel/trace/ftrace.c =================================================================== --- linux-2.6.24.7.orig/kernel/trace/ftrace.c +++ linux-2.6.24.7/kernel/trace/ftrace.c @@ -452,7 +452,7 @@ static void ftrace_shutdown_replenish(vo ftrace_pages->next = (void *)get_zeroed_page(GFP_KERNEL); } -static void +static int ftrace_code_disable(struct dyn_ftrace *rec) { unsigned long ip; @@ -468,7 +468,9 @@ ftrace_code_disable(struct dyn_ftrace *r if (failed) { rec->flags |= FTRACE_FL_FAILED; ftrace_free_rec(rec); + return 0; } + return 1; } static int __ftrace_update_code(void *ignore); @@ -643,8 +645,8 @@ static int __ftrace_update_code(void *ig /* all CPUS are stopped, we are safe to modify code */ hlist_for_each_entry(p, t, &head, node) { - ftrace_code_disable(p); - ftrace_update_cnt++; + if (ftrace_code_disable(p)) + ftrace_update_cnt++; } } ���������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/ftrace-function-record-nop.patch������������������������������������������������������������0000664�0000764�0000764�00000002437�11041657731�017426� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From: Steven Rostedt <srostedt@redhat.com> Subject: ftrace: define function trace nop When CONFIG_FTRACE is not enabled, the tracing_start_functon_trace and tracing_stop_function_Trace should be nops. Signed-off-by: Steven Rostedt <srostedt@redhat.com> --- kernel/trace/trace.h | 10 ++++++++-- 1 file changed, 8 insertions(+), 2 deletions(-) Index: linux-2.6.24.7/kernel/trace/trace.h =================================================================== --- linux-2.6.24.7.orig/kernel/trace/trace.h +++ linux-2.6.24.7/kernel/trace/trace.h @@ -223,8 +223,6 @@ void trace_function(struct trace_array * unsigned long parent_ip, unsigned long flags); -void tracing_start_function_trace(void); -void tracing_stop_function_trace(void); void tracing_start_cmdline_record(void); void tracing_stop_cmdline_record(void); int register_tracer(struct tracer *type); @@ -241,6 +239,14 @@ void update_max_tr_single(struct trace_a extern cycle_t ftrace_now(int cpu); +#ifdef CONFIG_FTRACE +void tracing_start_function_trace(void); +void tracing_stop_function_trace(void); +#else +# define tracing_start_function_trace() do { } while (0) +# define tracing_stop_function_trace() do { } while (0) +#endif + #ifdef CONFIG_CONTEXT_SWITCH_TRACER typedef void (*tracer_switch_func_t)(void *private, ���������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/tracer-add-event-markers.patch��������������������������������������������������������������0000664�0000764�0000764�00000021334�11041673257�017056� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������Add markers to various events This patch adds markers to various events in the kernel. (interrupts, task activation and hrtimers) Signed-off-by: Steven Rostedt <srostedt@redhat.com> --- arch/x86/kernel/apic_32.c | 3 ++ arch/x86/kernel/irq_32.c | 3 ++ arch/x86/kernel/irq_64.c | 4 +++ arch/x86/kernel/traps_32.c | 4 +++ arch/x86/kernel/traps_64.c | 4 +++ arch/x86/mm/fault_32.c | 4 +++ arch/x86/mm/fault_64.c | 4 +++ include/linux/ftrace.h | 49 +++++++++++++++++++++++++++++++++++++++++++++ kernel/hrtimer.c | 6 +++++ kernel/sched.c | 7 ++++++ 10 files changed, 88 insertions(+) Index: linux-2.6.24.7/arch/x86/kernel/apic_32.c =================================================================== --- linux-2.6.24.7.orig/arch/x86/kernel/apic_32.c +++ linux-2.6.24.7/arch/x86/kernel/apic_32.c @@ -45,6 +45,8 @@ #include "io_ports.h" +#include <linux/ftrace.h> + /* * Sanity check */ @@ -581,6 +583,7 @@ void fastcall smp_apic_timer_interrupt(s { struct pt_regs *old_regs = set_irq_regs(regs); + ftrace_event_irq(-1, user_mode(regs), regs->eip); /* * NOTE! We'd better ACK the irq immediately, * because timer handling can be slow. Index: linux-2.6.24.7/arch/x86/kernel/irq_32.c =================================================================== --- linux-2.6.24.7.orig/arch/x86/kernel/irq_32.c +++ linux-2.6.24.7/arch/x86/kernel/irq_32.c @@ -16,6 +16,8 @@ #include <linux/cpu.h> #include <linux/delay.h> +#include <linux/ftrace.h> + #include <asm/apic.h> #include <asm/uaccess.h> @@ -85,6 +87,7 @@ fastcall unsigned int do_IRQ(struct pt_r old_regs = set_irq_regs(regs); irq_enter(); + ftrace_event_irq(irq, user_mode(regs), regs->eip); #ifdef CONFIG_DEBUG_STACKOVERFLOW /* Debugging check for stack overflow: is there less than 1KB free? */ { Index: linux-2.6.24.7/arch/x86/kernel/irq_64.c =================================================================== --- linux-2.6.24.7.orig/arch/x86/kernel/irq_64.c +++ linux-2.6.24.7/arch/x86/kernel/irq_64.c @@ -18,6 +18,8 @@ #include <asm/idle.h> #include <asm/smp.h> +#include <linux/ftrace.h> + atomic_t irq_err_count; #ifdef CONFIG_DEBUG_STACKOVERFLOW @@ -149,6 +151,8 @@ asmlinkage unsigned int do_IRQ(struct pt irq_enter(); irq = __get_cpu_var(vector_irq)[vector]; + ftrace_event_irq(irq, user_mode(regs), regs->rip); + #ifdef CONFIG_DEBUG_STACKOVERFLOW stack_overflow_check(regs); #endif Index: linux-2.6.24.7/arch/x86/kernel/traps_32.c =================================================================== --- linux-2.6.24.7.orig/arch/x86/kernel/traps_32.c +++ linux-2.6.24.7/arch/x86/kernel/traps_32.c @@ -30,6 +30,8 @@ #include <linux/nmi.h> #include <linux/bug.h> +#include <linux/ftrace.h> + #ifdef CONFIG_EISA #include <linux/ioport.h> #include <linux/eisa.h> @@ -769,6 +771,8 @@ fastcall __kprobes void do_nmi(struct pt nmi_enter(); + ftrace_event_irq(-1, user_mode(regs), regs->eip); + cpu = smp_processor_id(); ++nmi_count(cpu); Index: linux-2.6.24.7/arch/x86/kernel/traps_64.c =================================================================== --- linux-2.6.24.7.orig/arch/x86/kernel/traps_64.c +++ linux-2.6.24.7/arch/x86/kernel/traps_64.c @@ -33,6 +33,8 @@ #include <linux/kdebug.h> #include <linux/utsname.h> +#include <linux/ftrace.h> + #if defined(CONFIG_EDAC) #include <linux/edac.h> #endif @@ -782,6 +784,8 @@ asmlinkage __kprobes void default_do_nmi cpu = smp_processor_id(); + ftrace_event_irq(-1, user_mode(regs), regs->rip); + /* Only the BSP gets external NMIs from the system. */ if (!cpu) reason = get_nmi_reason(); Index: linux-2.6.24.7/arch/x86/mm/fault_32.c =================================================================== --- linux-2.6.24.7.orig/arch/x86/mm/fault_32.c +++ linux-2.6.24.7/arch/x86/mm/fault_32.c @@ -27,6 +27,8 @@ #include <linux/kdebug.h> #include <linux/kprobes.h> +#include <linux/ftrace.h> + #include <asm/system.h> #include <asm/desc.h> #include <asm/segment.h> @@ -311,6 +313,8 @@ fastcall void __kprobes do_page_fault(st /* get the address */ address = read_cr2(); + ftrace_event_fault(regs->eip, error_code, address); + tsk = current; si_code = SEGV_MAPERR; Index: linux-2.6.24.7/arch/x86/mm/fault_64.c =================================================================== --- linux-2.6.24.7.orig/arch/x86/mm/fault_64.c +++ linux-2.6.24.7/arch/x86/mm/fault_64.c @@ -27,6 +27,8 @@ #include <linux/kdebug.h> #include <linux/kprobes.h> +#include <linux/ftrace.h> + #include <asm/system.h> #include <asm/pgalloc.h> #include <asm/smp.h> @@ -316,6 +318,8 @@ asmlinkage void __kprobes do_page_fault( /* get the address */ address = read_cr2(); + ftrace_event_fault(regs->rip, error_code, address); + info.si_code = SEGV_MAPERR; Index: linux-2.6.24.7/include/linux/ftrace.h =================================================================== --- linux-2.6.24.7.orig/include/linux/ftrace.h +++ linux-2.6.24.7/include/linux/ftrace.h @@ -4,6 +4,7 @@ #ifdef CONFIG_FTRACE #include <linux/linkage.h> +#include <linux/ktime.h> #include <linux/fs.h> extern int ftrace_enabled; @@ -136,4 +137,52 @@ static inline void ftrace_special(unsigned long arg1, unsigned long arg2, unsigned long arg3) { } #endif +#ifdef CONFIG_EVENT_TRACER +#include <linux/marker.h> + +static inline void ftrace_event_irq(int irq, int user, unsigned long ip) +{ + trace_mark(ftrace_event_irq, "%d %d %ld", irq, user, ip); +} + +static inline void ftrace_event_fault(unsigned long ip, unsigned long error, + unsigned long addr) +{ + trace_mark(ftrace_event_fault, "%ld %ld %ld", ip, error, addr); +} + +static inline void ftrace_event_timer_set(void *p1, void *p2) +{ + trace_mark(ftrace_event_timer_set, "%p %p", p1, p2); +} + +static inline void ftrace_event_timer_triggered(void *p1, void *p2) +{ + trace_mark(ftrace_event_timer_triggered, "%p %p", p1, p2); +} + +static inline void ftrace_event_timestamp(ktime_t *time) +{ + trace_mark(ftrace_event_hrtimer, "%p", time); +} + +static inline void ftrace_event_task_activate(struct task_struct *p, int cpu) +{ + trace_mark(ftrace_event_task_activate, "%p %d", p, cpu); +} + +static inline void ftrace_event_task_deactivate(struct task_struct *p, int cpu) +{ + trace_mark(ftrace_event_task_deactivate, "%p %d", p, cpu); +} +#else +# define ftrace_event_irq(irq, user, ip) do { } while (0) +# define ftrace_event_fault(ip, error, addr) do { } while (0) +# define ftrace_event_timer_set(p1, p2) do { } while (0) +# define ftrace_event_timer_triggered(p1, p2) do { } while (0) +# define ftrace_event_timestamp(now) do { } while (0) +# define ftrace_event_task_activate(p, cpu) do { } while (0) +# define ftrace_event_task_deactivate(p, cpu) do { } while (0) +#endif /* CONFIG_TRACE_EVENTS */ + #endif /* _LINUX_FTRACE_H */ Index: linux-2.6.24.7/kernel/hrtimer.c =================================================================== --- linux-2.6.24.7.orig/kernel/hrtimer.c +++ linux-2.6.24.7/kernel/hrtimer.c @@ -44,6 +44,8 @@ #include <linux/seq_file.h> #include <linux/err.h> +#include <linux/ftrace.h> + #include <asm/uaccess.h> /** @@ -742,6 +744,7 @@ static void enqueue_hrtimer(struct hrtim struct hrtimer *entry; int leftmost = 1; + ftrace_event_timer_set(&timer->expires, timer); /* * Find the right place in the rbtree: */ @@ -1094,6 +1097,7 @@ void hrtimer_interrupt(struct clock_even retry: now = ktime_get(); + ftrace_event_timestamp(&now); expires_next.tv64 = KTIME_MAX; @@ -1122,6 +1126,8 @@ void hrtimer_interrupt(struct clock_even break; } + ftrace_event_timer_triggered(&timer->expires, timer); + /* Move softirq callbacks to the pending list */ if (timer->cb_mode == HRTIMER_CB_SOFTIRQ) { __remove_hrtimer(timer, base, Index: linux-2.6.24.7/kernel/sched.c =================================================================== --- linux-2.6.24.7.orig/kernel/sched.c +++ linux-2.6.24.7/kernel/sched.c @@ -91,6 +91,11 @@ unsigned long long __attribute__((weak)) #define PRIO_TO_NICE(prio) ((prio) - MAX_RT_PRIO - 20) #define TASK_NICE(p) PRIO_TO_NICE((p)->static_prio) +#define __PRIO(prio) \ + ((prio) <= 99 ? 199 - (prio) : (prio) - 120) + +#define PRIO(p) __PRIO((p)->prio) + /* * 'User priority' is the nice value converted to something we * can work with better when scaling various scheduler parameters, @@ -1119,6 +1124,7 @@ static void activate_task(struct rq *rq, if (p->state == TASK_UNINTERRUPTIBLE) rq->nr_uninterruptible--; + ftrace_event_task_activate(p, cpu_of(rq)); enqueue_task(rq, p, wakeup); inc_nr_running(p, rq); } @@ -1131,6 +1137,7 @@ static void deactivate_task(struct rq *r if (p->state == TASK_UNINTERRUPTIBLE) rq->nr_uninterruptible++; + ftrace_event_task_deactivate(p, cpu_of(rq)); dequeue_task(rq, p, sleep); dec_nr_running(p, rq); } ����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/tracer-event-trace.patch��������������������������������������������������������������������0000664�0000764�0000764�00000063176�11041657731�015773� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������Add event tracer. This patch adds a event trace that hooks into various events in the kernel. Although it can be used separately, it is mainly to help other traces (wakeup and preempt off) with seeing various events in the traces without having to enable the heavy mcount hooks. Signed-off-by: Steven Rostedt <srostedt@redhat.com> --- kernel/trace/Kconfig | 9 kernel/trace/Makefile | 1 kernel/trace/trace.c | 220 ++++++++++++++++ kernel/trace/trace.h | 98 +++++++ kernel/trace/trace_events.c | 566 ++++++++++++++++++++++++++++++++++++++++++ kernel/trace/trace_selftest.c | 7 6 files changed, 901 insertions(+) Index: linux-2.6.24.7/kernel/trace/Kconfig =================================================================== --- linux-2.6.24.7.orig/kernel/trace/Kconfig +++ linux-2.6.24.7/kernel/trace/Kconfig @@ -85,6 +85,15 @@ config SCHED_TRACER This tracer tracks the latency of the highest priority task to be scheduled in, starting from the point it has woken up. +config EVENT_TRACER + bool "trace kernel events" + depends on DEBUG_KERNEL + select CONTEXT_SWITCH_TRACER + help + This option activates the event tracer of the latency_tracer. + It activates markers through out the kernel for tracing. + This option has a fairly low overhead when enabled. + config CONTEXT_SWITCH_TRACER bool "Trace process context switches" depends on HAVE_FTRACE Index: linux-2.6.24.7/kernel/trace/Makefile =================================================================== --- linux-2.6.24.7.orig/kernel/trace/Makefile +++ linux-2.6.24.7/kernel/trace/Makefile @@ -19,5 +19,6 @@ obj-$(CONFIG_IRQSOFF_TRACER) += trace_ir obj-$(CONFIG_PREEMPT_TRACER) += trace_irqsoff.o obj-$(CONFIG_SCHED_TRACER) += trace_sched_wakeup.o obj-$(CONFIG_MMIOTRACE) += trace_mmiotrace.o +obj-$(CONFIG_EVENT_TRACER) += trace_events.o libftrace-y := ftrace.o Index: linux-2.6.24.7/kernel/trace/trace.c =================================================================== --- linux-2.6.24.7.orig/kernel/trace/trace.c +++ linux-2.6.24.7/kernel/trace/trace.c @@ -1006,6 +1006,126 @@ ftrace_special(unsigned long arg1, unsig local_irq_restore(flags); } +void tracing_event_irq(struct trace_array *tr, + struct trace_array_cpu *data, + unsigned long flags, + unsigned long ip, + int irq, int usermode, + unsigned long retip) +{ + struct trace_entry *entry; + + entry = tracing_get_trace_entry(tr, data); + tracing_generic_entry_update(entry, flags); + entry->type = TRACE_IRQ; + entry->irq.ip = ip; + entry->irq.irq = irq; + entry->irq.ret_ip = retip; + entry->irq.usermode = usermode; +} + +void tracing_event_fault(struct trace_array *tr, + struct trace_array_cpu *data, + unsigned long flags, + unsigned long ip, + unsigned long retip, + unsigned long error_code, + unsigned long address) +{ + struct trace_entry *entry; + + entry = tracing_get_trace_entry(tr, data); + tracing_generic_entry_update(entry, flags); + entry->type = TRACE_FAULT; + entry->fault.ip = ip; + entry->fault.ret_ip = retip; + entry->fault.errorcode = error_code; + entry->fault.address = address; +} + +void tracing_event_timer_set(struct trace_array *tr, + struct trace_array_cpu *data, + unsigned long flags, + unsigned long ip, + ktime_t *expires, void *timer) +{ + struct trace_entry *entry; + + entry = tracing_get_trace_entry(tr, data); + tracing_generic_entry_update(entry, flags); + entry->type = TRACE_TIMER_SET; + entry->timer.ip = ip; + entry->timer.expire = *expires; + entry->timer.timer = timer; +} + +void tracing_event_timer_triggered(struct trace_array *tr, + struct trace_array_cpu *data, + unsigned long flags, + unsigned long ip, + ktime_t *expired, void *timer) +{ + struct trace_entry *entry; + + entry = tracing_get_trace_entry(tr, data); + tracing_generic_entry_update(entry, flags); + entry->type = TRACE_TIMER_TRIG; + entry->timer.ip = ip; + entry->timer.expire = *expired; + entry->timer.timer = timer; +} + +void tracing_event_timestamp(struct trace_array *tr, + struct trace_array_cpu *data, + unsigned long flags, + unsigned long ip, + ktime_t *now) +{ + struct trace_entry *entry; + + entry = tracing_get_trace_entry(tr, data); + tracing_generic_entry_update(entry, flags); + entry->type = TRACE_TIMESTAMP; + entry->timestamp.ip = ip; + entry->timestamp.now = *now; +} + +void tracing_event_task_activate(struct trace_array *tr, + struct trace_array_cpu *data, + unsigned long flags, + unsigned long ip, + struct task_struct *p, + int task_cpu) +{ + struct trace_entry *entry; + + entry = tracing_get_trace_entry(tr, data); + tracing_generic_entry_update(entry, flags); + entry->type = TRACE_TASK_ACT; + entry->task.ip = ip; + entry->task.pid = p->pid; + entry->task.prio = p->prio; + entry->task.cpu = task_cpu; +} + +void tracing_event_task_deactivate(struct trace_array *tr, + struct trace_array_cpu *data, + unsigned long flags, + unsigned long ip, + struct task_struct *p, + int task_cpu) +{ + struct trace_entry *entry; + + entry = tracing_get_trace_entry(tr, data); + tracing_generic_entry_update(entry, flags); + entry->type = TRACE_TASK_DEACT; + entry->task.ip = ip; + entry->task.pid = p->pid; + entry->task.prio = p->prio; + entry->task.cpu = task_cpu; +} + #ifdef CONFIG_FTRACE static void function_trace_call(unsigned long ip, unsigned long parent_ip) @@ -1511,6 +1631,55 @@ print_lat_fmt(struct trace_iterator *ite } trace_seq_puts(s, "\n"); break; + case TRACE_IRQ: + seq_print_ip_sym(s, entry->irq.ip, sym_flags); + if (entry->irq.irq >= 0) + trace_seq_printf(s, " %d ", entry->irq.irq); + if (entry->irq.usermode) + trace_seq_puts(s, " (usermode)\n "); + else { + trace_seq_puts(s, " ("); + seq_print_ip_sym(s, entry->irq.ret_ip, sym_flags); + trace_seq_puts(s, ")\n"); + } + break; + case TRACE_FAULT: + seq_print_ip_sym(s, entry->fault.ip, sym_flags); + trace_seq_printf(s, " %lx ", entry->fault.errorcode); + trace_seq_puts(s, " ("); + seq_print_ip_sym(s, entry->fault.ret_ip, sym_flags); + trace_seq_puts(s, ")"); + trace_seq_printf(s, " [%lx]\n", entry->fault.address); + break; + case TRACE_TIMER_SET: + seq_print_ip_sym(s, entry->timer.ip, sym_flags); + trace_seq_printf(s, " (%Ld) (%p)\n", + entry->timer.expire, entry->timer.timer); + break; + case TRACE_TIMER_TRIG: + seq_print_ip_sym(s, entry->timer.ip, sym_flags); + trace_seq_printf(s, " (%Ld) (%p)\n", + entry->timer.expire, entry->timer.timer); + break; + case TRACE_TIMESTAMP: + seq_print_ip_sym(s, entry->timestamp.ip, sym_flags); + trace_seq_printf(s, " (%Ld)\n", + entry->timestamp.now.tv64); + break; + case TRACE_TASK_ACT: + seq_print_ip_sym(s, entry->task.ip, sym_flags); + comm = trace_find_cmdline(entry->task.pid); + trace_seq_printf(s, " %s %d %d [%d]\n", + comm, entry->task.pid, + entry->task.prio, entry->task.cpu); + break; + case TRACE_TASK_DEACT: + seq_print_ip_sym(s, entry->task.ip, sym_flags); + comm = trace_find_cmdline(entry->task.pid); + trace_seq_printf(s, " %s %d %d [%d]\n", + comm, entry->task.pid, + entry->task.prio, entry->task.cpu); + break; default: trace_seq_printf(s, "Unknown type %d\n", entry->type); } @@ -1608,6 +1777,57 @@ static int print_trace_fmt(struct trace_ if (!ret) return 0; break; + case TRACE_IRQ: + seq_print_ip_sym(s, entry->irq.ip, sym_flags); + if (entry->irq.irq >= 0) + trace_seq_printf(s, " %d ", entry->irq.irq); + if (entry->irq.usermode) + trace_seq_puts(s, " (usermode)\n "); + else { + trace_seq_puts(s, " ("); + seq_print_ip_sym(s, entry->irq.ret_ip, sym_flags); + trace_seq_puts(s, ")\n"); + } + break; + case TRACE_FAULT: + seq_print_ip_sym(s, entry->fault.ip, sym_flags); + trace_seq_printf(s, " %lx ", entry->fault.errorcode); + trace_seq_puts(s, " ("); + seq_print_ip_sym(s, entry->fault.ret_ip, sym_flags); + trace_seq_puts(s, ")"); + trace_seq_printf(s, " [%lx]\n", entry->fault.address); + break; + case TRACE_TIMER_SET: + seq_print_ip_sym(s, entry->timer.ip, sym_flags); + trace_seq_printf(s, " (%Ld) (%p)\n", + entry->timer.expire, entry->timer.timer); + break; + case TRACE_TIMER_TRIG: + seq_print_ip_sym(s, entry->timer.ip, sym_flags); + trace_seq_printf(s, " (%Ld) (%p)\n", + entry->timer.expire, entry->timer.timer); + break; + case TRACE_TIMESTAMP: + seq_print_ip_sym(s, entry->timestamp.ip, sym_flags); + trace_seq_printf(s, " (%Ld)\n", + entry->timestamp.now.tv64); + break; + case TRACE_TASK_ACT: + seq_print_ip_sym(s, entry->task.ip, sym_flags); + comm = trace_find_cmdline(entry->task.pid); + trace_seq_printf(s, " %s %d %d [%d]\n", + comm, entry->task.pid, + entry->task.prio, entry->task.cpu); + break; + case TRACE_TASK_DEACT: + seq_print_ip_sym(s, entry->task.ip, sym_flags); + comm = trace_find_cmdline(entry->task.pid); + trace_seq_printf(s, " %s %d %d [%d]\n", + comm, entry->task.pid, + entry->task.prio, entry->task.cpu); + break; + default: + trace_seq_printf(s, "Unknown type %d\n", entry->type); } return 1; } Index: linux-2.6.24.7/kernel/trace/trace.h =================================================================== --- linux-2.6.24.7.orig/kernel/trace/trace.h +++ linux-2.6.24.7/kernel/trace/trace.h @@ -17,6 +17,13 @@ enum trace_type { TRACE_SPECIAL, TRACE_MMIO_RW, TRACE_MMIO_MAP, + TRACE_IRQ, + TRACE_FAULT, + TRACE_TIMER_SET, + TRACE_TIMER_TRIG, + TRACE_TIMESTAMP, + TRACE_TASK_ACT, + TRACE_TASK_DEACT, __TRACE_LAST_TYPE }; @@ -50,6 +57,45 @@ struct special_entry { unsigned long arg3; }; +struct irq_entry { + unsigned long ip; + unsigned long ret_ip; + unsigned irq; + unsigned usermode; +}; + +struct fault_entry { + unsigned long ip; + unsigned long ret_ip; + unsigned long errorcode; + unsigned long address; +}; + +struct timer_entry { + unsigned long ip; + ktime_t expire; + void *timer; +}; + +struct timestamp_entry { + unsigned long ip; + ktime_t now; +}; + +struct task_entry { + unsigned long ip; + pid_t pid; + unsigned prio; + int cpu; +}; + +struct wakeup_entry { + unsigned long ip; + pid_t pid; + unsigned prio; + unsigned curr_prio; +}; + /* * Stack-trace entry: */ @@ -80,6 +126,12 @@ struct trace_entry { struct stack_entry stack; struct mmiotrace_rw mmiorw; struct mmiotrace_map mmiomap; + struct irq_entry irq; + struct fault_entry fault; + struct timer_entry timer; + struct timestamp_entry timestamp; + struct task_entry task; + struct wakeup_entry wakeup; }; }; @@ -222,6 +274,52 @@ void trace_function(struct trace_array * unsigned long ip, unsigned long parent_ip, unsigned long flags); +void tracing_event_irq(struct trace_array *tr, + struct trace_array_cpu *data, + unsigned long flags, + unsigned long ip, + int irq, int usermode, + unsigned long retip); +void tracing_event_fault(struct trace_array *tr, + struct trace_array_cpu *data, + unsigned long flags, + unsigned long ip, + unsigned long retip, + unsigned long error_code, + unsigned long address); +void tracing_event_timer_set(struct trace_array *tr, + struct trace_array_cpu *data, + unsigned long flags, + unsigned long ip, + ktime_t *expires, void *timer); +void tracing_event_timer_triggered(struct trace_array *tr, + struct trace_array_cpu *data, + unsigned long flags, + unsigned long ip, + ktime_t *expired, void *timer); +void tracing_event_timestamp(struct trace_array *tr, + struct trace_array_cpu *data, + unsigned long flags, + unsigned long ip, + ktime_t *now); +void tracing_event_task_activate(struct trace_array *tr, + struct trace_array_cpu *data, + unsigned long flags, + unsigned long ip, + struct task_struct *p, + int cpu); +void tracing_event_task_deactivate(struct trace_array *tr, + struct trace_array_cpu *data, + unsigned long flags, + unsigned long ip, + struct task_struct *p, + int cpu); +void tracing_event_wakeup(struct trace_array *tr, + struct trace_array_cpu *data, + unsigned long flags, + unsigned long ip, + pid_t pid, int prio, + int curr_prio); void tracing_start_cmdline_record(void); void tracing_stop_cmdline_record(void); Index: linux-2.6.24.7/kernel/trace/trace_events.c =================================================================== --- /dev/null +++ linux-2.6.24.7/kernel/trace/trace_events.c @@ -0,0 +1,566 @@ +/* + * trace task events + * + * Copyright (C) 2007 Steven Rostedt <srostedt@redhat.com> + * + * Based on code from the latency_tracer, that is: + * + * Copyright (C) 2004-2006 Ingo Molnar + * Copyright (C) 2004 William Lee Irwin III + */ +#include <linux/module.h> +#include <linux/fs.h> +#include <linux/debugfs.h> +#include <linux/kallsyms.h> +#include <linux/uaccess.h> +#include <linux/ftrace.h> + +#include "trace.h" + +static struct trace_array __read_mostly *events_trace; +static int __read_mostly tracer_enabled; +static atomic_t event_ref; + +static void event_reset(struct trace_array *tr) +{ + struct trace_array_cpu *data; + int cpu; + + for_each_possible_cpu(cpu) { + data = tr->data[cpu]; + tracing_reset(data); + } + + tr->time_start = ftrace_now(raw_smp_processor_id()); +} + +#define getarg(arg, ap) arg = va_arg(ap, typeof(arg)); + +static void +event_irq_callback(void *probe_data, void *call_data, + const char *format, va_list *args) +{ + struct trace_array *tr = probe_data; + struct trace_array_cpu *data; + unsigned long ip, flags; + int irq, user, cpu; + long disable; + + if (!tracer_enabled) + return; + + getarg(irq, *args); + getarg(user, *args); + getarg(ip, *args); + + /* interrupts should be off, we are in an interrupt */ + cpu = smp_processor_id(); + data = tr->data[cpu]; + + disable = atomic_inc_return(&data->disabled); + if (disable != 1) + goto out; + + local_save_flags(flags); + tracing_event_irq(tr, data, flags, CALLER_ADDR1, irq, user, ip); + + out: + atomic_dec(&data->disabled); +} + +static void +event_fault_callback(void *probe_data, void *call_data, + const char *format, va_list *args) +{ + struct trace_array *tr = probe_data; + struct trace_array_cpu *data; + unsigned long ip, flags, error, addr; + long disable; + int cpu; + + if (!tracer_enabled) + return; + + getarg(ip, *args); + getarg(error, *args); + getarg(addr, *args); + + preempt_disable_notrace(); + cpu = smp_processor_id(); + data = tr->data[cpu]; + + disable = atomic_inc_return(&data->disabled); + if (disable != 1) + goto out; + + local_save_flags(flags); + tracing_event_fault(tr, data, flags, CALLER_ADDR1, ip, error, addr); + + out: + atomic_dec(&data->disabled); + preempt_enable_notrace(); +} + +static void +event_timer_set_callback(void *probe_data, void *call_data, + const char *format, va_list *args) +{ + struct trace_array *tr = probe_data; + struct trace_array_cpu *data; + unsigned long flags; + ktime_t *expires; + void *timer; + long disable; + int cpu; + + if (!tracer_enabled) + return; + + getarg(expires, *args); + getarg(timer, *args); + + /* interrupts should be off, we are in an interrupt */ + cpu = smp_processor_id(); + data = tr->data[cpu]; + + disable = atomic_inc_return(&data->disabled); + if (disable != 1) + goto out; + + local_save_flags(flags); + tracing_event_timer_set(tr, data, flags, CALLER_ADDR1, expires, timer); + + out: + atomic_dec(&data->disabled); +} + +static void +event_timer_triggered_callback(void *probe_data, void *call_data, + const char *format, va_list *args) +{ + struct trace_array *tr = probe_data; + struct trace_array_cpu *data; + unsigned long flags; + ktime_t *expired; + void *timer; + long disable; + int cpu; + + if (!tracer_enabled) + return; + + getarg(expired, *args); + getarg(timer, *args); + + /* interrupts should be off, we are in an interrupt */ + cpu = smp_processor_id(); + data = tr->data[cpu]; + + disable = atomic_inc_return(&data->disabled); + if (disable != 1) + goto out; + + local_save_flags(flags); + tracing_event_timer_triggered(tr, data, flags, CALLER_ADDR1, expired, timer); + + out: + atomic_dec(&data->disabled); +} + +static void +event_hrtimer_callback(void *probe_data, void *call_data, + const char *format, va_list *args) +{ + struct trace_array *tr = probe_data; + struct trace_array_cpu *data; + unsigned long flags; + ktime_t *now; + long disable; + int cpu; + + if (!tracer_enabled) + return; + + getarg(now, *args); + + /* interrupts should be off, we are in an interrupt */ + cpu = smp_processor_id(); + data = tr->data[cpu]; + + disable = atomic_inc_return(&data->disabled); + if (disable != 1) + goto out; + + local_save_flags(flags); + tracing_event_timestamp(tr, data, flags, CALLER_ADDR1, now); + + out: + atomic_dec(&data->disabled); +} + +static void +event_task_activate_callback(void *probe_data, void *call_data, + const char *format, va_list *args) +{ + struct trace_array *tr = probe_data; + struct trace_array_cpu *data; + unsigned long flags; + struct task_struct *p; + long disable; + int cpu, rqcpu; + + if (!tracer_enabled) + return; + + getarg(p, *args); + getarg(rqcpu, *args); + + /* interrupts should be off, we are in an interrupt */ + cpu = smp_processor_id(); + data = tr->data[cpu]; + + disable = atomic_inc_return(&data->disabled); + if (disable != 1) + goto out; + + local_save_flags(flags); + tracing_event_task_activate(tr, data, flags, CALLER_ADDR1, p, rqcpu); + + out: + atomic_dec(&data->disabled); +} + +static void +event_task_deactivate_callback(void *probe_data, void *call_data, + const char *format, va_list *args) +{ + struct trace_array *tr = probe_data; + struct trace_array_cpu *data; + unsigned long flags; + struct task_struct *p; + long disable; + int cpu, rqcpu; + + if (!tracer_enabled) + return; + + getarg(p, *args); + getarg(rqcpu, *args); + + /* interrupts should be off, we are in an interrupt */ + cpu = smp_processor_id(); + data = tr->data[cpu]; + + disable = atomic_inc_return(&data->disabled); + if (disable != 1) + goto out; + + local_save_flags(flags); + tracing_event_task_deactivate(tr, data, flags, CALLER_ADDR1, p, rqcpu); + + out: + atomic_dec(&data->disabled); +} + +static void +event_wakeup_callback(void *probe_data, void *call_data, + const char *format, va_list *args) +{ + struct trace_array *tr = probe_data; + struct trace_array_cpu *data; + unsigned long flags; + struct task_struct *wakee, *curr; + long disable, ignore2; + void *ignore3; + int ignore1; + int cpu; + + if (!tracer_enabled) + return; + + getarg(ignore1, *args); + getarg(ignore2, *args); + getarg(ignore3, *args); + + getarg(wakee, *args); + getarg(curr, *args); + + /* interrupts should be disabled */ + cpu = smp_processor_id(); + data = tr->data[cpu]; + + disable = atomic_inc_return(&data->disabled); + if (unlikely(disable != 1)) + goto out; + + local_save_flags(flags); + /* record process's command line */ + tracing_record_cmdline(wakee); + tracing_record_cmdline(curr); + + tracing_sched_wakeup_trace(tr, data, wakee, curr, flags); + + out: + atomic_dec(&data->disabled); +} +static void +event_ctx_callback(void *probe_data, void *call_data, + const char *format, va_list *args) +{ + struct trace_array *tr = probe_data; + struct trace_array_cpu *data; + unsigned long flags; + struct task_struct *prev; + struct task_struct *next; + long disable, ignore2; + void *ignore3; + int ignore1; + int cpu; + + if (!tracer_enabled) + return; + + /* skip prev_pid %d next_pid %d prev_state %ld */ + getarg(ignore1, *args); + getarg(ignore1, *args); + getarg(ignore2, *args); + getarg(ignore3, *args); + + prev = va_arg(*args, typeof(prev)); + next = va_arg(*args, typeof(next)); + + tracing_record_cmdline(prev); + tracing_record_cmdline(next); + + /* interrupts should be disabled */ + cpu = smp_processor_id(); + data = tr->data[cpu]; + disable = atomic_inc_return(&data->disabled); + + if (likely(disable != 1)) + goto out; + + local_save_flags(flags); + tracing_sched_switch_trace(tr, data, prev, next, flags); + out: + atomic_dec(&data->disabled); +} + +static int event_register_marker(const char *name, const char *format, + marker_probe_func *probe, void *data) +{ + int ret; + + ret = marker_probe_register(name, format, probe, data); + if (ret) { + pr_info("event trace: Couldn't add marker" + " probe to %s\n", name); + return ret; + } + + return 0; +} + +static void event_tracer_register(struct trace_array *tr) +{ + int ret; + + ret = event_register_marker("ftrace_event_irq", "%d %d %ld", + event_irq_callback, tr); + if (ret) + return; + + ret = event_register_marker("ftrace_event_fault", "%ld %ld %ld", + event_fault_callback, tr); + if (ret) + goto out1; + + ret = event_register_marker("ftrace_event_timer_set", "%p %p", + event_timer_set_callback, tr); + if (ret) + goto out2; + + ret = event_register_marker("ftrace_event_timer_triggered", "%p %p", + event_timer_triggered_callback, tr); + if (ret) + goto out3; + + ret = event_register_marker("ftrace_event_hrtimer", "%p", + event_hrtimer_callback, tr); + if (ret) + goto out4; + + ret = event_register_marker("ftrace_event_task_activate", "%p %d", + event_task_activate_callback, tr); + if (ret) + goto out5; + + ret = event_register_marker("ftrace_event_task_deactivate", "%p %d", + event_task_deactivate_callback, tr); + if (ret) + goto out6; + + ret = event_register_marker("kernel_sched_wakeup", + "pid %d state %ld ## rq %p task %p rq->curr %p", + event_wakeup_callback, tr); + if (ret) + goto out7; + + ret = event_register_marker("kernel_sched_wakeup_new", + "pid %d state %ld ## rq %p task %p rq->curr %p", + event_wakeup_callback, tr); + if (ret) + goto out8; + + ret = event_register_marker("kernel_sched_schedule", + "prev_pid %d next_pid %d prev_state %ld " + "## rq %p prev %p next %p", + event_ctx_callback, tr); + if (ret) + goto out9; + + return; + + out9: + marker_probe_unregister("kernel_sched_wakeup_new", + event_wakeup_callback, tr); + out8: + marker_probe_unregister("kernel_sched_wakeup", + event_wakeup_callback, tr); + out7: + marker_probe_unregister("ftrace_event_task_deactivate", + event_task_deactivate_callback, tr); + out6: + marker_probe_unregister("ftrace_event_task_activate", + event_task_activate_callback, tr); + out5: + marker_probe_unregister("ftrace_event_hrtimer", + event_hrtimer_callback, tr); + out4: + marker_probe_unregister("ftrace_event_timer_triggered", + event_timer_triggered_callback, tr); + out3: + marker_probe_unregister("ftrace_event_timer_set", + event_timer_set_callback, tr); + out2: + marker_probe_unregister("ftrace_event_fault", + event_fault_callback, tr); + out1: + marker_probe_unregister("ftrace_event_irq", + event_irq_callback, tr); +} + +static void event_tracer_unregister(struct trace_array *tr) +{ + marker_probe_unregister("kernel_sched_schedule", + event_ctx_callback, tr); + marker_probe_unregister("kernel_sched_wakeup_new", + event_wakeup_callback, tr); + marker_probe_unregister("kernel_sched_wakeup", + event_wakeup_callback, tr); + marker_probe_unregister("ftrace_event_task_deactivate", + event_task_deactivate_callback, tr); + marker_probe_unregister("ftrace_event_task_activate", + event_task_activate_callback, tr); + marker_probe_unregister("ftrace_event_hrtimer", + event_hrtimer_callback, tr); + marker_probe_unregister("ftrace_event_timer_triggered", + event_timer_triggered_callback, tr); + marker_probe_unregister("ftrace_event_timer_set", + event_timer_set_callback, tr); + marker_probe_unregister("ftrace_event_fault", + event_fault_callback, tr); + marker_probe_unregister("ftrace_event_irq", + event_irq_callback, tr); +} + +void trace_event_register(struct trace_array *tr) +{ + long ref; + + ref = atomic_inc_return(&event_ref); + if (ref == 1) + event_tracer_register(tr); +} + +void trace_event_unregister(struct trace_array *tr) +{ + long ref; + + ref = atomic_dec_and_test(&event_ref); + if (ref) + event_tracer_unregister(tr); +} + +static void start_event_trace(struct trace_array *tr) +{ + event_reset(tr); + trace_event_register(tr); + tracing_start_function_trace(); + tracer_enabled = 1; +} + +static void stop_event_trace(struct trace_array *tr) +{ + tracer_enabled = 0; + tracing_stop_function_trace(); + trace_event_unregister(tr); +} + +static void event_trace_init(struct trace_array *tr) +{ + events_trace = tr; + + if (tr->ctrl) + start_event_trace(tr); +} + +static void event_trace_reset(struct trace_array *tr) +{ + if (tr->ctrl) + stop_event_trace(tr); +} + +static void event_trace_ctrl_update(struct trace_array *tr) +{ + if (tr->ctrl) + start_event_trace(tr); + else + stop_event_trace(tr); +} + +static void event_trace_open(struct trace_iterator *iter) +{ + /* stop the trace while dumping */ + if (iter->tr->ctrl) + tracer_enabled = 0; +} + +static void event_trace_close(struct trace_iterator *iter) +{ + if (iter->tr->ctrl) + tracer_enabled = 1; +} + +static struct tracer event_trace __read_mostly = +{ + .name = "events", + .init = event_trace_init, + .reset = event_trace_reset, + .open = event_trace_open, + .close = event_trace_close, + .ctrl_update = event_trace_ctrl_update, +}; + +__init static int init_event_trace(void) +{ + int ret; + + ret = register_tracer(&event_trace); + if (ret) + return ret; + + return 0; +} + +device_initcall(init_event_trace); Index: linux-2.6.24.7/kernel/trace/trace_selftest.c =================================================================== --- linux-2.6.24.7.orig/kernel/trace/trace_selftest.c +++ linux-2.6.24.7/kernel/trace/trace_selftest.c @@ -11,6 +11,13 @@ static inline int trace_valid_entry(stru case TRACE_WAKE: case TRACE_STACK: case TRACE_SPECIAL: + case TRACE_IRQ: + case TRACE_FAULT: + case TRACE_TIMER_SET: + case TRACE_TIMER_TRIG: + case TRACE_TIMESTAMP: + case TRACE_TASK_ACT: + case TRACE_TASK_DEACT: return 1; } return 0; ��������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/trace-histograms.patch����������������������������������������������������������������������0000664�0000764�0000764�00000056530�11041657732�015551� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������Critical latency timings histogram This patch adds hooks into the latency tracer to give us histograms of interrupts off, preemption off and wakeup timings. This code was based off of work done by Yi Yang <yyang@ch.mvista.com> But heavily modified to work with the new tracer, and some clean ups by Steven Rostedt <srostedt@redhat.com> This adds the following to /debugfs/tracing latency_hist/ - root dir for historgrams. Under latency_hist there is (depending on what's configured): interrupt_off_latency/ - latency histograms of interrupts off. preempt_interrupts_off_latency/ - latency histograms of preemption and/or interrupts off. preempt_off_latency/ - latency histograms of preemption off. wakeup_latency/ - latency histograms of wakeup timings. Under each of the above is a file labeled: CPU# for each possible CPU were # is the CPU number. reset - writing into this file will reset the histogram back to zeros and start again. Signed-off-by: Steven Rostedt <srostedt@redhat.com> --- kernel/trace/Kconfig | 22 + kernel/trace/Makefile | 4 kernel/trace/trace_hist.c | 638 +++++++++++++++++++++++++++++++++++++++++++ kernel/trace/trace_hist.h | 39 ++ kernel/trace/trace_irqsoff.c | 19 + 5 files changed, 722 insertions(+) Index: linux-2.6.24.7/kernel/trace/Kconfig =================================================================== --- linux-2.6.24.7.orig/kernel/trace/Kconfig +++ linux-2.6.24.7/kernel/trace/Kconfig @@ -134,3 +134,25 @@ config FTRACE_STARTUP_TEST a series of tests are made to verify that the tracer is functioning properly. It will do tests on all the configured tracers of ftrace. + +config INTERRUPT_OFF_HIST + bool "Interrupts off critical timings histogram" + depends on IRQSOFF_TRACER + help + This option uses the infrastructure of the critical + irqs off timings to create a histogram of latencies. + +config PREEMPT_OFF_HIST + bool "Preempt off critical timings histogram" + depends on PREEMPT_TRACER + help + This option uses the infrastructure of the critical + preemption off timings to create a histogram of latencies. + +config WAKEUP_LATENCY_HIST + bool "Interrupts off critical timings histogram" + select TRACING + select MARKERS + help + This option uses the infrastructure of the wakeup tracer + to create a histogram of latencies. Index: linux-2.6.24.7/kernel/trace/Makefile =================================================================== --- linux-2.6.24.7.orig/kernel/trace/Makefile +++ linux-2.6.24.7/kernel/trace/Makefile @@ -21,4 +21,8 @@ obj-$(CONFIG_SCHED_TRACER) += trace_sche obj-$(CONFIG_MMIOTRACE) += trace_mmiotrace.o obj-$(CONFIG_EVENT_TRACER) += trace_events.o +obj-$(CONFIG_INTERRUPT_OFF_HIST) += trace_hist.o +obj-$(CONFIG_PREEMPT_OFF_HIST) += trace_hist.o +obj-$(CONFIG_WAKEUP_LATENCY_HIST) += trace_hist.o + libftrace-y := ftrace.o Index: linux-2.6.24.7/kernel/trace/trace_hist.c =================================================================== --- /dev/null +++ linux-2.6.24.7/kernel/trace/trace_hist.c @@ -0,0 +1,638 @@ +/* + * kernel/trace/trace_hist.c + * + * Add support for histograms of preemption-off latency and + * interrupt-off latency and wakeup latency, it depends on + * Real-Time Preemption Support. + * + * Copyright (C) 2005 MontaVista Software, Inc. + * Yi Yang <yyang@ch.mvista.com> + * + * Converted to work with the new latency tracer. + * Copyright (C) 2008 Red Hat, Inc. + * Steven Rostedt <srostedt@redhat.com> + * + */ +#include <linux/module.h> +#include <linux/debugfs.h> +#include <linux/seq_file.h> +#include <linux/percpu.h> +#include <linux/spinlock.h> +#include <linux/marker.h> +#include <asm/atomic.h> +#include <asm/div64.h> +#include <asm/uaccess.h> + +#include "trace.h" +#include "trace_hist.h" + +enum { + INTERRUPT_LATENCY = 0, + PREEMPT_LATENCY, + PREEMPT_INTERRUPT_LATENCY, + WAKEUP_LATENCY, +}; + +#define MAX_ENTRY_NUM 10240 + +struct hist_data { + atomic_t hist_mode; /* 0 log, 1 don't log */ + unsigned long min_lat; + unsigned long avg_lat; + unsigned long max_lat; + unsigned long long beyond_hist_bound_samples; + unsigned long long accumulate_lat; + unsigned long long total_samples; + unsigned long long hist_array[MAX_ENTRY_NUM]; +}; + +static char *latency_hist_dir_root = "latency_hist"; + +#ifdef CONFIG_INTERRUPT_OFF_HIST +static DEFINE_PER_CPU(struct hist_data, interrupt_off_hist); +static char *interrupt_off_hist_dir = "interrupt_off_latency"; +#endif + +#ifdef CONFIG_PREEMPT_OFF_HIST +static DEFINE_PER_CPU(struct hist_data, preempt_off_hist); +static char *preempt_off_hist_dir = "preempt_off_latency"; +#endif + +#if defined(CONFIG_PREEMPT_OFF_HIST) && defined(CONFIG_INTERRUPT_OFF_HIST) +static DEFINE_PER_CPU(struct hist_data, preempt_irqs_off_hist); +static char *preempt_irqs_off_hist_dir = "preempt_interrupts_off_latency"; +#endif + +#ifdef CONFIG_WAKEUP_LATENCY_HIST +static DEFINE_PER_CPU(struct hist_data, wakeup_latency_hist); +static char *wakeup_latency_hist_dir = "wakeup_latency"; +#endif + +static inline u64 u64_div(u64 x, u64 y) +{ + do_div(x, y); + return x; +} + +void notrace latency_hist(int latency_type, int cpu, unsigned long latency) +{ + struct hist_data *my_hist; + + if ((cpu < 0) || (cpu >= NR_CPUS) || (latency_type < INTERRUPT_LATENCY) + || (latency_type > WAKEUP_LATENCY) || (latency < 0)) + return; + + switch (latency_type) { +#ifdef CONFIG_INTERRUPT_OFF_HIST + case INTERRUPT_LATENCY: + my_hist = &per_cpu(interrupt_off_hist, cpu); + break; +#endif + +#ifdef CONFIG_PREEMPT_OFF_HIST + case PREEMPT_LATENCY: + my_hist = &per_cpu(preempt_off_hist, cpu); + break; +#endif + +#if defined(CONFIG_PREEMPT_OFF_HIST) && defined(CONFIG_INTERRUPT_OFF_HIST) + case PREEMPT_INTERRUPT_LATENCY: + my_hist = &per_cpu(preempt_irqs_off_hist, cpu); + break; +#endif + +#ifdef CONFIG_WAKEUP_LATENCY_HIST + case WAKEUP_LATENCY: + my_hist = &per_cpu(wakeup_latency_hist, cpu); + break; +#endif + default: + return; + } + + if (atomic_read(&my_hist->hist_mode) == 0) + return; + + if (latency >= MAX_ENTRY_NUM) + my_hist->beyond_hist_bound_samples++; + else + my_hist->hist_array[latency]++; + + if (latency < my_hist->min_lat) + my_hist->min_lat = latency; + else if (latency > my_hist->max_lat) + my_hist->max_lat = latency; + + my_hist->total_samples++; + my_hist->accumulate_lat += latency; + my_hist->avg_lat = (unsigned long) u64_div(my_hist->accumulate_lat, + my_hist->total_samples); + return; +} + +static void *l_start(struct seq_file *m, loff_t *pos) +{ + loff_t *index_ptr = kmalloc(sizeof(loff_t), GFP_KERNEL); + loff_t index = *pos; + struct hist_data *my_hist = m->private; + + if (!index_ptr) + return NULL; + + if (index == 0) { + atomic_dec(&my_hist->hist_mode); + seq_printf(m, "#Minimum latency: %lu microseconds.\n" + "#Average latency: %lu microseconds.\n" + "#Maximum latency: %lu microseconds.\n" + "#Total samples: %llu\n" + "#There are %llu samples greater or equal" + " than %d microseconds\n" + "#usecs\t%16s\n" + , my_hist->min_lat + , my_hist->avg_lat + , my_hist->max_lat + , my_hist->total_samples + , my_hist->beyond_hist_bound_samples + , MAX_ENTRY_NUM, "samples"); + } + if (index >= MAX_ENTRY_NUM) + return NULL; + + *index_ptr = index; + return index_ptr; +} + +static void *l_next(struct seq_file *m, void *p, loff_t *pos) +{ + loff_t *index_ptr = p; + struct hist_data *my_hist = m->private; + + if (++*pos >= MAX_ENTRY_NUM) { + atomic_inc(&my_hist->hist_mode); + return NULL; + } + *index_ptr = *pos; + return index_ptr; +} + +static void l_stop(struct seq_file *m, void *p) +{ + kfree(p); +} + +static int l_show(struct seq_file *m, void *p) +{ + int index = *(loff_t *) p; + struct hist_data *my_hist = m->private; + + seq_printf(m, "%5d\t%16llu\n", index, my_hist->hist_array[index]); + return 0; +} + +static struct seq_operations latency_hist_seq_op = { + .start = l_start, + .next = l_next, + .stop = l_stop, + .show = l_show +}; + +static int latency_hist_open(struct inode *inode, struct file *file) +{ + int ret; + + ret = seq_open(file, &latency_hist_seq_op); + if (!ret) { + struct seq_file *seq = file->private_data; + seq->private = inode->i_private; + } + return ret; +} + +static struct file_operations latency_hist_fops = { + .open = latency_hist_open, + .read = seq_read, + .llseek = seq_lseek, + .release = seq_release, +}; + +static void hist_reset(struct hist_data *hist) +{ + atomic_dec(&hist->hist_mode); + + memset(hist->hist_array, 0, sizeof(hist->hist_array)); + hist->beyond_hist_bound_samples = 0UL; + hist->min_lat = 0xFFFFFFFFUL; + hist->max_lat = 0UL; + hist->total_samples = 0UL; + hist->accumulate_lat = 0UL; + hist->avg_lat = 0UL; + + atomic_inc(&hist->hist_mode); +} + +ssize_t latency_hist_reset(struct file *file, const char __user *a, + size_t size, loff_t *off) +{ + int cpu; + struct hist_data *hist; + int latency_type = (long)file->private_data; + + switch (latency_type) { + +#ifdef CONFIG_WAKEUP_LATENCY_HIST + case WAKEUP_LATENCY: + for_each_online_cpu(cpu) { + hist = &per_cpu(wakeup_latency_hist, cpu); + hist_reset(hist); + } + break; +#endif + +#ifdef CONFIG_PREEMPT_OFF_HIST + case PREEMPT_LATENCY: + for_each_online_cpu(cpu) { + hist = &per_cpu(preempt_off_hist, cpu); + hist_reset(hist); + } + break; +#endif + +#ifdef CONFIG_INTERRUPT_OFF_HIST + case INTERRUPT_LATENCY: + for_each_online_cpu(cpu) { + hist = &per_cpu(interrupt_off_hist, cpu); + hist_reset(hist); + } + break; +#endif + +#if defined(CONFIG_INTERRUPT_OFF_HIST) && defined(CONFIG_PREEMPT_OFF_HIST) + case PREEMPT_INTERRUPT_LATENCY: + for_each_online_cpu(cpu) { + hist = &per_cpu(preempt_irqs_off_hist, cpu); + hist_reset(hist); + } + break; +#endif + } + + return size; +} + +static struct file_operations latency_hist_reset_fops = { + .open = tracing_open_generic, + .write = latency_hist_reset, +}; + +#if defined(CONFIG_INTERRUPT_OFF_HIST) || defined(CONFIG_PREEMPT_OFF_HIST) +#ifdef CONFIG_INTERRUPT_OFF_HIST +static DEFINE_PER_CPU(cycles_t, hist_irqsoff_start); +static DEFINE_PER_CPU(int, hist_irqsoff_tracing); +#endif +#ifdef CONFIG_PREEMPT_OFF_HIST +static DEFINE_PER_CPU(cycles_t, hist_preemptoff_start); +static DEFINE_PER_CPU(int, hist_preemptoff_tracing); +#endif +#if defined(CONFIG_INTERRUPT_OFF_HIST) && defined(CONFIG_PREEMPT_OFF_HIST) +static DEFINE_PER_CPU(cycles_t, hist_preemptirqsoff_start); +static DEFINE_PER_CPU(int, hist_preemptirqsoff_tracing); +#endif + +notrace void tracing_hist_preempt_start(void) +{ + cycle_t uninitialized_var(start); + int start_set = 0; + int cpu; + + if (!preempt_count() && !irqs_disabled()) + return; + + /* cpu is only used if we are in atomic */ + cpu = raw_smp_processor_id(); + +#ifdef CONFIG_INTERRUPT_OFF_HIST + if (irqs_disabled() && + !per_cpu(hist_irqsoff_tracing, cpu)) { + per_cpu(hist_irqsoff_tracing, cpu) = 1; + start_set++; + start = ftrace_now(cpu); + per_cpu(hist_irqsoff_start, cpu) = start; + } +#endif + +#ifdef CONFIG_PREEMPT_OFF_HIST + if (preempt_count() && + !per_cpu(hist_preemptoff_tracing, cpu)) { + per_cpu(hist_preemptoff_tracing, cpu) = 1; + if (1 || !(start_set++)) + start = ftrace_now(cpu); + per_cpu(hist_preemptoff_start, cpu) = start; + + } +#endif + +#if defined(CONFIG_INTERRUPT_OFF_HIST) && defined(CONFIG_PREEMPT_OFF_HIST) + if (!per_cpu(hist_preemptirqsoff_tracing, cpu)) { + per_cpu(hist_preemptirqsoff_tracing, cpu) = 1; + if (1 || !(start_set)) + start = ftrace_now(cpu); + per_cpu(hist_preemptirqsoff_start, cpu) = start; + } +#endif +} + +notrace void tracing_hist_preempt_stop(int irqs_on) +{ + long latency; + cycle_t start; + cycle_t uninitialized_var(stop); + int stop_set = 0; + int cpu; + + /* irqs_on == TRACE_STOP if we must stop tracing. */ + + /* cpu is only used if we are in atomic */ + cpu = raw_smp_processor_id(); + +#ifdef CONFIG_INTERRUPT_OFF_HIST + if (irqs_on && + per_cpu(hist_irqsoff_tracing, cpu)) { + stop = ftrace_now(cpu); + stop_set++; + start = per_cpu(hist_irqsoff_start, cpu); + latency = (long)nsecs_to_usecs(stop - start); + if (latency > 1000000) { + printk("%d: latency = %ld (%lu)\n", __LINE__, latency, latency); + printk("%d: start=%Ld stop=%Ld\n", __LINE__, start, stop); + } + barrier(); + per_cpu(hist_irqsoff_tracing, cpu) = 0; + latency_hist(INTERRUPT_LATENCY, cpu, latency); + } +#endif + +#ifdef CONFIG_PREEMPT_OFF_HIST + if ((!irqs_on || irqs_on == TRACE_STOP) && + per_cpu(hist_preemptoff_tracing, cpu)) { + WARN_ON(!preempt_count()); + if (1 || !(stop_set++)) + stop = ftrace_now(cpu); + start = per_cpu(hist_preemptoff_start, cpu); + latency = (long)nsecs_to_usecs(stop - start); + if (latency > 1000000) { + printk("%d: latency = %ld (%lu)\n", __LINE__, latency, latency); + printk("%d: start=%Ld stop=%Ld\n", __LINE__, start, stop); + } + barrier(); + per_cpu(hist_preemptoff_tracing, cpu) = 0; + latency_hist(PREEMPT_LATENCY, cpu, latency); + } +#endif + +#if defined(CONFIG_INTERRUPT_OFF_HIST) && defined(CONFIG_PREEMPT_OFF_HIST) + if (((!irqs_on && !irqs_disabled()) || + (irqs_on && !preempt_count()) || + (irqs_on == TRACE_STOP)) && + per_cpu(hist_preemptirqsoff_tracing, cpu)) { + WARN_ON(!preempt_count() && !irqs_disabled()); + if (1 || !stop_set) + stop = ftrace_now(cpu); + start = per_cpu(hist_preemptirqsoff_start, cpu); + latency = (long)nsecs_to_usecs(stop - start); + if (latency > 1000000) { + printk("%d: latency = %ld (%lu)\n", __LINE__, latency, latency); + printk("%d: start=%Ld stop=%Ld\n", __LINE__, start, stop); + } + barrier(); + per_cpu(hist_preemptirqsoff_tracing, cpu) = 0; + latency_hist(PREEMPT_INTERRUPT_LATENCY, cpu, latency); + } +#endif +} +#endif + +#ifdef CONFIG_WAKEUP_LATENCY_HIST +int tracing_wakeup_hist __read_mostly = 1; + +static unsigned wakeup_prio = (unsigned)-1 ; +static struct task_struct *wakeup_task; +static cycle_t wakeup_start; +static DEFINE_SPINLOCK(wakeup_lock); + +notrace void tracing_hist_wakeup_start(struct task_struct *p, + struct task_struct *curr) +{ + unsigned long flags; + + if (likely(!rt_task(p)) || + p->prio >= wakeup_prio || + p->prio >= curr->prio) + return; + + spin_lock_irqsave(&wakeup_lock, flags); + if (wakeup_task) + put_task_struct(wakeup_task); + + get_task_struct(p); + wakeup_task = p; + wakeup_prio = p->prio; + wakeup_start = ftrace_now(raw_smp_processor_id()); + spin_unlock_irqrestore(&wakeup_lock, flags); +} + +notrace void tracing_hist_wakeup_stop(struct task_struct *next) +{ + unsigned long flags; + long latency; + cycle_t stop; + + if (next != wakeup_task) + return; + + stop = ftrace_now(raw_smp_processor_id()); + + spin_lock_irqsave(&wakeup_lock, flags); + if (wakeup_task != next) + goto out; + + latency = (long)nsecs_to_usecs(stop - wakeup_start); + + latency_hist(WAKEUP_LATENCY, smp_processor_id(), latency); + + put_task_struct(wakeup_task); + wakeup_task = NULL; + wakeup_prio = (unsigned)-1; + out: + spin_unlock_irqrestore(&wakeup_lock, flags); + +} + +static void +sched_switch_callback(void *probe_data, void *call_data, + const char *format, va_list *args) +{ + struct task_struct *prev; + struct task_struct *next; + struct rq *__rq; + + /* skip prev_pid %d next_pid %d prev_state %ld */ + (void)va_arg(*args, int); + (void)va_arg(*args, int); + (void)va_arg(*args, long); + __rq = va_arg(*args, typeof(__rq)); + prev = va_arg(*args, typeof(prev)); + next = va_arg(*args, typeof(next)); + + tracing_hist_wakeup_stop(next); +} + +static void +wake_up_callback(void *probe_data, void *call_data, + const char *format, va_list *args) +{ + struct task_struct *curr; + struct task_struct *task; + struct rq *__rq; + + /* Skip pid %d state %ld */ + (void)va_arg(*args, int); + (void)va_arg(*args, long); + /* now get the meat: "rq %p task %p rq->curr %p" */ + __rq = va_arg(*args, typeof(__rq)); + task = va_arg(*args, typeof(task)); + curr = va_arg(*args, typeof(curr)); + + tracing_hist_wakeup_start(task, curr); +} + +#endif + +static __init int latency_hist_init(void) +{ + struct dentry *latency_hist_root = NULL; + struct dentry *dentry; + struct dentry *entry; + int i = 0, len = 0; + struct hist_data *my_hist; + char name[64]; + + dentry = tracing_init_dentry(); + + latency_hist_root = + debugfs_create_dir(latency_hist_dir_root, dentry); + +#ifdef CONFIG_INTERRUPT_OFF_HIST + dentry = debugfs_create_dir(interrupt_off_hist_dir, + latency_hist_root); + for_each_possible_cpu(i) { + len = sprintf(name, "CPU%d", i); + name[len] = '\0'; + entry = debugfs_create_file(name, 0444, dentry, + &per_cpu(interrupt_off_hist, i), + &latency_hist_fops); + my_hist = &per_cpu(interrupt_off_hist, i); + atomic_set(&my_hist->hist_mode, 1); + my_hist->min_lat = 0xFFFFFFFFUL; + } + entry = debugfs_create_file("reset", 0444, dentry, + (void *)INTERRUPT_LATENCY, + &latency_hist_reset_fops); +#endif + +#ifdef CONFIG_PREEMPT_OFF_HIST + dentry = debugfs_create_dir(preempt_off_hist_dir, + latency_hist_root); + for_each_possible_cpu(i) { + len = sprintf(name, "CPU%d", i); + name[len] = '\0'; + entry = debugfs_create_file(name, 0444, dentry, + &per_cpu(preempt_off_hist, i), + &latency_hist_fops); + my_hist = &per_cpu(preempt_off_hist, i); + atomic_set(&my_hist->hist_mode, 1); + my_hist->min_lat = 0xFFFFFFFFUL; + } + entry = debugfs_create_file("reset", 0444, dentry, + (void *)PREEMPT_LATENCY, + &latency_hist_reset_fops); +#endif + +#if defined(CONFIG_INTERRUPT_OFF_HIST) && defined(CONFIG_PREEMPT_OFF_HIST) + dentry = debugfs_create_dir(preempt_irqs_off_hist_dir, + latency_hist_root); + for_each_possible_cpu(i) { + len = sprintf(name, "CPU%d", i); + name[len] = '\0'; + entry = debugfs_create_file(name, 0444, dentry, + &per_cpu(preempt_off_hist, i), + &latency_hist_fops); + my_hist = &per_cpu(preempt_irqs_off_hist, i); + atomic_set(&my_hist->hist_mode, 1); + my_hist->min_lat = 0xFFFFFFFFUL; + } + entry = debugfs_create_file("reset", 0444, dentry, + (void *)PREEMPT_INTERRUPT_LATENCY, + &latency_hist_reset_fops); +#endif + +#ifdef CONFIG_WAKEUP_LATENCY_HIST + + i = marker_probe_register("kernel_sched_wakeup", + "pid %d state %ld ## rq %p task %p rq->curr %p", + wake_up_callback, NULL); + if (i) { + pr_info("wakeup hist: Couldn't add marker" + " probe to kernel_sched_wakeup\n"); + goto out_wake; + } + + i = marker_probe_register("kernel_sched_wakeup_new", + "pid %d state %ld ## rq %p task %p rq->curr %p", + wake_up_callback, NULL); + if (i) { + pr_info("wakeup hist: Couldn't add marker" + " probe to kernel_sched_wakeup_new\n"); + goto fail_deprobe; + } + + i = marker_probe_register("kernel_sched_schedule", + "prev_pid %d next_pid %d prev_state %ld " + "## rq %p prev %p next %p", + sched_switch_callback, NULL); + if (i) { + pr_info("wakeup hist: Couldn't add marker" + " probe to kernel_sched_schedule\n"); + goto fail_deprobe_wake_new; + } + + dentry = debugfs_create_dir(wakeup_latency_hist_dir, + latency_hist_root); + for_each_possible_cpu(i) { + len = sprintf(name, "CPU%d", i); + name[len] = '\0'; + entry = debugfs_create_file(name, 0444, dentry, + &per_cpu(wakeup_latency_hist, i), + &latency_hist_fops); + my_hist = &per_cpu(wakeup_latency_hist, i); + atomic_set(&my_hist->hist_mode, 1); + my_hist->min_lat = 0xFFFFFFFFUL; + } + entry = debugfs_create_file("reset", 0444, dentry, + (void *)WAKEUP_LATENCY, + &latency_hist_reset_fops); + + goto out_wake; + +fail_deprobe_wake_new: + marker_probe_unregister("kernel_sched_wakeup_new", + wake_up_callback, NULL); +fail_deprobe: + marker_probe_unregister("kernel_sched_wakeup", + wake_up_callback, NULL); + out_wake: +#endif + return 0; + +} + +__initcall(latency_hist_init); Index: linux-2.6.24.7/kernel/trace/trace_hist.h =================================================================== --- /dev/null +++ linux-2.6.24.7/kernel/trace/trace_hist.h @@ -0,0 +1,39 @@ +/* + * kernel/trace/trace_hist.h + * + * Add support for histograms of preemption-off latency and + * interrupt-off latency and wakeup latency, it depends on + * Real-Time Preemption Support. + * + * Copyright (C) 2005 MontaVista Software, Inc. + * Yi Yang <yyang@ch.mvista.com> + * + * Converted to work with the new latency tracer. + * Copyright (C) 2008 Red Hat, Inc. + * Steven Rostedt <srostedt@redhat.com> + * + */ +#ifndef _LIB_TRACING_TRACER_HIST_H_ +#define _LIB_TRACING_TRACER_HIST_H_ + +#if defined(CONFIG_INTERRUPT_OFF_HIST) || defined(CONFIG_PREEMPT_OFF_HIST) +# define TRACE_STOP 2 +void tracing_hist_preempt_start(void); +void tracing_hist_preempt_stop(int irqs_on); +#else +# define tracing_hist_preempt_start() do { } while (0) +# define tracing_hist_preempt_stop(irqs_off) do { } while (0) +#endif + +#ifdef CONFIG_WAKEUP_LATENCY_HIST +void tracing_hist_wakeup_start(struct task_struct *p, + struct task_struct *curr); +void tracing_hist_wakeup_stop(struct task_struct *next); +extern int tracing_wakeup_hist; +#else +# define tracing_hist_wakeup_start(p, curr) do { } while (0) +# define tracing_hist_wakeup_stop(next) do { } while (0) +# define tracing_wakeup_hist 0 +#endif + +#endif /* ifndef _LIB_TRACING_TRACER_HIST_H_ */ Index: linux-2.6.24.7/kernel/trace/trace_irqsoff.c =================================================================== --- linux-2.6.24.7.orig/kernel/trace/trace_irqsoff.c +++ linux-2.6.24.7/kernel/trace/trace_irqsoff.c @@ -17,6 +17,7 @@ #include <linux/fs.h> #include "trace.h" +#include "trace_hist.h" static struct trace_array *irqsoff_trace __read_mostly; static int tracer_enabled __read_mostly; @@ -252,10 +253,14 @@ void start_critical_timings(void) { if (preempt_trace() || irq_trace()) start_critical_timing(CALLER_ADDR0, CALLER_ADDR1); + + tracing_hist_preempt_start(); } void stop_critical_timings(void) { + tracing_hist_preempt_stop(TRACE_STOP); + if (preempt_trace() || irq_trace()) stop_critical_timing(CALLER_ADDR0, CALLER_ADDR1); } @@ -264,6 +269,8 @@ void stop_critical_timings(void) #ifdef CONFIG_PROVE_LOCKING void time_hardirqs_on(unsigned long a0, unsigned long a1) { + tracing_hist_preempt_stop(1); + if (!preempt_trace() && irq_trace()) stop_critical_timing(a0, a1); } @@ -272,6 +279,8 @@ void time_hardirqs_off(unsigned long a0, { if (!preempt_trace() && irq_trace()) start_critical_timing(a0, a1); + + tracing_hist_preempt_start(); } #else /* !CONFIG_PROVE_LOCKING */ @@ -305,6 +314,8 @@ inline void print_irqtrace_events(struct */ void trace_hardirqs_on(void) { + tracing_hist_preempt_stop(1); + if (!preempt_trace() && irq_trace()) stop_critical_timing(CALLER_ADDR0, CALLER_ADDR1); } @@ -314,11 +325,15 @@ void trace_hardirqs_off(void) { if (!preempt_trace() && irq_trace()) start_critical_timing(CALLER_ADDR0, CALLER_ADDR1); + + tracing_hist_preempt_start(); } EXPORT_SYMBOL(trace_hardirqs_off); void trace_hardirqs_on_caller(unsigned long caller_addr) { + tracing_hist_preempt_stop(1); + if (!preempt_trace() && irq_trace()) stop_critical_timing(CALLER_ADDR0, caller_addr); } @@ -328,6 +343,8 @@ void trace_hardirqs_off_caller(unsigned { if (!preempt_trace() && irq_trace()) start_critical_timing(CALLER_ADDR0, caller_addr); + + tracing_hist_preempt_start(); } EXPORT_SYMBOL(trace_hardirqs_off_caller); @@ -337,12 +354,14 @@ EXPORT_SYMBOL(trace_hardirqs_off_caller) #ifdef CONFIG_PREEMPT_TRACER void trace_preempt_on(unsigned long a0, unsigned long a1) { + tracing_hist_preempt_stop(0); stop_critical_timing(a0, a1); } void trace_preempt_off(unsigned long a0, unsigned long a1) { start_critical_timing(a0, a1); + tracing_hist_preempt_start(); } #endif /* CONFIG_PREEMPT_TRACER */ ������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/trace_hist-divzero.patch��������������������������������������������������������������������0000664�0000764�0000764�00000005511�11041657731�016072� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������Date: Tue, 27 May 2008 03:21:25 +0200 From: Carsten Emde <c.emde@osadl.org> To: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Subject: trace_hist.c: divide-by-zero problem (2) Parts/Attachments: 1 Shown 20 lines Text 2 Shown 86 lines Text ---------------------------------------- Steven, do we really need to continuously calculate the average latency and spend lots of time in the division function? I don't think so. It is probably sufficient to calculate the average latency only when we display it. What do you think? Carsten. The division function of a 64-bit divisor divided by a 64-bit dividend to calculate the average latency may be time-consuming on a 32-bit system. We, therefore, no longer calculate the average whenever a new latency value is added but only when we display the histogram data. Signed-off-by: Carsten Emde <C.Emde@osadl.org> --- kernel/trace/trace_hist.c | 21 ++++++++------------- 1 file changed, 8 insertions(+), 13 deletions(-) Index: linux-2.6.24.7/kernel/trace/trace_hist.c =================================================================== --- linux-2.6.24.7.orig/kernel/trace/trace_hist.c +++ linux-2.6.24.7/kernel/trace/trace_hist.c @@ -17,7 +17,6 @@ #include <linux/debugfs.h> #include <linux/seq_file.h> #include <linux/percpu.h> -#include <linux/spinlock.h> #include <linux/marker.h> #include <asm/atomic.h> #include <asm/div64.h> @@ -68,15 +67,10 @@ static DEFINE_PER_CPU(struct hist_data, static char *wakeup_latency_hist_dir = "wakeup_latency"; #endif -static inline u64 u64_div(u64 x, u64 y) -{ - do_div(x, y); - return x; -} - void notrace latency_hist(int latency_type, int cpu, unsigned long latency) { struct hist_data *my_hist; + unsigned long long total_samples; if ((cpu < 0) || (cpu >= NR_CPUS) || (latency_type < INTERRUPT_LATENCY) || (latency_type > WAKEUP_LATENCY) || (latency < 0)) @@ -123,10 +117,11 @@ void notrace latency_hist(int latency_ty else if (latency > my_hist->max_lat) my_hist->max_lat = latency; - my_hist->total_samples++; + total_samples = my_hist->total_samples++; my_hist->accumulate_lat += latency; - my_hist->avg_lat = (unsigned long) u64_div(my_hist->accumulate_lat, - my_hist->total_samples); + if (likely(total_samples)) + my_hist->avg_lat = (unsigned long) + div64_64(my_hist->accumulate_lat, total_samples); return; } @@ -220,11 +215,11 @@ static void hist_reset(struct hist_data atomic_dec(&hist->hist_mode); memset(hist->hist_array, 0, sizeof(hist->hist_array)); - hist->beyond_hist_bound_samples = 0UL; + hist->beyond_hist_bound_samples = 0ULL; hist->min_lat = 0xFFFFFFFFUL; hist->max_lat = 0UL; - hist->total_samples = 0UL; - hist->accumulate_lat = 0UL; + hist->total_samples = 0ULL; + hist->accumulate_lat = 0ULL; hist->avg_lat = 0UL; atomic_inc(&hist->hist_mode); ���������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/event-tracer-syscall-x86_64.patch�����������������������������������������������������������0000664�0000764�0000764�00000007615�11041657731�017277� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������Add hooks to x86 to track syscalls for event trace. This code was taken from the work by Ingo Molnar. Signed-off-by: Steven Rostedt <srostedt@redhat.com> --- arch/x86/ia32/ia32entry.S | 9 +++++++ arch/x86/kernel/entry_64.S | 8 ++++++- include/asm-x86/calling.h | 50 ++++++++++++++++++++++++++++++++++++++++++++ include/asm-x86/unistd_64.h | 2 + 4 files changed, 67 insertions(+), 2 deletions(-) Index: linux-2.6.24.7/arch/x86/ia32/ia32entry.S =================================================================== --- linux-2.6.24.7.orig/arch/x86/ia32/ia32entry.S +++ linux-2.6.24.7/arch/x86/ia32/ia32entry.S @@ -132,7 +132,9 @@ sysenter_do_call: cmpl $(IA32_NR_syscalls-1),%eax ja ia32_badsys IA32_ARG_FIXUP 1 + TRACE_SYS_IA32_CALL call *ia32_sys_call_table(,%rax,8) + TRACE_SYS_RET movq %rax,RAX-ARGOFFSET(%rsp) GET_THREAD_INFO(%r10) cli @@ -237,7 +239,9 @@ cstar_do_call: cmpl $IA32_NR_syscalls-1,%eax ja ia32_badsys IA32_ARG_FIXUP 1 + TRACE_SYS_IA32_CALL call *ia32_sys_call_table(,%rax,8) + TRACE_SYS_RET movq %rax,RAX-ARGOFFSET(%rsp) GET_THREAD_INFO(%r10) cli @@ -328,8 +332,10 @@ ia32_do_syscall: cmpl $(IA32_NR_syscalls-1),%eax ja ia32_badsys IA32_ARG_FIXUP + TRACE_SYS_IA32_CALL call *ia32_sys_call_table(,%rax,8) # xxx: rip relative ia32_sysret: + TRACE_SYS_RET movq %rax,RAX-ARGOFFSET(%rsp) jmp int_ret_from_sys_call @@ -400,7 +406,7 @@ END(ia32_ptregs_common) .section .rodata,"a" .align 8 -ia32_sys_call_table: +ENTRY(ia32_sys_call_table) .quad sys_restart_syscall .quad sys_exit .quad stub32_fork @@ -726,4 +732,5 @@ ia32_sys_call_table: .quad compat_sys_timerfd .quad sys_eventfd .quad sys32_fallocate +.globl ia32_syscall_end ia32_syscall_end: Index: linux-2.6.24.7/arch/x86/kernel/entry_64.S =================================================================== --- linux-2.6.24.7.orig/arch/x86/kernel/entry_64.S +++ linux-2.6.24.7/arch/x86/kernel/entry_64.S @@ -338,7 +338,10 @@ ENTRY(system_call) cmpq $__NR_syscall_max,%rax ja badsys movq %r10,%rcx + TRACE_SYS_CALL call *sys_call_table(,%rax,8) # XXX: rip relative +system_call_ret: + TRACE_SYS_RET movq %rax,RAX-ARGOFFSET(%rsp) /* * Syscall return path ending with SYSRET (fast path) @@ -408,7 +411,7 @@ badsys: jmp ret_from_sys_call /* Do syscall tracing */ -tracesys: +tracesys: SAVE_REST movq $-ENOSYS,RAX(%rsp) FIXUP_TOP_OF_STACK %rdi @@ -421,7 +424,10 @@ tracesys: cmova %rcx,%rax ja 1f movq %r10,%rcx /* fixup for C */ + TRACE_SYS_CALL call *sys_call_table(,%rax,8) +traceret: + TRACE_SYS_RET 1: movq %rax,RAX-ARGOFFSET(%rsp) /* Use IRET because user could have changed frame */ Index: linux-2.6.24.7/include/asm-x86/calling.h =================================================================== --- linux-2.6.24.7.orig/include/asm-x86/calling.h +++ linux-2.6.24.7/include/asm-x86/calling.h @@ -160,3 +160,53 @@ .macro icebp .byte 0xf1 .endm + +/* + * latency-tracing helpers: + */ + + .macro TRACE_SYS_CALL + +#ifdef CONFIG_EVENT_TRACER + SAVE_ARGS + + mov %rdx, %rcx + mov %rsi, %rdx + mov %rdi, %rsi + mov %rax, %rdi + + call sys_call + + RESTORE_ARGS +#endif + .endm + + + .macro TRACE_SYS_IA32_CALL + +#ifdef CONFIG_EVENT_TRACER + SAVE_ARGS + + mov %rdx, %rcx + mov %rsi, %rdx + mov %rdi, %rsi + mov %rax, %rdi + + call sys_ia32_call + + RESTORE_ARGS +#endif + .endm + + .macro TRACE_SYS_RET + +#ifdef CONFIG_EVENT_TRACER + SAVE_ARGS + + mov %rax, %rdi + + call sys_ret + + RESTORE_ARGS +#endif + .endm Index: linux-2.6.24.7/include/asm-x86/unistd_64.h =================================================================== --- linux-2.6.24.7.orig/include/asm-x86/unistd_64.h +++ linux-2.6.24.7/include/asm-x86/unistd_64.h @@ -11,6 +11,8 @@ * Note: holes are not allowed. */ +#define NR_syscalls (__NR_syscall_max+1) + /* at least 8 syscall per cacheline */ #define __NR_read 0 __SYSCALL(__NR_read, sys_read) �������������������������������������������������������������������������������������������������������������������patches/event-tracer-syscall-i386.patch�������������������������������������������������������������0000664�0000764�0000764�00000002757�11041657732�017035� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������hooks into i386 to track event calls. This code was taken from the work by Ingo Molnar. Signed-off-by: Steven Rostedt <srostedt@redhat.com> --- arch/x86/kernel/entry_32.S | 15 +++++++++++++++ 1 file changed, 15 insertions(+) Index: linux-2.6.24.7/arch/x86/kernel/entry_32.S =================================================================== --- linux-2.6.24.7.orig/arch/x86/kernel/entry_32.S +++ linux-2.6.24.7/arch/x86/kernel/entry_32.S @@ -330,6 +330,11 @@ sysenter_past_esp: pushl %eax CFI_ADJUST_CFA_OFFSET 4 SAVE_ALL +#ifdef CONFIG_EVENT_TRACE + pushl %edx; pushl %ecx; pushl %ebx; pushl %eax + call sys_call + popl %eax; popl %ebx; popl %ecx; popl %edx +#endif GET_THREAD_INFO(%ebp) /* Note, _TIF_SECCOMP is bit number 8, and so it needs testw and not testb */ @@ -345,6 +350,11 @@ sysenter_past_esp: movl TI_flags(%ebp), %ecx testw $_TIF_ALLWORK_MASK, %cx jne syscall_exit_work +#ifdef CONFIG_EVENT_TRACE + pushl %eax + call sys_ret + popl %eax +#endif /* if something modifies registers it must also disable sysexit */ movl PT_EIP(%esp), %edx movl PT_OLDESP(%esp), %ecx @@ -368,6 +378,11 @@ ENTRY(system_call) pushl %eax # save orig_eax CFI_ADJUST_CFA_OFFSET 4 SAVE_ALL +#ifdef CONFIG_EVENT_TRACE + pushl %edx; pushl %ecx; pushl %ebx; pushl %eax + call sys_call + popl %eax; popl %ebx; popl %ecx; popl %edx +#endif GET_THREAD_INFO(%ebp) # system call tracing in operation / emulation /* Note, _TIF_SECCOMP is bit number 8, and so it needs testw and not testb */ �����������������patches/trace-events-handle-syscalls.patch����������������������������������������������������������0000664�0000764�0000764�00000021374�11041657735�017762� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������Add syscall tracing in event trace. Signed-off-by: Steven Rostedt <srostedt@redhat.com> --- kernel/trace/trace.c | 104 ++++++++++++++++++++++++++++++++++++++++++ kernel/trace/trace.h | 30 ++++++++++++ kernel/trace/trace_events.c | 92 +++++++++++++++++++++++++++++++++++++ kernel/trace/trace_selftest.c | 2 4 files changed, 228 insertions(+) Index: linux-2.6.24.7/kernel/trace/trace.c =================================================================== --- linux-2.6.24.7.orig/kernel/trace/trace.c +++ linux-2.6.24.7/kernel/trace/trace.c @@ -31,6 +31,9 @@ #include <linux/stacktrace.h> +#include <asm/asm-offsets.h> +#include <asm/unistd.h> + #include "trace.h" unsigned long __read_mostly tracing_max_latency = (cycle_t)ULONG_MAX; @@ -1126,6 +1129,42 @@ void tracing_event_task_deactivate(struc entry->task.cpu = task_cpu; } +void tracing_event_syscall(struct trace_array *tr, + struct trace_array_cpu *data, + unsigned long flags, + unsigned long ip, + unsigned long nr, + unsigned long p1, + unsigned long p2, + unsigned long p3) +{ + struct trace_entry *entry; + + entry = tracing_get_trace_entry(tr, data); + tracing_generic_entry_update(entry, flags); + entry->type = TRACE_SYSCALL; + entry->syscall.ip = ip; + entry->syscall.nr = nr; + entry->syscall.p1 = p1; + entry->syscall.p2 = p2; + entry->syscall.p3 = p3; +} + +void tracing_event_sysret(struct trace_array *tr, + struct trace_array_cpu *data, + unsigned long flags, + unsigned long ip, + unsigned long ret) +{ + struct trace_entry *entry; + + entry = tracing_get_trace_entry(tr, data); + tracing_generic_entry_update(entry, flags); + entry->type = TRACE_SYSRET; + entry->sysret.ip = ip; + entry->sysret.ret = ret; +} + #ifdef CONFIG_FTRACE static void function_trace_call(unsigned long ip, unsigned long parent_ip) @@ -1559,6 +1598,13 @@ lat_print_timestamp(struct trace_seq *s, static const char state_to_char[] = TASK_STATE_TO_CHAR_STR; +extern unsigned long sys_call_table[NR_syscalls]; + +#if defined(CONFIG_COMPAT) && defined(CONFIG_X86) +extern unsigned long ia32_sys_call_table[], ia32_syscall_end[]; +# define IA32_NR_syscalls (ia32_syscall_end - ia32_sys_call_table) +#endif + static int print_lat_fmt(struct trace_iterator *iter, unsigned int trace_idx, int cpu) { @@ -1569,6 +1615,7 @@ print_lat_fmt(struct trace_iterator *ite struct trace_entry *entry = iter->ent; unsigned long abs_usecs; unsigned long rel_usecs; + unsigned long nr; char *comm; int S, T; int i; @@ -1680,6 +1727,34 @@ print_lat_fmt(struct trace_iterator *ite comm, entry->task.pid, entry->task.prio, entry->task.cpu); break; + case TRACE_SYSCALL: + seq_print_ip_sym(s, entry->syscall.ip, sym_flags); + nr = entry->syscall.nr; + trace_seq_putc(s, ' '); +#if defined(CONFIG_COMPAT) && defined(CONFIG_X86) + if (nr & 0x80000000) { + nr &= ~0x80000000; + if (nr < IA32_NR_syscalls) + seq_print_ip_sym(s, ia32_sys_call_table[nr], 0); + else + trace_seq_printf(s, "<badsys(%lu)>", nr); + } else +#endif + if (nr < NR_syscalls) + seq_print_ip_sym(s, sys_call_table[nr], 0); + else + trace_seq_printf(s, "<badsys(%lu)>", nr); + + trace_seq_printf(s, " (%lx %lx %lx)\n", + entry->syscall.p1, + entry->syscall.p2, + entry->syscall.p3); + break; + case TRACE_SYSRET: + seq_print_ip_sym(s, entry->sysret.ip, sym_flags); + trace_seq_printf(s, " < (%ld)\n", + entry->sysret.ret); + break; default: trace_seq_printf(s, "Unknown type %d\n", entry->type); } @@ -1694,6 +1769,7 @@ static int print_trace_fmt(struct trace_ unsigned long usec_rem; unsigned long long t; unsigned long secs; + long nr; char *comm; int ret; int S, T; @@ -1826,6 +1902,34 @@ static int print_trace_fmt(struct trace_ comm, entry->task.pid, entry->task.prio, entry->task.cpu); break; + case TRACE_SYSCALL: + seq_print_ip_sym(s, entry->syscall.ip, sym_flags); + nr = entry->syscall.nr; + trace_seq_putc(s, ' '); +#if defined(CONFIG_COMPAT) && defined(CONFIG_X86) + if (nr & 0x80000000) { + nr &= ~0x80000000; + if (nr < IA32_NR_syscalls) + seq_print_ip_sym(s, ia32_sys_call_table[nr], 0); + else + trace_seq_printf(s, "<badsys(%lu)>", nr); + } else +#endif + if (nr < NR_syscalls) + seq_print_ip_sym(s, sys_call_table[nr], 0); + else + trace_seq_printf(s, "<badsys(%lu)>", nr); + + trace_seq_printf(s, " (%lx %lx %lx)\n", + entry->syscall.p1, + entry->syscall.p2, + entry->syscall.p3); + break; + case TRACE_SYSRET: + seq_print_ip_sym(s, entry->sysret.ip, sym_flags); + trace_seq_printf(s, "< (%ld)\n", + entry->sysret.ret); + break; default: trace_seq_printf(s, "Unknown type %d\n", entry->type); } Index: linux-2.6.24.7/kernel/trace/trace.h =================================================================== --- linux-2.6.24.7.orig/kernel/trace/trace.h +++ linux-2.6.24.7/kernel/trace/trace.h @@ -24,6 +24,8 @@ enum trace_type { TRACE_TIMESTAMP, TRACE_TASK_ACT, TRACE_TASK_DEACT, + TRACE_SYSCALL, + TRACE_SYSRET, __TRACE_LAST_TYPE }; @@ -96,6 +98,19 @@ struct wakeup_entry { unsigned curr_prio; }; +struct syscall_entry { + unsigned long ip; + unsigned long nr; + unsigned long p1; + unsigned long p2; + unsigned long p3; +}; + +struct sysret_entry { + unsigned long ip; + unsigned long ret; +}; + /* * Stack-trace entry: */ @@ -132,6 +147,8 @@ struct trace_entry { struct timestamp_entry timestamp; struct task_entry task; struct wakeup_entry wakeup; + struct syscall_entry syscall; + struct sysret_entry sysret; }; }; @@ -320,6 +337,19 @@ void tracing_event_wakeup(struct trace_a unsigned long ip, pid_t pid, int prio, int curr_prio); +void tracing_event_syscall(struct trace_array *tr, + struct trace_array_cpu *data, + unsigned long flags, + unsigned long ip, + unsigned long nr, + unsigned long p1, + unsigned long p2, + unsigned long p3); +void tracing_event_sysret(struct trace_array *tr, + struct trace_array_cpu *data, + unsigned long flags, + unsigned long ip, + unsigned long ret); void tracing_start_cmdline_record(void); void tracing_stop_cmdline_record(void); Index: linux-2.6.24.7/kernel/trace/trace_events.c =================================================================== --- linux-2.6.24.7.orig/kernel/trace/trace_events.c +++ linux-2.6.24.7/kernel/trace/trace_events.c @@ -34,6 +34,98 @@ static void event_reset(struct trace_arr tr->time_start = ftrace_now(raw_smp_processor_id()); } +/* HACK */ +void notrace +sys_call(unsigned long nr, unsigned long p1, unsigned long p2, unsigned long p3) +{ + struct trace_array *tr; + struct trace_array_cpu *data; + unsigned long flags; + unsigned long ip; + int cpu; + + if (!tracer_enabled) + return; + + tr = events_trace; + local_irq_save(flags); + cpu = raw_smp_processor_id(); + data = tr->data[cpu]; + + atomic_inc(&data->disabled); + if (atomic_read(&data->disabled) != 1) + goto out; + + ip = CALLER_ADDR0; + + tracing_event_syscall(tr, data, flags, ip, nr, p1, p2, p3); + + out: + atomic_dec(&data->disabled); + local_irq_restore(flags); +} + +#if defined(CONFIG_COMPAT) && defined(CONFIG_X86) +void notrace +sys_ia32_call(unsigned long nr, unsigned long p1, unsigned long p2, + unsigned long p3) +{ + struct trace_array *tr; + struct trace_array_cpu *data; + unsigned long flags; + unsigned long ip; + int cpu; + + if (!tracer_enabled) + return; + + tr = events_trace; + local_irq_save(flags); + cpu = raw_smp_processor_id(); + data = tr->data[cpu]; + + atomic_inc(&data->disabled); + if (atomic_read(&data->disabled) != 1) + goto out; + + ip = CALLER_ADDR0; + tracing_event_syscall(tr, data, flags, ip, nr | 0x80000000, p1, p2, p3); + + out: + atomic_dec(&data->disabled); + local_irq_restore(flags); +} +#endif + +void notrace +sys_ret(unsigned long ret) +{ + struct trace_array *tr; + struct trace_array_cpu *data; + unsigned long flags; + unsigned long ip; + int cpu; + + if (!tracer_enabled) + return; + + tr = events_trace; + local_irq_save(flags); + cpu = raw_smp_processor_id(); + data = tr->data[cpu]; + + atomic_inc(&data->disabled); + if (atomic_read(&data->disabled) != 1) + goto out; + + ip = CALLER_ADDR0; + tracing_event_sysret(tr, data, flags, ip, ret); + + out: + atomic_dec(&data->disabled); + local_irq_restore(flags); +} + #define getarg(arg, ap) arg = va_arg(ap, typeof(arg)); static void Index: linux-2.6.24.7/kernel/trace/trace_selftest.c =================================================================== --- linux-2.6.24.7.orig/kernel/trace/trace_selftest.c +++ linux-2.6.24.7/kernel/trace/trace_selftest.c @@ -18,6 +18,8 @@ static inline int trace_valid_entry(stru case TRACE_TIMESTAMP: case TRACE_TASK_ACT: case TRACE_TASK_DEACT: + case TRACE_SYSCALL: + case TRACE_SYSRET: return 1; } return 0; ��������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/preempt-trace.patch�������������������������������������������������������������������������0000664�0000764�0000764�00000014563�11041657734�015047� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������Track preempt disable nesting This code was largly influenced by work from Ingo Molnar. Signed-off-by: Steven Rostedt <srostedt@redhat.com> --- arch/arm/kernel/traps.c | 1 + arch/x86/kernel/traps_32.c | 1 + arch/x86/kernel/traps_64.c | 1 + include/linux/preempt.h | 3 ++- include/linux/sched.h | 13 +++++++++++++ kernel/sched.c | 14 +++++++++++++- kernel/trace/Kconfig | 7 +++++++ kernel/trace/Makefile | 2 ++ kernel/trace/preempt-trace.c | 30 ++++++++++++++++++++++++++++++ 9 files changed, 70 insertions(+), 2 deletions(-) Index: linux-2.6.24.7/arch/arm/kernel/traps.c =================================================================== --- linux-2.6.24.7.orig/arch/arm/kernel/traps.c +++ linux-2.6.24.7/arch/arm/kernel/traps.c @@ -354,6 +354,7 @@ asmlinkage void do_unexp_fiq (struct pt_ { printk("Hmm. Unexpected FIQ received, but trying to continue\n"); printk("You may have a hardware problem...\n"); + print_preempt_trace(current); } /* Index: linux-2.6.24.7/arch/x86/kernel/traps_32.c =================================================================== --- linux-2.6.24.7.orig/arch/x86/kernel/traps_32.c +++ linux-2.6.24.7/arch/x86/kernel/traps_32.c @@ -239,6 +239,7 @@ show_trace_log_lvl(struct task_struct *t { dump_trace(task, regs, stack, &print_trace_ops, log_lvl); printk("%s =======================\n", log_lvl); + print_preempt_trace(task); } void show_trace(struct task_struct *task, struct pt_regs *regs, Index: linux-2.6.24.7/arch/x86/kernel/traps_64.c =================================================================== --- linux-2.6.24.7.orig/arch/x86/kernel/traps_64.c +++ linux-2.6.24.7/arch/x86/kernel/traps_64.c @@ -352,6 +352,7 @@ show_trace(struct task_struct *tsk, stru printk("\nCall Trace:\n"); dump_trace(tsk, regs, stack, &print_trace_ops, NULL); printk("\n"); + print_preempt_trace(tsk); } static void Index: linux-2.6.24.7/include/linux/preempt.h =================================================================== --- linux-2.6.24.7.orig/include/linux/preempt.h +++ linux-2.6.24.7/include/linux/preempt.h @@ -10,7 +10,8 @@ #include <linux/linkage.h> #include <linux/list.h> -#if defined(CONFIG_DEBUG_PREEMPT) || defined(CONFIG_PREEMPT_TRACER) +#if defined(CONFIG_DEBUG_PREEMPT) || defined(CONFIG_PREEMPT_TRACER) || \ + defined(CONFIG_PREEMPT_TRACE) extern void fastcall add_preempt_count(int val); extern void fastcall sub_preempt_count(int val); #else Index: linux-2.6.24.7/include/linux/sched.h =================================================================== --- linux-2.6.24.7.orig/include/linux/sched.h +++ linux-2.6.24.7/include/linux/sched.h @@ -1134,6 +1134,13 @@ struct task_struct { unsigned int lockdep_recursion; #endif +#define MAX_PREEMPT_TRACE 25 + +#ifdef CONFIG_PREEMPT_TRACE + unsigned long preempt_trace_eip[MAX_PREEMPT_TRACE]; + unsigned long preempt_trace_parent_eip[MAX_PREEMPT_TRACE]; +#endif + /* journalling filesystem info */ void *journal_info; @@ -2015,6 +2022,12 @@ static inline void inc_syscw(struct task } #endif +#ifdef CONFIG_PREEMPT_TRACE +void print_preempt_trace(struct task_struct *tsk); +#else +# define print_preempt_trace(tsk) do { } while (0) +#endif + #ifdef CONFIG_SMP void migration_init(void); #else Index: linux-2.6.24.7/kernel/sched.c =================================================================== --- linux-2.6.24.7.orig/kernel/sched.c +++ linux-2.6.24.7/kernel/sched.c @@ -3544,6 +3544,9 @@ static inline unsigned long get_parent_i void fastcall add_preempt_count(int val) { + unsigned long eip = CALLER_ADDR0; + unsigned long parent_eip = get_parent_ip(CALLER_ADDR1); + #ifdef CONFIG_DEBUG_PREEMPT /* * Underflow? @@ -3552,6 +3555,15 @@ void fastcall add_preempt_count(int val) return; #endif preempt_count() += val; +#ifdef CONFIG_PREEMPT_TRACE + if (val <= 10) { + unsigned int idx = preempt_count() & PREEMPT_MASK; + if (idx < MAX_PREEMPT_TRACE) { + current->preempt_trace_eip[idx] = eip; + current->preempt_trace_parent_eip[idx] = parent_eip; + } + } +#endif #ifdef CONFIG_DEBUG_PREEMPT /* * Spinlock count overflowing soon? @@ -3560,7 +3572,7 @@ void fastcall add_preempt_count(int val) PREEMPT_MASK - 10); #endif if (preempt_count() == val) - trace_preempt_off(CALLER_ADDR0, get_parent_ip(CALLER_ADDR1)); + trace_preempt_off(eip, parent_eip); } EXPORT_SYMBOL(add_preempt_count); Index: linux-2.6.24.7/kernel/trace/Kconfig =================================================================== --- linux-2.6.24.7.orig/kernel/trace/Kconfig +++ linux-2.6.24.7/kernel/trace/Kconfig @@ -156,3 +156,10 @@ config WAKEUP_LATENCY_HIST help This option uses the infrastructure of the wakeup tracer to create a histogram of latencies. + +config PREEMPT_TRACE + bool "Keep a record of preempt disabled spots" + depends on DEBUG_KERNEL + select TRACING + help + Keeps a record of the last 25 preempt disabled locations. Index: linux-2.6.24.7/kernel/trace/Makefile =================================================================== --- linux-2.6.24.7.orig/kernel/trace/Makefile +++ linux-2.6.24.7/kernel/trace/Makefile @@ -25,4 +25,6 @@ obj-$(CONFIG_INTERRUPT_OFF_HIST) += trac obj-$(CONFIG_PREEMPT_OFF_HIST) += trace_hist.o obj-$(CONFIG_WAKEUP_LATENCY_HIST) += trace_hist.o +obj-$(CONFIG_PREEMPT_TRACE) += preempt-trace.o + libftrace-y := ftrace.o Index: linux-2.6.24.7/kernel/trace/preempt-trace.c =================================================================== --- /dev/null +++ linux-2.6.24.7/kernel/trace/preempt-trace.c @@ -0,0 +1,30 @@ +#include <linux/sched.h> +#include <linux/hardirq.h> +#include <linux/kallsyms.h> + +void print_preempt_trace(struct task_struct *task) +{ + unsigned int count; + unsigned int i, lim; + + if (!task) + task = current; + + count = task_thread_info(task)->preempt_count; + lim = count & PREEMPT_MASK; + + if (lim >= MAX_PREEMPT_TRACE) + lim = MAX_PREEMPT_TRACE-1; + printk("---------------------------\n"); + printk("| preempt count: %08x ]\n", count); + printk("| %d-level deep critical section nesting:\n", lim); + printk("----------------------------------------\n"); + for (i = 1; i <= lim; i++) { + printk(".. [<%08lx>] .... ", task->preempt_trace_eip[i]); + print_symbol("%s\n", task->preempt_trace_eip[i]); + printk(".....[<%08lx>] .. ( <= ", + task->preempt_trace_parent_eip[i]); + print_symbol("%s)\n", task->preempt_trace_parent_eip[i]); + } + printk("\n"); +} ���������������������������������������������������������������������������������������������������������������������������������������������patches/trace-add-event-markers-arm.patch�����������������������������������������������������������0000664�0000764�0000764�00000001376�11041657735�017460� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������--- arch/arm/kernel/irq.c | 4 ++++ 1 file changed, 4 insertions(+) Index: linux-2.6.24.7/arch/arm/kernel/irq.c =================================================================== --- linux-2.6.24.7.orig/arch/arm/kernel/irq.c +++ linux-2.6.24.7/arch/arm/kernel/irq.c @@ -37,6 +37,8 @@ #include <linux/kallsyms.h> #include <linux/proc_fs.h> +#include <linux/ftrace.h> + #include <asm/system.h> #include <asm/mach/time.h> @@ -113,6 +115,8 @@ asmlinkage void __exception asm_do_IRQ(u struct pt_regs *old_regs = set_irq_regs(regs); struct irq_desc *desc = irq_desc + irq; + ftrace_event_irq(irq, user_mode(regs), instruction_pointer(regs)); + /* * Some hardware gives randomly wrong interrupts. Rather * than crashing, do something sensible. ������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/ppc-rename-xmon-mcount.patch����������������������������������������������������������������0000664�0000764�0000764�00000004636�11041657731�016605� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From tsutomu.owa@toshiba.co.jp Mon May 14 17:19:36 2007 Date: Mon, 14 May 2007 17:19:36 +0900 From: Tsutomu OWA <tsutomu.owa@toshiba.co.jp> To: linuxppc-dev@ozlabs.org, linux-kernel@vger.kernel.org Cc: mingo@elte.hu, tglx@linutronix.de Subject: Re: [patch 4/5] powerpc 2.6.21-rt1: rename mcount variable in xmon to xmon_mcount Rename variable name "mcount" in xmon to xmon_mcount, since it conflicts with mcount() function used by latency trace function. Signed-off-by: Tsutomu OWA <tsutomu.owa@toshiba.co.jp> -- owa --- From tsutomu.owa@toshiba.co.jp Mon May 14 17:19:36 2007 Date: Mon, 14 May 2007 17:19:36 +0900 From: Tsutomu OWA <tsutomu.owa@toshiba.co.jp> To: linuxppc-dev@ozlabs.org, linux-kernel@vger.kernel.org Cc: mingo@elte.hu, tglx@linutronix.de Subject: Re: [patch 4/5] powerpc 2.6.21-rt1: rename mcount variable in xmon to xmon_mcount Rename variable name "mcount" in xmon to xmon_mcount, since it conflicts with mcount() function used by latency trace function. Signed-off-by: Tsutomu OWA <tsutomu.owa@toshiba.co.jp> -- owa --- arch/powerpc/xmon/xmon.c | 11 ++++++----- 1 file changed, 6 insertions(+), 5 deletions(-) Index: linux-2.6.24.7/arch/powerpc/xmon/xmon.c =================================================================== --- linux-2.6.24.7.orig/arch/powerpc/xmon/xmon.c +++ linux-2.6.24.7/arch/powerpc/xmon/xmon.c @@ -2129,7 +2129,7 @@ print_address(unsigned long addr) static unsigned long mdest; /* destination address */ static unsigned long msrc; /* source address */ static unsigned long mval; /* byte value to set memory to */ -static unsigned long mcount; /* # bytes to affect */ +static unsigned long xmon_mcount; /* # bytes to affect */ static unsigned long mdiffs; /* max # differences to print */ void @@ -2141,19 +2141,20 @@ memops(int cmd) scanhex((void *)(cmd == 's'? &mval: &msrc)); if( termch != '\n' ) termch = 0; - scanhex((void *)&mcount); + scanhex((void *)&xmon_mcount); switch( cmd ){ case 'm': - memmove((void *)mdest, (void *)msrc, mcount); + memmove((void *)mdest, (void *)msrc, xmon_mcount); break; case 's': - memset((void *)mdest, mval, mcount); + memset((void *)mdest, mval, xmon_mcount); break; case 'd': if( termch != '\n' ) termch = 0; scanhex((void *)&mdiffs); - memdiffs((unsigned char *)mdest, (unsigned char *)msrc, mcount, mdiffs); + memdiffs((unsigned char *)mdest, (unsigned char *)msrc, + xmon_mcount, mdiffs); break; } } ��������������������������������������������������������������������������������������������������patches/powerpc-add-ftrace.patch��������������������������������������������������������������������0000664�0000764�0000764�00000031776�11043103223�015730� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������--- arch/powerpc/Kconfig | 1 arch/powerpc/kernel/Makefile | 14 ++ arch/powerpc/kernel/entry_32.S | 130 ++++++++++++++++++++++++ arch/powerpc/kernel/entry_64.S | 62 +++++++++++ arch/powerpc/kernel/ftrace.c | 165 +++++++++++++++++++++++++++++++ arch/powerpc/kernel/io.c | 3 arch/powerpc/kernel/irq.c | 6 - arch/powerpc/kernel/setup_32.c | 11 +- arch/powerpc/kernel/setup_64.c | 5 arch/powerpc/platforms/powermac/Makefile | 5 10 files changed, 395 insertions(+), 7 deletions(-) Index: linux-2.6.24.7/arch/powerpc/Kconfig =================================================================== --- linux-2.6.24.7.orig/arch/powerpc/Kconfig +++ linux-2.6.24.7/arch/powerpc/Kconfig @@ -79,6 +79,7 @@ config ARCH_NO_VIRT_TO_BUS config PPC bool default y + select HAVE_FTRACE config EARLY_PRINTK bool Index: linux-2.6.24.7/arch/powerpc/kernel/Makefile =================================================================== --- linux-2.6.24.7.orig/arch/powerpc/kernel/Makefile +++ linux-2.6.24.7/arch/powerpc/kernel/Makefile @@ -10,6 +10,18 @@ CFLAGS_prom_init.o += -fPIC CFLAGS_btext.o += -fPIC endif +ifdef CONFIG_FTRACE +# Do not trace early boot code +CFLAGS_REMOVE_cputable.o = -pg +CFLAGS_REMOVE_prom_init.o = -pg + +ifdef CONFIG_DYNAMIC_FTRACE +# dynamic ftrace setup. +CFLAGS_REMOVE_ftrace.o = -pg +endif + +endif + obj-y := semaphore.o cputable.o ptrace.o syscalls.o \ irq.o align.o signal_32.o pmc.o vdso.o \ init_task.o process.o systbl.o idle.o \ @@ -75,6 +87,8 @@ obj-$(CONFIG_KEXEC) += machine_kexec.o obj-$(CONFIG_AUDIT) += audit.o obj64-$(CONFIG_AUDIT) += compat_audit.o +obj-$(CONFIG_DYNAMIC_FTRACE) += ftrace.o + obj-$(CONFIG_8XX_MINIMAL_FPEMU) += softemu8xx.o ifneq ($(CONFIG_PPC_INDIRECT_IO),y) Index: linux-2.6.24.7/arch/powerpc/kernel/entry_32.S =================================================================== --- linux-2.6.24.7.orig/arch/powerpc/kernel/entry_32.S +++ linux-2.6.24.7/arch/powerpc/kernel/entry_32.S @@ -1022,3 +1022,133 @@ machine_check_in_rtas: /* XXX load up BATs and panic */ #endif /* CONFIG_PPC_RTAS */ + +#ifdef CONFIG_FTRACE +#ifdef CONFIG_DYNAMIC_FTRACE +_GLOBAL(mcount) +_GLOBAL(_mcount) + stwu r1,-48(r1) + stw r3, 12(r1) + stw r4, 16(r1) + stw r5, 20(r1) + stw r6, 24(r1) + mflr r3 + stw r7, 28(r1) + mfcr r5 + stw r8, 32(r1) + stw r9, 36(r1) + stw r10,40(r1) + stw r3, 44(r1) + stw r5, 8(r1) + .globl mcount_call +mcount_call: + bl ftrace_stub + nop + lwz r6, 8(r1) + lwz r0, 44(r1) + lwz r3, 12(r1) + mtctr r0 + lwz r4, 16(r1) + mtcr r6 + lwz r5, 20(r1) + lwz r6, 24(r1) + lwz r0, 52(r1) + lwz r7, 28(r1) + lwz r8, 32(r1) + mtlr r0 + lwz r9, 36(r1) + lwz r10,40(r1) + addi r1, r1, 48 + bctr + +_GLOBAL(ftrace_caller) + /* Based off of objdump optput from glibc */ + stwu r1,-48(r1) + stw r3, 12(r1) + stw r4, 16(r1) + stw r5, 20(r1) + stw r6, 24(r1) + mflr r3 + lwz r4, 52(r1) + mfcr r5 + stw r7, 28(r1) + stw r8, 32(r1) + stw r9, 36(r1) + stw r10,40(r1) + stw r3, 44(r1) + stw r5, 8(r1) +.globl ftrace_call +ftrace_call: + bl ftrace_stub + nop + lwz r6, 8(r1) + lwz r0, 44(r1) + lwz r3, 12(r1) + mtctr r0 + lwz r4, 16(r1) + mtcr r6 + lwz r5, 20(r1) + lwz r6, 24(r1) + lwz r0, 52(r1) + lwz r7, 28(r1) + lwz r8, 32(r1) + mtlr r0 + lwz r9, 36(r1) + lwz r10,40(r1) + addi r1, r1, 48 + bctr +#else +_GLOBAL(mcount) +_GLOBAL(_mcount) + stwu r1,-48(r1) + stw r3, 12(r1) + stw r4, 16(r1) + stw r5, 20(r1) + stw r6, 24(r1) + mflr r3 + lwz r4, 52(r1) + mfcr r5 + stw r7, 28(r1) + stw r8, 32(r1) + stw r9, 36(r1) + stw r10,40(r1) + stw r3, 44(r1) + stw r5, 8(r1) + + LOAD_REG_ADDR(r5, ftrace_trace_function) +#if 0 + mtctr r3 + mr r1, r5 + bctrl +#endif + lwz r5,0(r5) +#if 1 + mtctr r5 + bctrl +#else + bl ftrace_stub +#endif + nop + + lwz r6, 8(r1) + lwz r0, 44(r1) + lwz r3, 12(r1) + mtctr r0 + lwz r4, 16(r1) + mtcr r6 + lwz r5, 20(r1) + lwz r6, 24(r1) + lwz r0, 52(r1) + lwz r7, 28(r1) + lwz r8, 32(r1) + mtlr r0 + lwz r9, 36(r1) + lwz r10,40(r1) + addi r1, r1, 48 + bctr +#endif + +_GLOBAL(ftrace_stub) + blr + +#endif /* CONFIG_MCOUNT */ Index: linux-2.6.24.7/arch/powerpc/kernel/entry_64.S =================================================================== --- linux-2.6.24.7.orig/arch/powerpc/kernel/entry_64.S +++ linux-2.6.24.7/arch/powerpc/kernel/entry_64.S @@ -846,3 +846,65 @@ _GLOBAL(enter_prom) ld r0,16(r1) mtlr r0 blr + +#ifdef CONFIG_FTRACE +#ifdef CONFIG_DYNAMIC_FTRACE +_GLOBAL(mcount) +_GLOBAL(_mcount) + /* Taken from output of objdump from lib64/glibc */ + mflr r3 + stdu r1, -112(r1) + std r3, 128(r1) + .globl mcount_call +mcount_call: + bl ftrace_stub + nop + ld r0, 128(r1) + mtlr r0 + addi r1, r1, 112 + blr + +_GLOBAL(ftrace_caller) + /* Taken from output of objdump from lib64/glibc */ + mflr r3 + ld r11, 0(r1) + stdu r1, -112(r1) + std r3, 128(r1) + ld r4, 16(r11) +.globl ftrace_call +ftrace_call: + bl ftrace_stub + nop + ld r0, 128(r1) + mtlr r0 + addi r1, r1, 112 +_GLOBAL(ftrace_stub) + blr +#else +_GLOBAL(mcount) + blr + +_GLOBAL(_mcount) + /* Taken from output of objdump from lib64/glibc */ + mflr r3 + ld r11, 0(r1) + stdu r1, -112(r1) + std r3, 128(r1) + ld r4, 16(r11) + + + LOAD_REG_ADDR(r5,ftrace_trace_function) + ld r5,0(r5) + ld r5,0(r5) + mtctr r5 + bctrl + + nop + ld r0, 128(r1) + mtlr r0 + addi r1, r1, 112 +_GLOBAL(ftrace_stub) + blr + +#endif +#endif Index: linux-2.6.24.7/arch/powerpc/kernel/ftrace.c =================================================================== --- /dev/null +++ linux-2.6.24.7/arch/powerpc/kernel/ftrace.c @@ -0,0 +1,165 @@ +/* + * Code for replacing ftrace calls with jumps. + * + * Copyright (C) 2007-2008 Steven Rostedt <srostedt@redhat.com> + * + * Thanks goes out to P.A. Semi, Inc for supplying me with a PPC64 box. + * + */ + +#include <linux/spinlock.h> +#include <linux/hardirq.h> +#include <linux/ftrace.h> +#include <linux/percpu.h> +#include <linux/init.h> +#include <linux/list.h> + +#include <asm/cacheflush.h> + +#define CALL_BACK 4 + +static unsigned int ftrace_nop = 0x60000000; + +#ifdef CONFIG_PPC32 +# define GET_ADDR(addr) addr +#else +/* PowerPC64's functions are data that points to the functions */ +# define GET_ADDR(addr) *(unsigned long *)addr +#endif + +notrace int ftrace_ip_converted(unsigned long ip) +{ + unsigned int save; + + ip -= CALL_BACK; + save = *(unsigned int *)ip; + + return save == ftrace_nop; +} + +static unsigned int notrace ftrace_calc_offset(long ip, long addr) +{ + return (int)((addr + CALL_BACK) - ip); +} + +notrace unsigned char *ftrace_nop_replace(void) +{ + return (char *)&ftrace_nop; +} + +notrace unsigned char *ftrace_call_replace(unsigned long ip, unsigned long addr) +{ + static unsigned int op; + + addr = GET_ADDR(addr); + + /* Set to "bl addr" */ + op = 0x48000001 | (ftrace_calc_offset(ip, addr) & 0x03fffffe); + + /* + * No locking needed, this must be called via kstop_machine + * which in essence is like running on a uniprocessor machine. + */ + return (unsigned char *)&op; +} + +#ifdef CONFIG_PPC64 +# define _ASM_ALIGN " .align 3 " +# define _ASM_PTR " .llong " +#else +# define _ASM_ALIGN " .align 2 " +# define _ASM_PTR " .long " +#endif + +notrace int +ftrace_modify_code(unsigned long ip, unsigned char *old_code, + unsigned char *new_code) +{ + unsigned replaced; + unsigned old = *(unsigned *)old_code; + unsigned new = *(unsigned *)new_code; + int faulted = 0; + + /* move the IP back to the start of the call */ + ip -= CALL_BACK; + + /* + * Note: Due to modules and __init, code can + * disappear and change, we need to protect against faulting + * as well as code changing. + * + * No real locking needed, this code is run through + * kstop_machine. + */ + asm volatile ( + "1: lwz %1, 0(%2)\n" + " cmpw %1, %5\n" + " bne 2f\n" + " stwu %3, 0(%2)\n" + "2:\n" + ".section .fixup, \"ax\"\n" + "3: li %0, 1\n" + " b 2b\n" + ".previous\n" + ".section __ex_table,\"a\"\n" + _ASM_ALIGN "\n" + _ASM_PTR "1b, 3b\n" + ".previous" + : "=r"(faulted), "=r"(replaced) + : "r"(ip), "r"(new), + "0"(faulted), "r"(old) + : "memory"); + + if (replaced != old && replaced != new) + faulted = 2; + + if (!faulted) + flush_icache_range(ip, ip + 8); + + return faulted; +} + +notrace int ftrace_update_ftrace_func(ftrace_func_t func) +{ + unsigned long ip = (unsigned long)(&ftrace_call); + unsigned char old[4], *new; + int ret; + + ip += CALL_BACK; + + memcpy(old, &ftrace_call, 4); + new = ftrace_call_replace(ip, (unsigned long)func); + ret = ftrace_modify_code(ip, old, new); + + return ret; +} + +notrace int ftrace_mcount_set(unsigned long *data) +{ + unsigned long ip = (long)(&mcount_call); + unsigned long *addr = data; + unsigned char old[4], *new; + + /* ip is at the location, but modify code will subtact this */ + ip += CALL_BACK; + + /* + * Replace the mcount stub with a pointer to the + * ip recorder function. + */ + memcpy(old, &mcount_call, 4); + new = ftrace_call_replace(ip, *addr); + *addr = ftrace_modify_code(ip, old, new); + + return 0; +} + +int __init ftrace_dyn_arch_init(void *data) +{ + /* This is running in kstop_machine */ + + ftrace_mcount_set(data); + + return 0; +} + Index: linux-2.6.24.7/arch/powerpc/kernel/io.c =================================================================== --- linux-2.6.24.7.orig/arch/powerpc/kernel/io.c +++ linux-2.6.24.7/arch/powerpc/kernel/io.c @@ -120,7 +120,8 @@ EXPORT_SYMBOL(_outsl_ns); #define IO_CHECK_ALIGN(v,a) ((((unsigned long)(v)) & ((a) - 1)) == 0) -void _memset_io(volatile void __iomem *addr, int c, unsigned long n) +notrace void +_memset_io(volatile void __iomem *addr, int c, unsigned long n) { void *p = (void __force *)addr; u32 lc = c; Index: linux-2.6.24.7/arch/powerpc/kernel/irq.c =================================================================== --- linux-2.6.24.7.orig/arch/powerpc/kernel/irq.c +++ linux-2.6.24.7/arch/powerpc/kernel/irq.c @@ -98,7 +98,7 @@ EXPORT_SYMBOL(irq_desc); int distribute_irqs = 1; -static inline unsigned long get_hard_enabled(void) +static inline notrace unsigned long get_hard_enabled(void) { unsigned long enabled; @@ -108,13 +108,13 @@ static inline unsigned long get_hard_ena return enabled; } -static inline void set_soft_enabled(unsigned long enable) +static inline notrace void set_soft_enabled(unsigned long enable) { __asm__ __volatile__("stb %0,%1(13)" : : "r" (enable), "i" (offsetof(struct paca_struct, soft_enabled))); } -void local_irq_restore(unsigned long en) +notrace void local_irq_restore(unsigned long en) { /* * get_paca()->soft_enabled = en; Index: linux-2.6.24.7/arch/powerpc/kernel/setup_32.c =================================================================== --- linux-2.6.24.7.orig/arch/powerpc/kernel/setup_32.c +++ linux-2.6.24.7/arch/powerpc/kernel/setup_32.c @@ -49,6 +49,11 @@ #include <asm/kgdb.h> #endif +#ifdef CONFIG_FTRACE +extern void _mcount(void); +EXPORT_SYMBOL(_mcount); +#endif + extern void bootx_init(unsigned long r4, unsigned long phys); #if defined(CONFIG_BLK_DEV_IDE) || defined(CONFIG_BLK_DEV_IDE_MODULE) @@ -88,7 +93,7 @@ int ucache_bsize; * from the address that it was linked at, so we must use RELOC/PTRRELOC * to access static data (including strings). -- paulus */ -unsigned long __init early_init(unsigned long dt_ptr) +notrace unsigned long __init early_init(unsigned long dt_ptr) { unsigned long offset = reloc_offset(); struct cpu_spec *spec; @@ -118,7 +123,7 @@ unsigned long __init early_init(unsigned * This is called very early on the boot process, after a minimal * MMU environment has been set up but before MMU_init is called. */ -void __init machine_init(unsigned long dt_ptr, unsigned long phys) +notrace void __init machine_init(unsigned long dt_ptr, unsigned long phys) { /* Enable early debugging if any specified (see udbg.h) */ udbg_early_init(); @@ -140,7 +145,7 @@ void __init machine_init(unsigned long d #ifdef CONFIG_BOOKE_WDT /* Checks wdt=x and wdt_period=xx command-line option */ -int __init early_parse_wdt(char *p) +notrace int __init early_parse_wdt(char *p) { if (p && strncmp(p, "0", 1) != 0) booke_wdt_enabled = 1; Index: linux-2.6.24.7/arch/powerpc/kernel/setup_64.c =================================================================== --- linux-2.6.24.7.orig/arch/powerpc/kernel/setup_64.c +++ linux-2.6.24.7/arch/powerpc/kernel/setup_64.c @@ -84,6 +84,11 @@ struct ppc64_caches ppc64_caches = { }; EXPORT_SYMBOL_GPL(ppc64_caches); +#ifdef CONFIG_FTRACE +extern void _mcount(void); +EXPORT_SYMBOL(_mcount); +#endif + /* * These are used in binfmt_elf.c to put aux entries on the stack * for each elf executable being started. Index: linux-2.6.24.7/arch/powerpc/platforms/powermac/Makefile =================================================================== --- linux-2.6.24.7.orig/arch/powerpc/platforms/powermac/Makefile +++ linux-2.6.24.7/arch/powerpc/platforms/powermac/Makefile @@ -1,5 +1,10 @@ CFLAGS_bootx_init.o += -fPIC +ifdef CONFIG_FTRACE +# Do not trace early boot code +CFLAGS_REMOVE_bootx_init.o = -pg +endif + obj-y += pic.o setup.o time.o feature.o pci.o \ sleep.o low_i2c.o cache.o pfunc_core.o \ pfunc_base.o ��patches/powerpc-ftrace-cleanups.patch���������������������������������������������������������������0000664�0000764�0000764�00000006733�11043075255�017022� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������--- arch/powerpc/kernel/entry_32.S | 11 ++--------- arch/powerpc/kernel/ftrace.c | 8 +++++++- arch/powerpc/kernel/ppc_ksyms.c | 5 +++++ arch/powerpc/kernel/setup_32.c | 5 ----- arch/powerpc/kernel/setup_64.c | 5 ----- include/asm-powerpc/ftrace.h | 6 ++++++ 6 files changed, 20 insertions(+), 20 deletions(-) Index: linux-2.6.24.7/arch/powerpc/kernel/entry_32.S =================================================================== --- linux-2.6.24.7.orig/arch/powerpc/kernel/entry_32.S +++ linux-2.6.24.7/arch/powerpc/kernel/entry_32.S @@ -1116,18 +1116,11 @@ _GLOBAL(_mcount) stw r5, 8(r1) LOAD_REG_ADDR(r5, ftrace_trace_function) -#if 0 - mtctr r3 - mr r1, r5 - bctrl -#endif lwz r5,0(r5) -#if 1 + mtctr r5 bctrl -#else - bl ftrace_stub -#endif + nop lwz r6, 8(r1) Index: linux-2.6.24.7/arch/powerpc/kernel/ftrace.c =================================================================== --- linux-2.6.24.7.orig/arch/powerpc/kernel/ftrace.c +++ linux-2.6.24.7/arch/powerpc/kernel/ftrace.c @@ -51,10 +51,16 @@ notrace unsigned char *ftrace_call_repla { static unsigned int op; + /* + * It would be nice to just use create_function_call, but that will + * update the code itself. Here we need to just return the + * instruction that is going to be modified, without modifying the + * code. + */ addr = GET_ADDR(addr); /* Set to "bl addr" */ - op = 0x48000001 | (ftrace_calc_offset(ip, addr) & 0x03fffffe); + op = 0x48000001 | (ftrace_calc_offset(ip, addr) & 0x03fffffc); /* * No locking needed, this must be called via kstop_machine Index: linux-2.6.24.7/arch/powerpc/kernel/ppc_ksyms.c =================================================================== --- linux-2.6.24.7.orig/arch/powerpc/kernel/ppc_ksyms.c +++ linux-2.6.24.7/arch/powerpc/kernel/ppc_ksyms.c @@ -44,6 +44,7 @@ #include <asm/div64.h> #include <asm/signal.h> #include <asm/dcr.h> +#include <asm/ftrace.h> #ifdef CONFIG_PPC64 EXPORT_SYMBOL(local_irq_restore); @@ -72,6 +73,10 @@ EXPORT_SYMBOL(single_step_exception); EXPORT_SYMBOL(sys_sigreturn); #endif +#ifdef CONFIG_FTRACE +EXPORT_SYMBOL(_mcount); +#endif + EXPORT_SYMBOL(strcpy); EXPORT_SYMBOL(strncpy); EXPORT_SYMBOL(strcat); Index: linux-2.6.24.7/arch/powerpc/kernel/setup_32.c =================================================================== --- linux-2.6.24.7.orig/arch/powerpc/kernel/setup_32.c +++ linux-2.6.24.7/arch/powerpc/kernel/setup_32.c @@ -49,11 +49,6 @@ #include <asm/kgdb.h> #endif -#ifdef CONFIG_FTRACE -extern void _mcount(void); -EXPORT_SYMBOL(_mcount); -#endif - extern void bootx_init(unsigned long r4, unsigned long phys); #if defined(CONFIG_BLK_DEV_IDE) || defined(CONFIG_BLK_DEV_IDE_MODULE) Index: linux-2.6.24.7/arch/powerpc/kernel/setup_64.c =================================================================== --- linux-2.6.24.7.orig/arch/powerpc/kernel/setup_64.c +++ linux-2.6.24.7/arch/powerpc/kernel/setup_64.c @@ -84,11 +84,6 @@ struct ppc64_caches ppc64_caches = { }; EXPORT_SYMBOL_GPL(ppc64_caches); -#ifdef CONFIG_FTRACE -extern void _mcount(void); -EXPORT_SYMBOL(_mcount); -#endif - /* * These are used in binfmt_elf.c to put aux entries on the stack * for each elf executable being started. Index: linux-2.6.24.7/include/asm-powerpc/ftrace.h =================================================================== --- /dev/null +++ linux-2.6.24.7/include/asm-powerpc/ftrace.h @@ -0,0 +1,6 @@ +#ifndef _ASM_POWERPC_FTRACE +#define _ASM_POWERPC_FTRACE + +extern void _mcount(void); + +#endif �������������������������������������patches/powerpc-remove-ip-converted.patch�����������������������������������������������������������0000664�0000764�0000764�00000001272�11043075254�017630� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������--- arch/powerpc/kernel/ftrace.c | 10 ---------- 1 file changed, 10 deletions(-) Index: linux-2.6.24.7/arch/powerpc/kernel/ftrace.c =================================================================== --- linux-2.6.24.7.orig/arch/powerpc/kernel/ftrace.c +++ linux-2.6.24.7/arch/powerpc/kernel/ftrace.c @@ -27,16 +27,6 @@ static unsigned int ftrace_nop = 0x60000 # define GET_ADDR(addr) *(unsigned long *)addr #endif -notrace int ftrace_ip_converted(unsigned long ip) -{ - unsigned int save; - - ip -= CALL_BACK; - save = *(unsigned int *)ip; - - return save == ftrace_nop; -} - static unsigned int notrace ftrace_calc_offset(long ip, long addr) { return (int)((addr + CALL_BACK) - ip); ��������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/powerpc-ftrace-store-mcount.patch�����������������������������������������������������������0000664�0000764�0000764�00000011014�11043075254�017632� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������--- arch/powerpc/kernel/entry_32.S | 4 ++++ arch/powerpc/kernel/entry_64.S | 5 ++++- arch/powerpc/kernel/ftrace.c | 21 +++++++-------------- include/asm-powerpc/ftrace.h | 8 ++++++++ 4 files changed, 23 insertions(+), 15 deletions(-) Index: linux-2.6.24.7/arch/powerpc/kernel/entry_32.S =================================================================== --- linux-2.6.24.7.orig/arch/powerpc/kernel/entry_32.S +++ linux-2.6.24.7/arch/powerpc/kernel/entry_32.S @@ -30,6 +30,7 @@ #include <asm/ppc_asm.h> #include <asm/asm-offsets.h> #include <asm/unistd.h> +#include <asm/ftrace.h> #undef SHOW_SYSCALLS #undef SHOW_SYSCALLS_TASK @@ -1040,6 +1041,7 @@ _GLOBAL(_mcount) stw r10,40(r1) stw r3, 44(r1) stw r5, 8(r1) + subi r3, r3, MCOUNT_INSN_SIZE .globl mcount_call mcount_call: bl ftrace_stub @@ -1077,6 +1079,7 @@ _GLOBAL(ftrace_caller) stw r10,40(r1) stw r3, 44(r1) stw r5, 8(r1) + subi r3, r3, MCOUNT_INSN_SIZE .globl ftrace_call ftrace_call: bl ftrace_stub @@ -1115,6 +1118,7 @@ _GLOBAL(_mcount) stw r3, 44(r1) stw r5, 8(r1) + subi r3, r3, MCOUNT_INSN_SIZE LOAD_REG_ADDR(r5, ftrace_trace_function) lwz r5,0(r5) Index: linux-2.6.24.7/arch/powerpc/kernel/entry_64.S =================================================================== --- linux-2.6.24.7.orig/arch/powerpc/kernel/entry_64.S +++ linux-2.6.24.7/arch/powerpc/kernel/entry_64.S @@ -29,6 +29,7 @@ #include <asm/cputable.h> #include <asm/firmware.h> #include <asm/bug.h> +#include <asm/ftrace.h> /* * System calls. @@ -855,6 +856,7 @@ _GLOBAL(_mcount) mflr r3 stdu r1, -112(r1) std r3, 128(r1) + subi r3, r3, MCOUNT_INSN_SIZE .globl mcount_call mcount_call: bl ftrace_stub @@ -871,6 +873,7 @@ _GLOBAL(ftrace_caller) stdu r1, -112(r1) std r3, 128(r1) ld r4, 16(r11) + subi r3, r3, MCOUNT_INSN_SIZE .globl ftrace_call ftrace_call: bl ftrace_stub @@ -892,7 +895,7 @@ _GLOBAL(_mcount) std r3, 128(r1) ld r4, 16(r11) - + subi r3, r3, MCOUNT_INSN_SIZE LOAD_REG_ADDR(r5,ftrace_trace_function) ld r5,0(r5) ld r5,0(r5) Index: linux-2.6.24.7/arch/powerpc/kernel/ftrace.c =================================================================== --- linux-2.6.24.7.orig/arch/powerpc/kernel/ftrace.c +++ linux-2.6.24.7/arch/powerpc/kernel/ftrace.c @@ -15,8 +15,8 @@ #include <linux/list.h> #include <asm/cacheflush.h> +#include <asm/ftrace.h> -#define CALL_BACK 4 static unsigned int ftrace_nop = 0x60000000; @@ -27,9 +27,10 @@ static unsigned int ftrace_nop = 0x60000 # define GET_ADDR(addr) *(unsigned long *)addr #endif + static unsigned int notrace ftrace_calc_offset(long ip, long addr) { - return (int)((addr + CALL_BACK) - ip); + return (int)(addr - ip); } notrace unsigned char *ftrace_nop_replace(void) @@ -76,9 +77,6 @@ ftrace_modify_code(unsigned long ip, uns unsigned new = *(unsigned *)new_code; int faulted = 0; - /* move the IP back to the start of the call */ - ip -= CALL_BACK; - /* * Note: Due to modules and __init, code can * disappear and change, we need to protect against faulting @@ -118,12 +116,10 @@ ftrace_modify_code(unsigned long ip, uns notrace int ftrace_update_ftrace_func(ftrace_func_t func) { unsigned long ip = (unsigned long)(&ftrace_call); - unsigned char old[4], *new; + unsigned char old[MCOUNT_INSN_SIZE], *new; int ret; - ip += CALL_BACK; - - memcpy(old, &ftrace_call, 4); + memcpy(old, &ftrace_call, MCOUNT_INSN_SIZE); new = ftrace_call_replace(ip, (unsigned long)func); ret = ftrace_modify_code(ip, old, new); @@ -134,16 +130,13 @@ notrace int ftrace_mcount_set(unsigned l { unsigned long ip = (long)(&mcount_call); unsigned long *addr = data; - unsigned char old[4], *new; - - /* ip is at the location, but modify code will subtact this */ - ip += CALL_BACK; + unsigned char old[MCOUNT_INSN_SIZE], *new; /* * Replace the mcount stub with a pointer to the * ip recorder function. */ - memcpy(old, &mcount_call, 4); + memcpy(old, &mcount_call, MCOUNT_INSN_SIZE); new = ftrace_call_replace(ip, *addr); *addr = ftrace_modify_code(ip, old, new); Index: linux-2.6.24.7/include/asm-powerpc/ftrace.h =================================================================== --- linux-2.6.24.7.orig/include/asm-powerpc/ftrace.h +++ linux-2.6.24.7/include/asm-powerpc/ftrace.h @@ -1,6 +1,14 @@ #ifndef _ASM_POWERPC_FTRACE #define _ASM_POWERPC_FTRACE +#ifdef CONFIG_FTRACE +#define MCOUNT_ADDR ((long)(_mcount)) +#define MCOUNT_INSN_SIZE 4 /* sizeof mcount call */ + +#ifndef __ASSEMBLY__ extern void _mcount(void); +#endif #endif + +#endif /* _ASM_POWERPC_FTRACE */ ��������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/powerpc-ftrace-stop-on-oops.patch�����������������������������������������������������������0000664�0000764�0000764�00000001624�11043075254�017556� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������Subject: powerpc: ftrace stop on crash From: Thomas Gleixner <tglx@linutronix.de> Date: Sun, 27 Jul 2008 09:42:36 +0200 Stop tracing, when we run into an oops/bug. That way we can see what led to that. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> --- arch/powerpc/kernel/traps.c | 3 +++ 1 file changed, 3 insertions(+) Index: linux-2.6.24.7/arch/powerpc/kernel/traps.c =================================================================== --- linux-2.6.24.7.orig/arch/powerpc/kernel/traps.c +++ linux-2.6.24.7/arch/powerpc/kernel/traps.c @@ -34,6 +34,7 @@ #include <linux/backlight.h> #include <linux/bug.h> #include <linux/kdebug.h> +#include <linux/ftrace.h> #include <asm/pgtable.h> #include <asm/uaccess.h> @@ -111,6 +112,8 @@ int die(const char *str, struct pt_regs if (debugger(regs)) return 1; + ftrace_stop(); + oops_enter(); if (die.lock_owner != raw_smp_processor_id()) { ������������������������������������������������������������������������������������������������������������patches/ftrace-m68knommu-add-FTRACE-support.patch���������������������������������������������������0000664�0000764�0000764�00000007222�11041657731�020573� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From 674ceaadcb008adc57249a54c1b5b20c74c8c80b Mon Sep 17 00:00:00 2001 From: Sebastian Siewior <bigeasy@linutronix.de> Date: Wed, 9 Jul 2008 13:33:20 +0200 Subject: [PATCH] m68knommu: add FTRACE support due to a gcc bug or feature or me too stupid, the following patch has to be applied to gcc: |m68k: remove label generation on -pg | |haven't found a reason why this flag is needed. Maybe glibc needs this label. |However this implementation puts the labels too far away. | |Signed-off-by: Sebastian Siewior <bigeasy@linutronix.de> | |--- a/gcc/config/m68k/linux.h |+++ b/gcc/config/m68k/linux.h |@@ -143,7 +143,6 @@ along with GCC; see the file COPYING3. | #undef FUNCTION_PROFILER | #define FUNCTION_PROFILER(FILE, LABELNO) \ | { \ |- asm_fprintf (FILE, "\tlea (%LLP%d,%Rpc),%Ra1\n", (LABELNO)); \ | if (flag_pic) \ | fprintf (FILE, "\tbsr.l _mcount@PLTPC\n"); \ | else \ |--- a/gcc/config/m68k/m68k.h |+++ b/gcc/config/m68k/m68k.h |@@ -576,7 +576,7 @@ extern enum reg_class regno_reg_class[]; | #define FUNCTION_ARG(CUM, MODE, TYPE, NAMED) 0 | | #define FUNCTION_PROFILER(FILE, LABELNO) \ |- asm_fprintf (FILE, "\tlea %LLP%d,%Ra0\n\tjsr mcount\n", (LABELNO)) |+ asm_fprintf (FILE, "\tjsr mcount\n", (LABELNO)) | | #define EXIT_IGNORE_STACK 1 | Signed-off-by: Sebastian Siewior <bigeasy@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> --- arch/m68knommu/Kconfig | 1 arch/m68knommu/kernel/process.c | 2 + arch/m68knommu/platform/coldfire/entry.S | 33 +++++++++++++++++++++++++++++++ kernel/trace/trace.c | 1 4 files changed, 36 insertions(+), 1 deletion(-) Index: linux-2.6.24.7/arch/m68knommu/Kconfig =================================================================== --- linux-2.6.24.7.orig/arch/m68knommu/Kconfig +++ linux-2.6.24.7/arch/m68knommu/Kconfig @@ -175,6 +175,7 @@ config M527x config COLDFIRE bool depends on (M5206 || M5206e || M520x || M523x || M5249 || M527x || M5272 || M528x || M5307 || M532x || M5407) + select HAVE_FTRACE default y config CLOCK_SET Index: linux-2.6.24.7/arch/m68knommu/kernel/process.c =================================================================== --- linux-2.6.24.7.orig/arch/m68knommu/kernel/process.c +++ linux-2.6.24.7/arch/m68knommu/kernel/process.c @@ -74,7 +74,9 @@ void cpu_idle(void) { /* endless idle loop with no priority at all */ while (1) { + stop_critical_timings(); idle(); + start_critical_timings(); preempt_enable_no_resched(); schedule(); preempt_disable(); Index: linux-2.6.24.7/arch/m68knommu/platform/coldfire/entry.S =================================================================== --- linux-2.6.24.7.orig/arch/m68knommu/platform/coldfire/entry.S +++ linux-2.6.24.7/arch/m68knommu/platform/coldfire/entry.S @@ -55,6 +55,39 @@ sw_usp: .globl inthandler .globl fasthandler +#ifdef CONFIG_FTRACE +ENTRY(_mcount) + linkw %fp, #0 + + moveal ftrace_trace_function, %a0 + movel #ftrace_stub, %d0 + cmpl %a0@, %d0 + + bnew do_mcount + + unlk %fp + rts + +do_mcount: + + movel %fp, %d0 + moveal %d0, %a1 + + moveal %a1@, %a0 + movel %a0@(4), %sp@- /* push parent ip */ + movel %a1@(4), %sp@- /* push ip */ + + moveal ftrace_trace_function, %a0 + jsr %a0@ + + unlk %fp + +.globl ftrace_stub +ftrace_stub: + rts +END(mcount) +#endif + enosys: mov.l #sys_ni_syscall,%d3 bra 1f Index: linux-2.6.24.7/kernel/trace/trace.c =================================================================== --- linux-2.6.24.7.orig/kernel/trace/trace.c +++ linux-2.6.24.7/kernel/trace/trace.c @@ -31,7 +31,6 @@ #include <linux/stacktrace.h> -#include <asm/asm-offsets.h> #include <asm/unistd.h> #include "trace.h" ������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/ftrace-m68knommu-generic-stacktrace-function.patch������������������������������������������0000664�0000764�0000764�00000006155�11041657730�022753� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From 616acbc45cdc24f145edd2960f6f6a0a5c2579b6 Mon Sep 17 00:00:00 2001 From: Sebastian Siewior <bigeasy@linutronix.de> Date: Wed, 9 Jul 2008 13:36:37 +0200 Subject: [PATCH] m68knommu: generic stacktrace function This provides the generic stack trace interface which is based on x86 and required by ftrace. A proper sollution will come once I unify this and the current m68knommu stack trace algo. Signed-off-by: Sebastian Siewior <bigeasy@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> --- arch/m68knommu/kernel/Makefile | 5 +- arch/m68knommu/kernel/stacktrace.c | 69 +++++++++++++++++++++++++++++++++++++ 2 files changed, 72 insertions(+), 2 deletions(-) create mode 100644 arch/m68knommu/kernel/stacktrace.c Index: linux-2.6.24.7/arch/m68knommu/kernel/Makefile =================================================================== --- linux-2.6.24.7.orig/arch/m68knommu/kernel/Makefile +++ linux-2.6.24.7/arch/m68knommu/kernel/Makefile @@ -7,5 +7,6 @@ extra-y := vmlinux.lds obj-y += dma.o entry.o init_task.o irq.o m68k_ksyms.o process.o ptrace.o \ semaphore.o setup.o signal.o syscalltable.o sys_m68k.o time.o traps.o -obj-$(CONFIG_MODULES) += module.o -obj-$(CONFIG_COMEMPCI) += comempci.o +obj-$(CONFIG_MODULES) += module.o +obj-$(CONFIG_COMEMPCI) += comempci.o +obj-$(CONFIG_STACKTRACE) += stacktrace.o Index: linux-2.6.24.7/arch/m68knommu/kernel/stacktrace.c =================================================================== --- /dev/null +++ linux-2.6.24.7/arch/m68knommu/kernel/stacktrace.c @@ -0,0 +1,69 @@ +/* + * Quick & dirty stacktrace implementation. + */ +#include <linux/sched.h> +#include <linux/stacktrace.h> + +typedef void (save_stack_addr_t)(void *data, unsigned long addr, int reliable); + +static void save_stack_address(void *data, unsigned long addr, int reliable) +{ + struct stack_trace *trace = data; + if (!reliable) + return; + if (trace->skip > 0) { + trace->skip--; + return; + } + if (trace->nr_entries < trace->max_entries) + trace->entries[trace->nr_entries++] = addr; +} + +static void print_context_stack(unsigned long *stack, + save_stack_addr_t *sstack_func, struct stack_trace *trace) +{ + unsigned long *last_stack; + unsigned long *endstack; + unsigned long addr; + + addr = (unsigned long) stack; + endstack = (unsigned long *) PAGE_ALIGN(addr); + + last_stack = stack - 1; + while (stack <= endstack && stack > last_stack) { + + addr = *(stack + 1); + sstack_func(trace, addr, 1); + + last_stack = stack; + stack = (unsigned long *)*stack; + } +} + +static noinline long *get_current_stack(void) +{ + unsigned long *stack; + + stack = (unsigned long *)&stack; + stack++; + return stack; +} + +static void save_current_stack(save_stack_addr_t *sstack_func, + struct stack_trace *trace) +{ + unsigned long *stack; + + stack = get_current_stack(); + print_context_stack(stack, save_stack_address, trace); +} + +/* + * Save stack-backtrace addresses into a stack_trace buffer. + */ +void save_stack_trace(struct stack_trace *trace) +{ + save_current_stack(save_stack_address, trace); + if (trace->nr_entries < trace->max_entries) + trace->entries[trace->nr_entries++] = ULONG_MAX; +} �������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/kvm-fix-preemption-bug.patch����������������������������������������������������������������0000664�0000764�0000764�00000002175�11041657733�016606� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From: Avi Kiviti <avi@qumranet.com> Date: Tue, 15 Jan 2008 15:02:22 +0200 Subject: kvm: check need_resched() inside the irq disabled region The missing need_resched() check inside the irq_disabled region can cause long latencies, if a interrupt with reschedule request happens between preempt_disable() and the local_irq_disable(). In fact the interrupts are disabled inside prepare_guest_switch(), so the race window is rather small. This can be further optimized, but it fixes the bug for now. Mainline bug is fixed in a similar way. Signed-off-by: Avi Kiviti <avi@qumranet.com> Signed-off-by: Thomas Gleixner <tgxl@linutronix.de> --- drivers/kvm/kvm_main.c | 7 +++++++ 1 file changed, 7 insertions(+) Index: linux-2.6.24.7/drivers/kvm/kvm_main.c =================================================================== --- linux-2.6.24.7.orig/drivers/kvm/kvm_main.c +++ linux-2.6.24.7/drivers/kvm/kvm_main.c @@ -2010,6 +2010,13 @@ again: local_irq_disable(); + if (need_resched()) { + local_irq_enable(); + preempt_enable(); + r = 1; + goto out; + } + if (signal_pending(current)) { local_irq_enable(); preempt_enable(); ���������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/kvm-lapic-migrate-latency-fix.patch���������������������������������������������������������0000664�0000764�0000764�00000006012�11041657733�020016� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From: Avi Kiviti <avi@qumranet.com> Date: Mon, 14 Jan 2008 16:35:08 +0200 Subject: kvm: move the apic timer migration Move apic timer migration to a place where it does not cause the "might sleep while atomic" check. The original place calls hrtimer_cancel in a preempt disabled region, which is fine in mainline, but preempt-rt changes hrtimer_cancel, that the caller sleeps on a wait_queue, when the callback of the timer is currently active. Scheduled to go to mainline as well. Signed-off-by: Avi Kiviti <avi@qumranet.com> Signed-off-by: Thomas Gleixner <tgxl@linutronix.de> --- drivers/kvm/irq.h | 2 +- drivers/kvm/kvm.h | 6 ++++++ drivers/kvm/kvm_main.c | 5 +++++ drivers/kvm/lapic.c | 2 +- 4 files changed, 13 insertions(+), 2 deletions(-) Index: linux-2.6.24.7/drivers/kvm/irq.h =================================================================== --- linux-2.6.24.7.orig/drivers/kvm/irq.h +++ linux-2.6.24.7/drivers/kvm/irq.h @@ -160,6 +160,6 @@ void kvm_apic_timer_intr_post(struct kvm void kvm_timer_intr_post(struct kvm_vcpu *vcpu, int vec); void kvm_inject_pending_timer_irqs(struct kvm_vcpu *vcpu); void kvm_inject_apic_timer_irqs(struct kvm_vcpu *vcpu); -void kvm_migrate_apic_timer(struct kvm_vcpu *vcpu); +void __kvm_migrate_apic_timer(struct kvm_vcpu *vcpu); #endif Index: linux-2.6.24.7/drivers/kvm/kvm.h =================================================================== --- linux-2.6.24.7.orig/drivers/kvm/kvm.h +++ linux-2.6.24.7/drivers/kvm/kvm.h @@ -325,6 +325,7 @@ struct kvm_vcpu { u64 pdptrs[4]; /* pae */ u64 shadow_efer; u64 apic_base; + bool migrate_apic_timer; struct kvm_lapic *apic; /* kernel irqchip context */ #define VCPU_MP_STATE_RUNNABLE 0 #define VCPU_MP_STATE_UNINITIALIZED 1 @@ -775,6 +776,11 @@ static inline u32 get_rdx_init_val(void) return 0x600; /* P6 family */ } +static inline void kvm_migrate_apic_timer(struct kvm_vcpu *vcpu) +{ + vcpu->migrate_apic_timer = true; +} + #define ASM_VMX_VMCLEAR_RAX ".byte 0x66, 0x0f, 0xc7, 0x30" #define ASM_VMX_VMLAUNCH ".byte 0x0f, 0x01, 0xc2" #define ASM_VMX_VMRESUME ".byte 0x0f, 0x01, 0xc3" Index: linux-2.6.24.7/drivers/kvm/kvm_main.c =================================================================== --- linux-2.6.24.7.orig/drivers/kvm/kvm_main.c +++ linux-2.6.24.7/drivers/kvm/kvm_main.c @@ -2003,6 +2003,11 @@ again: if (unlikely(r)) goto out; + if (vcpu->migrate_apic_timer) { + vcpu->migrate_apic_timer = false; + __kvm_migrate_apic_timer(vcpu); + } + preempt_disable(); kvm_x86_ops->prepare_guest_switch(vcpu); Index: linux-2.6.24.7/drivers/kvm/lapic.c =================================================================== --- linux-2.6.24.7.orig/drivers/kvm/lapic.c +++ linux-2.6.24.7/drivers/kvm/lapic.c @@ -1065,7 +1065,7 @@ void kvm_apic_post_state_restore(struct start_apic_timer(apic); } -void kvm_migrate_apic_timer(struct kvm_vcpu *vcpu) +void __kvm_migrate_apic_timer(struct kvm_vcpu *vcpu) { struct kvm_lapic *apic = vcpu->apic; struct hrtimer *timer; ����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/kvm-make-less-noise.patch�������������������������������������������������������������������0000664�0000764�0000764�00000006556�11041657733�016070� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From: Avi Kiviti <avi@qumranet.com> Date: Tue, 15 Jan 2008 11:42:29 +0200 Subject: kvm: silence the printk noise Will hit mainline in a modified way. Signed-off-by: Avi Kiviti <avi@qumranet.com> Signed-off-by: Thomas Gleixner <tgxl@linutronix.de> --- drivers/kvm/kvm.h | 6 ++++++ drivers/kvm/kvm_main.c | 4 ++-- drivers/kvm/lapic.c | 22 +++++++++++----------- 3 files changed, 19 insertions(+), 13 deletions(-) Index: linux-2.6.24.7/drivers/kvm/kvm.h =================================================================== --- linux-2.6.24.7.orig/drivers/kvm/kvm.h +++ linux-2.6.24.7/drivers/kvm/kvm.h @@ -509,6 +509,7 @@ struct kvm_x86_ops { extern struct kvm_x86_ops *kvm_x86_ops; +#ifdef KVM_DEBUG /* The guest did something we don't support. */ #define pr_unimpl(vcpu, fmt, ...) \ do { \ @@ -518,6 +519,11 @@ extern struct kvm_x86_ops *kvm_x86_ops; } while(0) #define kvm_printf(kvm, fmt ...) printk(KERN_DEBUG fmt) +#else +#define pr_unimpl(vcpu, fmt ...) do { } while(0) +#define kvm_printf(kvm, fmt ...) do { } while(0) +#endif + #define vcpu_printf(vcpu, fmt...) kvm_printf(vcpu->kvm, fmt) int kvm_vcpu_init(struct kvm_vcpu *vcpu, struct kvm *kvm, unsigned id); Index: linux-2.6.24.7/drivers/kvm/kvm_main.c =================================================================== --- linux-2.6.24.7.orig/drivers/kvm/kvm_main.c +++ linux-2.6.24.7/drivers/kvm/kvm_main.c @@ -1987,8 +1987,8 @@ static int __vcpu_run(struct kvm_vcpu *v int r; if (unlikely(vcpu->mp_state == VCPU_MP_STATE_SIPI_RECEIVED)) { - printk("vcpu %d received sipi with vector # %x\n", - vcpu->vcpu_id, vcpu->sipi_vector); + vcpu_printf(vcpu, "vcpu %d received sipi with vector # %x\n", + vcpu->vcpu_id, vcpu->sipi_vector); kvm_lapic_reset(vcpu); kvm_x86_ops->vcpu_reset(vcpu); vcpu->mp_state = VCPU_MP_STATE_RUNNABLE; Index: linux-2.6.24.7/drivers/kvm/lapic.c =================================================================== --- linux-2.6.24.7.orig/drivers/kvm/lapic.c +++ linux-2.6.24.7/drivers/kvm/lapic.c @@ -347,35 +347,35 @@ static int __apic_accept_irq(struct kvm_ break; case APIC_DM_REMRD: - printk(KERN_DEBUG "Ignoring delivery mode 3\n"); + vcpu_printf(vcpu "Ignoring delivery mode 3\n"); break; case APIC_DM_SMI: - printk(KERN_DEBUG "Ignoring guest SMI\n"); + vcpu_printf(vcpu, "Ignoring guest SMI\n"); break; case APIC_DM_NMI: - printk(KERN_DEBUG "Ignoring guest NMI\n"); + vcpu_printf(vcpu, "Ignoring guest NMI\n"); break; case APIC_DM_INIT: if (level) { if (vcpu->mp_state == VCPU_MP_STATE_RUNNABLE) - printk(KERN_DEBUG - "INIT on a runnable vcpu %d\n", - vcpu->vcpu_id); + vcpu_printf(vcpu, + "INIT on a runnable vcpu %d\n", + vcpu->vcpu_id); vcpu->mp_state = VCPU_MP_STATE_INIT_RECEIVED; kvm_vcpu_kick(vcpu); } else { - printk(KERN_DEBUG - "Ignoring de-assert INIT to vcpu %d\n", - vcpu->vcpu_id); + vcpu_printf(vcpu, + "Ignoring de-assert INIT to vcpu %d\n", + vcpu->vcpu_id); } break; case APIC_DM_STARTUP: - printk(KERN_DEBUG "SIPI to vcpu %d vector 0x%02x\n", - vcpu->vcpu_id, vector); + vcpu_printf(vcpu, "SIPI to vcpu %d vector 0x%02x\n", + vcpu->vcpu_id, vector); if (vcpu->mp_state == VCPU_MP_STATE_INIT_RECEIVED) { vcpu->sipi_vector = vector; vcpu->mp_state = VCPU_MP_STATE_SIPI_RECEIVED; ��������������������������������������������������������������������������������������������������������������������������������������������������patches/kvm-preempt-rt-resched-delayed.patch��������������������������������������������������������0000664�0000764�0000764�00000001267�11041657732�020204� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From: Thomas Gleixner <tglx@linutronix.de> Date: Tue, 15 Jan 2008 15:02:44 +0200 Subject: kvm: add need_resched_delayed() Check, whether this is really necessary here. Signed-off-by: Thomas Gleixner <tgxl@linutronix.de> --- drivers/kvm/kvm_main.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) Index: linux-2.6.24.7/drivers/kvm/kvm_main.c =================================================================== --- linux-2.6.24.7.orig/drivers/kvm/kvm_main.c +++ linux-2.6.24.7/drivers/kvm/kvm_main.c @@ -2015,7 +2015,7 @@ again: local_irq_disable(); - if (need_resched()) { + if (need_resched() || need_resched_delayed()) { local_irq_enable(); preempt_enable(); r = 1; �����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/sched-enable-irqs-in-preempt-in-notifier-call.patch�����������������������������������������0000664�0000764�0000764�00000002063�11041657735�022766� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From: Thomas Gleixner <tglx@linutronix.de> Date: Mon, 14 Jan 2008 14:02:44 +0200 Subject: CFS: enable irqs in fire_sched_in_preempt_notifier KVM expects the notifier call with irqs enabled. It's necessary due to a possible IPI call. Make the preempt-rt version behave the same way as mainline. Signed-off-by: Thomas Gleixner <tgxl@linutronix.de> --- kernel/sched.c | 9 +++++++++ 1 file changed, 9 insertions(+) Index: linux-2.6.24.7/kernel/sched.c =================================================================== --- linux-2.6.24.7.orig/kernel/sched.c +++ linux-2.6.24.7/kernel/sched.c @@ -1821,8 +1821,17 @@ static void fire_sched_in_preempt_notifi struct preempt_notifier *notifier; struct hlist_node *node; + if (hlist_empty(&curr->preempt_notifiers)) + return; + + /* + * The KVM sched in notifier expects to be called with + * interrupts enabled. + */ + local_irq_enable(); hlist_for_each_entry(notifier, node, &curr->preempt_notifiers, link) notifier->ops->sched_in(notifier, raw_smp_processor_id()); + local_irq_disable(); } static void �����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/ep93xx-timer-accuracy.patch�����������������������������������������������������������������0000664�0000764�0000764�00000003240�11041657731�016330� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������� The ep93xx has a weird timer tick base (983.04 kHz.) This experimental patch tries to increase time of day accuracy by keeping the number of ticks until the next jiffy in a fractional value representation. Signed-off-by: Lennert Buytenhek <buytenh@wantstofly.org> --- arch/arm/mach-ep93xx/core.c | 23 ++++++++++++++++++----- 1 file changed, 18 insertions(+), 5 deletions(-) Index: linux-2.6.24.7/arch/arm/mach-ep93xx/core.c =================================================================== --- linux-2.6.24.7.orig/arch/arm/mach-ep93xx/core.c +++ linux-2.6.24.7/arch/arm/mach-ep93xx/core.c @@ -94,19 +94,32 @@ void __init ep93xx_map_io(void) * track of lost jiffies. */ static unsigned int last_jiffy_time; +static unsigned int next_jiffy_time; +static unsigned int accumulator; -#define TIMER4_TICKS_PER_JIFFY ((CLOCK_TICK_RATE + (HZ/2)) / HZ) +#define TIMER4_TICKS_PER_JIFFY (983040 / HZ) +#define TIMER4_TICKS_MOD_JIFFY (983040 % HZ) + +static int after_eq(unsigned long a, unsigned long b) +{ + return ((signed long)(a - b)) >= 0; +} static int ep93xx_timer_interrupt(int irq, void *dev_id) { write_seqlock(&xtime_lock); __raw_writel(1, EP93XX_TIMER1_CLEAR); - while ((signed long) - (__raw_readl(EP93XX_TIMER4_VALUE_LOW) - last_jiffy_time) - >= TIMER4_TICKS_PER_JIFFY) { - last_jiffy_time += TIMER4_TICKS_PER_JIFFY; + while (after_eq(__raw_readl(EP93XX_TIMER4_VALUE_LOW), next_jiffy_time)) { timer_tick(); + + last_jiffy_time = next_jiffy_time; + next_jiffy_time += TIMER4_TICKS_PER_JIFFY; + accumulator += TIMER4_TICKS_MOD_JIFFY; + if (accumulator >= HZ) { + next_jiffy_time++; + accumulator -= HZ; + } } write_sequnlock(&xtime_lock); ����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/ep93xx-clockevents.patch��������������������������������������������������������������������0000664�0000764�0000764�00000014437�11041657733�015754� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������clockevent support for the EP93xx platform clockevent support for the EP93xx platform (by tglx) Only added a fix for clockevent_ep93xx.mult, which was using the wrong clock tickrate) --- arch/arm/mach-ep93xx/core.c | 125 ++++++++++++++++++++---------- include/asm-arm/arch-ep93xx/ep93xx-regs.h | 6 + 2 files changed, 91 insertions(+), 40 deletions(-) Index: linux-2.6.24.7/arch/arm/mach-ep93xx/core.c =================================================================== --- linux-2.6.24.7.orig/arch/arm/mach-ep93xx/core.c +++ linux-2.6.24.7/arch/arm/mach-ep93xx/core.c @@ -32,6 +32,8 @@ #include <linux/termios.h> #include <linux/amba/bus.h> #include <linux/amba/serial.h> +#include <linux/clocksource.h> +#include <linux/clockchips.h> #include <asm/types.h> #include <asm/setup.h> @@ -50,7 +52,6 @@ #include <asm/hardware/vic.h> - /************************************************************************* * Static I/O mappings that are needed for all EP93xx platforms *************************************************************************/ @@ -93,39 +94,58 @@ void __init ep93xx_map_io(void) * to use this timer for something else. We also use timer 4 for keeping * track of lost jiffies. */ -static unsigned int last_jiffy_time; -static unsigned int next_jiffy_time; -static unsigned int accumulator; +static struct clock_event_device clockevent_ep93xx; + +static int ep93xx_timer_interrupt(int irq, void *dev_id) +{ + __raw_writel(EP93XX_TC_CLEAR, EP93XX_TIMER1_CLEAR); -#define TIMER4_TICKS_PER_JIFFY (983040 / HZ) -#define TIMER4_TICKS_MOD_JIFFY (983040 % HZ) + clockevent_ep93xx.event_handler(&clockevent_ep93xx); -static int after_eq(unsigned long a, unsigned long b) + return IRQ_HANDLED; +} + +static int ep93xx_set_next_event(unsigned long evt, + struct clock_event_device *unused) { - return ((signed long)(a - b)) >= 0; + __raw_writel(evt, EP93XX_TIMER1_LOAD); + return 0; } -static int ep93xx_timer_interrupt(int irq, void *dev_id) +static void ep93xx_set_mode(enum clock_event_mode mode, + struct clock_event_device *evt) { - write_seqlock(&xtime_lock); + u32 tmode = EP93XX_TC123_SEL_508KHZ; - __raw_writel(1, EP93XX_TIMER1_CLEAR); - while (after_eq(__raw_readl(EP93XX_TIMER4_VALUE_LOW), next_jiffy_time)) { - timer_tick(); - - last_jiffy_time = next_jiffy_time; - next_jiffy_time += TIMER4_TICKS_PER_JIFFY; - accumulator += TIMER4_TICKS_MOD_JIFFY; - if (accumulator >= HZ) { - next_jiffy_time++; - accumulator -= HZ; - } + /* Disable timer */ + __raw_writel(tmode, EP93XX_TIMER1_CONTROL); + + switch(mode) { + case CLOCK_EVT_MODE_PERIODIC: + /* Set timer period */ + __raw_writel((508469 / HZ) - 1, EP93XX_TIMER1_LOAD); + tmode |= EP93XX_TC123_PERIODIC; + + case CLOCK_EVT_MODE_ONESHOT: + tmode |= EP93XX_TC123_ENABLE; + __raw_writel(tmode, EP93XX_TIMER1_CONTROL); + break; + + case CLOCK_EVT_MODE_SHUTDOWN: + case CLOCK_EVT_MODE_UNUSED: + case CLOCK_EVT_MODE_RESUME: + return; } +} - write_sequnlock(&xtime_lock); +static struct clock_event_device clockevent_ep93xx = { + .name = "ep93xx-timer1", + .features = CLOCK_EVT_FEAT_ONESHOT | CLOCK_EVT_FEAT_PERIODIC, + .shift = 32, + .set_mode = ep93xx_set_mode, + .set_next_event = ep93xx_set_next_event, +}; - return IRQ_HANDLED; -} static struct irqaction ep93xx_timer_irq = { .name = "ep93xx timer", @@ -133,32 +153,58 @@ static struct irqaction ep93xx_timer_irq .handler = ep93xx_timer_interrupt, }; -static void __init ep93xx_timer_init(void) +static void __init ep93xx_clockevent_init(void) { - /* Enable periodic HZ timer. */ - __raw_writel(0x48, EP93XX_TIMER1_CONTROL); - __raw_writel((508469 / HZ) - 1, EP93XX_TIMER1_LOAD); - __raw_writel(0xc8, EP93XX_TIMER1_CONTROL); + setup_irq(IRQ_EP93XX_TIMER1, &ep93xx_timer_irq); - /* Enable lost jiffy timer. */ - __raw_writel(0x100, EP93XX_TIMER4_VALUE_HIGH); + clockevent_ep93xx.mult = div_sc(508469, NSEC_PER_SEC, + clockevent_ep93xx.shift); + clockevent_ep93xx.max_delta_ns = + clockevent_delta2ns(0xfffffffe, &clockevent_ep93xx); + clockevent_ep93xx.min_delta_ns = + clockevent_delta2ns(0xf, &clockevent_ep93xx); + clockevent_ep93xx.cpumask = cpumask_of_cpu(0); + clockevents_register_device(&clockevent_ep93xx); +} - setup_irq(IRQ_EP93XX_TIMER1, &ep93xx_timer_irq); +/* + * timer4 is a 40 Bit timer, separated in a 32bit and a 8 bit + * register, EP93XX_TIMER4_VALUE_LOW stores 32 bit word. The + * controlregister is in EP93XX_TIMER4_VALUE_HIGH + */ + +cycle_t ep93xx_get_cycles(void) +{ + return __raw_readl(EP93XX_TIMER4_VALUE_LOW); } -static unsigned long ep93xx_gettimeoffset(void) +static struct clocksource clocksource_ep93xx = { + .name = "ep93xx_timer4", + .rating = 200, + .read = ep93xx_get_cycles, + .mask = 0xFFFFFFFF, + .shift = 20, + .flags = CLOCK_SOURCE_IS_CONTINUOUS, +}; + +static void __init ep93xx_clocksource_init(void) { - int offset; + /* Reset time-stamp counter */ + __raw_writel(0x100, EP93XX_TIMER4_VALUE_HIGH); - offset = __raw_readl(EP93XX_TIMER4_VALUE_LOW) - last_jiffy_time; + clocksource_ep93xx.mult = + clocksource_hz2mult(983040, clocksource_ep93xx.shift); + clocksource_register(&clocksource_ep93xx); +} - /* Calculate (1000000 / 983040) * offset. */ - return offset + (53 * offset / 3072); +static void __init ep93xx_timer_init(void) +{ + ep93xx_clocksource_init(); + ep93xx_clockevent_init(); } struct sys_timer ep93xx_timer = { - .init = ep93xx_timer_init, - .offset = ep93xx_gettimeoffset, + .init = ep93xx_timer_init, }; @@ -510,7 +556,6 @@ static struct platform_device ep93xx_ohc .resource = ep93xx_ohci_resources, }; - void __init ep93xx_init_devices(void) { unsigned int v; Index: linux-2.6.24.7/include/asm-arm/arch-ep93xx/ep93xx-regs.h =================================================================== --- linux-2.6.24.7.orig/include/asm-arm/arch-ep93xx/ep93xx-regs.h +++ linux-2.6.24.7/include/asm-arm/arch-ep93xx/ep93xx-regs.h @@ -67,6 +67,12 @@ #define EP93XX_TIMER3_CONTROL EP93XX_TIMER_REG(0x88) #define EP93XX_TIMER3_CLEAR EP93XX_TIMER_REG(0x8c) +#define EP93XX_TC_CLEAR 0x00000001 +#define EP93XX_TC123_ENABLE 0x00000080 +#define EP93XX_TC123_PERIODIC 0x00000040 +#define EP93XX_TC123_SEL_508KHZ 0x00000008 +#define EP93XX_TC4_ENABLE 0x00000100 + #define EP93XX_I2S_BASE (EP93XX_APB_VIRT_BASE + 0x00020000) #define EP93XX_SECURITY_BASE (EP93XX_APB_VIRT_BASE + 0x00030000) ���������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/ep93xx-clockevents-fix.patch����������������������������������������������������������������0000664�0000764�0000764�00000002547�11041657732�016536� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������Subject: timer patch for ep93xx From: Manfred Gruber <m.gruber@tirol.com> hi ! this patch is necessary to get latencies < 1ms for ep93xx armv4t with 2.6.21.5-rt18. Signed-off-by: Manfred Gruber <m.gruber@tirol.com> --- arch/arm/mach-ep93xx/core.c | 12 ++++++++++-- 1 file changed, 10 insertions(+), 2 deletions(-) Index: linux-2.6.24.7/arch/arm/mach-ep93xx/core.c =================================================================== --- linux-2.6.24.7.orig/arch/arm/mach-ep93xx/core.c +++ linux-2.6.24.7/arch/arm/mach-ep93xx/core.c @@ -98,9 +98,9 @@ static struct clock_event_device clockev static int ep93xx_timer_interrupt(int irq, void *dev_id) { - __raw_writel(EP93XX_TC_CLEAR, EP93XX_TIMER1_CLEAR); + __raw_writel(EP93XX_TC_CLEAR, EP93XX_TIMER1_CLEAR); - clockevent_ep93xx.event_handler(&clockevent_ep93xx); + clockevent_ep93xx.event_handler(&clockevent_ep93xx); return IRQ_HANDLED; } @@ -108,7 +108,15 @@ static int ep93xx_timer_interrupt(int ir static int ep93xx_set_next_event(unsigned long evt, struct clock_event_device *unused) { + u32 tmode = __raw_readl(EP93XX_TIMER1_CONTROL); + + /* stop timer */ + __raw_writel(tmode & ~EP93XX_TC123_ENABLE, EP93XX_TIMER1_CONTROL); + /* program timer */ __raw_writel(evt, EP93XX_TIMER1_LOAD); + /* start timer */ + __raw_writel(tmode | EP93XX_TC123_ENABLE, EP93XX_TIMER1_CONTROL); + return 0; } ���������������������������������������������������������������������������������������������������������������������������������������������������������patches/arm-leds-timer.patch������������������������������������������������������������������������0000664�0000764�0000764�00000001366�11041657734�015116� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������The clockevent layer now handles everything done by the ARM timer_tick() call, except the LED stuff. Here we add an arch_tick_leds() to handle LED toggling which is called by do_timer(). Signed-off-by: Kevin Hilman <khilman@mvista.com> --- arch/arm/kernel/time.c | 7 +++++++ 1 file changed, 7 insertions(+) Index: linux-2.6.24.7/arch/arm/kernel/time.c =================================================================== --- linux-2.6.24.7.orig/arch/arm/kernel/time.c +++ linux-2.6.24.7/arch/arm/kernel/time.c @@ -236,6 +236,13 @@ static inline void do_leds(void) #define do_leds() #endif +void arch_tick_leds(void) +{ +#ifdef CONFIG_LEDS_TIMER + do_leds(); +#endif +} + #ifndef CONFIG_GENERIC_TIME void do_gettimeofday(struct timeval *tv) { ��������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/spinlock-trylock-cleanup-sungem.patch�������������������������������������������������������0000664�0000764�0000764�00000001166�11041657733�020521� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������--- drivers/net/sungem.c | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) Index: linux-2.6.24.7/drivers/net/sungem.c =================================================================== --- linux-2.6.24.7.orig/drivers/net/sungem.c +++ linux-2.6.24.7/drivers/net/sungem.c @@ -1031,10 +1031,8 @@ static int gem_start_xmit(struct sk_buff (csum_stuff_off << 21)); } - local_irq_save(flags); - if (!spin_trylock(&gp->tx_lock)) { + if (!spin_trylock_irqsave(&gp->tx_lock, flags)) { /* Tell upper layer to requeue */ - local_irq_restore(flags); return NETDEV_TX_LOCKED; } /* We raced with gem_do_stop() */ ����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/x86_64-tsc-sync-irqflags-fix.patch����������������������������������������������������������0000664�0000764�0000764�00000001405�11041657732�017355� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������--- arch/x86/kernel/tsc_sync.c | 4 ++++ 1 file changed, 4 insertions(+) Index: linux-2.6.24.7/arch/x86/kernel/tsc_sync.c =================================================================== --- linux-2.6.24.7.orig/arch/x86/kernel/tsc_sync.c +++ linux-2.6.24.7/arch/x86/kernel/tsc_sync.c @@ -97,6 +97,7 @@ static __cpuinit void check_tsc_warp(voi */ void __cpuinit check_tsc_sync_source(int cpu) { + unsigned long flags; int cpus = 2; /* @@ -117,8 +118,11 @@ void __cpuinit check_tsc_sync_source(int /* * Wait for the target to arrive: */ + local_save_flags(flags); + local_irq_enable(); while (atomic_read(&start_count) != cpus-1) cpu_relax(); + local_irq_restore(flags); /* * Trigger the target to continue into the measurement too: */ �����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/neptune-no-at-keyboard.patch����������������������������������������������������������������0000664�0000764�0000764�00000003330�11041657730�016551� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������neptune needs this to boot ... --- drivers/input/keyboard/atkbd.c | 14 ++++++++++++++ drivers/input/mouse/psmouse-base.c | 15 +++++++++++++++ 2 files changed, 29 insertions(+) Index: linux-2.6.24.7/drivers/input/keyboard/atkbd.c =================================================================== --- linux-2.6.24.7.orig/drivers/input/keyboard/atkbd.c +++ linux-2.6.24.7/drivers/input/keyboard/atkbd.c @@ -1401,9 +1401,23 @@ static ssize_t atkbd_show_err_count(stru return sprintf(buf, "%lu\n", atkbd->err_count); } +static int __read_mostly noatkbd; + +static int __init noatkbd_setup(char *str) +{ + noatkbd = 1; + printk(KERN_INFO "debug: not setting up AT keyboard.\n"); + + return 1; +} + +__setup("noatkbd", noatkbd_setup); static int __init atkbd_init(void) { + if (noatkbd) + return 0; + return serio_register_driver(&atkbd_drv); } Index: linux-2.6.24.7/drivers/input/mouse/psmouse-base.c =================================================================== --- linux-2.6.24.7.orig/drivers/input/mouse/psmouse-base.c +++ linux-2.6.24.7/drivers/input/mouse/psmouse-base.c @@ -1598,10 +1598,25 @@ static int psmouse_get_maxproto(char *bu return sprintf(buffer, "%s\n", psmouse_protocol_by_type(type)->name); } +static int __read_mostly nopsmouse; + +static int __init nopsmouse_setup(char *str) +{ + nopsmouse = 1; + printk(KERN_INFO "debug: not setting up psmouse.\n"); + + return 1; +} + +__setup("nopsmouse", nopsmouse_setup); + static int __init psmouse_init(void) { int err; + if (nopsmouse) + return 0; + kpsmoused_wq = create_singlethread_workqueue("kpsmoused"); if (!kpsmoused_wq) { printk(KERN_ERR "psmouse: failed to create kpsmoused workqueue\n"); ��������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/rtmutex-debug.h-cleanup.patch���������������������������������������������������������������0000664�0000764�0000764�00000003011�11041657735�016731� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������Subject: [patch] lock debugging: clean up rtmutex-debug.h From: Ingo Molnar <mingo@elte.hu> style cleanups. Signed-off-by: Ingo Molnar <mingo@elte.hu> --- kernel/rtmutex-debug.h | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) Index: linux-2.6.24.7/kernel/rtmutex-debug.h =================================================================== --- linux-2.6.24.7.orig/kernel/rtmutex-debug.h +++ linux-2.6.24.7/kernel/rtmutex-debug.h @@ -17,17 +17,17 @@ extern void debug_rt_mutex_free_waiter(s extern void debug_rt_mutex_init(struct rt_mutex *lock, const char *name); extern void debug_rt_mutex_lock(struct rt_mutex *lock); extern void debug_rt_mutex_unlock(struct rt_mutex *lock); -extern void debug_rt_mutex_proxy_lock(struct rt_mutex *lock, - struct task_struct *powner); +extern void +debug_rt_mutex_proxy_lock(struct rt_mutex *lock, struct task_struct *powner); extern void debug_rt_mutex_proxy_unlock(struct rt_mutex *lock); extern void debug_rt_mutex_deadlock(int detect, struct rt_mutex_waiter *waiter, struct rt_mutex *lock); extern void debug_rt_mutex_print_deadlock(struct rt_mutex_waiter *waiter); -# define debug_rt_mutex_reset_waiter(w) \ +# define debug_rt_mutex_reset_waiter(w) \ do { (w)->deadlock_lock = NULL; } while (0) -static inline int debug_rt_mutex_detect_deadlock(struct rt_mutex_waiter *waiter, - int detect) +static inline int +debug_rt_mutex_detect_deadlock(struct rt_mutex_waiter *waiter, int detect) { - return (waiter != NULL); + return waiter != NULL; } �����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/netpoll-8139too-fix.patch�������������������������������������������������������������������0000664�0000764�0000764�00000001175�11041657733�015656� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������--- drivers/net/8139too.c | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) Index: linux-2.6.24.7/drivers/net/8139too.c =================================================================== --- linux-2.6.24.7.orig/drivers/net/8139too.c +++ linux-2.6.24.7/drivers/net/8139too.c @@ -2199,7 +2199,11 @@ static irqreturn_t rtl8139_interrupt (in */ static void rtl8139_poll_controller(struct net_device *dev) { - disable_irq(dev->irq); + /* + * use _nosync() variant - might be used by netconsole + * from atomic contexts: + */ + disable_irq_nosync(dev->irq); rtl8139_interrupt(dev->irq, dev); enable_irq(dev->irq); } ���������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/kprobes-preempt-fix.patch�������������������������������������������������������������������0000664�0000764�0000764�00000002636�11041657734�016200� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������� arch/x86/kernel/kprobes_32.c | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) Index: linux-2.6.24.7/arch/x86/kernel/kprobes_32.c =================================================================== --- linux-2.6.24.7.orig/arch/x86/kernel/kprobes_32.c +++ linux-2.6.24.7/arch/x86/kernel/kprobes_32.c @@ -332,7 +332,7 @@ ss_probe: /* Boost up -- we can execute copied instructions directly */ reset_current_kprobe(); regs->eip = (unsigned long)p->ainsn.insn; - preempt_enable_no_resched(); + preempt_enable(); return 1; } #endif @@ -341,7 +341,7 @@ ss_probe: return 1; no_kprobe: - preempt_enable_no_resched(); + preempt_enable(); return ret; } @@ -573,7 +573,7 @@ static int __kprobes post_kprobe_handler } reset_current_kprobe(); out: - preempt_enable_no_resched(); + preempt_enable(); /* * if somebody else is singlestepping across a probe point, eflags @@ -607,7 +607,7 @@ int __kprobes kprobe_fault_handler(struc restore_previous_kprobe(kcb); else reset_current_kprobe(); - preempt_enable_no_resched(); + preempt_enable(); break; case KPROBE_HIT_ACTIVE: case KPROBE_HIT_SSDONE: @@ -739,7 +739,7 @@ int __kprobes longjmp_break_handler(stru *regs = kcb->jprobe_saved_regs; memcpy((kprobe_opcode_t *) stack_addr, kcb->jprobes_stack, MIN_STACK_SIZE(stack_addr)); - preempt_enable_no_resched(); + preempt_enable(); return 1; } return 0; ��������������������������������������������������������������������������������������������������patches/replace-bugon-by-warn-on.patch��������������������������������������������������������������0000664�0000764�0000764�00000001141�11041657735�016776� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������--- arch/x86/mm/highmem_32.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) Index: linux-2.6.24.7/arch/x86/mm/highmem_32.c =================================================================== --- linux-2.6.24.7.orig/arch/x86/mm/highmem_32.c +++ linux-2.6.24.7/arch/x86/mm/highmem_32.c @@ -39,7 +39,7 @@ void *kmap_atomic_prot(struct page *page idx = type + KM_TYPE_NR*smp_processor_id(); vaddr = __fix_to_virt(FIX_KMAP_BEGIN + idx); - BUG_ON(!pte_none(*(kmap_pte-idx))); + WARN_ON_ONCE(!pte_none(*(kmap_pte-idx))); set_pte(kmap_pte-idx, mk_pte(page, prot)); arch_flush_lazy_mmu_mode(); �������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/i386-mark-atomic-irq-ops-raw.patch����������������������������������������������������������0000664�0000764�0000764�00000001163�11041657731�017336� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������--- include/asm-x86/atomic_32.h | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) Index: linux-2.6.24.7/include/asm-x86/atomic_32.h =================================================================== --- linux-2.6.24.7.orig/include/asm-x86/atomic_32.h +++ linux-2.6.24.7/include/asm-x86/atomic_32.h @@ -195,10 +195,10 @@ static __inline__ int atomic_add_return( #ifdef CONFIG_M386 no_xadd: /* Legacy 386 processor */ - local_irq_save(flags); + raw_local_irq_save(flags); __i = atomic_read(v); atomic_set(v, i + __i); - local_irq_restore(flags); + raw_local_irq_restore(flags); return i + __i; #endif } �������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/msi-suspend-resume-workaround.patch���������������������������������������������������������0000664�0000764�0000764�00000000727�11041657732�020230� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������--- drivers/pci/msi.c | 4 ++++ 1 file changed, 4 insertions(+) Index: linux-2.6.24.7/drivers/pci/msi.c =================================================================== --- linux-2.6.24.7.orig/drivers/pci/msi.c +++ linux-2.6.24.7/drivers/pci/msi.c @@ -241,6 +241,10 @@ static void __pci_restore_msi_state(stru return; entry = get_irq_msi(dev->irq); + if (!entry) { + WARN_ON(1); + return; + } pos = entry->msi_attrib.pos; pci_intx_for_msi(dev, 0); �����������������������������������������patches/floppy-resume-fix.patch���������������������������������������������������������������������0000664�0000764�0000764�00000004175�11041657733�015667� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������Subject: [patch] floppy: suspend/resume fix From: Ingo Molnar <mingo@elte.hu> introduce a floppy platform-driver and suspend/resume ops to stop/start the floppy driver. Bug reported by Mikael Pettersson. Signed-off-by: Ingo Molnar <mingo@elte.hu> --- drivers/block/floppy.c | 31 ++++++++++++++++++++++++++++++- 1 file changed, 30 insertions(+), 1 deletion(-) Index: linux-2.6.24.7/drivers/block/floppy.c =================================================================== --- linux-2.6.24.7.orig/drivers/block/floppy.c +++ linux-2.6.24.7/drivers/block/floppy.c @@ -4149,6 +4149,28 @@ static void floppy_device_release(struct complete(&device_release); } +static int floppy_suspend(struct platform_device *dev, pm_message_t state) +{ + floppy_release_irq_and_dma(); + + return 0; +} + +static int floppy_resume(struct platform_device *dev) +{ + floppy_grab_irq_and_dma(); + + return 0; +} + +static struct platform_driver floppy_driver = { + .suspend = floppy_suspend, + .resume = floppy_resume, + .driver = { + .name = "floppy", + }, +}; + static struct platform_device floppy_device[N_DRIVE]; static struct kobject *floppy_find(dev_t dev, int *part, void *data) @@ -4197,10 +4219,14 @@ static int __init floppy_init(void) if (err) goto out_put_disk; + err = platform_driver_register(&floppy_driver); + if (err) + goto out_unreg_blkdev; + floppy_queue = blk_init_queue(do_fd_request, &floppy_lock); if (!floppy_queue) { err = -ENOMEM; - goto out_unreg_blkdev; + goto out_unreg_driver; } blk_queue_max_sectors(floppy_queue, 64); @@ -4349,6 +4375,8 @@ out_flush_work: out_unreg_region: blk_unregister_region(MKDEV(FLOPPY_MAJOR, 0), 256); blk_cleanup_queue(floppy_queue); +out_unreg_driver: + platform_driver_unregister(&floppy_driver); out_unreg_blkdev: unregister_blkdev(FLOPPY_MAJOR, "fd"); out_put_disk: @@ -4544,6 +4572,7 @@ void cleanup_module(void) init_completion(&device_release); blk_unregister_region(MKDEV(FLOPPY_MAJOR, 0), 256); unregister_blkdev(FLOPPY_MAJOR, "fd"); + platform_driver_unregister(&floppy_driver); for (drive = 0; drive < N_DRIVE; drive++) { del_timer_sync(&motor_off_timer[drive]); ���������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/hrtimers-overrun-api.patch������������������������������������������������������������������0000664�0000764�0000764�00000003045�11041673250�016361� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������--- include/linux/hrtimer.h | 3 +++ kernel/hrtimer.c | 22 ++++++++++++++++++++++ 2 files changed, 25 insertions(+) Index: linux-2.6.24.7/include/linux/hrtimer.h =================================================================== --- linux-2.6.24.7.orig/include/linux/hrtimer.h +++ linux-2.6.24.7/include/linux/hrtimer.h @@ -297,6 +297,9 @@ static inline int hrtimer_is_queued(stru /* Forward a hrtimer so it expires after now: */ extern unsigned long hrtimer_forward(struct hrtimer *timer, ktime_t now, ktime_t interval); +/* Overrun count: */ +extern unsigned long +hrtimer_overrun(struct hrtimer *timer, ktime_t now, ktime_t interval); /* Precise sleep: */ extern long hrtimer_nanosleep(struct timespec *rqtp, Index: linux-2.6.24.7/kernel/hrtimer.c =================================================================== --- linux-2.6.24.7.orig/kernel/hrtimer.c +++ linux-2.6.24.7/kernel/hrtimer.c @@ -730,6 +730,28 @@ hrtimer_forward(struct hrtimer *timer, k } EXPORT_SYMBOL_GPL(hrtimer_forward); +unsigned long +hrtimer_overrun(struct hrtimer *timer, ktime_t now, ktime_t interval) +{ + unsigned long orun = 1; + ktime_t delta; + + delta = ktime_sub(now, timer->expires); + + if (delta.tv64 < 0) + return 0; + + if (interval.tv64 < timer->base->resolution.tv64) + interval.tv64 = timer->base->resolution.tv64; + + if (unlikely(delta.tv64 >= interval.tv64)) + orun = ktime_divns(delta, ktime_to_ns(interval)) + 1; + + return orun; +} +EXPORT_SYMBOL_GPL(hrtimer_overrun); + + /* * enqueue_hrtimer - internal function to (re)start a timer * �������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/mm-fix-latency.patch������������������������������������������������������������������������0000664�0000764�0000764�00000005571�11041657733�015127� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From: Hugh Dickins <hugh@veritas.com> Subject: reduce pagetable-freeing latencies 2.6.15-rc1 moved the unlinking of a vma from its prio_tree and anon_vma into free_pgtables: so the vma is hidden from rmap and vmtruncate before freeing its page tables, allowing safe descent without page table lock. But free_pgtables is still called with preemption disabled, and Lee Revell has now detected high latency there. The right fix will be to rework the mmu_gathering, not to need preemption disabled; but for now an ugly CONFIG_PREEMPT block in free_pgtables, to make an initial unlinking pass with preemption enabled - made uglier by CONFIG_IA64 definitions (only ia64 actually uses the start and end given to tlb_finish_mmu, and our floor and ceiling don't quite work for those). These CONFIG choices being to minimize the additional TLB flushing. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> -- mm/memory.c | 32 ++++++++++++++++++++++++++++++++ 1 file changed, 32 insertions(+) Index: linux-2.6.24.7/mm/memory.c =================================================================== --- linux-2.6.24.7.orig/mm/memory.c +++ linux-2.6.24.7/mm/memory.c @@ -261,18 +261,48 @@ void free_pgd_range(struct mmu_gather ** } while (pgd++, addr = next, addr != end); } +#ifdef CONFIG_IA64 +#define tlb_start_addr(tlb) (tlb)->start_addr +#define tlb_end_addr(tlb) (tlb)->end_addr +#else +#define tlb_start_addr(tlb) 0UL /* only ia64 really uses it */ +#define tlb_end_addr(tlb) 0UL /* only ia64 really uses it */ +#endif + void free_pgtables(struct mmu_gather **tlb, struct vm_area_struct *vma, unsigned long floor, unsigned long ceiling) { +#ifdef CONFIG_PREEMPT + struct vm_area_struct *unlink = vma; + int fullmm = (*tlb)->fullmm; + + if (!vma) /* Sometimes when exiting after an oops */ + return; + if (vma->vm_next) + tlb_finish_mmu(*tlb, tlb_start_addr(*tlb), tlb_end_addr(*tlb)); + /* + * Hide vma from rmap and vmtruncate before freeeing pgtables, + * with preemption enabled, except when unmapping just one area. + */ + while (unlink) { + anon_vma_unlink(unlink); + unlink_file_vma(unlink); + unlink = unlink->vm_next; + } + if (vma->vm_next) + *tlb = tlb_gather_mmu(vma->vm_mm, fullmm); +#endif while (vma) { struct vm_area_struct *next = vma->vm_next; unsigned long addr = vma->vm_start; +#ifndef CONFIG_PREEMPT /* * Hide vma from rmap and vmtruncate before freeing pgtables */ anon_vma_unlink(vma); unlink_file_vma(vma); +#endif if (is_vm_hugetlb_page(vma)) { hugetlb_free_pgd_range(tlb, addr, vma->vm_end, @@ -285,8 +315,10 @@ void free_pgtables(struct mmu_gather **t && !is_vm_hugetlb_page(next)) { vma = next; next = vma->vm_next; +#ifndef CONFIG_PREEMPT anon_vma_unlink(vma); unlink_file_vma(vma); +#endif } free_pgd_range(tlb, addr, vma->vm_end, floor, next? next->vm_start: ceiling); ���������������������������������������������������������������������������������������������������������������������������������������patches/ioapic-fix-too-fast-clocks.patch������������������������������������������������������������0000664�0000764�0000764�00000002763�11041657730�017330� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From: Akira Tsukamoto <akira-t@s9.dion.ne.jp> This one line patch adds upper bound testing inside timer_irq_works() when evaluating whether irq timer works or not on boot up. It fix the machines having problem with clock running too fast. What this patch do is, if timer interrupts running too fast through IO-APIC IRQ then false back to i8259A IRQ. I really appreciate for the feedback from ATI Xpress 200 chipset user, It should eliminate the needs of adding no_timer_check on kernel options. I have NEC laptop using ATI Xpress 200 chipset with Pentium M 1.8GHz and its clock keep going forward when kernel compiled with local APIC support. Many machines based on RS200 chipset seem to have the same problem, including Acer Ferrari 400X AMD notebook or Compaq R4000. Also I would like to have comments on upper bound limit, 16 ticks, which I chose in this patch. My laptop always reports around 20, which is double from normal. arch/x86/kernel/io_apic_32.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) Index: linux-2.6.24.7/arch/x86/kernel/io_apic_32.c =================================================================== --- linux-2.6.24.7.orig/arch/x86/kernel/io_apic_32.c +++ linux-2.6.24.7/arch/x86/kernel/io_apic_32.c @@ -1900,7 +1900,7 @@ static int __init timer_irq_works(void) * might have cached one ExtINT interrupt. Finally, at * least one tick may be lost due to delays. */ - if (jiffies - t1 > 4) + if (jiffies - t1 > 4 && jiffies - t1 < 16) return 1; return 0; �������������patches/fix-acpi-build-weirdness.patch��������������������������������������������������������������0000664�0000764�0000764�00000001152�11041657733�017062� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������� arch/x86/pci/Makefile_32 | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) Index: linux-2.6.24.7/arch/x86/pci/Makefile_32 =================================================================== --- linux-2.6.24.7.orig/arch/x86/pci/Makefile_32 +++ linux-2.6.24.7/arch/x86/pci/Makefile_32 @@ -4,8 +4,9 @@ obj-$(CONFIG_PCI_BIOS) += pcbios.o obj-$(CONFIG_PCI_MMCONFIG) += mmconfig_32.o direct.o mmconfig-shared.o obj-$(CONFIG_PCI_DIRECT) += direct.o +obj-$(CONFIG_ACPI) += acpi.o + pci-y := fixup.o -pci-$(CONFIG_ACPI) += acpi.o pci-y += legacy.o irq.o pci-$(CONFIG_X86_VISWS) := visws.o fixup.o ����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/write-try-lock-irqsave.patch����������������������������������������������������������������0000664�0000764�0000764�00000001170�11041657732�016627� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������--- include/linux/spinlock.h | 7 +++++++ 1 file changed, 7 insertions(+) Index: linux-2.6.24.7/include/linux/spinlock.h =================================================================== --- linux-2.6.24.7.orig/include/linux/spinlock.h +++ linux-2.6.24.7/include/linux/spinlock.h @@ -289,6 +289,13 @@ do { \ 1 : ({ local_irq_restore(flags); 0; }); \ }) +#define write_trylock_irqsave(lock, flags) \ +({ \ + local_irq_save(flags); \ + write_trylock(lock) ? \ + 1 : ({ local_irq_restore(flags); 0; }); \ +}) + /* * Locks two spinlocks l1 and l2. * l1_first indicates if spinlock l1 should be taken first. ��������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/move-native-irq.patch�����������������������������������������������������������������������0000664�0000764�0000764�00000001711�11041657733�015310� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������--- kernel/irq/migration.c | 14 ++++++++++++-- 1 file changed, 12 insertions(+), 2 deletions(-) Index: linux-2.6.24.7/kernel/irq/migration.c =================================================================== --- linux-2.6.24.7.orig/kernel/irq/migration.c +++ linux-2.6.24.7/kernel/irq/migration.c @@ -61,6 +61,7 @@ void move_masked_irq(int irq) void move_native_irq(int irq) { struct irq_desc *desc = irq_desc + irq; + int mask = 1; if (likely(!(desc->status & IRQ_MOVE_PENDING))) return; @@ -68,8 +69,17 @@ void move_native_irq(int irq) if (unlikely(desc->status & IRQ_DISABLED)) return; - desc->chip->mask(irq); + /* + * If the irq is already in progress, it should be masked. + * If we unmask it, we might cause an interrupt storm on RT. + */ + if (unlikely(desc->status & IRQ_INPROGRESS)) + mask = 0; + + if (mask) + desc->chip->mask(irq); move_masked_irq(irq); - desc->chip->unmask(irq); + if (mask) + desc->chip->unmask(irq); } �������������������������������������������������������patches/dont-unmask-io_apic.patch�������������������������������������������������������������������0000664�0000764�0000764�00000001352�11041657733�016127� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������--- arch/x86/kernel/io_apic_64.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) Index: linux-2.6.24.7/arch/x86/kernel/io_apic_64.c =================================================================== --- linux-2.6.24.7.orig/arch/x86/kernel/io_apic_64.c +++ linux-2.6.24.7/arch/x86/kernel/io_apic_64.c @@ -1440,7 +1440,8 @@ static void ack_apic_level(unsigned int irq_complete_move(irq); #if defined(CONFIG_GENERIC_PENDING_IRQ) || defined(CONFIG_IRQBALANCE) /* If we are moving the irq we need to mask it */ - if (unlikely(irq_desc[irq].status & IRQ_MOVE_PENDING)) { + if (unlikely(irq_desc[irq].status & IRQ_MOVE_PENDING) && + !(irq_desc[irq].status & IRQ_INPROGRESS)) { do_unmask_irq = 1; mask_IO_APIC_irq(irq); } ��������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/gcc-warnings-shut-up.patch������������������������������������������������������������������0000664�0000764�0000764�00000004104�11041657731�016247� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������� kernel/audit.c | 2 +- net/core/flow.c | 2 +- net/sunrpc/svc.c | 2 +- sound/core/control_compat.c | 2 +- 4 files changed, 4 insertions(+), 4 deletions(-) Index: linux-2.6.24.7/kernel/audit.c =================================================================== --- linux-2.6.24.7.orig/kernel/audit.c +++ linux-2.6.24.7/kernel/audit.c @@ -1130,7 +1130,7 @@ struct audit_buffer *audit_log_start(str { struct audit_buffer *ab = NULL; struct timespec t; - unsigned int serial; + unsigned int serial = 0 /* shut up gcc */; int reserve; unsigned long timeout_start = jiffies; Index: linux-2.6.24.7/net/core/flow.c =================================================================== --- linux-2.6.24.7.orig/net/core/flow.c +++ linux-2.6.24.7/net/core/flow.c @@ -169,7 +169,7 @@ static int flow_key_compare(struct flowi void *flow_cache_lookup(struct flowi *key, u16 family, u8 dir, flow_resolve_t resolver) { - struct flow_cache_entry *fle, **head; + struct flow_cache_entry *fle, **head = NULL /* shut up GCC */; unsigned int hash; int cpu; Index: linux-2.6.24.7/net/sunrpc/svc.c =================================================================== --- linux-2.6.24.7.orig/net/sunrpc/svc.c +++ linux-2.6.24.7/net/sunrpc/svc.c @@ -547,7 +547,7 @@ __svc_create_thread(svc_thread_fn func, struct svc_rqst *rqstp; int error = -ENOMEM; int have_oldmask = 0; - cpumask_t oldmask; + cpumask_t oldmask = CPU_MASK_NONE /* shut up GCC */; rqstp = kzalloc(sizeof(*rqstp), GFP_KERNEL); if (!rqstp) Index: linux-2.6.24.7/sound/core/control_compat.c =================================================================== --- linux-2.6.24.7.orig/sound/core/control_compat.c +++ linux-2.6.24.7/sound/core/control_compat.c @@ -219,7 +219,7 @@ static int copy_ctl_value_from_user(stru struct snd_ctl_elem_value32 __user *data32, int *typep, int *countp) { - int i, type, count, size; + int i, type, count = 0 /* shut up gcc warning */, size; unsigned int indirect; if (copy_from_user(&data->id, &data32->id, sizeof(data->id))) ������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/apic-dumpstack.patch������������������������������������������������������������������������0000664�0000764�0000764�00000000733�11041657732�015174� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������� arch/x86/kernel/apic_32.c | 1 + 1 file changed, 1 insertion(+) Index: linux-2.6.24.7/arch/x86/kernel/apic_32.c =================================================================== --- linux-2.6.24.7.orig/arch/x86/kernel/apic_32.c +++ linux-2.6.24.7/arch/x86/kernel/apic_32.c @@ -1311,6 +1311,7 @@ void smp_error_interrupt(struct pt_regs */ printk (KERN_DEBUG "APIC error on CPU%d: %02lx(%02lx)\n", smp_processor_id(), v , v1); + dump_stack(); irq_exit(); } �������������������������������������patches/netfilter-more-debugging.patch��������������������������������������������������������������0000664�0000764�0000764�00000001637�11041657735�017163� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������� doing netfilter changes and turning on netfilter debug means we've got to interpret netfilter warning messages a bit more. --- include/net/netfilter/nf_conntrack.h | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) Index: linux-2.6.24.7/include/net/netfilter/nf_conntrack.h =================================================================== --- linux-2.6.24.7.orig/include/net/netfilter/nf_conntrack.h +++ linux-2.6.24.7/include/net/netfilter/nf_conntrack.h @@ -63,11 +63,14 @@ union nf_conntrack_help { #ifdef CONFIG_NETFILTER_DEBUG #define NF_CT_ASSERT(x) \ do { \ - if (!(x)) \ + if (!(x)) { \ /* Wooah! I'm tripping my conntrack in a frenzy of \ netplay... */ \ printk("NF_CT_ASSERT: %s:%i(%s)\n", \ __FILE__, __LINE__, __FUNCTION__); \ + if (printk_ratelimit()) \ + WARN_ON(1); \ + } \ } while(0) #else #define NF_CT_ASSERT(x) �������������������������������������������������������������������������������������������������patches/nmi-profiling-base.patch��������������������������������������������������������������������0000664�0000764�0000764�00000027054�11041657732�015756� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������Subject: [patch] nmi-driven profiling for /proc/profile From: Ingo Molnar <mingo@elte.hu> nmi-driven profiling for /proc/profile Signed-off-by: Ingo Molnar <mingo@elte.hu> --- arch/x86/kernel/crash.c | 8 ---- arch/x86/kernel/irq_64.c | 2 + arch/x86/kernel/nmi_32.c | 89 ++++++++++++++++++++++++++++++++++++++++++---- arch/x86/kernel/nmi_64.c | 64 +++++++++++++++++++++++++++++++-- include/asm-x86/apic_32.h | 2 + include/asm-x86/apic_64.h | 2 + include/linux/profile.h | 1 kernel/profile.c | 9 +++- kernel/time/tick-common.c | 1 kernel/time/tick-sched.c | 2 - 10 files changed, 156 insertions(+), 24 deletions(-) Index: linux-2.6.24.7/arch/x86/kernel/crash.c =================================================================== --- linux-2.6.24.7.orig/arch/x86/kernel/crash.c +++ linux-2.6.24.7/arch/x86/kernel/crash.c @@ -78,14 +78,6 @@ static int crash_nmi_callback(struct not return 1; } -static void smp_send_nmi_allbutself(void) -{ - cpumask_t mask = cpu_online_map; - cpu_clear(safe_smp_processor_id(), mask); - if (!cpus_empty(mask)) - send_IPI_mask(mask, NMI_VECTOR); -} - static struct notifier_block crash_nmi_nb = { .notifier_call = crash_nmi_callback, }; Index: linux-2.6.24.7/arch/x86/kernel/irq_64.c =================================================================== --- linux-2.6.24.7.orig/arch/x86/kernel/irq_64.c +++ linux-2.6.24.7/arch/x86/kernel/irq_64.c @@ -147,6 +147,8 @@ asmlinkage unsigned int do_IRQ(struct pt unsigned vector = ~regs->orig_rax; unsigned irq; + irq_show_regs_callback(smp_processor_id(), regs); + exit_idle(); irq_enter(); irq = __get_cpu_var(vector_irq)[vector]; Index: linux-2.6.24.7/arch/x86/kernel/nmi_32.c =================================================================== --- linux-2.6.24.7.orig/arch/x86/kernel/nmi_32.c +++ linux-2.6.24.7/arch/x86/kernel/nmi_32.c @@ -25,6 +25,7 @@ #include <asm/smp.h> #include <asm/nmi.h> +#include <asm/mach-default/mach_ipi.h> #include "mach_traps.h" @@ -42,7 +43,7 @@ static cpumask_t backtrace_mask = CPU_MA atomic_t nmi_active = ATOMIC_INIT(0); /* oprofile uses this */ unsigned int nmi_watchdog = NMI_DEFAULT; -static unsigned int nmi_hz = HZ; +static unsigned int nmi_hz = 1000; static DEFINE_PER_CPU(short, wd_enabled); @@ -93,7 +94,7 @@ static int __init check_nmi_watchdog(voi for_each_possible_cpu(cpu) prev_nmi_count[cpu] = per_cpu(irq_stat, cpu).__nmi_count; local_irq_enable(); - mdelay((20*1000)/nmi_hz); // wait 20 ticks + mdelay((100*1000)/nmi_hz); /* wait 100 ticks */ for_each_possible_cpu(cpu) { #ifdef CONFIG_SMP @@ -318,6 +319,46 @@ EXPORT_SYMBOL(touch_nmi_watchdog); extern void die_nmi(struct pt_regs *, const char *msg); +int nmi_show_regs[NR_CPUS]; + +void nmi_show_all_regs(void) +{ + int i; + + if (system_state == SYSTEM_BOOTING) + return; + + printk(KERN_WARNING "nmi_show_all_regs(): start on CPU#%d.\n", + raw_smp_processor_id()); + dump_stack(); + + for_each_online_cpu(i) + nmi_show_regs[i] = 1; + + smp_send_nmi_allbutself(); + + for_each_online_cpu(i) { + while (nmi_show_regs[i] == 1) + barrier(); + } +} + +static DEFINE_SPINLOCK(nmi_print_lock); + +void irq_show_regs_callback(int cpu, struct pt_regs *regs) +{ + if (!nmi_show_regs[cpu]) + return; + + nmi_show_regs[cpu] = 0; + spin_lock(&nmi_print_lock); + printk(KERN_WARNING "NMI show regs on CPU#%d:\n", cpu); + printk(KERN_WARNING "apic_timer_irqs: %d\n", + per_cpu(irq_stat, cpu).apic_timer_irqs); + show_regs(regs); + spin_unlock(&nmi_print_lock); +} + notrace __kprobes int nmi_watchdog_tick(struct pt_regs * regs, unsigned reason) { @@ -332,6 +373,8 @@ nmi_watchdog_tick(struct pt_regs * regs, int cpu = smp_processor_id(); int rc=0; + __profile_tick(CPU_PROFILING, regs); + /* check for other users first */ if (notify_die(DIE_NMI, "nmi", regs, reason, 2, SIGINT) == NOTIFY_STOP) { @@ -356,6 +399,9 @@ nmi_watchdog_tick(struct pt_regs * regs, sum = per_cpu(irq_stat, cpu).apic_timer_irqs + per_cpu(irq_stat, cpu).irq0_irqs; + irq_show_regs_callback(cpu, regs); + + /* if the apic timer isn't firing, this cpu isn't doing much */ /* if the none of the timers isn't firing, this cpu isn't doing much */ if (!touched && last_irq_sums[cpu] == sum) { /* @@ -363,11 +409,30 @@ nmi_watchdog_tick(struct pt_regs * regs, * wait a few IRQs (5 seconds) before doing the oops ... */ alert_counter[cpu]++; - if (alert_counter[cpu] == 5*nmi_hz) - /* - * die_nmi will return ONLY if NOTIFY_STOP happens.. - */ - die_nmi(regs, "BUG: NMI Watchdog detected LOCKUP"); + if (alert_counter[cpu] && !(alert_counter[cpu] % (5*nmi_hz))) { + int i; + + spin_lock(&nmi_print_lock); + printk(KERN_WARNING "NMI watchdog detected lockup on " + "CPU#%d (%d/%d)\n", cpu, alert_counter[cpu], + 5*nmi_hz); + show_regs(regs); + spin_unlock(&nmi_print_lock); + + for_each_online_cpu(i) { + if (i == cpu) + continue; + nmi_show_regs[i] = 1; + while (nmi_show_regs[i] == 1) + cpu_relax(); + } + printk(KERN_WARNING "NMI watchdog running again ...\n"); + for_each_online_cpu(i) + alert_counter[i] = 0; + + + } + } else { last_irq_sums[cpu] = sum; alert_counter[cpu] = 0; @@ -465,5 +530,15 @@ void __trigger_all_cpu_backtrace(void) } } +void smp_send_nmi_allbutself(void) +{ +#ifdef CONFIG_SMP + cpumask_t mask = cpu_online_map; + cpu_clear(safe_smp_processor_id(), mask); + if (!cpus_empty(mask)) + send_IPI_mask(mask, NMI_VECTOR); +#endif +} + EXPORT_SYMBOL(nmi_active); EXPORT_SYMBOL(nmi_watchdog); Index: linux-2.6.24.7/arch/x86/kernel/nmi_64.c =================================================================== --- linux-2.6.24.7.orig/arch/x86/kernel/nmi_64.c +++ linux-2.6.24.7/arch/x86/kernel/nmi_64.c @@ -20,11 +20,13 @@ #include <linux/kprobes.h> #include <linux/cpumask.h> #include <linux/kdebug.h> +#include <linux/kernel_stat.h> #include <asm/smp.h> #include <asm/nmi.h> #include <asm/proto.h> #include <asm/mce.h> +#include <asm/mach_apic.h> int unknown_nmi_panic; int nmi_watchdog_enabled; @@ -42,7 +44,7 @@ atomic_t nmi_active = ATOMIC_INIT(0); / int panic_on_timeout; unsigned int nmi_watchdog = NMI_DEFAULT; -static unsigned int nmi_hz = HZ; +static unsigned int nmi_hz = 1000; static DEFINE_PER_CPU(short, wd_enabled); @@ -301,7 +303,7 @@ void touch_nmi_watchdog(void) unsigned cpu; /* - * Tell other CPUs to reset their alert counters. We cannot + * Tell other CPUs to reset their alert counters. We cannot * do it ourselves because the alert count increase is not * atomic. */ @@ -314,6 +316,41 @@ void touch_nmi_watchdog(void) touch_softlockup_watchdog(); } +int nmi_show_regs[NR_CPUS]; + +void nmi_show_all_regs(void) +{ + int i; + + if (system_state == SYSTEM_BOOTING) + return; + + smp_send_nmi_allbutself(); + + for_each_online_cpu(i) + nmi_show_regs[i] = 1; + + for_each_online_cpu(i) { + while (nmi_show_regs[i] == 1) + barrier(); + } +} + +static DEFINE_SPINLOCK(nmi_print_lock); + +void irq_show_regs_callback(int cpu, struct pt_regs *regs) +{ + if (!nmi_show_regs[cpu]) + return; + + nmi_show_regs[cpu] = 0; + spin_lock(&nmi_print_lock); + printk(KERN_WARNING "NMI show regs on CPU#%d:\n", cpu); + printk(KERN_WARNING "apic_timer_irqs: %d\n", read_pda(apic_timer_irqs)); + show_regs(regs); + spin_unlock(&nmi_print_lock); +} + notrace int __kprobes nmi_watchdog_tick(struct pt_regs * regs, unsigned reason) { @@ -322,6 +359,9 @@ nmi_watchdog_tick(struct pt_regs * regs, int cpu = smp_processor_id(); int rc = 0; + irq_show_regs_callback(cpu, regs); + __profile_tick(CPU_PROFILING, regs); + /* check for other users first */ if (notify_die(DIE_NMI, "nmi", regs, reason, 2, SIGINT) == NOTIFY_STOP) { @@ -358,9 +398,20 @@ nmi_watchdog_tick(struct pt_regs * regs, * wait a few IRQs (5 seconds) before doing the oops ... */ local_inc(&__get_cpu_var(alert_counter)); - if (local_read(&__get_cpu_var(alert_counter)) == 5*nmi_hz) + if (local_read(&__get_cpu_var(alert_counter)) == 5*nmi_hz) { + int i; + + for_each_online_cpu(i) { + if (i == cpu) + continue; + nmi_show_regs[i] = 1; + while (nmi_show_regs[i] == 1) + cpu_relax(); + } + die_nmi("NMI Watchdog detected LOCKUP on CPU %d\n", regs, panic_on_timeout); + } } else { __get_cpu_var(last_irq_sum) = sum; local_set(&__get_cpu_var(alert_counter), 0); @@ -478,6 +529,13 @@ void __trigger_all_cpu_backtrace(void) } } +void smp_send_nmi_allbutself(void) +{ +#ifdef CONFIG_SMP + send_IPI_allbutself(NMI_VECTOR); +#endif +} + EXPORT_SYMBOL(nmi_active); EXPORT_SYMBOL(nmi_watchdog); EXPORT_SYMBOL(touch_nmi_watchdog); Index: linux-2.6.24.7/include/asm-x86/apic_32.h =================================================================== --- linux-2.6.24.7.orig/include/asm-x86/apic_32.h +++ linux-2.6.24.7/include/asm-x86/apic_32.h @@ -118,6 +118,8 @@ extern int local_apic_timer_c2_ok; extern int local_apic_timer_disabled; +extern void smp_send_nmi_allbutself(void); + #else /* !CONFIG_X86_LOCAL_APIC */ static inline void lapic_shutdown(void) { } #define local_apic_timer_c2_ok 1 Index: linux-2.6.24.7/include/asm-x86/apic_64.h =================================================================== --- linux-2.6.24.7.orig/include/asm-x86/apic_64.h +++ linux-2.6.24.7/include/asm-x86/apic_64.h @@ -87,6 +87,8 @@ extern void setup_APIC_extended_lvt(unsi extern int apic_is_clustered_box(void); +extern void smp_send_nmi_allbutself(void); + #define K8_APIC_EXT_LVT_BASE 0x500 #define K8_APIC_EXT_INT_MSG_FIX 0x0 #define K8_APIC_EXT_INT_MSG_SMI 0x2 Index: linux-2.6.24.7/include/linux/profile.h =================================================================== --- linux-2.6.24.7.orig/include/linux/profile.h +++ linux-2.6.24.7/include/linux/profile.h @@ -23,6 +23,7 @@ struct notifier_block; /* init basic kernel profiler */ void __init profile_init(void); +void __profile_tick(int type, struct pt_regs *regs); void profile_tick(int); /* Index: linux-2.6.24.7/kernel/profile.c =================================================================== --- linux-2.6.24.7.orig/kernel/profile.c +++ linux-2.6.24.7/kernel/profile.c @@ -412,16 +412,19 @@ void profile_hits(int type, void *__pc, EXPORT_SYMBOL_GPL(profile_hits); -void profile_tick(int type) +void __profile_tick(int type, struct pt_regs *regs) { - struct pt_regs *regs = get_irq_regs(); - if (type == CPU_PROFILING && timer_hook) timer_hook(regs); if (!user_mode(regs) && cpu_isset(smp_processor_id(), prof_cpu_mask)) profile_hit(type, (void *)profile_pc(regs)); } +void profile_tick(int type) +{ + return __profile_tick(type, get_irq_regs()); +} + #ifdef CONFIG_PROC_FS #include <linux/proc_fs.h> #include <asm/uaccess.h> Index: linux-2.6.24.7/kernel/time/tick-common.c =================================================================== --- linux-2.6.24.7.orig/kernel/time/tick-common.c +++ linux-2.6.24.7/kernel/time/tick-common.c @@ -68,7 +68,6 @@ static void tick_periodic(int cpu) } update_process_times(user_mode(get_irq_regs())); - profile_tick(CPU_PROFILING); } /* Index: linux-2.6.24.7/kernel/time/tick-sched.c =================================================================== --- linux-2.6.24.7.orig/kernel/time/tick-sched.c +++ linux-2.6.24.7/kernel/time/tick-sched.c @@ -440,7 +440,6 @@ static void tick_nohz_handler(struct clo } update_process_times(user_mode(regs)); - profile_tick(CPU_PROFILING); /* Do not restart, when we are in the idle loop */ if (ts->tick_stopped) @@ -554,7 +553,6 @@ static enum hrtimer_restart tick_sched_t */ spin_unlock(&base->lock); update_process_times(user_mode(regs)); - profile_tick(CPU_PROFILING); spin_lock(&base->lock); } ������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/latency-tracing-ppc.patch�������������������������������������������������������������������0000664�0000764�0000764�00000001437�11041657731�016134� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������� arch/ppc/boot/Makefile | 9 +++++++++ 1 file changed, 9 insertions(+) Index: linux-2.6.24.7/arch/ppc/boot/Makefile =================================================================== --- linux-2.6.24.7.orig/arch/ppc/boot/Makefile +++ linux-2.6.24.7/arch/ppc/boot/Makefile @@ -15,6 +15,15 @@ # KBUILD_CFLAGS used when building rest of boot (takes effect recursively) KBUILD_CFLAGS += -fno-builtin -D__BOOTER__ -Iarch/$(ARCH)/boot/include + +ifdef CONFIG_MCOUNT +# do not trace the boot loader +nullstring := +space := $(nullstring) # end of the line +pg_flag = $(nullstring) -pg # end of the line +KBUILD_CFLAGS := $(subst ${pg_flag},${space},${KBUILD_CFLAGS}) +endif + HOSTCFLAGS += -Iarch/$(ARCH)/boot/include BOOT_TARGETS = zImage zImage.initrd znetboot znetboot.initrd ���������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/latency-tracing-arm.patch�������������������������������������������������������������������0000664�0000764�0000764�00000026546�11041657734�016144� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������� arch/arm/boot/compressed/head.S | 13 ++++ arch/arm/kernel/entry-common.S | 109 ++++++++++++++++++++++++++++++++++++++++ arch/arm/kernel/fiq.c | 4 - arch/arm/kernel/irq.c | 2 arch/arm/mm/copypage-v4mc.c | 4 - arch/arm/mm/copypage-xscale.c | 4 - arch/arm/mm/fault.c | 14 ++--- include/asm-arm/pgalloc.h | 4 - include/asm-arm/timex.h | 10 +++ include/asm-arm/unistd.h | 4 + 10 files changed, 151 insertions(+), 17 deletions(-) Index: linux-2.6.24.7/arch/arm/boot/compressed/head.S =================================================================== --- linux-2.6.24.7.orig/arch/arm/boot/compressed/head.S +++ linux-2.6.24.7/arch/arm/boot/compressed/head.S @@ -928,6 +928,19 @@ memdump: mov r12, r0 #endif .ltorg +#ifdef CONFIG_MCOUNT +/* CONFIG_MCOUNT causes boot header to be built with -pg requiring this + * trampoline + */ + .text + .align 0 + .type mcount %function + .global mcount +mcount: + mov pc, lr @ just return +#endif + + reloc_end: .align Index: linux-2.6.24.7/arch/arm/kernel/entry-common.S =================================================================== --- linux-2.6.24.7.orig/arch/arm/kernel/entry-common.S +++ linux-2.6.24.7/arch/arm/kernel/entry-common.S @@ -3,6 +3,8 @@ * * Copyright (C) 2000 Russell King * + * FUNCTION_TRACE/mcount support (C) 2005 Timesys john.cooper@timesys.com + * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License version 2 as * published by the Free Software Foundation. @@ -395,5 +397,112 @@ ENTRY(sys_oabi_call_table) #undef ABI #undef OBSOLETE +#ifdef CONFIG_FRAME_POINTER + +#ifdef CONFIG_MCOUNT +/* + * At the point where we are in mcount() we maintain the + * frame of the prologue code and keep the call to mcount() + * out of the stack frame list: + + saved pc <---\ caller of instrumented routine + saved lr | + ip/prev_sp | + fp -----^ | + : | + | + -> saved pc | instrumented routine + | saved lr | + | ip/prev_sp | + | fp ---------/ + | : + | + | mcount + | saved pc + | saved lr + | ip/prev sp + -- fp + r3 + r2 + r1 + sp-> r0 + : + */ + + .text + .align 0 + .type mcount %function + .global mcount + +/* gcc -pg generated FUNCTION_PROLOGUE references mcount() + * and has already created the stack frame invocation for + * the routine we have been called to instrument. We create + * a complete frame nevertheless, as we want to use the same + * call to mcount() from c code. + */ +mcount: + + ldr ip, =mcount_enabled @ leave early, if disabled + ldr ip, [ip] + cmp ip, #0 + moveq pc,lr + + mov ip, sp + stmdb sp!, {r0 - r3, fp, ip, lr, pc} @ create stack frame + + ldr r1, [fp, #-4] @ get lr (the return address + @ of the caller of the + @ instrumented function) + mov r0, lr @ get lr - (the return address + @ of the instrumented function) + + sub fp, ip, #4 @ point fp at this frame + + bl __trace +1: + ldmdb fp, {r0 - r3, fp, sp, pc} @ pop entry frame and return + +#endif + +/* ARM replacement for unsupported gcc __builtin_return_address(n) + * where 0 < n. n == 0 is supported here as well. + * + * Walk up the stack frame until the desired frame is found or a NULL + * fp is encountered, return NULL in the latter case. + * + * Note: it is possible under code optimization for the stack invocation + * of an ancestor function (level N) to be removed before calling a + * descendant function (level N+1). No easy means is available to deduce + * this scenario with the result being [for example] caller_addr(0) when + * called from level N+1 returning level N-1 rather than the expected + * level N. This optimization issue appears isolated to the case of + * a call to a level N+1 routine made at the tail end of a level N + * routine -- the level N frame is deleted and a simple branch is made + * to the level N+1 routine. + */ + + .text + .align 0 + .type arm_return_addr %function + .global arm_return_addr + +arm_return_addr: + mov ip, r0 + mov r0, fp +3: + cmp r0, #0 + beq 1f @ frame list hit end, bail + cmp ip, #0 + beq 2f @ reached desired frame + ldr r0, [r0, #-12] @ else continue, get next fp + sub ip, ip, #1 + b 3b +2: + ldr r0, [r0, #-4] @ get target return address +1: + mov pc, lr + +#endif + #endif Index: linux-2.6.24.7/arch/arm/kernel/fiq.c =================================================================== --- linux-2.6.24.7.orig/arch/arm/kernel/fiq.c +++ linux-2.6.24.7/arch/arm/kernel/fiq.c @@ -89,7 +89,7 @@ void set_fiq_handler(void *start, unsign * disable irqs for the duration. Note - these functions are almost * entirely coded in assembly. */ -void __attribute__((naked)) set_fiq_regs(struct pt_regs *regs) +void notrace __attribute__((naked)) set_fiq_regs(struct pt_regs *regs) { register unsigned long tmp; asm volatile ( @@ -107,7 +107,7 @@ void __attribute__((naked)) set_fiq_regs : "r" (®s->ARM_r8), "I" (PSR_I_BIT | PSR_F_BIT | FIQ_MODE)); } -void __attribute__((naked)) get_fiq_regs(struct pt_regs *regs) +void notrace __attribute__((naked)) get_fiq_regs(struct pt_regs *regs) { register unsigned long tmp; asm volatile ( Index: linux-2.6.24.7/arch/arm/kernel/irq.c =================================================================== --- linux-2.6.24.7.orig/arch/arm/kernel/irq.c +++ linux-2.6.24.7/arch/arm/kernel/irq.c @@ -110,7 +110,7 @@ static struct irq_desc bad_irq_desc = { * come via this function. Instead, they should provide their * own 'handler' */ -asmlinkage void __exception asm_do_IRQ(unsigned int irq, struct pt_regs *regs) +asmlinkage void __exception notrace asm_do_IRQ(unsigned int irq, struct pt_regs *regs) { struct pt_regs *old_regs = set_irq_regs(regs); struct irq_desc *desc = irq_desc + irq; Index: linux-2.6.24.7/arch/arm/mm/copypage-v4mc.c =================================================================== --- linux-2.6.24.7.orig/arch/arm/mm/copypage-v4mc.c +++ linux-2.6.24.7/arch/arm/mm/copypage-v4mc.c @@ -44,7 +44,7 @@ static DEFINE_SPINLOCK(minicache_lock); * instruction. If your processor does not supply this, you have to write your * own copy_user_page that does the right thing. */ -static void __attribute__((naked)) +static void notrace __attribute__((naked)) mc_copy_user_page(void *from, void *to) { asm volatile( @@ -88,7 +88,7 @@ void v4_mc_copy_user_page(void *kto, con /* * ARMv4 optimised clear_user_page */ -void __attribute__((naked)) +void notrace __attribute__((naked)) v4_mc_clear_user_page(void *kaddr, unsigned long vaddr) { asm volatile( Index: linux-2.6.24.7/arch/arm/mm/copypage-xscale.c =================================================================== --- linux-2.6.24.7.orig/arch/arm/mm/copypage-xscale.c +++ linux-2.6.24.7/arch/arm/mm/copypage-xscale.c @@ -42,7 +42,7 @@ static DEFINE_SPINLOCK(minicache_lock); * Dcache aliasing issue. The writes will be forwarded to the write buffer, * and merged as appropriate. */ -static void __attribute__((naked)) +static void notrace __attribute__((naked)) mc_copy_user_page(void *from, void *to) { /* @@ -110,7 +110,7 @@ void xscale_mc_copy_user_page(void *kto, /* * XScale optimised clear_user_page */ -void __attribute__((naked)) +void notrace __attribute__((naked)) xscale_mc_clear_user_page(void *kaddr, unsigned long vaddr) { asm volatile( Index: linux-2.6.24.7/arch/arm/mm/fault.c =================================================================== --- linux-2.6.24.7.orig/arch/arm/mm/fault.c +++ linux-2.6.24.7/arch/arm/mm/fault.c @@ -215,7 +215,7 @@ out: return fault; } -static int +static notrace int do_page_fault(unsigned long addr, unsigned int fsr, struct pt_regs *regs) { struct task_struct *tsk; @@ -311,7 +311,7 @@ no_context: * interrupt or a critical region, and should only copy the information * from the master page table, nothing more. */ -static int +static notrace int do_translation_fault(unsigned long addr, unsigned int fsr, struct pt_regs *regs) { @@ -354,7 +354,7 @@ bad_area: * Some section permission faults need to be handled gracefully. * They can happen due to a __{get,put}_user during an oops. */ -static int +static notrace int do_sect_fault(unsigned long addr, unsigned int fsr, struct pt_regs *regs) { do_bad_area(addr, fsr, regs); @@ -364,7 +364,7 @@ do_sect_fault(unsigned long addr, unsign /* * This abort handler always returns "fault". */ -static int +static notrace int do_bad(unsigned long addr, unsigned int fsr, struct pt_regs *regs) { return 1; @@ -419,7 +419,7 @@ static struct fsr_info { { do_bad, SIGBUS, 0, "unknown 31" } }; -void __init +void __init notrace hook_fault_code(int nr, int (*fn)(unsigned long, unsigned int, struct pt_regs *), int sig, const char *name) { @@ -433,7 +433,7 @@ hook_fault_code(int nr, int (*fn)(unsign /* * Dispatch a data abort to the relevant handler. */ -asmlinkage void __exception +asmlinkage void __exception notrace do_DataAbort(unsigned long addr, unsigned int fsr, struct pt_regs *regs) { const struct fsr_info *inf = fsr_info + (fsr & 15) + ((fsr & (1 << 10)) >> 6); @@ -452,7 +452,7 @@ do_DataAbort(unsigned long addr, unsigne arm_notify_die("", regs, &info, fsr, 0); } -asmlinkage void __exception +asmlinkage void __exception notrace do_PrefetchAbort(unsigned long addr, struct pt_regs *regs) { do_translation_fault(addr, 0, regs); Index: linux-2.6.24.7/include/asm-arm/pgalloc.h =================================================================== --- linux-2.6.24.7.orig/include/asm-arm/pgalloc.h +++ linux-2.6.24.7/include/asm-arm/pgalloc.h @@ -109,7 +109,7 @@ static inline void __pmd_populate(pmd_t * * Ensure that we always set both PMD entries. */ -static inline void +static inline void notrace pmd_populate_kernel(struct mm_struct *mm, pmd_t *pmdp, pte_t *ptep) { unsigned long pte_ptr = (unsigned long)ptep; @@ -122,7 +122,7 @@ pmd_populate_kernel(struct mm_struct *mm __pmd_populate(pmdp, __pa(pte_ptr) | _PAGE_KERNEL_TABLE); } -static inline void +static inline void notrace pmd_populate(struct mm_struct *mm, pmd_t *pmdp, struct page *ptep) { __pmd_populate(pmdp, page_to_pfn(ptep) << PAGE_SHIFT | _PAGE_USER_TABLE); Index: linux-2.6.24.7/include/asm-arm/timex.h =================================================================== --- linux-2.6.24.7.orig/include/asm-arm/timex.h +++ linux-2.6.24.7/include/asm-arm/timex.h @@ -16,9 +16,17 @@ typedef unsigned long cycles_t; +#ifndef mach_read_cycles + #define mach_read_cycles() (0) +#ifdef CONFIG_LATENCY_TIMING + #define mach_cycles_to_usecs(d) (d) + #define mach_usecs_to_cycles(d) (d) +#endif +#endif + static inline cycles_t get_cycles (void) { - return 0; + return mach_read_cycles(); } #endif Index: linux-2.6.24.7/include/asm-arm/unistd.h =================================================================== --- linux-2.6.24.7.orig/include/asm-arm/unistd.h +++ linux-2.6.24.7/include/asm-arm/unistd.h @@ -380,6 +380,10 @@ #define __NR_eventfd (__NR_SYSCALL_BASE+351) #define __NR_fallocate (__NR_SYSCALL_BASE+352) +#ifndef __ASSEMBLY__ +#define NR_syscalls (__NR_fallocate + 1 - __NR_SYSCALL_BASE) +#endif + /* * The following SWIs are ARM private. */ ����������������������������������������������������������������������������������������������������������������������������������������������������������patches/arm-latency-tracer-support.patch������������������������������������������������������������0000664�0000764�0000764�00000005173�11041657734�017500� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������add latency tracer support for EP93xx boards Add latency tracer support for the EP93xx platform. This is done by: - adding the correct Kconfig options - add (an empty) save_stack_trace implementation. -> Someone needs to implement save_stack_trace for arm :) Maybe we can use the implementation from rmk? - implementing mach_read_cycles (read out EP93XX_TIMER4_VALUE_LOW) - implementing mach_cycles_to_usecs (just the same way as for the PXA platform) - implementing mach_usecs_to_cycles (just the same way as for the PXA platform) Signed-off-by: Jan Altenberg <jan@linutronix.de> --- arch/arm/Kconfig | 4 ++++ arch/arm/lib/Makefile | 1 + arch/arm/lib/stacktrace.c | 7 +++++++ include/asm-arm/arch-ep93xx/timex.h | 6 ++++++ 4 files changed, 18 insertions(+) Index: linux-2.6.24.7/arch/arm/Kconfig =================================================================== --- linux-2.6.24.7.orig/arch/arm/Kconfig +++ linux-2.6.24.7/arch/arm/Kconfig @@ -33,6 +33,10 @@ config GENERIC_CLOCKEVENTS bool default n +config STACKTRACE_SUPPORT + bool + default y + config MMU bool default y Index: linux-2.6.24.7/arch/arm/lib/Makefile =================================================================== --- linux-2.6.24.7.orig/arch/arm/lib/Makefile +++ linux-2.6.24.7/arch/arm/lib/Makefile @@ -41,6 +41,7 @@ lib-$(CONFIG_ARCH_RPC) += ecard.o io-ac lib-$(CONFIG_ARCH_CLPS7500) += io-acorn.o lib-$(CONFIG_ARCH_L7200) += io-acorn.o lib-$(CONFIG_ARCH_SHARK) += io-shark.o +lib-$(CONFIG_STACKTRACE) += stacktrace.o $(obj)/csumpartialcopy.o: $(obj)/csumpartialcopygeneric.S $(obj)/csumpartialcopyuser.o: $(obj)/csumpartialcopygeneric.S Index: linux-2.6.24.7/arch/arm/lib/stacktrace.c =================================================================== --- /dev/null +++ linux-2.6.24.7/arch/arm/lib/stacktrace.c @@ -0,0 +1,7 @@ +#include <linux/sched.h> +#include <linux/stacktrace.h> + +void save_stack_trace(struct stack_trace *trace) +{ +} + Index: linux-2.6.24.7/include/asm-arm/arch-ep93xx/timex.h =================================================================== --- linux-2.6.24.7.orig/include/asm-arm/arch-ep93xx/timex.h +++ linux-2.6.24.7/include/asm-arm/arch-ep93xx/timex.h @@ -1,5 +1,11 @@ /* * linux/include/asm-arm/arch-ep93xx/timex.h */ +#include <asm-arm/arch-ep93xx/ep93xx-regs.h> +#include <asm-arm/io.h> #define CLOCK_TICK_RATE 983040 + +#define mach_read_cycles() __raw_readl(EP93XX_TIMER4_VALUE_LOW) +#define mach_cycles_to_usecs(d) (((d) * ((1000000LL << 32) / CLOCK_TICK_RATE)) >> 32) +#define mach_usecs_to_cycles(d) (((d) * (((long long)CLOCK_TICK_RATE << 32) / 1000000)) >> 32) �����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/random-driver-latency-fix.patch�������������������������������������������������������������0000664�0000764�0000764�00000001731�11041657734�017262� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������� drivers/char/random.c | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) Index: linux-2.6.24.7/drivers/char/random.c =================================================================== --- linux-2.6.24.7.orig/drivers/char/random.c +++ linux-2.6.24.7/drivers/char/random.c @@ -580,8 +580,11 @@ static void add_timer_randomness(struct preempt_disable(); /* if over the trickle threshold, use only 1 in 4096 samples */ if (input_pool.entropy_count > trickle_thresh && - (__get_cpu_var(trickle_count)++ & 0xfff)) - goto out; + (__get_cpu_var(trickle_count)++ & 0xfff)) { + preempt_enable(); + return; + } + preempt_enable(); sample.jiffies = jiffies; sample.cycles = get_cycles(); @@ -626,9 +629,6 @@ static void add_timer_randomness(struct if(input_pool.entropy_count >= random_read_wakeup_thresh) wake_up_interruptible(&random_read_wait); - -out: - preempt_enable(); } void add_input_randomness(unsigned int type, unsigned int code, ���������������������������������������patches/latency-measurement-drivers.patch�����������������������������������������������������������0000664�0000764�0000764�00000045535�11041657734�017740� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������� this patch adds: - histogram support to /dev/rtc - the /dev/blocker lock-latency test-device - the /dev/lpptest parallel-port irq latency test-device drivers/char/Kconfig | 40 ++++++++++ drivers/char/Makefile | 2 drivers/char/blocker.c | 109 +++++++++++++++++++++++++++++ drivers/char/lpptest.c | 178 ++++++++++++++++++++++++++++++++++++++++++++++++ drivers/char/rtc.c | 179 ++++++++++++++++++++++++++++++++++++++++++++++++- scripts/Makefile | 3 scripts/testlpp.c | 159 +++++++++++++++++++++++++++++++++++++++++++ 7 files changed, 668 insertions(+), 2 deletions(-) Index: linux-2.6.24.7/drivers/char/Kconfig =================================================================== --- linux-2.6.24.7.orig/drivers/char/Kconfig +++ linux-2.6.24.7/drivers/char/Kconfig @@ -753,6 +753,46 @@ config JS_RTC To compile this driver as a module, choose M here: the module will be called js-rtc. +config RTC_HISTOGRAM + bool "Real Time Clock Histogram Support" + default n + depends on RTC + ---help--- + If you say Y here then the kernel will track the delivery and + wakeup latency of /dev/rtc using tasks and will report a + histogram to the kernel log when the application closes /dev/rtc. + +config BLOCKER + tristate "Priority Inheritance Debugging (Blocker) Device Support" + depends on X86 + default y + ---help--- + If you say Y here then a device will be created that the userspace + pi_test suite uses to test and measure kernel locking primitives. + +config LPPTEST + tristate "Parallel Port Based Latency Measurement Device" + depends on !PARPORT && X86 + default y + ---help--- + If you say Y here then a device will be created that the userspace + testlpp utility uses to measure IRQ latencies of a target system + from an independent measurement system. + + NOTE: this code assumes x86 PCs and that the parallel port is + bidirectional and is on IRQ 7. + + to use the device, both the target and the source system needs to + run a kernel with CONFIG_LPPTEST enabled. To measure latencies, + use the scripts/testlpp utility in your kernel source directory, + and run it (as root) on the source system - it will start printing + out the latencies it took to get a response from the target system: + + Latency of response: 12.2 usecs (121265 cycles) + + then generate various workloads on the target system to see how + (worst-case-) latencies are impacted. + config SGI_DS1286 tristate "SGI DS1286 RTC support" depends on SGI_IP22 Index: linux-2.6.24.7/drivers/char/Makefile =================================================================== --- linux-2.6.24.7.orig/drivers/char/Makefile +++ linux-2.6.24.7/drivers/char/Makefile @@ -85,6 +85,8 @@ obj-$(CONFIG_TOSHIBA) += toshiba.o obj-$(CONFIG_I8K) += i8k.o obj-$(CONFIG_DS1620) += ds1620.o obj-$(CONFIG_HW_RANDOM) += hw_random/ +obj-$(CONFIG_BLOCKER) += blocker.o +obj-$(CONFIG_LPPTEST) += lpptest.o obj-$(CONFIG_COBALT_LCD) += lcd.o obj-$(CONFIG_PPDEV) += ppdev.o obj-$(CONFIG_NWBUTTON) += nwbutton.o Index: linux-2.6.24.7/drivers/char/blocker.c =================================================================== --- /dev/null +++ linux-2.6.24.7/drivers/char/blocker.c @@ -0,0 +1,109 @@ +/* + * priority inheritance testing device + */ + +#include <linux/fs.h> +#include <linux/miscdevice.h> +#include <linux/timex.h> +#include <linux/sched.h> + +#define BLOCKER_MINOR 221 + +#define BLOCK_IOCTL 4245 +#define BLOCK_SET_DEPTH 4246 + +#define BLOCKER_MAX_LOCK_DEPTH 10 + +void loop(int loops) +{ + int i; + + for (i = 0; i < loops; i++) + get_cycles(); +} + +static spinlock_t blocker_lock[BLOCKER_MAX_LOCK_DEPTH]; + +static unsigned int lock_depth = 1; + +void do_the_lock_and_loop(unsigned int args) +{ + int i, max; + + if (rt_task(current)) + max = lock_depth; + else if (lock_depth > 1) + max = (current->pid % lock_depth) + 1; + else + max = 1; + + /* Always lock from the top down */ + for (i = max-1; i >= 0; i--) + spin_lock(&blocker_lock[i]); + loop(args); + for (i = 0; i < max; i++) + spin_unlock(&blocker_lock[i]); +} + +static int blocker_open(struct inode *in, struct file *file) +{ + printk(KERN_INFO "blocker_open called\n"); + + return 0; +} + +static long blocker_ioctl(struct file *file, + unsigned int cmd, unsigned long args) +{ + switch(cmd) { + case BLOCK_IOCTL: + do_the_lock_and_loop(args); + return 0; + case BLOCK_SET_DEPTH: + if (args >= BLOCKER_MAX_LOCK_DEPTH) + return -EINVAL; + lock_depth = args; + return 0; + default: + return -EINVAL; + } +} + +static struct file_operations blocker_fops = { + .owner = THIS_MODULE, + .llseek = no_llseek, + .unlocked_ioctl = blocker_ioctl, + .open = blocker_open, +}; + +static struct miscdevice blocker_dev = +{ + BLOCKER_MINOR, + "blocker", + &blocker_fops +}; + +static int __init blocker_init(void) +{ + int i; + + if (misc_register(&blocker_dev)) + return -ENODEV; + + for (i = 0; i < BLOCKER_MAX_LOCK_DEPTH; i++) + spin_lock_init(blocker_lock + i); + + return 0; +} + +void __exit blocker_exit(void) +{ + printk(KERN_INFO "blocker device uninstalled\n"); + misc_deregister(&blocker_dev); +} + +module_init(blocker_init); +module_exit(blocker_exit); + +MODULE_LICENSE("GPL"); + Index: linux-2.6.24.7/drivers/char/lpptest.c =================================================================== --- /dev/null +++ linux-2.6.24.7/drivers/char/lpptest.c @@ -0,0 +1,178 @@ +/* + * /dev/lpptest device: test IRQ handling latencies over parallel port + * + * Copyright (C) 2005 Thomas Gleixner, Ingo Molnar + * + * licensed under the GPL + * + * You need to have CONFIG_PARPORT disabled for this device, it is a + * completely self-contained device that assumes sole ownership of the + * parallel port. + */ +#include <linux/sched.h> +#include <linux/kernel.h> +#include <linux/module.h> +#include <linux/spinlock.h> +#include <linux/list.h> +#include <linux/irq.h> +#include <linux/interrupt.h> +#include <linux/fs.h> +#include <linux/delay.h> +#include <asm/uaccess.h> +#include <asm/io.h> +#include <asm/rtc.h> + +/* + * API wrappers so that the code can be shared with the -rt tree: + */ +#ifndef local_irq_disable +# define local_irq_disable local_irq_disable +# define local_irq_enable local_irq_enable +#endif + +#ifndef IRQ_NODELAY +# define IRQ_NODELAY 0 +# define IRQF_NODELAY 0 +#endif + +/* + * Driver: + */ +#define LPPTEST_CHAR_MAJOR 245 +#define LPPTEST_DEVICE_NAME "lpptest" + +#define LPPTEST_IRQ 7 + +#define LPPTEST_TEST _IOR (LPPTEST_CHAR_MAJOR, 1, unsigned long long) +#define LPPTEST_DISABLE _IOR (LPPTEST_CHAR_MAJOR, 2, unsigned long long) +#define LPPTEST_ENABLE _IOR (LPPTEST_CHAR_MAJOR, 3, unsigned long long) + +static char dev_id[] = "lpptest"; + +#define INIT_PORT() outb(0x04, 0x37a) +#define ENABLE_IRQ() outb(0x10, 0x37a) +#define DISABLE_IRQ() outb(0, 0x37a) + +static unsigned char out = 0x5a; + +/** + * Interrupt handler. Flip a bit in the reply. + */ +static int lpptest_irq (int irq, void *dev_id) +{ + out ^= 0xff; + outb(out, 0x378); + + return IRQ_HANDLED; +} + +static cycles_t test_response(void) +{ + cycles_t now, end; + unsigned char in; + int timeout = 0; + + local_irq_disable(); + in = inb(0x379); + inb(0x378); + outb(0x08, 0x378); + now = get_cycles(); + while(1) { + if (inb(0x379) != in) + break; + if (timeout++ > 1000000) { + outb(0x00, 0x378); + local_irq_enable(); + + return 0; + } + } + end = get_cycles(); + outb(0x00, 0x378); + local_irq_enable(); + + return end - now; +} + +static int lpptest_open(struct inode *inode, struct file *file) +{ + return 0; +} + +static int lpptest_close(struct inode *inode, struct file *file) +{ + return 0; +} + +int lpptest_ioctl(struct inode *inode, struct file *file, unsigned int ioctl_num, unsigned long ioctl_param) +{ + int retval = 0; + + switch (ioctl_num) { + + case LPPTEST_DISABLE: + DISABLE_IRQ(); + break; + + case LPPTEST_ENABLE: + ENABLE_IRQ(); + break; + + case LPPTEST_TEST: { + + cycles_t diff = test_response(); + if (copy_to_user((void *)ioctl_param, (void*) &diff, sizeof(diff))) + goto errcpy; + break; + } + default: retval = -EINVAL; + } + + return retval; + + errcpy: + return -EFAULT; +} + +static struct file_operations lpptest_dev_fops = { + .ioctl = lpptest_ioctl, + .open = lpptest_open, + .release = lpptest_close, +}; + +static int __init lpptest_init (void) +{ + if (register_chrdev(LPPTEST_CHAR_MAJOR, LPPTEST_DEVICE_NAME, &lpptest_dev_fops)) + { + printk(KERN_NOTICE "Can't allocate major number %d for lpptest.\n", + LPPTEST_CHAR_MAJOR); + return -EAGAIN; + } + + if (request_irq (LPPTEST_IRQ, lpptest_irq, 0, "lpptest", dev_id)) { + printk (KERN_WARNING "lpptest: irq %d in use. Unload parport module!\n", LPPTEST_IRQ); + unregister_chrdev(LPPTEST_CHAR_MAJOR, LPPTEST_DEVICE_NAME); + return -EAGAIN; + } + irq_desc[LPPTEST_IRQ].status |= IRQ_NODELAY; + irq_desc[LPPTEST_IRQ].action->flags |= IRQF_NODELAY | IRQF_DISABLED; + + INIT_PORT(); + ENABLE_IRQ(); + + return 0; +} +module_init (lpptest_init); + +static void __exit lpptest_exit (void) +{ + DISABLE_IRQ(); + + free_irq(LPPTEST_IRQ, dev_id); + unregister_chrdev(LPPTEST_CHAR_MAJOR, LPPTEST_DEVICE_NAME); +} +module_exit (lpptest_exit); + +MODULE_LICENSE("GPL"); +MODULE_DESCRIPTION("lpp test module"); + Index: linux-2.6.24.7/drivers/char/rtc.c =================================================================== --- linux-2.6.24.7.orig/drivers/char/rtc.c +++ linux-2.6.24.7/drivers/char/rtc.c @@ -90,6 +90,32 @@ #include <linux/pci.h> #include <asm/ebus.h> +#ifdef CONFIG_MIPS +# include <asm/time.h> +#endif + +#ifdef CONFIG_RTC_HISTOGRAM + +static cycles_t last_interrupt_time; + +#include <asm/timex.h> + +#define CPU_MHZ (cpu_khz / 1000) + +#define HISTSIZE 10000 +static int histogram[HISTSIZE]; + +static int rtc_state; + +enum rtc_states { + S_STARTUP, /* First round - let the application start */ + S_IDLE, /* Waiting for an interrupt */ + S_WAITING_FOR_READ, /* Signal delivered. waiting for rtc_read() */ + S_READ_MISSED, /* Signal delivered, read() deadline missed */ +}; + +#endif + static unsigned long rtc_port; static int rtc_irq = PCI_IRQ_NONE; #endif @@ -222,7 +248,146 @@ static inline unsigned char rtc_is_updat return uip; } +#ifndef RTC_IRQ +# undef CONFIG_RTC_HISTOGRAM +#endif + +static inline void rtc_open_event(void) +{ +#ifdef CONFIG_RTC_HISTOGRAM + int i; + + last_interrupt_time = 0; + rtc_state = S_STARTUP; + rtc_irq_data = 0; + + for (i = 0; i < HISTSIZE; i++) + histogram[i] = 0; +#endif +} + +static inline void rtc_wake_event(void) +{ +#ifndef CONFIG_RTC_HISTOGRAM + kill_fasync (&rtc_async_queue, SIGIO, POLL_IN); +#else + if (!(rtc_status & RTC_IS_OPEN)) + return; + + switch (rtc_state) { + /* Startup */ + case S_STARTUP: + kill_fasync (&rtc_async_queue, SIGIO, POLL_IN); + break; + /* Waiting for an interrupt */ + case S_IDLE: + kill_fasync (&rtc_async_queue, SIGIO, POLL_IN); + last_interrupt_time = get_cycles(); + rtc_state = S_WAITING_FOR_READ; + break; + + /* Signal has been delivered. waiting for rtc_read() */ + case S_WAITING_FOR_READ: + /* + * Well foo. The usermode application didn't + * schedule and read in time. + */ + last_interrupt_time = get_cycles(); + rtc_state = S_READ_MISSED; + printk("Read missed before next interrupt\n"); + break; + /* Signal has been delivered, read() deadline was missed */ + case S_READ_MISSED: + /* + * Not much we can do here. We're waiting for the usermode + * application to read the rtc + */ + last_interrupt_time = get_cycles(); + break; + } +#endif +} + +static inline void rtc_read_event(void) +{ +#ifdef CONFIG_RTC_HISTOGRAM + cycles_t now = get_cycles(); + + switch (rtc_state) { + /* Startup */ + case S_STARTUP: + rtc_state = S_IDLE; + break; + + /* Waiting for an interrupt */ + case S_IDLE: + printk("bug in rtc_read(): called in state S_IDLE!\n"); + break; + case S_WAITING_FOR_READ: /* + * Signal has been delivered. + * waiting for rtc_read() + */ + /* + * Well done + */ + case S_READ_MISSED: /* + * Signal has been delivered, read() + * deadline was missed + */ + /* + * So, you finally got here. + */ + if (!last_interrupt_time) + printk("bug in rtc_read(): last_interrupt_time = 0\n"); + rtc_state = S_IDLE; + { + cycles_t latency = now - last_interrupt_time; + unsigned long delta; /* Microseconds */ + + delta = latency; + delta /= CPU_MHZ; + + if (delta > 1000 * 1000) { + printk("rtc: eek\n"); + } else { + unsigned long slot = delta; + if (slot >= HISTSIZE) + slot = HISTSIZE - 1; + histogram[slot]++; + if (delta > 2000) + printk("wow! That was a " + "%ld millisec bump\n", + delta / 1000); + } + } + rtc_state = S_IDLE; + break; + } +#endif +} + +static inline void rtc_close_event(void) +{ +#ifdef CONFIG_RTC_HISTOGRAM + int i = 0; + unsigned long total = 0; + + for (i = 0; i < HISTSIZE; i++) + total += histogram[i]; + if (!total) + return; + + printk("\nrtc latency histogram of {%s/%d, %lu samples}:\n", + current->comm, current->pid, total); + for (i = 0; i < HISTSIZE; i++) { + if (histogram[i]) + printk("%d %d\n", i, histogram[i]); + } +#endif +} + #ifdef RTC_IRQ + /* * A very tiny interrupt handler. It runs with IRQF_DISABLED set, * but there is possibility of conflicting with the set_rtc_mmss() @@ -266,9 +431,9 @@ irqreturn_t rtc_interrupt(int irq, void if (rtc_callback) rtc_callback->func(rtc_callback->private_data); spin_unlock(&rtc_task_lock); - wake_up_interruptible(&rtc_wait); - kill_fasync (&rtc_async_queue, SIGIO, POLL_IN); + rtc_wake_event(); + wake_up_interruptible(&rtc_wait); return IRQ_HANDLED; } @@ -378,6 +543,8 @@ static ssize_t rtc_read(struct file *fil schedule(); } while (1); + rtc_read_event(); + if (count == sizeof(unsigned int)) retval = put_user(data, (unsigned int __user *)buf) ?: sizeof(int); else @@ -610,6 +777,11 @@ static int rtc_do_ioctl(unsigned int cmd save_freq_select = CMOS_READ(RTC_FREQ_SELECT); CMOS_WRITE((save_freq_select|RTC_DIV_RESET2), RTC_FREQ_SELECT); + /* + * Make CMOS date writes nonpreemptible even on PREEMPT_RT. + * There's a limit to everything! =B-) + */ + preempt_disable(); #ifdef CONFIG_MACH_DECSTATION CMOS_WRITE(real_yrs, RTC_DEC_YEAR); #endif @@ -619,6 +791,7 @@ static int rtc_do_ioctl(unsigned int cmd CMOS_WRITE(hrs, RTC_HOURS); CMOS_WRITE(min, RTC_MINUTES); CMOS_WRITE(sec, RTC_SECONDS); + preempt_enable(); CMOS_WRITE(save_control, RTC_CONTROL); CMOS_WRITE(save_freq_select, RTC_FREQ_SELECT); @@ -717,6 +890,7 @@ static int rtc_open(struct inode *inode, if(rtc_status & RTC_IS_OPEN) goto out_busy; + rtc_open_event(); rtc_status |= RTC_IS_OPEN; rtc_irq_data = 0; @@ -772,6 +946,7 @@ no_irq: rtc_irq_data = 0; rtc_status &= ~RTC_IS_OPEN; spin_unlock_irq (&rtc_lock); + rtc_close_event(); return 0; } Index: linux-2.6.24.7/scripts/Makefile =================================================================== --- linux-2.6.24.7.orig/scripts/Makefile +++ linux-2.6.24.7/scripts/Makefile @@ -12,6 +12,9 @@ hostprogs-$(CONFIG_LOGO) += pnmt hostprogs-$(CONFIG_VT) += conmakehash hostprogs-$(CONFIG_PROM_CONSOLE) += conmakehash hostprogs-$(CONFIG_IKCONFIG) += bin2c +ifdef CONFIG_LPPTEST +hostprogs-y += testlpp +endif always := $(hostprogs-y) $(hostprogs-m) Index: linux-2.6.24.7/scripts/testlpp.c =================================================================== --- /dev/null +++ linux-2.6.24.7/scripts/testlpp.c @@ -0,0 +1,159 @@ +/* + * testlpp.c: use the /dev/lpptest device to test IRQ handling + * latencies over parallel port + * + * Copyright (C) 2005 Thomas Gleixner + * + * licensed under the GPL + */ +#include <unistd.h> +#include <stdio.h> +#include <string.h> +#include <signal.h> +#include <stdlib.h> +#include <fcntl.h> +#include <sys/io.h> +#include <sys/ioctl.h> +#include <sys/types.h> +#include <sys/stat.h> +#include <fcntl.h> + +#define LPPTEST_CHAR_MAJOR 245 +#define LPPTEST_DEVICE_NAME "lpptest" + +#define LPPTEST_TEST _IOR (LPPTEST_CHAR_MAJOR, 1, unsigned long long) +#define LPPTEST_DISABLE _IOR (LPPTEST_CHAR_MAJOR, 2, unsigned long long) +#define LPPTEST_ENABLE _IOR (LPPTEST_CHAR_MAJOR, 3, unsigned long long) + +#define HIST_SIZE 10000 + +static int hist_total; +static unsigned long hist[HIST_SIZE]; + +static void hist_hit(unsigned long usecs) +{ + hist_total++; + if (usecs >= HIST_SIZE-1) + hist[HIST_SIZE-1]++; + else + hist[usecs]++; +} + +static void print_hist(void) +{ + int i; + + printf("LPP latency histogram:\n"); + + for (i = 0; i < HIST_SIZE; i++) { + if (hist[i]) + printf("%3d usecs: %9ld\n", i, hist[i]); + } +} + +static inline unsigned long long int rdtsc(void) +{ + unsigned long long int x, y; + for (;;) { + __asm__ volatile ("rdtsc" : "=A" (x)); + __asm__ volatile ("rdtsc" : "=A" (y)); + if (y - x < 1000) + return y; + } +} + +static unsigned long long calibrate_loop(void) +{ + unsigned long long mytime1, mytime2; + + mytime1 = rdtsc(); + usleep(500000); + mytime2 = rdtsc(); + + return (mytime2 - mytime1) * 2; +} + +#define time_to_usecs(time) ((double)time*1000000.0/(double)cycles_per_sec) + +#define time_to_usecs_l(time) (long)(time*1000000/cycles_per_sec) + +int fd, total; +unsigned long long tim, sum_tim, min_tim = -1ULL, max_tim, cycles_per_sec; + +void cleanup(int sig) +{ + ioctl (fd, LPPTEST_ENABLE, &tim); + if (sig) + printf("[ interrupted - exiting ]\n"); + printf("\ntotal number of responses: %d\n", total); + printf("average reponse latency: %.2lf usecs\n", + time_to_usecs(sum_tim/total)); + printf("minimum latency: %.2lf usecs\n", + time_to_usecs(min_tim)); + printf("maximum latency: %.2lf usecs\n", + time_to_usecs(max_tim)); + print_hist(); + exit(0); +} + +#define HZ 3000 + +int main (int argc, char **argv) +{ + unsigned int nr_requests = 0; + + if (argc > 2) { + fprintf(stderr, "usage: testlpp [<nr_of_requests>]\n"); + exit(-1); + } + if (argc == 2) + nr_requests = atol(argv[1]); + + if (getuid() != 0) { + fprintf(stderr, "need to run as root!\n"); + exit(-1); + } + mknod("/dev/lpptest", S_IFCHR|0666, makedev(245, 1)); + + fd = open("/dev/lpptest", O_RDWR); + if (fd == -1) { + fprintf(stderr, "could not open /dev/lpptest, your kernel doesnt have CONFIG_LPPTEST enabled?\n"); + exit(-1); + } + + signal(SIGINT,&cleanup); + + ioctl (fd, LPPTEST_DISABLE, &tim); + + fprintf(stderr, "calibrating cycles to usecs: "); + cycles_per_sec = calibrate_loop(); + fprintf(stderr, "%lld cycles per usec\n", cycles_per_sec/1000000); + if (nr_requests) + fprintf(stderr, "[max # of requests: %u]\n", nr_requests); + fprintf(stderr, "starting %dHz test, hit Ctrl-C to stop:\n\n", HZ); + + while(1) { + ioctl (fd, LPPTEST_TEST, &tim); + if (tim == 0) + printf ("No response from target.\n"); + else { + hist_hit(time_to_usecs_l(tim)); + if (tim > max_tim) { + printf ("new max latency: %.2lf usecs (%Ld cycles)\n", time_to_usecs(tim), tim); + max_tim = tim; + } + if (tim < min_tim) + min_tim = tim; + total++; + if (total == nr_requests) + break; + sum_tim += tim; + } + usleep(1000000/HZ); + } + cleanup(0); + + return 0; +} + + �������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/latency-measurement-drivers-fix.patch�������������������������������������������������������0000664�0000764�0000764�00000006552�11041657730�020514� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From linux-rt-users-owner@vger.kernel.org Mon May 21 18:04:37 2007 Return-Path: <linux-rt-users-owner@vger.kernel.org> X-Spam-Checker-Version: SpamAssassin 3.1.7-deb (2006-10-05) on debian X-Spam-Level: X-Spam-Status: No, score=0.1 required=5.0 tests=AWL autolearn=unavailable version=3.1.7-deb Received: from vger.kernel.org (vger.kernel.org [209.132.176.167]) by mail.tglx.de (Postfix) with ESMTP id CCDEA65C065; Mon, 21 May 2007 18:04:37 +0200 (CEST) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754957AbXEUQEg (ORCPT <rfc822;jan.altenberg@linutronix.de> + 1 other); Mon, 21 May 2007 12:04:36 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1754985AbXEUQEg (ORCPT <rfc822;linux-rt-users-outgoing>); Mon, 21 May 2007 12:04:36 -0400 Received: from relay00.pair.com ([209.68.5.9]:4558 "HELO relay00.pair.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with SMTP id S1754957AbXEUQEf (ORCPT <rfc822;linux-rt-users@vger.kernel.org>); Mon, 21 May 2007 12:04:35 -0400 Received: (qmail 64058 invoked from network); 21 May 2007 16:04:33 -0000 Received: from 24.241.238.207 (HELO ?127.0.0.1?) (24.241.238.207) by relay00.pair.com with SMTP; 21 May 2007 16:04:33 -0000 X-pair-Authenticated: 24.241.238.207 Message-ID: <4651C310.1090008@cybsft.com> Date: Mon, 21 May 2007 11:04:32 -0500 From: "K.R. Foley" <kr@cybsft.com> Organization: Cybersoft Solutions, Inc. User-Agent: Thunderbird 2.0.0.0 (X11/20070326) MIME-Version: 1.0 To: Ingo Molnar <mingo@elte.hu> CC: linux-kernel@vger.kernel.org, linux-rt-users@vger.kernel.org, Thomas Gleixner <tglx@linutronix.de> Subject: Re: v2.6.21-rt3 References: <20070517194143.GA25394@elte.hu> In-Reply-To: <20070517194143.GA25394@elte.hu> X-Enigmail-Version: 0.95.0 Content-Type: multipart/mixed; boundary="------------020805030708060904050208" Sender: linux-rt-users-owner@vger.kernel.org Precedence: bulk X-Mailing-List: linux-rt-users@vger.kernel.org X-Filter-To: .Kernel.rt-users X-Evolution-Source: imap://tglx%40linutronix.de@localhost:8993/ This is a multi-part message in MIME format. --------------020805030708060904050208 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 8bit Ingo Molnar wrote: > i'm pleased to announce the v2.6.21-rt3 kernel, which can be downloaded > from the usual place: > This is actually regarding v2.6.21-rt5 but I don't remember seeing an announcement for that one. The attached patch is necessary if you happen to have RTC_HISTOGRAM enabled, which I'm guessing most folks don't. BTW, what was the consensus on pagefault_enable and pagefault_disable? -- kr --------------020805030708060904050208 Content-Type: text/x-patch; name="fixrtc.patch" Content-Disposition: inline; filename="fixrtc.patch" Content-Transfer-Encoding: 8bit --- drivers/char/rtc.c | 7 +++---- 1 file changed, 3 insertions(+), 4 deletions(-) Index: linux-2.6.24.7/drivers/char/rtc.c =================================================================== --- linux-2.6.24.7.orig/drivers/char/rtc.c +++ linux-2.6.24.7/drivers/char/rtc.c @@ -93,6 +93,9 @@ #ifdef CONFIG_MIPS # include <asm/time.h> #endif +static unsigned long rtc_port; +static int rtc_irq = PCI_IRQ_NONE; +#endif #ifdef CONFIG_RTC_HISTOGRAM @@ -116,10 +119,6 @@ enum rtc_states { #endif -static unsigned long rtc_port; -static int rtc_irq = PCI_IRQ_NONE; -#endif - #ifdef CONFIG_HPET_RTC_IRQ #undef RTC_IRQ #endif ������������������������������������������������������������������������������������������������������������������������������������������������������patches/lockdep-show-held-locks.patch���������������������������������������������������������������0000664�0000764�0000764�00000003756�11041657734�016723� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������Subject: [patch] lockdep: show held locks when showing a stackdump From: Ingo Molnar <mingo@elte.hu> show held locks when printing a backtrace. Signed-off-by: Ingo Molnar <mingo@elte.hu> --- arch/x86/kernel/traps_32.c | 1 + arch/x86/kernel/traps_64.c | 1 + kernel/lockdep.c | 9 +++++++-- 3 files changed, 9 insertions(+), 2 deletions(-) Index: linux-2.6.24.7/arch/x86/kernel/traps_32.c =================================================================== --- linux-2.6.24.7.orig/arch/x86/kernel/traps_32.c +++ linux-2.6.24.7/arch/x86/kernel/traps_32.c @@ -271,6 +271,7 @@ static void show_stack_log_lvl(struct ta } printk("\n%sCall Trace:\n", log_lvl); show_trace_log_lvl(task, regs, esp, log_lvl); + debug_show_held_locks(task); } void show_stack(struct task_struct *task, unsigned long *esp) Index: linux-2.6.24.7/arch/x86/kernel/traps_64.c =================================================================== --- linux-2.6.24.7.orig/arch/x86/kernel/traps_64.c +++ linux-2.6.24.7/arch/x86/kernel/traps_64.c @@ -320,6 +320,7 @@ print_trace_warning_symbol(void *data, c { print_symbol(msg, symbol); printk("\n"); + debug_show_held_locks(tsk); } static void print_trace_warning(void *data, char *msg) Index: linux-2.6.24.7/kernel/lockdep.c =================================================================== --- linux-2.6.24.7.orig/kernel/lockdep.c +++ linux-2.6.24.7/kernel/lockdep.c @@ -512,7 +512,11 @@ static void print_lock(struct held_lock static void lockdep_print_held_locks(struct task_struct *curr) { - int i, depth = curr->lockdep_depth; + int i, depth; + + if (!curr) + curr = current; + depth = curr->lockdep_depth; if (!depth) { printk("no locks held by %s/%d.\n", curr->comm, task_pid_nr(curr)); @@ -3229,7 +3233,8 @@ void debug_show_held_locks(struct task_s printk("INFO: lockdep is turned off.\n"); return; } - lockdep_print_held_locks(task); + if (task == current) + lockdep_print_held_locks(task); } EXPORT_SYMBOL_GPL(debug_show_held_locks); ������������������patches/lockdep-lock_set_subclass.patch�������������������������������������������������������������0000664�0000764�0000764�00000007201�11041657734�017407� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������Subject: [patch] lockdep: lock_set_subclass - reset a held lock's subclass From: Peter Zijlstra <a.p.zijlstra@chello.nl> this can be used to reset a held lock's subclass, for arbitrary-depth iterated data structures such as trees or lists which have per-node locks. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu> --- include/linux/lockdep.h | 4 ++ kernel/lockdep.c | 69 ++++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 73 insertions(+) Index: linux-2.6.24.7/include/linux/lockdep.h =================================================================== --- linux-2.6.24.7.orig/include/linux/lockdep.h +++ linux-2.6.24.7/include/linux/lockdep.h @@ -304,6 +304,9 @@ extern void lock_acquire(struct lockdep_ extern void lock_release(struct lockdep_map *lock, int nested, unsigned long ip); +extern void lock_set_subclass(struct lockdep_map *lock, unsigned int subclass, + unsigned long ip); + # define INIT_LOCKDEP .lockdep_recursion = 0, #define lockdep_depth(tsk) (debug_locks ? (tsk)->lockdep_depth : 0) @@ -320,6 +323,7 @@ static inline void lockdep_on(void) # define lock_acquire(l, s, t, r, c, i) do { } while (0) # define lock_release(l, n, i) do { } while (0) +# define lock_set_subclass(l, s, i) do { } while (0) # define lockdep_init() do { } while (0) # define lockdep_info() do { } while (0) # define lockdep_init_map(lock, name, key, sub) do { (void)(key); } while (0) Index: linux-2.6.24.7/kernel/lockdep.c =================================================================== --- linux-2.6.24.7.orig/kernel/lockdep.c +++ linux-2.6.24.7/kernel/lockdep.c @@ -2539,6 +2539,55 @@ static int check_unlock(struct task_stru return 1; } +static int +__lock_set_subclass(struct lockdep_map *lock, + unsigned int subclass, unsigned long ip) +{ + struct task_struct *curr = current; + struct held_lock *hlock, *prev_hlock; + struct lock_class *class; + unsigned int depth; + int i; + + depth = curr->lockdep_depth; + if (DEBUG_LOCKS_WARN_ON(!depth)) + return 0; + + prev_hlock = NULL; + for (i = depth-1; i >= 0; i--) { + hlock = curr->held_locks + i; + /* + * We must not cross into another context: + */ + if (prev_hlock && prev_hlock->irq_context != hlock->irq_context) + break; + if (hlock->instance == lock) + goto found_it; + prev_hlock = hlock; + } + return print_unlock_inbalance_bug(curr, lock, ip); + +found_it: + class = register_lock_class(lock, subclass, 0); + hlock->class = class; + + curr->lockdep_depth = i; + curr->curr_chain_key = hlock->prev_chain_key; + + for (; i < depth; i++) { + hlock = curr->held_locks + i; + if (!__lock_acquire(hlock->instance, + hlock->class->subclass, hlock->trylock, + hlock->read, hlock->check, hlock->hardirqs_off, + hlock->acquire_ip)) + return 0; + } + + if (DEBUG_LOCKS_WARN_ON(curr->lockdep_depth != depth)) + return 0; + return 1; +} + /* * Remove the lock to the list of currently held locks in a * potentially non-nested (out of order) manner. This is a @@ -2702,6 +2751,26 @@ static void check_flags(unsigned long fl #endif } +void +lock_set_subclass(struct lockdep_map *lock, + unsigned int subclass, unsigned long ip) +{ + unsigned long flags; + + if (unlikely(current->lockdep_recursion)) + return; + + raw_local_irq_save(flags); + current->lockdep_recursion = 1; + check_flags(flags); + if (__lock_set_subclass(lock, subclass, ip)) + check_chain_key(current); + current->lockdep_recursion = 0; + raw_local_irq_restore(flags); +} + +EXPORT_SYMBOL_GPL(lock_set_subclass); + /* * We are not always called with irqs disabled - do that here, * and also avoid lockdep recursion: �����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/lockdep-prettify.patch����������������������������������������������������������������������0000664�0000764�0000764�00000003651�11041657735�015561� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������Subject: [patch] lockdep: prettify output From: Ingo Molnar <mingo@elte.hu> recent changes to the lockdep code made some of the printouts uglier - mend them. Signed-off-by: Ingo Molnar <mingo@elte.hu> --- kernel/lockdep.c | 16 ++++++++-------- 1 file changed, 8 insertions(+), 8 deletions(-) Index: linux-2.6.24.7/kernel/lockdep.c =================================================================== --- linux-2.6.24.7.orig/kernel/lockdep.c +++ linux-2.6.24.7/kernel/lockdep.c @@ -581,7 +581,7 @@ static void print_lock_dependencies(stru static void print_kernel_version(void) { - printk("%s %.*s\n", init_utsname()->release, + printk("[ %s %.*s\n", init_utsname()->release, (int)strcspn(init_utsname()->version, " "), init_utsname()->version); } @@ -3129,13 +3129,13 @@ void __init lockdep_info(void) { printk("Lock dependency validator: Copyright (c) 2006 Red Hat, Inc., Ingo Molnar\n"); - printk("... MAX_LOCKDEP_SUBCLASSES: %lu\n", MAX_LOCKDEP_SUBCLASSES); - printk("... MAX_LOCK_DEPTH: %lu\n", MAX_LOCK_DEPTH); - printk("... MAX_LOCKDEP_KEYS: %lu\n", MAX_LOCKDEP_KEYS); - printk("... CLASSHASH_SIZE: %lu\n", CLASSHASH_SIZE); - printk("... MAX_LOCKDEP_ENTRIES: %lu\n", MAX_LOCKDEP_ENTRIES); - printk("... MAX_LOCKDEP_CHAINS: %lu\n", MAX_LOCKDEP_CHAINS); - printk("... CHAINHASH_SIZE: %lu\n", CHAINHASH_SIZE); + printk("... MAX_LOCKDEP_SUBCLASSES: %6lu\n", MAX_LOCKDEP_SUBCLASSES); + printk("... MAX_LOCK_DEPTH: %6lu\n", MAX_LOCK_DEPTH); + printk("... MAX_LOCKDEP_KEYS: %6lu\n", MAX_LOCKDEP_KEYS); + printk("... CLASSHASH_SIZE: %6lu\n", CLASSHASH_SIZE); + printk("... MAX_LOCKDEP_ENTRIES: %6lu\n", MAX_LOCKDEP_ENTRIES); + printk("... MAX_LOCKDEP_CHAINS: %6lu\n", MAX_LOCKDEP_CHAINS); + printk("... CHAINHASH_SIZE: %6lu\n", CHAINHASH_SIZE); printk(" memory used by lock dependency info: %lu kB\n", (sizeof(struct lock_class) * MAX_LOCKDEP_KEYS + ���������������������������������������������������������������������������������������patches/lockdep-more-entries.patch������������������������������������������������������������������0000664�0000764�0000764�00000001337�11041657734�016322� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������--- kernel/lockdep_internals.h | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) Index: linux-2.6.24.7/kernel/lockdep_internals.h =================================================================== --- linux-2.6.24.7.orig/kernel/lockdep_internals.h +++ linux-2.6.24.7/kernel/lockdep_internals.h @@ -15,12 +15,12 @@ * table (if it's not there yet), and we check it for lock order * conflicts and deadlocks. */ -#define MAX_LOCKDEP_ENTRIES 8192UL +#define MAX_LOCKDEP_ENTRIES 16384UL #define MAX_LOCKDEP_KEYS_BITS 11 #define MAX_LOCKDEP_KEYS (1UL << MAX_LOCKDEP_KEYS_BITS) -#define MAX_LOCKDEP_CHAINS_BITS 14 +#define MAX_LOCKDEP_CHAINS_BITS 15 #define MAX_LOCKDEP_CHAINS (1UL << MAX_LOCKDEP_CHAINS_BITS) /* �������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/loopback-revert.patch�����������������������������������������������������������������������0000664�0000764�0000764�00000002270�11041657731�015363� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������� revert this commit: commit 58f539740b1ccfc5ef4e509ec2efe82621b546e3 Author: Eric Dumazet <dada1@cosmosbay.com> Date: Fri Oct 20 00:32:41 2006 -0700 [NET]: Can use __get_cpu_var() instead of per_cpu() in loopback driver. As BHs are off in loopback_xmit(), preemption cannot occurs, so we can use __get_cpu_var() instead of per_cpu() (and avoid a preempt_enable()/preempt_disable() pair) Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Signed-off-by: David S. Miller <davem@davemloft.net> --- drivers/net/loopback.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) Index: linux-2.6.24.7/drivers/net/loopback.c =================================================================== --- linux-2.6.24.7.orig/drivers/net/loopback.c +++ linux-2.6.24.7/drivers/net/loopback.c @@ -154,11 +154,11 @@ static int loopback_xmit(struct sk_buff #endif dev->last_rx = jiffies; - /* it's OK to use per_cpu_ptr() because BHs are off */ pcpu_lstats = netdev_priv(dev); - lb_stats = per_cpu_ptr(pcpu_lstats, smp_processor_id()); + lb_stats = per_cpu_ptr(pcpu_lstats, get_cpu()); lb_stats->bytes += skb->len; lb_stats->packets++; + put_cpu(); netif_rx(skb); ����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/ppc-gtod-notrace-fix.patch������������������������������������������������������������������0000664�0000764�0000764�00000000766�11041657734�016231� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������--- arch/powerpc/kernel/time.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) Index: linux-2.6.24.7/arch/powerpc/kernel/time.c =================================================================== --- linux-2.6.24.7.orig/arch/powerpc/kernel/time.c +++ linux-2.6.24.7/arch/powerpc/kernel/time.c @@ -751,7 +751,7 @@ static cycle_t rtc_read(void) return (cycle_t)get_rtc(); } -static cycle_t timebase_read(void) +static cycle_t notrace timebase_read(void) { return (cycle_t)get_tb(); } ����������patches/rcu-new-1.patch�����������������������������������������������������������������������������0000664�0000764�0000764�00000136575�11041657734�014025� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From paulmck@linux.vnet.ibm.com Wed Sep 26 23:41:51 2007 Date: Mon, 10 Sep 2007 11:32:08 -0700 From: Paul E. McKenney <paulmck@linux.vnet.ibm.com> To: linux-kernel@vger.kernel.org Cc: linux-rt-users@vger.kernel.org, mingo@elte.hu, akpm@linux-foundation.org, dipankar@in.ibm.com, josht@linux.vnet.ibm.com, tytso@us.ibm.com, dvhltc@us.ibm.com, tglx@linutronix.de, a.p.zijlstra@chello.nl, bunk@kernel.org, ego@in.ibm.com, oleg@tv-sign.ru, srostedt@redhat.com Subject: [PATCH RFC 1/9] RCU: Split API to permit multiple RCU implementations Work in progress, not for inclusion. This patch re-organizes the RCU code to enable multiple implementations of RCU. Users of RCU continues to include rcupdate.h and the RCU interfaces remain the same. This is in preparation for subsequently merging the preemptible RCU implementation. Signed-off-by: Dipankar Sarma <dipankar@in.ibm.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> --- include/linux/rcuclassic.h | 149 ++++++++++++ include/linux/rcupdate.h | 153 +++--------- kernel/Makefile | 2 kernel/rcuclassic.c | 558 +++++++++++++++++++++++++++++++++++++++++++++ kernel/rcupdate.c | 557 +------------------------------------------- 5 files changed, 777 insertions(+), 642 deletions(-) Index: linux-2.6.24.7/include/linux/rcuclassic.h =================================================================== --- /dev/null +++ linux-2.6.24.7/include/linux/rcuclassic.h @@ -0,0 +1,149 @@ +/* + * Read-Copy Update mechanism for mutual exclusion (classic version) + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA. + * + * Copyright IBM Corporation, 2001 + * + * Author: Dipankar Sarma <dipankar@in.ibm.com> + * + * Based on the original work by Paul McKenney <paulmck@us.ibm.com> + * and inputs from Rusty Russell, Andrea Arcangeli and Andi Kleen. + * Papers: + * http://www.rdrop.com/users/paulmck/paper/rclockpdcsproof.pdf + * http://lse.sourceforge.net/locking/rclock_OLS.2001.05.01c.sc.pdf (OLS2001) + * + * For detailed explanation of Read-Copy Update mechanism see - + * Documentation/RCU + * + */ + +#ifndef __LINUX_RCUCLASSIC_H +#define __LINUX_RCUCLASSIC_H + +#ifdef __KERNEL__ + +#include <linux/cache.h> +#include <linux/spinlock.h> +#include <linux/threads.h> +#include <linux/percpu.h> +#include <linux/cpumask.h> +#include <linux/seqlock.h> + + +/* Global control variables for rcupdate callback mechanism. */ +struct rcu_ctrlblk { + long cur; /* Current batch number. */ + long completed; /* Number of the last completed batch */ + int next_pending; /* Is the next batch already waiting? */ + + int signaled; + + spinlock_t lock ____cacheline_internodealigned_in_smp; + cpumask_t cpumask; /* CPUs that need to switch in order */ + /* for current batch to proceed. */ +} ____cacheline_internodealigned_in_smp; + +/* Is batch a before batch b ? */ +static inline int rcu_batch_before(long a, long b) +{ + return (a - b) < 0; +} + +/* Is batch a after batch b ? */ +static inline int rcu_batch_after(long a, long b) +{ + return (a - b) > 0; +} + +/* + * Per-CPU data for Read-Copy UPdate. + * nxtlist - new callbacks are added here + * curlist - current batch for which quiescent cycle started if any + */ +struct rcu_data { + /* 1) quiescent state handling : */ + long quiescbatch; /* Batch # for grace period */ + int passed_quiesc; /* User-mode/idle loop etc. */ + int qs_pending; /* core waits for quiesc state */ + + /* 2) batch handling */ + long batch; /* Batch # for current RCU batch */ + struct rcu_head *nxtlist; + struct rcu_head **nxttail; + long qlen; /* # of queued callbacks */ + struct rcu_head *curlist; + struct rcu_head **curtail; + struct rcu_head *donelist; + struct rcu_head **donetail; + long blimit; /* Upper limit on a processed batch */ + int cpu; + struct rcu_head barrier; +}; + +DECLARE_PER_CPU(struct rcu_data, rcu_data); +DECLARE_PER_CPU(struct rcu_data, rcu_bh_data); + +/* + * Increment the quiescent state counter. + * The counter is a bit degenerated: We do not need to know + * how many quiescent states passed, just if there was at least + * one since the start of the grace period. Thus just a flag. + */ +static inline void rcu_qsctr_inc(int cpu) +{ + struct rcu_data *rdp = &per_cpu(rcu_data, cpu); + rdp->passed_quiesc = 1; +} +static inline void rcu_bh_qsctr_inc(int cpu) +{ + struct rcu_data *rdp = &per_cpu(rcu_bh_data, cpu); + rdp->passed_quiesc = 1; +} + +extern int rcu_pending(int cpu); +extern int rcu_needs_cpu(int cpu); + +#define __rcu_read_lock() \ + do { \ + preempt_disable(); \ + __acquire(RCU); \ + } while (0) +#define __rcu_read_unlock() \ + do { \ + __release(RCU); \ + preempt_enable(); \ + } while (0) +#define __rcu_read_lock_bh() \ + do { \ + local_bh_disable(); \ + __acquire(RCU_BH); \ + } while (0) +#define __rcu_read_unlock_bh() \ + do { \ + __release(RCU_BH); \ + local_bh_enable(); \ + } while (0) + +#define __synchronize_sched() synchronize_rcu() + +extern void __rcu_init(void); +extern void rcu_check_callbacks(int cpu, int user); +extern void rcu_restart_cpu(int cpu); +extern long rcu_batches_completed(void); +extern long rcu_batches_completed_bh(void); + +#endif /* __KERNEL__ */ +#endif /* __LINUX_RCUCLASSIC_H */ Index: linux-2.6.24.7/include/linux/rcupdate.h =================================================================== --- linux-2.6.24.7.orig/include/linux/rcupdate.h +++ linux-2.6.24.7/include/linux/rcupdate.h @@ -15,7 +15,7 @@ * along with this program; if not, write to the Free Software * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA. * - * Copyright (C) IBM Corporation, 2001 + * Copyright IBM Corporation, 2001 * * Author: Dipankar Sarma <dipankar@in.ibm.com> * @@ -53,6 +53,8 @@ struct rcu_head { void (*func)(struct rcu_head *head); }; +#include <linux/rcuclassic.h> + #define RCU_HEAD_INIT { .next = NULL, .func = NULL } #define RCU_HEAD(head) struct rcu_head head = RCU_HEAD_INIT #define INIT_RCU_HEAD(ptr) do { \ @@ -61,79 +63,6 @@ struct rcu_head { -/* Global control variables for rcupdate callback mechanism. */ -struct rcu_ctrlblk { - long cur; /* Current batch number. */ - long completed; /* Number of the last completed batch */ - int next_pending; /* Is the next batch already waiting? */ - - int signaled; - - spinlock_t lock ____cacheline_internodealigned_in_smp; - cpumask_t cpumask; /* CPUs that need to switch in order */ - /* for current batch to proceed. */ -} ____cacheline_internodealigned_in_smp; - -/* Is batch a before batch b ? */ -static inline int rcu_batch_before(long a, long b) -{ - return (a - b) < 0; -} - -/* Is batch a after batch b ? */ -static inline int rcu_batch_after(long a, long b) -{ - return (a - b) > 0; -} - -/* - * Per-CPU data for Read-Copy UPdate. - * nxtlist - new callbacks are added here - * curlist - current batch for which quiescent cycle started if any - */ -struct rcu_data { - /* 1) quiescent state handling : */ - long quiescbatch; /* Batch # for grace period */ - int passed_quiesc; /* User-mode/idle loop etc. */ - int qs_pending; /* core waits for quiesc state */ - - /* 2) batch handling */ - long batch; /* Batch # for current RCU batch */ - struct rcu_head *nxtlist; - struct rcu_head **nxttail; - long qlen; /* # of queued callbacks */ - struct rcu_head *curlist; - struct rcu_head **curtail; - struct rcu_head *donelist; - struct rcu_head **donetail; - long blimit; /* Upper limit on a processed batch */ - int cpu; - struct rcu_head barrier; -}; - -DECLARE_PER_CPU(struct rcu_data, rcu_data); -DECLARE_PER_CPU(struct rcu_data, rcu_bh_data); - -/* - * Increment the quiescent state counter. - * The counter is a bit degenerated: We do not need to know - * how many quiescent states passed, just if there was at least - * one since the start of the grace period. Thus just a flag. - */ -static inline void rcu_qsctr_inc(int cpu) -{ - struct rcu_data *rdp = &per_cpu(rcu_data, cpu); - rdp->passed_quiesc = 1; -} -static inline void rcu_bh_qsctr_inc(int cpu) -{ - struct rcu_data *rdp = &per_cpu(rcu_bh_data, cpu); - rdp->passed_quiesc = 1; -} - -extern int rcu_pending(int cpu); -extern int rcu_needs_cpu(int cpu); - #ifdef CONFIG_DEBUG_LOCK_ALLOC extern struct lockdep_map rcu_lock_map; # define rcu_read_acquire() lock_acquire(&rcu_lock_map, 0, 0, 2, 1, _THIS_IP_) @@ -172,24 +101,14 @@ extern struct lockdep_map rcu_lock_map; * * It is illegal to block while in an RCU read-side critical section. */ -#define rcu_read_lock() \ - do { \ - preempt_disable(); \ - __acquire(RCU); \ - rcu_read_acquire(); \ - } while(0) +#define rcu_read_lock() __rcu_read_lock() /** * rcu_read_unlock - marks the end of an RCU read-side critical section. * * See rcu_read_lock() for more information. */ -#define rcu_read_unlock() \ - do { \ - rcu_read_release(); \ - __release(RCU); \ - preempt_enable(); \ - } while(0) +#define rcu_read_unlock() __rcu_read_unlock() /* * So where is rcu_write_lock()? It does not exist, as there is no @@ -212,24 +131,14 @@ extern struct lockdep_map rcu_lock_map; * can use just rcu_read_lock(). * */ -#define rcu_read_lock_bh() \ - do { \ - local_bh_disable(); \ - __acquire(RCU_BH); \ - rcu_read_acquire(); \ - } while(0) +#define rcu_read_lock_bh() __rcu_read_lock_bh() /* * rcu_read_unlock_bh - marks the end of a softirq-only RCU critical section * * See rcu_read_lock_bh() for more information. */ -#define rcu_read_unlock_bh() \ - do { \ - rcu_read_release(); \ - __release(RCU_BH); \ - local_bh_enable(); \ - } while(0) +#define rcu_read_unlock_bh() __rcu_read_unlock_bh() /* * Prevent the compiler from merging or refetching accesses. The compiler @@ -293,21 +202,49 @@ extern struct lockdep_map rcu_lock_map; * In "classic RCU", these two guarantees happen to be one and * the same, but can differ in realtime RCU implementations. */ -#define synchronize_sched() synchronize_rcu() +#define synchronize_sched() __synchronize_sched() -extern void rcu_init(void); -extern void rcu_check_callbacks(int cpu, int user); -extern void rcu_restart_cpu(int cpu); -extern long rcu_batches_completed(void); -extern long rcu_batches_completed_bh(void); - -/* Exported interfaces */ -extern void FASTCALL(call_rcu(struct rcu_head *head, - void (*func)(struct rcu_head *head))); +/** + * call_rcu - Queue an RCU callback for invocation after a grace period. + * @head: structure to be used for queueing the RCU updates. + * @func: actual update function to be invoked after the grace period + * + * The update function will be invoked some time after a full grace + * period elapses, in other words after all currently executing RCU + * read-side critical sections have completed. RCU read-side critical + * sections are delimited by rcu_read_lock() and rcu_read_unlock(), + * delimited by rcu_read_lock() and rcu_read_unlock(), + * and may be nested. + */ +extern void FASTCALL(call_rcu(struct rcu_head *head, + void (*func)(struct rcu_head *head))); + +/** + * call_rcu_bh - Queue an RCU for invocation after a quicker grace period. + * @head: structure to be used for queueing the RCU updates. + * @func: actual update function to be invoked after the grace period + * + * The update function will be invoked some time after a full grace + * period elapses, in other words after all currently executing RCU + * read-side critical sections have completed. call_rcu_bh() assumes + * that the read-side critical sections end on completion of a softirq + * handler. This means that read-side critical sections in process + * context must not be interrupted by softirqs. This interface is to be + * used when most of the read-side critical sections are in softirq context. + * RCU read-side critical sections are delimited by rcu_read_lock() and + * rcu_read_unlock(), * if in interrupt context or rcu_read_lock_bh() + * and rcu_read_unlock_bh(), if in process context. These may be nested. + */ extern void FASTCALL(call_rcu_bh(struct rcu_head *head, void (*func)(struct rcu_head *head))); + +/* Exported common interfaces */ extern void synchronize_rcu(void); extern void rcu_barrier(void); +/* Internal to kernel */ +extern void rcu_init(void); +extern void rcu_check_callbacks(int cpu, int user); + #endif /* __KERNEL__ */ #endif /* __LINUX_RCUPDATE_H */ Index: linux-2.6.24.7/kernel/Makefile =================================================================== --- linux-2.6.24.7.orig/kernel/Makefile +++ linux-2.6.24.7/kernel/Makefile @@ -6,7 +6,7 @@ obj-y = sched.o fork.o exec_domain.o exit.o itimer.o time.o softirq.o resource.o \ sysctl.o capability.o ptrace.o timer.o user.o user_namespace.o \ signal.o sys.o kmod.o workqueue.o pid.o \ - rcupdate.o extable.o params.o posix-timers.o \ + rcupdate.o rcuclassic.o extable.o params.o posix-timers.o \ kthread.o wait.o kfifo.o sys_ni.o posix-cpu-timers.o mutex.o \ hrtimer.o rwsem.o latency.o nsproxy.o srcu.o \ utsname.o notifier.o Index: linux-2.6.24.7/kernel/rcuclassic.c =================================================================== --- /dev/null +++ linux-2.6.24.7/kernel/rcuclassic.c @@ -0,0 +1,558 @@ +/* + * Read-Copy Update mechanism for mutual exclusion + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA. + * + * Copyright IBM Corporation, 2001 + * + * Authors: Dipankar Sarma <dipankar@in.ibm.com> + * Manfred Spraul <manfred@colorfullife.com> + * + * Based on the original work by Paul McKenney <paulmck@us.ibm.com> + * and inputs from Rusty Russell, Andrea Arcangeli and Andi Kleen. + * Papers: + * http://www.rdrop.com/users/paulmck/paper/rclockpdcsproof.pdf + * http://lse.sourceforge.net/locking/rclock_OLS.2001.05.01c.sc.pdf (OLS2001) + * + * For detailed explanation of Read-Copy Update mechanism see - + * Documentation/RCU + * + */ +#include <linux/types.h> +#include <linux/kernel.h> +#include <linux/init.h> +#include <linux/spinlock.h> +#include <linux/smp.h> +#include <linux/rcupdate.h> +#include <linux/interrupt.h> +#include <linux/sched.h> +#include <asm/atomic.h> +#include <linux/bitops.h> +#include <linux/module.h> +#include <linux/completion.h> +#include <linux/moduleparam.h> +#include <linux/percpu.h> +#include <linux/notifier.h> +/* #include <linux/rcupdate.h> @@@ */ +#include <linux/cpu.h> +#include <linux/mutex.h> + +/* Definition for rcupdate control block. */ +static struct rcu_ctrlblk rcu_ctrlblk = { + .cur = -300, + .completed = -300, + .lock = __SPIN_LOCK_UNLOCKED(&rcu_ctrlblk.lock), + .cpumask = CPU_MASK_NONE, +}; +static struct rcu_ctrlblk rcu_bh_ctrlblk = { + .cur = -300, + .completed = -300, + .lock = __SPIN_LOCK_UNLOCKED(&rcu_bh_ctrlblk.lock), + .cpumask = CPU_MASK_NONE, +}; + +DEFINE_PER_CPU(struct rcu_data, rcu_data) = { 0L }; +DEFINE_PER_CPU(struct rcu_data, rcu_bh_data) = { 0L }; + +/* Fake initialization required by compiler */ +static DEFINE_PER_CPU(struct tasklet_struct, rcu_tasklet) = {NULL}; +static int blimit = 10; +static int qhimark = 10000; +static int qlowmark = 100; + +#ifdef CONFIG_SMP +static void force_quiescent_state(struct rcu_data *rdp, + struct rcu_ctrlblk *rcp) +{ + int cpu; + cpumask_t cpumask; + set_need_resched(); + if (unlikely(!rcp->signaled)) { + rcp->signaled = 1; + /* + * Don't send IPI to itself. With irqs disabled, + * rdp->cpu is the current cpu. + */ + cpumask = rcp->cpumask; + cpu_clear(rdp->cpu, cpumask); + for_each_cpu_mask(cpu, cpumask) + smp_send_reschedule(cpu); + } +} +#else +static inline void force_quiescent_state(struct rcu_data *rdp, + struct rcu_ctrlblk *rcp) +{ + set_need_resched(); +} +#endif + +/** + * call_rcu - Queue an RCU callback for invocation after a grace period. + * @head: structure to be used for queueing the RCU updates. + * @func: actual update function to be invoked after the grace period + * + * The update function will be invoked some time after a full grace + * period elapses, in other words after all currently executing RCU + * read-side critical sections have completed. RCU read-side critical + * sections are delimited by rcu_read_lock() and rcu_read_unlock(), + * and may be nested. + */ +void fastcall call_rcu(struct rcu_head *head, + void (*func)(struct rcu_head *rcu)) +{ + unsigned long flags; + struct rcu_data *rdp; + + head->func = func; + head->next = NULL; + local_irq_save(flags); + rdp = &__get_cpu_var(rcu_data); + *rdp->nxttail = head; + rdp->nxttail = &head->next; + if (unlikely(++rdp->qlen > qhimark)) { + rdp->blimit = INT_MAX; + force_quiescent_state(rdp, &rcu_ctrlblk); + } + local_irq_restore(flags); +} +EXPORT_SYMBOL_GPL(call_rcu); + +/** + * call_rcu_bh - Queue an RCU for invocation after a quicker grace period. + * @head: structure to be used for queueing the RCU updates. + * @func: actual update function to be invoked after the grace period + * + * The update function will be invoked some time after a full grace + * period elapses, in other words after all currently executing RCU + * read-side critical sections have completed. call_rcu_bh() assumes + * that the read-side critical sections end on completion of a softirq + * handler. This means that read-side critical sections in process + * context must not be interrupted by softirqs. This interface is to be + * used when most of the read-side critical sections are in softirq context. + * RCU read-side critical sections are delimited by rcu_read_lock() and + * rcu_read_unlock(), * if in interrupt context or rcu_read_lock_bh() + * and rcu_read_unlock_bh(), if in process context. These may be nested. + */ +void fastcall call_rcu_bh(struct rcu_head *head, + void (*func)(struct rcu_head *rcu)) +{ + unsigned long flags; + struct rcu_data *rdp; + + head->func = func; + head->next = NULL; + local_irq_save(flags); + rdp = &__get_cpu_var(rcu_bh_data); + *rdp->nxttail = head; + rdp->nxttail = &head->next; + + if (unlikely(++rdp->qlen > qhimark)) { + rdp->blimit = INT_MAX; + force_quiescent_state(rdp, &rcu_bh_ctrlblk); + } + + local_irq_restore(flags); +} +EXPORT_SYMBOL_GPL(call_rcu_bh); + +/* + * Return the number of RCU batches processed thus far. Useful + * for debug and statistics. + */ +long rcu_batches_completed(void) +{ + return rcu_ctrlblk.completed; +} +EXPORT_SYMBOL_GPL(rcu_batches_completed); + +/* + * Return the number of RCU batches processed thus far. Useful + * for debug and statistics. + */ +long rcu_batches_completed_bh(void) +{ + return rcu_bh_ctrlblk.completed; +} +EXPORT_SYMBOL_GPL(rcu_batches_completed_bh); + +/* + * Invoke the completed RCU callbacks. They are expected to be in + * a per-cpu list. + */ +static void rcu_do_batch(struct rcu_data *rdp) +{ + struct rcu_head *next, *list; + int count = 0; + + list = rdp->donelist; + while (list) { + next = list->next; + prefetch(next); + list->func(list); + list = next; + if (++count >= rdp->blimit) + break; + } + rdp->donelist = list; + + local_irq_disable(); + rdp->qlen -= count; + local_irq_enable(); + if (rdp->blimit == INT_MAX && rdp->qlen <= qlowmark) + rdp->blimit = blimit; + + if (!rdp->donelist) + rdp->donetail = &rdp->donelist; + else + tasklet_schedule(&per_cpu(rcu_tasklet, rdp->cpu)); +} + +/* + * Grace period handling: + * The grace period handling consists out of two steps: + * - A new grace period is started. + * This is done by rcu_start_batch. The start is not broadcasted to + * all cpus, they must pick this up by comparing rcp->cur with + * rdp->quiescbatch. All cpus are recorded in the + * rcu_ctrlblk.cpumask bitmap. + * - All cpus must go through a quiescent state. + * Since the start of the grace period is not broadcasted, at least two + * calls to rcu_check_quiescent_state are required: + * The first call just notices that a new grace period is running. The + * following calls check if there was a quiescent state since the beginning + * of the grace period. If so, it updates rcu_ctrlblk.cpumask. If + * the bitmap is empty, then the grace period is completed. + * rcu_check_quiescent_state calls rcu_start_batch(0) to start the next grace + * period (if necessary). + */ +/* + * Register a new batch of callbacks, and start it up if there is currently no + * active batch and the batch to be registered has not already occurred. + * Caller must hold rcu_ctrlblk.lock. + */ +static void rcu_start_batch(struct rcu_ctrlblk *rcp) +{ + if (rcp->next_pending && + rcp->completed == rcp->cur) { + rcp->next_pending = 0; + /* + * next_pending == 0 must be visible in + * __rcu_process_callbacks() before it can see new value of cur. + */ + smp_wmb(); + rcp->cur++; + + /* + * Accessing nohz_cpu_mask before incrementing rcp->cur needs a + * Barrier Otherwise it can cause tickless idle CPUs to be + * included in rcp->cpumask, which will extend graceperiods + * unnecessarily. + */ + smp_mb(); + cpus_andnot(rcp->cpumask, cpu_online_map, nohz_cpu_mask); + + rcp->signaled = 0; + } +} + +/* + * cpu went through a quiescent state since the beginning of the grace period. + * Clear it from the cpu mask and complete the grace period if it was the last + * cpu. Start another grace period if someone has further entries pending + */ +static void cpu_quiet(int cpu, struct rcu_ctrlblk *rcp) +{ + cpu_clear(cpu, rcp->cpumask); + if (cpus_empty(rcp->cpumask)) { + /* batch completed ! */ + rcp->completed = rcp->cur; + rcu_start_batch(rcp); + } +} + +/* + * Check if the cpu has gone through a quiescent state (say context + * switch). If so and if it already hasn't done so in this RCU + * quiescent cycle, then indicate that it has done so. + */ +static void rcu_check_quiescent_state(struct rcu_ctrlblk *rcp, + struct rcu_data *rdp) +{ + if (rdp->quiescbatch != rcp->cur) { + /* start new grace period: */ + rdp->qs_pending = 1; + rdp->passed_quiesc = 0; + rdp->quiescbatch = rcp->cur; + return; + } + + /* Grace period already completed for this cpu? + * qs_pending is checked instead of the actual bitmap to avoid + * cacheline trashing. + */ + if (!rdp->qs_pending) + return; + + /* + * Was there a quiescent state since the beginning of the grace + * period? If no, then exit and wait for the next call. + */ + if (!rdp->passed_quiesc) + return; + rdp->qs_pending = 0; + + spin_lock(&rcp->lock); + /* + * rdp->quiescbatch/rcp->cur and the cpu bitmap can come out of sync + * during cpu startup. Ignore the quiescent state. + */ + if (likely(rdp->quiescbatch == rcp->cur)) + cpu_quiet(rdp->cpu, rcp); + + spin_unlock(&rcp->lock); +} + + +#ifdef CONFIG_HOTPLUG_CPU + +/* warning! helper for rcu_offline_cpu. do not use elsewhere without reviewing + * locking requirements, the list it's pulling from has to belong to a cpu + * which is dead and hence not processing interrupts. + */ +static void rcu_move_batch(struct rcu_data *this_rdp, struct rcu_head *list, + struct rcu_head **tail) +{ + local_irq_disable(); + *this_rdp->nxttail = list; + if (list) + this_rdp->nxttail = tail; + local_irq_enable(); +} + +static void __rcu_offline_cpu(struct rcu_data *this_rdp, + struct rcu_ctrlblk *rcp, struct rcu_data *rdp) +{ + /* if the cpu going offline owns the grace period + * we can block indefinitely waiting for it, so flush + * it here + */ + spin_lock_bh(&rcp->lock); + if (rcp->cur != rcp->completed) + cpu_quiet(rdp->cpu, rcp); + spin_unlock_bh(&rcp->lock); + rcu_move_batch(this_rdp, rdp->curlist, rdp->curtail); + rcu_move_batch(this_rdp, rdp->nxtlist, rdp->nxttail); + rcu_move_batch(this_rdp, rdp->donelist, rdp->donetail); +} + +static void rcu_offline_cpu(int cpu) +{ + struct rcu_data *this_rdp = &get_cpu_var(rcu_data); + struct rcu_data *this_bh_rdp = &get_cpu_var(rcu_bh_data); + + __rcu_offline_cpu(this_rdp, &rcu_ctrlblk, + &per_cpu(rcu_data, cpu)); + __rcu_offline_cpu(this_bh_rdp, &rcu_bh_ctrlblk, + &per_cpu(rcu_bh_data, cpu)); + put_cpu_var(rcu_data); + put_cpu_var(rcu_bh_data); + tasklet_kill_immediate(&per_cpu(rcu_tasklet, cpu), cpu); +} + +#else + +static void rcu_offline_cpu(int cpu) +{ +} + +#endif + +/* + * This does the RCU processing work from tasklet context. + */ +static void __rcu_process_callbacks(struct rcu_ctrlblk *rcp, + struct rcu_data *rdp) +{ + if (rdp->curlist && !rcu_batch_before(rcp->completed, rdp->batch)) { + *rdp->donetail = rdp->curlist; + rdp->donetail = rdp->curtail; + rdp->curlist = NULL; + rdp->curtail = &rdp->curlist; + } + + if (rdp->nxtlist && !rdp->curlist) { + local_irq_disable(); + rdp->curlist = rdp->nxtlist; + rdp->curtail = rdp->nxttail; + rdp->nxtlist = NULL; + rdp->nxttail = &rdp->nxtlist; + local_irq_enable(); + + /* + * start the next batch of callbacks + */ + + /* determine batch number */ + rdp->batch = rcp->cur + 1; + /* see the comment and corresponding wmb() in + * the rcu_start_batch() + */ + smp_rmb(); + + if (!rcp->next_pending) { + /* and start it/schedule start if it's a new batch */ + spin_lock(&rcp->lock); + rcp->next_pending = 1; + rcu_start_batch(rcp); + spin_unlock(&rcp->lock); + } + } + + rcu_check_quiescent_state(rcp, rdp); + if (rdp->donelist) + rcu_do_batch(rdp); +} + +static void rcu_process_callbacks(unsigned long unused) +{ + __rcu_process_callbacks(&rcu_ctrlblk, &__get_cpu_var(rcu_data)); + __rcu_process_callbacks(&rcu_bh_ctrlblk, &__get_cpu_var(rcu_bh_data)); +} + +static int __rcu_pending(struct rcu_ctrlblk *rcp, struct rcu_data *rdp) +{ + /* This cpu has pending rcu entries and the grace period + * for them has completed. + */ + if (rdp->curlist && !rcu_batch_before(rcp->completed, rdp->batch)) + return 1; + + /* This cpu has no pending entries, but there are new entries */ + if (!rdp->curlist && rdp->nxtlist) + return 1; + + /* This cpu has finished callbacks to invoke */ + if (rdp->donelist) + return 1; + + /* The rcu core waits for a quiescent state from the cpu */ + if (rdp->quiescbatch != rcp->cur || rdp->qs_pending) + return 1; + + /* nothing to do */ + return 0; +} + +/* + * Check to see if there is any immediate RCU-related work to be done + * by the current CPU, returning 1 if so. This function is part of the + * RCU implementation; it is -not- an exported member of the RCU API. + */ +int rcu_pending(int cpu) +{ + return __rcu_pending(&rcu_ctrlblk, &per_cpu(rcu_data, cpu)) || + __rcu_pending(&rcu_bh_ctrlblk, &per_cpu(rcu_bh_data, cpu)); +} + +/* + * Check to see if any future RCU-related work will need to be done + * by the current CPU, even if none need be done immediately, returning + * 1 if so. This function is part of the RCU implementation; it is -not- + * an exported member of the RCU API. + */ +int rcu_needs_cpu(int cpu) +{ + struct rcu_data *rdp = &per_cpu(rcu_data, cpu); + struct rcu_data *rdp_bh = &per_cpu(rcu_bh_data, cpu); + + return (!!rdp->curlist || !!rdp_bh->curlist || rcu_pending(cpu)); +} + +void rcu_check_callbacks(int cpu, int user) +{ + if (user || + (idle_cpu(cpu) && !in_softirq() && + hardirq_count() <= (1 << HARDIRQ_SHIFT))) { + rcu_qsctr_inc(cpu); + rcu_bh_qsctr_inc(cpu); + } else if (!in_softirq()) + rcu_bh_qsctr_inc(cpu); + tasklet_schedule(&per_cpu(rcu_tasklet, cpu)); +} + +static void rcu_init_percpu_data(int cpu, struct rcu_ctrlblk *rcp, + struct rcu_data *rdp) +{ + memset(rdp, 0, sizeof(*rdp)); + rdp->curtail = &rdp->curlist; + rdp->nxttail = &rdp->nxtlist; + rdp->donetail = &rdp->donelist; + rdp->quiescbatch = rcp->completed; + rdp->qs_pending = 0; + rdp->cpu = cpu; + rdp->blimit = blimit; +} + +static void __cpuinit rcu_online_cpu(int cpu) +{ + struct rcu_data *rdp = &per_cpu(rcu_data, cpu); + struct rcu_data *bh_rdp = &per_cpu(rcu_bh_data, cpu); + + rcu_init_percpu_data(cpu, &rcu_ctrlblk, rdp); + rcu_init_percpu_data(cpu, &rcu_bh_ctrlblk, bh_rdp); + tasklet_init(&per_cpu(rcu_tasklet, cpu), rcu_process_callbacks, 0UL); +} + +static int __cpuinit rcu_cpu_notify(struct notifier_block *self, + unsigned long action, void *hcpu) +{ + long cpu = (long)hcpu; + switch (action) { + case CPU_UP_PREPARE: + case CPU_UP_PREPARE_FROZEN: + rcu_online_cpu(cpu); + break; + case CPU_DEAD: + case CPU_DEAD_FROZEN: + rcu_offline_cpu(cpu); + break; + default: + break; + } + return NOTIFY_OK; +} + +static struct notifier_block __cpuinitdata rcu_nb = { + .notifier_call = rcu_cpu_notify, +}; + +/* + * Initializes rcu mechanism. Assumed to be called early. + * That is before local timer(SMP) or jiffie timer (uniproc) is setup. + * Note that rcu_qsctr and friends are implicitly + * initialized due to the choice of ``0'' for RCU_CTR_INVALID. + */ +void __init __rcu_init(void) +{ + rcu_cpu_notify(&rcu_nb, CPU_UP_PREPARE, + (void *)(long)smp_processor_id()); + /* Register notifier for non-boot CPUs */ + register_cpu_notifier(&rcu_nb); +} + +module_param(blimit, int, 0); +module_param(qhimark, int, 0); +module_param(qlowmark, int, 0); Index: linux-2.6.24.7/kernel/rcupdate.c =================================================================== --- linux-2.6.24.7.orig/kernel/rcupdate.c +++ linux-2.6.24.7/kernel/rcupdate.c @@ -15,7 +15,7 @@ * along with this program; if not, write to the Free Software * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA. * - * Copyright (C) IBM Corporation, 2001 + * Copyright IBM Corporation, 2001 * * Authors: Dipankar Sarma <dipankar@in.ibm.com> * Manfred Spraul <manfred@colorfullife.com> @@ -35,14 +35,12 @@ #include <linux/init.h> #include <linux/spinlock.h> #include <linux/smp.h> -#include <linux/rcupdate.h> #include <linux/interrupt.h> #include <linux/sched.h> #include <asm/atomic.h> #include <linux/bitops.h> #include <linux/module.h> #include <linux/completion.h> -#include <linux/moduleparam.h> #include <linux/percpu.h> #include <linux/notifier.h> #include <linux/cpu.h> @@ -56,144 +54,46 @@ struct lockdep_map rcu_lock_map = EXPORT_SYMBOL_GPL(rcu_lock_map); #endif -/* Definition for rcupdate control block. */ -static struct rcu_ctrlblk rcu_ctrlblk = { - .cur = -300, - .completed = -300, - .lock = __SPIN_LOCK_UNLOCKED(&rcu_ctrlblk.lock), - .cpumask = CPU_MASK_NONE, -}; -static struct rcu_ctrlblk rcu_bh_ctrlblk = { - .cur = -300, - .completed = -300, - .lock = __SPIN_LOCK_UNLOCKED(&rcu_bh_ctrlblk.lock), - .cpumask = CPU_MASK_NONE, +struct rcu_synchronize { + struct rcu_head head; + struct completion completion; }; -DEFINE_PER_CPU(struct rcu_data, rcu_data) = { 0L }; -DEFINE_PER_CPU(struct rcu_data, rcu_bh_data) = { 0L }; - -/* Fake initialization required by compiler */ -static DEFINE_PER_CPU(struct tasklet_struct, rcu_tasklet) = {NULL}; -static int blimit = 10; -static int qhimark = 10000; -static int qlowmark = 100; - +static DEFINE_PER_CPU(struct rcu_head, rcu_barrier_head) = {NULL}; static atomic_t rcu_barrier_cpu_count; static DEFINE_MUTEX(rcu_barrier_mutex); static struct completion rcu_barrier_completion; -#ifdef CONFIG_SMP -static void force_quiescent_state(struct rcu_data *rdp, - struct rcu_ctrlblk *rcp) -{ - int cpu; - cpumask_t cpumask; - set_need_resched(); - if (unlikely(!rcp->signaled)) { - rcp->signaled = 1; - /* - * Don't send IPI to itself. With irqs disabled, - * rdp->cpu is the current cpu. - */ - cpumask = rcp->cpumask; - cpu_clear(rdp->cpu, cpumask); - for_each_cpu_mask(cpu, cpumask) - smp_send_reschedule(cpu); - } -} -#else -static inline void force_quiescent_state(struct rcu_data *rdp, - struct rcu_ctrlblk *rcp) +/* Because of FASTCALL declaration of complete, we use this wrapper */ +static void wakeme_after_rcu(struct rcu_head *head) { - set_need_resched(); + struct rcu_synchronize *rcu; + + rcu = container_of(head, struct rcu_synchronize, head); + complete(&rcu->completion); } -#endif /** - * call_rcu - Queue an RCU callback for invocation after a grace period. - * @head: structure to be used for queueing the RCU updates. - * @func: actual update function to be invoked after the grace period + * synchronize_rcu - wait until a grace period has elapsed. * - * The update function will be invoked some time after a full grace - * period elapses, in other words after all currently executing RCU + * Control will return to the caller some time after a full grace + * period has elapsed, in other words after all currently executing RCU * read-side critical sections have completed. RCU read-side critical * sections are delimited by rcu_read_lock() and rcu_read_unlock(), * and may be nested. */ -void fastcall call_rcu(struct rcu_head *head, - void (*func)(struct rcu_head *rcu)) -{ - unsigned long flags; - struct rcu_data *rdp; - - head->func = func; - head->next = NULL; - local_irq_save(flags); - rdp = &__get_cpu_var(rcu_data); - *rdp->nxttail = head; - rdp->nxttail = &head->next; - if (unlikely(++rdp->qlen > qhimark)) { - rdp->blimit = INT_MAX; - force_quiescent_state(rdp, &rcu_ctrlblk); - } - local_irq_restore(flags); -} - -/** - * call_rcu_bh - Queue an RCU for invocation after a quicker grace period. - * @head: structure to be used for queueing the RCU updates. - * @func: actual update function to be invoked after the grace period - * - * The update function will be invoked some time after a full grace - * period elapses, in other words after all currently executing RCU - * read-side critical sections have completed. call_rcu_bh() assumes - * that the read-side critical sections end on completion of a softirq - * handler. This means that read-side critical sections in process - * context must not be interrupted by softirqs. This interface is to be - * used when most of the read-side critical sections are in softirq context. - * RCU read-side critical sections are delimited by rcu_read_lock() and - * rcu_read_unlock(), * if in interrupt context or rcu_read_lock_bh() - * and rcu_read_unlock_bh(), if in process context. These may be nested. - */ -void fastcall call_rcu_bh(struct rcu_head *head, - void (*func)(struct rcu_head *rcu)) +void synchronize_rcu(void) { - unsigned long flags; - struct rcu_data *rdp; - - head->func = func; - head->next = NULL; - local_irq_save(flags); - rdp = &__get_cpu_var(rcu_bh_data); - *rdp->nxttail = head; - rdp->nxttail = &head->next; - - if (unlikely(++rdp->qlen > qhimark)) { - rdp->blimit = INT_MAX; - force_quiescent_state(rdp, &rcu_bh_ctrlblk); - } - - local_irq_restore(flags); -} + struct rcu_synchronize rcu; -/* - * Return the number of RCU batches processed thus far. Useful - * for debug and statistics. - */ -long rcu_batches_completed(void) -{ - return rcu_ctrlblk.completed; -} + init_completion(&rcu.completion); + /* Will wake me after RCU finished */ + call_rcu(&rcu.head, wakeme_after_rcu); -/* - * Return the number of RCU batches processed thus far. Useful - * for debug and statistics. - */ -long rcu_batches_completed_bh(void) -{ - return rcu_bh_ctrlblk.completed; + /* Wait for it */ + wait_for_completion(&rcu.completion); } +EXPORT_SYMBOL_GPL(synchronize_rcu); static void rcu_barrier_callback(struct rcu_head *notused) { @@ -207,10 +107,8 @@ static void rcu_barrier_callback(struct static void rcu_barrier_func(void *notused) { int cpu = smp_processor_id(); - struct rcu_data *rdp = &per_cpu(rcu_data, cpu); - struct rcu_head *head; + struct rcu_head *head = &per_cpu(rcu_barrier_head, cpu); - head = &rdp->barrier; atomic_inc(&rcu_barrier_cpu_count); call_rcu(head, rcu_barrier_callback); } @@ -231,414 +129,7 @@ void rcu_barrier(void) } EXPORT_SYMBOL_GPL(rcu_barrier); -/* - * Invoke the completed RCU callbacks. They are expected to be in - * a per-cpu list. - */ -static void rcu_do_batch(struct rcu_data *rdp) -{ - struct rcu_head *next, *list; - int count = 0; - - list = rdp->donelist; - while (list) { - next = list->next; - prefetch(next); - list->func(list); - list = next; - if (++count >= rdp->blimit) - break; - } - rdp->donelist = list; - - local_irq_disable(); - rdp->qlen -= count; - local_irq_enable(); - if (rdp->blimit == INT_MAX && rdp->qlen <= qlowmark) - rdp->blimit = blimit; - - if (!rdp->donelist) - rdp->donetail = &rdp->donelist; - else - tasklet_schedule(&per_cpu(rcu_tasklet, rdp->cpu)); -} - -/* - * Grace period handling: - * The grace period handling consists out of two steps: - * - A new grace period is started. - * This is done by rcu_start_batch. The start is not broadcasted to - * all cpus, they must pick this up by comparing rcp->cur with - * rdp->quiescbatch. All cpus are recorded in the - * rcu_ctrlblk.cpumask bitmap. - * - All cpus must go through a quiescent state. - * Since the start of the grace period is not broadcasted, at least two - * calls to rcu_check_quiescent_state are required: - * The first call just notices that a new grace period is running. The - * following calls check if there was a quiescent state since the beginning - * of the grace period. If so, it updates rcu_ctrlblk.cpumask. If - * the bitmap is empty, then the grace period is completed. - * rcu_check_quiescent_state calls rcu_start_batch(0) to start the next grace - * period (if necessary). - */ -/* - * Register a new batch of callbacks, and start it up if there is currently no - * active batch and the batch to be registered has not already occurred. - * Caller must hold rcu_ctrlblk.lock. - */ -static void rcu_start_batch(struct rcu_ctrlblk *rcp) -{ - if (rcp->next_pending && - rcp->completed == rcp->cur) { - rcp->next_pending = 0; - /* - * next_pending == 0 must be visible in - * __rcu_process_callbacks() before it can see new value of cur. - */ - smp_wmb(); - rcp->cur++; - - /* - * Accessing nohz_cpu_mask before incrementing rcp->cur needs a - * Barrier Otherwise it can cause tickless idle CPUs to be - * included in rcp->cpumask, which will extend graceperiods - * unnecessarily. - */ - smp_mb(); - cpus_andnot(rcp->cpumask, cpu_online_map, nohz_cpu_mask); - - rcp->signaled = 0; - } -} - -/* - * cpu went through a quiescent state since the beginning of the grace period. - * Clear it from the cpu mask and complete the grace period if it was the last - * cpu. Start another grace period if someone has further entries pending - */ -static void cpu_quiet(int cpu, struct rcu_ctrlblk *rcp) -{ - cpu_clear(cpu, rcp->cpumask); - if (cpus_empty(rcp->cpumask)) { - /* batch completed ! */ - rcp->completed = rcp->cur; - rcu_start_batch(rcp); - } -} - -/* - * Check if the cpu has gone through a quiescent state (say context - * switch). If so and if it already hasn't done so in this RCU - * quiescent cycle, then indicate that it has done so. - */ -static void rcu_check_quiescent_state(struct rcu_ctrlblk *rcp, - struct rcu_data *rdp) -{ - if (rdp->quiescbatch != rcp->cur) { - /* start new grace period: */ - rdp->qs_pending = 1; - rdp->passed_quiesc = 0; - rdp->quiescbatch = rcp->cur; - return; - } - - /* Grace period already completed for this cpu? - * qs_pending is checked instead of the actual bitmap to avoid - * cacheline trashing. - */ - if (!rdp->qs_pending) - return; - - /* - * Was there a quiescent state since the beginning of the grace - * period? If no, then exit and wait for the next call. - */ - if (!rdp->passed_quiesc) - return; - rdp->qs_pending = 0; - - spin_lock(&rcp->lock); - /* - * rdp->quiescbatch/rcp->cur and the cpu bitmap can come out of sync - * during cpu startup. Ignore the quiescent state. - */ - if (likely(rdp->quiescbatch == rcp->cur)) - cpu_quiet(rdp->cpu, rcp); - - spin_unlock(&rcp->lock); -} - - -#ifdef CONFIG_HOTPLUG_CPU - -/* warning! helper for rcu_offline_cpu. do not use elsewhere without reviewing - * locking requirements, the list it's pulling from has to belong to a cpu - * which is dead and hence not processing interrupts. - */ -static void rcu_move_batch(struct rcu_data *this_rdp, struct rcu_head *list, - struct rcu_head **tail) -{ - local_irq_disable(); - *this_rdp->nxttail = list; - if (list) - this_rdp->nxttail = tail; - local_irq_enable(); -} - -static void __rcu_offline_cpu(struct rcu_data *this_rdp, - struct rcu_ctrlblk *rcp, struct rcu_data *rdp) -{ - /* if the cpu going offline owns the grace period - * we can block indefinitely waiting for it, so flush - * it here - */ - spin_lock_bh(&rcp->lock); - if (rcp->cur != rcp->completed) - cpu_quiet(rdp->cpu, rcp); - spin_unlock_bh(&rcp->lock); - rcu_move_batch(this_rdp, rdp->curlist, rdp->curtail); - rcu_move_batch(this_rdp, rdp->nxtlist, rdp->nxttail); - rcu_move_batch(this_rdp, rdp->donelist, rdp->donetail); -} - -static void rcu_offline_cpu(int cpu) -{ - struct rcu_data *this_rdp = &get_cpu_var(rcu_data); - struct rcu_data *this_bh_rdp = &get_cpu_var(rcu_bh_data); - - __rcu_offline_cpu(this_rdp, &rcu_ctrlblk, - &per_cpu(rcu_data, cpu)); - __rcu_offline_cpu(this_bh_rdp, &rcu_bh_ctrlblk, - &per_cpu(rcu_bh_data, cpu)); - put_cpu_var(rcu_data); - put_cpu_var(rcu_bh_data); - tasklet_kill_immediate(&per_cpu(rcu_tasklet, cpu), cpu); -} - -#else - -static void rcu_offline_cpu(int cpu) -{ -} - -#endif - -/* - * This does the RCU processing work from tasklet context. - */ -static void __rcu_process_callbacks(struct rcu_ctrlblk *rcp, - struct rcu_data *rdp) -{ - if (rdp->curlist && !rcu_batch_before(rcp->completed, rdp->batch)) { - *rdp->donetail = rdp->curlist; - rdp->donetail = rdp->curtail; - rdp->curlist = NULL; - rdp->curtail = &rdp->curlist; - } - - if (rdp->nxtlist && !rdp->curlist) { - local_irq_disable(); - rdp->curlist = rdp->nxtlist; - rdp->curtail = rdp->nxttail; - rdp->nxtlist = NULL; - rdp->nxttail = &rdp->nxtlist; - local_irq_enable(); - - /* - * start the next batch of callbacks - */ - - /* determine batch number */ - rdp->batch = rcp->cur + 1; - /* see the comment and corresponding wmb() in - * the rcu_start_batch() - */ - smp_rmb(); - - if (!rcp->next_pending) { - /* and start it/schedule start if it's a new batch */ - spin_lock(&rcp->lock); - rcp->next_pending = 1; - rcu_start_batch(rcp); - spin_unlock(&rcp->lock); - } - } - - rcu_check_quiescent_state(rcp, rdp); - if (rdp->donelist) - rcu_do_batch(rdp); -} - -static void rcu_process_callbacks(unsigned long unused) -{ - __rcu_process_callbacks(&rcu_ctrlblk, &__get_cpu_var(rcu_data)); - __rcu_process_callbacks(&rcu_bh_ctrlblk, &__get_cpu_var(rcu_bh_data)); -} - -static int __rcu_pending(struct rcu_ctrlblk *rcp, struct rcu_data *rdp) -{ - /* This cpu has pending rcu entries and the grace period - * for them has completed. - */ - if (rdp->curlist && !rcu_batch_before(rcp->completed, rdp->batch)) - return 1; - - /* This cpu has no pending entries, but there are new entries */ - if (!rdp->curlist && rdp->nxtlist) - return 1; - - /* This cpu has finished callbacks to invoke */ - if (rdp->donelist) - return 1; - - /* The rcu core waits for a quiescent state from the cpu */ - if (rdp->quiescbatch != rcp->cur || rdp->qs_pending) - return 1; - - /* nothing to do */ - return 0; -} - -/* - * Check to see if there is any immediate RCU-related work to be done - * by the current CPU, returning 1 if so. This function is part of the - * RCU implementation; it is -not- an exported member of the RCU API. - */ -int rcu_pending(int cpu) -{ - return __rcu_pending(&rcu_ctrlblk, &per_cpu(rcu_data, cpu)) || - __rcu_pending(&rcu_bh_ctrlblk, &per_cpu(rcu_bh_data, cpu)); -} - -/* - * Check to see if any future RCU-related work will need to be done - * by the current CPU, even if none need be done immediately, returning - * 1 if so. This function is part of the RCU implementation; it is -not- - * an exported member of the RCU API. - */ -int rcu_needs_cpu(int cpu) -{ - struct rcu_data *rdp = &per_cpu(rcu_data, cpu); - struct rcu_data *rdp_bh = &per_cpu(rcu_bh_data, cpu); - - return (!!rdp->curlist || !!rdp_bh->curlist || rcu_pending(cpu)); -} - -void rcu_check_callbacks(int cpu, int user) -{ - if (user || - (idle_cpu(cpu) && !in_softirq() && - hardirq_count() <= (1 << HARDIRQ_SHIFT))) { - rcu_qsctr_inc(cpu); - rcu_bh_qsctr_inc(cpu); - } else if (!in_softirq()) - rcu_bh_qsctr_inc(cpu); - tasklet_schedule(&per_cpu(rcu_tasklet, cpu)); -} - -static void rcu_init_percpu_data(int cpu, struct rcu_ctrlblk *rcp, - struct rcu_data *rdp) -{ - memset(rdp, 0, sizeof(*rdp)); - rdp->curtail = &rdp->curlist; - rdp->nxttail = &rdp->nxtlist; - rdp->donetail = &rdp->donelist; - rdp->quiescbatch = rcp->completed; - rdp->qs_pending = 0; - rdp->cpu = cpu; - rdp->blimit = blimit; -} - -static void __cpuinit rcu_online_cpu(int cpu) -{ - struct rcu_data *rdp = &per_cpu(rcu_data, cpu); - struct rcu_data *bh_rdp = &per_cpu(rcu_bh_data, cpu); - - rcu_init_percpu_data(cpu, &rcu_ctrlblk, rdp); - rcu_init_percpu_data(cpu, &rcu_bh_ctrlblk, bh_rdp); - tasklet_init(&per_cpu(rcu_tasklet, cpu), rcu_process_callbacks, 0UL); -} - -static int __cpuinit rcu_cpu_notify(struct notifier_block *self, - unsigned long action, void *hcpu) -{ - long cpu = (long)hcpu; - switch (action) { - case CPU_UP_PREPARE: - case CPU_UP_PREPARE_FROZEN: - rcu_online_cpu(cpu); - break; - case CPU_DEAD: - case CPU_DEAD_FROZEN: - rcu_offline_cpu(cpu); - break; - default: - break; - } - return NOTIFY_OK; -} - -static struct notifier_block __cpuinitdata rcu_nb = { - .notifier_call = rcu_cpu_notify, -}; - -/* - * Initializes rcu mechanism. Assumed to be called early. - * That is before local timer(SMP) or jiffie timer (uniproc) is setup. - * Note that rcu_qsctr and friends are implicitly - * initialized due to the choice of ``0'' for RCU_CTR_INVALID. - */ void __init rcu_init(void) { - rcu_cpu_notify(&rcu_nb, CPU_UP_PREPARE, - (void *)(long)smp_processor_id()); - /* Register notifier for non-boot CPUs */ - register_cpu_notifier(&rcu_nb); + __rcu_init(); } - -struct rcu_synchronize { - struct rcu_head head; - struct completion completion; -}; - -/* Because of FASTCALL declaration of complete, we use this wrapper */ -static void wakeme_after_rcu(struct rcu_head *head) -{ - struct rcu_synchronize *rcu; - - rcu = container_of(head, struct rcu_synchronize, head); - complete(&rcu->completion); -} - -/** - * synchronize_rcu - wait until a grace period has elapsed. - * - * Control will return to the caller some time after a full grace - * period has elapsed, in other words after all currently executing RCU - * read-side critical sections have completed. RCU read-side critical - * sections are delimited by rcu_read_lock() and rcu_read_unlock(), - * and may be nested. - * - * If your read-side code is not protected by rcu_read_lock(), do -not- - * use synchronize_rcu(). - */ -void synchronize_rcu(void) -{ - struct rcu_synchronize rcu; - - init_completion(&rcu.completion); - /* Will wake me after RCU finished */ - call_rcu(&rcu.head, wakeme_after_rcu); - - /* Wait for it */ - wait_for_completion(&rcu.completion); -} - -module_param(blimit, int, 0); -module_param(qhimark, int, 0); -module_param(qlowmark, int, 0); -EXPORT_SYMBOL_GPL(rcu_batches_completed); -EXPORT_SYMBOL_GPL(rcu_batches_completed_bh); -EXPORT_SYMBOL_GPL(call_rcu); -EXPORT_SYMBOL_GPL(call_rcu_bh); -EXPORT_SYMBOL_GPL(synchronize_rcu); �����������������������������������������������������������������������������������������������������������������������������������patches/rcu-new-2.patch�����������������������������������������������������������������������������0000664�0000764�0000764�00000004655�11041657734�014017� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From paulmck@linux.vnet.ibm.com Thu Sep 27 00:03:00 2007 Date: Mon, 10 Sep 2007 11:33:05 -0700 From: Paul E. McKenney <paulmck@linux.vnet.ibm.com> To: linux-kernel@vger.kernel.org Cc: linux-rt-users@vger.kernel.org, mingo@elte.hu, akpm@linux-foundation.org, dipankar@in.ibm.com, josht@linux.vnet.ibm.com, tytso@us.ibm.com, dvhltc@us.ibm.com, tglx@linutronix.de, a.p.zijlstra@chello.nl, bunk@kernel.org, ego@in.ibm.com, oleg@tv-sign.ru, srostedt@redhat.com Subject: [PATCH RFC 2/9] RCU: Fix barriers Work in progress, not for inclusion. Fix rcu_barrier() to work properly in preemptive kernel environment. Also, the ordering of callback must be preserved while moving callbacks to another CPU during CPU hotplug. Signed-off-by: Dipankar Sarma <dipankar@in.ibm.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> --- kernel/rcuclassic.c | 2 +- kernel/rcupdate.c | 10 ++++++++++ 2 files changed, 11 insertions(+), 1 deletion(-) Index: linux-2.6.24.7/kernel/rcuclassic.c =================================================================== --- linux-2.6.24.7.orig/kernel/rcuclassic.c +++ linux-2.6.24.7/kernel/rcuclassic.c @@ -353,9 +353,9 @@ static void __rcu_offline_cpu(struct rcu if (rcp->cur != rcp->completed) cpu_quiet(rdp->cpu, rcp); spin_unlock_bh(&rcp->lock); + rcu_move_batch(this_rdp, rdp->donelist, rdp->donetail); rcu_move_batch(this_rdp, rdp->curlist, rdp->curtail); rcu_move_batch(this_rdp, rdp->nxtlist, rdp->nxttail); - rcu_move_batch(this_rdp, rdp->donelist, rdp->donetail); } static void rcu_offline_cpu(int cpu) Index: linux-2.6.24.7/kernel/rcupdate.c =================================================================== --- linux-2.6.24.7.orig/kernel/rcupdate.c +++ linux-2.6.24.7/kernel/rcupdate.c @@ -123,7 +123,17 @@ void rcu_barrier(void) mutex_lock(&rcu_barrier_mutex); init_completion(&rcu_barrier_completion); atomic_set(&rcu_barrier_cpu_count, 0); + /* + * The queueing of callbacks in all CPUs must be atomic with + * respect to RCU, otherwise one CPU may queue a callback, + * wait for a grace period, decrement barrier count and call + * complete(), while other CPUs have not yet queued anything. + * So, we need to make sure that grace periods cannot complete + * until all the callbacks are queued. + */ + rcu_read_lock(); on_each_cpu(rcu_barrier_func, NULL, 0, 1); + rcu_read_unlock(); wait_for_completion(&rcu_barrier_completion); mutex_unlock(&rcu_barrier_mutex); } �����������������������������������������������������������������������������������patches/rcu-new-3.patch�����������������������������������������������������������������������������0000664�0000764�0000764�00000132472�11041657731�014014� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From paulmck@linux.vnet.ibm.com Thu Sep 27 00:03:19 2007 Date: Mon, 10 Sep 2007 11:34:12 -0700 From: Paul E. McKenney <paulmck@linux.vnet.ibm.com> To: linux-kernel@vger.kernel.org Cc: linux-rt-users@vger.kernel.org, mingo@elte.hu, akpm@linux-foundation.org, dipankar@in.ibm.com, josht@linux.vnet.ibm.com, tytso@us.ibm.com, dvhltc@us.ibm.com, tglx@linutronix.de, a.p.zijlstra@chello.nl, bunk@kernel.org, ego@in.ibm.com, oleg@tv-sign.ru, srostedt@redhat.com Subject: [PATCH RFC 3/9] RCU: Preemptible RCU Work in progress, not for inclusion. This patch implements a new version of RCU which allows its read-side critical sections to be preempted. It uses a set of counter pairs to keep track of the read-side critical sections and flips them when all tasks exit read-side critical section. The details of this implementation can be found in this paper - http://www.rdrop.com/users/paulmck/RCU/OLSrtRCU.2006.08.11a.pdf This patch was developed as a part of the -rt kernel development and meant to provide better latencies when read-side critical sections of RCU don't disable preemption. As a consequence of keeping track of RCU readers, the readers have a slight overhead (optimizations in the paper). This implementation co-exists with the "classic" RCU implementations and can be switched to at compiler. Also includes RCU tracing summarized in debugfs and RCU_SOFTIRQ for the preemptible variant of RCU. Signed-off-by: Dipankar Sarma <dipankar@in.ibm.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> (for RCU_SOFTIRQ) Signed-off-by: Paul McKenney <paulmck@us.ibm.com> --- include/linux/interrupt.h | 1 include/linux/rcuclassic.h | 2 include/linux/rcupdate.h | 7 include/linux/rcupreempt.h | 78 +++ include/linux/rcupreempt_trace.h | 100 +++++ include/linux/sched.h | 5 kernel/Kconfig.preempt | 39 + kernel/Makefile | 7 kernel/fork.c | 4 kernel/rcupreempt.c | 766 +++++++++++++++++++++++++++++++++++++++ kernel/rcupreempt_trace.c | 330 ++++++++++++++++ 11 files changed, 1336 insertions(+), 3 deletions(-) Index: linux-2.6.24.7/include/linux/interrupt.h =================================================================== --- linux-2.6.24.7.orig/include/linux/interrupt.h +++ linux-2.6.24.7/include/linux/interrupt.h @@ -256,6 +256,7 @@ enum #ifdef CONFIG_HIGH_RES_TIMERS HRTIMER_SOFTIRQ, #endif + RCU_SOFTIRQ, /* Preferable RCU should always be the last softirq */ }; /* softirq mask and active fields moved to irq_cpustat_t in Index: linux-2.6.24.7/include/linux/rcuclassic.h =================================================================== --- linux-2.6.24.7.orig/include/linux/rcuclassic.h +++ linux-2.6.24.7/include/linux/rcuclassic.h @@ -142,8 +142,6 @@ extern int rcu_needs_cpu(int cpu); extern void __rcu_init(void); extern void rcu_check_callbacks(int cpu, int user); extern void rcu_restart_cpu(int cpu); -extern long rcu_batches_completed(void); -extern long rcu_batches_completed_bh(void); #endif /* __KERNEL__ */ #endif /* __LINUX_RCUCLASSIC_H */ Index: linux-2.6.24.7/include/linux/rcupdate.h =================================================================== --- linux-2.6.24.7.orig/include/linux/rcupdate.h +++ linux-2.6.24.7/include/linux/rcupdate.h @@ -53,7 +53,11 @@ struct rcu_head { void (*func)(struct rcu_head *head); }; +#ifdef CONFIG_CLASSIC_RCU #include <linux/rcuclassic.h> +#else /* #ifdef CONFIG_CLASSIC_RCU */ +#include <linux/rcupreempt.h> +#endif /* #else #ifdef CONFIG_CLASSIC_RCU */ #define RCU_HEAD_INIT { .next = NULL, .func = NULL } #define RCU_HEAD(head) struct rcu_head head = RCU_HEAD_INIT @@ -241,10 +245,13 @@ extern void FASTCALL(call_rcu_bh(struct /* Exported common interfaces */ extern void synchronize_rcu(void); extern void rcu_barrier(void); +extern long rcu_batches_completed(void); +extern long rcu_batches_completed_bh(void); /* Internal to kernel */ extern void rcu_init(void); extern void rcu_check_callbacks(int cpu, int user); +extern int rcu_needs_cpu(int cpu); #endif /* __KERNEL__ */ #endif /* __LINUX_RCUPDATE_H */ Index: linux-2.6.24.7/include/linux/rcupreempt.h =================================================================== --- /dev/null +++ linux-2.6.24.7/include/linux/rcupreempt.h @@ -0,0 +1,78 @@ +/* + * Read-Copy Update mechanism for mutual exclusion (RT implementation) + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA. + * + * Copyright (C) IBM Corporation, 2006 + * + * Author: Paul McKenney <paulmck@us.ibm.com> + * + * Based on the original work by Paul McKenney <paul.mckenney@us.ibm.com> + * and inputs from Rusty Russell, Andrea Arcangeli and Andi Kleen. + * Papers: + * http://www.rdrop.com/users/paulmck/paper/rclockpdcsproof.pdf + * http://lse.sourceforge.net/locking/rclock_OLS.2001.05.01c.sc.pdf (OLS2001) + * + * For detailed explanation of Read-Copy Update mechanism see - + * Documentation/RCU + * + */ + +#ifndef __LINUX_RCUPREEMPT_H +#define __LINUX_RCUPREEMPT_H + +#ifdef __KERNEL__ + +#include <linux/cache.h> +#include <linux/spinlock.h> +#include <linux/threads.h> +#include <linux/percpu.h> +#include <linux/cpumask.h> +#include <linux/seqlock.h> + +#define rcu_qsctr_inc(cpu) +#define rcu_bh_qsctr_inc(cpu) +#define call_rcu_bh(head, rcu) call_rcu(head, rcu) + +extern void __rcu_read_lock(void); +extern void __rcu_read_unlock(void); +extern int rcu_pending(int cpu); +extern int rcu_needs_cpu(int cpu); + +#define __rcu_read_lock_bh() { rcu_read_lock(); local_bh_disable(); } +#define __rcu_read_unlock_bh() { local_bh_enable(); rcu_read_unlock(); } + +#define __rcu_read_lock_nesting() (current->rcu_read_lock_nesting) + +extern void __synchronize_sched(void); + +extern void __rcu_init(void); +extern void rcu_check_callbacks(int cpu, int user); +extern void rcu_restart_cpu(int cpu); + +#ifdef CONFIG_RCU_TRACE +struct rcupreempt_trace; +extern int *rcupreempt_flipctr(int cpu); +extern long rcupreempt_data_completed(void); +extern int rcupreempt_flip_flag(int cpu); +extern int rcupreempt_mb_flag(int cpu); +extern char *rcupreempt_try_flip_state_name(void); +extern struct rcupreempt_trace *rcupreempt_trace_cpu(int cpu); +#endif + +struct softirq_action; + +#endif /* __KERNEL__ */ +#endif /* __LINUX_RCUPREEMPT_H */ Index: linux-2.6.24.7/include/linux/rcupreempt_trace.h =================================================================== --- /dev/null +++ linux-2.6.24.7/include/linux/rcupreempt_trace.h @@ -0,0 +1,100 @@ +/* + * Read-Copy Update mechanism for mutual exclusion (RT implementation) + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA. + * + * Copyright (C) IBM Corporation, 2006 + * + * Author: Paul McKenney <paulmck@us.ibm.com> + * + * Based on the original work by Paul McKenney <paul.mckenney@us.ibm.com> + * and inputs from Rusty Russell, Andrea Arcangeli and Andi Kleen. + * Papers: + * http://www.rdrop.com/users/paulmck/paper/rclockpdcsproof.pdf + * http://lse.sourceforge.net/locking/rclock_OLS.2001.05.01c.sc.pdf (OLS2001) + * + * For detailed explanation of Read-Copy Update mechanism see - + * http://lse.sourceforge.net/locking/rcupdate.html + * + */ + +#ifndef __LINUX_RCUPREEMPT_TRACE_H +#define __LINUX_RCUPREEMPT_TRACE_H + +#ifdef __KERNEL__ +#include <linux/types.h> +#include <linux/kernel.h> + +#include <asm/atomic.h> + +/* + * PREEMPT_RCU data structures. + */ + +struct rcupreempt_trace { + long next_length; + long next_add; + long wait_length; + long wait_add; + long done_length; + long done_add; + long done_remove; + atomic_t done_invoked; + long rcu_check_callbacks; + atomic_t rcu_try_flip_1; + atomic_t rcu_try_flip_e1; + long rcu_try_flip_i1; + long rcu_try_flip_ie1; + long rcu_try_flip_g1; + long rcu_try_flip_a1; + long rcu_try_flip_ae1; + long rcu_try_flip_a2; + long rcu_try_flip_z1; + long rcu_try_flip_ze1; + long rcu_try_flip_z2; + long rcu_try_flip_m1; + long rcu_try_flip_me1; + long rcu_try_flip_m2; +}; + +#ifdef CONFIG_RCU_TRACE +#define RCU_TRACE(fn, arg) fn(arg); +#else +#define RCU_TRACE(fn, arg) +#endif + +extern void rcupreempt_trace_move2done(struct rcupreempt_trace *trace); +extern void rcupreempt_trace_move2wait(struct rcupreempt_trace *trace); +extern void rcupreempt_trace_try_flip_1(struct rcupreempt_trace *trace); +extern void rcupreempt_trace_try_flip_e1(struct rcupreempt_trace *trace); +extern void rcupreempt_trace_try_flip_i1(struct rcupreempt_trace *trace); +extern void rcupreempt_trace_try_flip_ie1(struct rcupreempt_trace *trace); +extern void rcupreempt_trace_try_flip_g1(struct rcupreempt_trace *trace); +extern void rcupreempt_trace_try_flip_a1(struct rcupreempt_trace *trace); +extern void rcupreempt_trace_try_flip_ae1(struct rcupreempt_trace *trace); +extern void rcupreempt_trace_try_flip_a2(struct rcupreempt_trace *trace); +extern void rcupreempt_trace_try_flip_z1(struct rcupreempt_trace *trace); +extern void rcupreempt_trace_try_flip_ze1(struct rcupreempt_trace *trace); +extern void rcupreempt_trace_try_flip_z2(struct rcupreempt_trace *trace); +extern void rcupreempt_trace_try_flip_m1(struct rcupreempt_trace *trace); +extern void rcupreempt_trace_try_flip_me1(struct rcupreempt_trace *trace); +extern void rcupreempt_trace_try_flip_m2(struct rcupreempt_trace *trace); +extern void rcupreempt_trace_check_callbacks(struct rcupreempt_trace *trace); +extern void rcupreempt_trace_done_remove(struct rcupreempt_trace *trace); +extern void rcupreempt_trace_invoke(struct rcupreempt_trace *trace); +extern void rcupreempt_trace_next_add(struct rcupreempt_trace *trace); + +#endif /* __KERNEL__ */ +#endif /* __LINUX_RCUPREEMPT_TRACE_H */ Index: linux-2.6.24.7/include/linux/sched.h =================================================================== --- linux-2.6.24.7.orig/include/linux/sched.h +++ linux-2.6.24.7/include/linux/sched.h @@ -976,6 +976,11 @@ struct task_struct { int nr_cpus_allowed; unsigned int time_slice; +#ifdef CONFIG_PREEMPT_RCU + int rcu_read_lock_nesting; + int rcu_flipctr_idx; +#endif /* #ifdef CONFIG_PREEMPT_RCU */ + #if defined(CONFIG_SCHEDSTATS) || defined(CONFIG_TASK_DELAY_ACCT) struct sched_info sched_info; #endif Index: linux-2.6.24.7/kernel/Kconfig.preempt =================================================================== --- linux-2.6.24.7.orig/kernel/Kconfig.preempt +++ linux-2.6.24.7/kernel/Kconfig.preempt @@ -52,6 +52,45 @@ config PREEMPT endchoice +choice + prompt "RCU implementation type:" + default CLASSIC_RCU + +config CLASSIC_RCU + bool "Classic RCU" + help + This option selects the classic RCU implementation that is + designed for best read-side performance on non-realtime + systems. + + Say Y if you are unsure. + +config PREEMPT_RCU + bool "Preemptible RCU" + depends on PREEMPT + help + This option reduces the latency of the kernel by making certain + RCU sections preemptible. Normally RCU code is non-preemptible, if + this option is selected then read-only RCU sections become + preemptible. This helps latency, but may expose bugs due to + now-naive assumptions about each RCU read-side critical section + remaining on a given CPU through its execution. + + Say N if you are unsure. + +endchoice + +config RCU_TRACE + bool "Enable tracing for RCU - currently stats in debugfs" + select DEBUG_FS + default y + help + This option provides tracing in RCU which presents stats + in debugfs for debugging RCU implementation. + + Say Y here if you want to enable RCU tracing + Say N if you are unsure. + config PREEMPT_BKL bool "Preempt The Big Kernel Lock" depends on SMP || PREEMPT Index: linux-2.6.24.7/kernel/Makefile =================================================================== --- linux-2.6.24.7.orig/kernel/Makefile +++ linux-2.6.24.7/kernel/Makefile @@ -6,7 +6,7 @@ obj-y = sched.o fork.o exec_domain.o exit.o itimer.o time.o softirq.o resource.o \ sysctl.o capability.o ptrace.o timer.o user.o user_namespace.o \ signal.o sys.o kmod.o workqueue.o pid.o \ - rcupdate.o rcuclassic.o extable.o params.o posix-timers.o \ + rcupdate.o extable.o params.o posix-timers.o \ kthread.o wait.o kfifo.o sys_ni.o posix-cpu-timers.o mutex.o \ hrtimer.o rwsem.o latency.o nsproxy.o srcu.o \ utsname.o notifier.o @@ -64,6 +64,11 @@ obj-$(CONFIG_DETECT_SOFTLOCKUP) += softl obj-$(CONFIG_GENERIC_HARDIRQS) += irq/ obj-$(CONFIG_SECCOMP) += seccomp.o obj-$(CONFIG_RCU_TORTURE_TEST) += rcutorture.o +obj-$(CONFIG_CLASSIC_RCU) += rcuclassic.o +obj-$(CONFIG_PREEMPT_RCU) += rcupreempt.o +ifeq ($(CONFIG_PREEMPT_RCU),y) +obj-$(CONFIG_RCU_TRACE) += rcupreempt_trace.o +endif obj-$(CONFIG_RELAY) += relay.o obj-$(CONFIG_SYSCTL) += utsname_sysctl.o obj-$(CONFIG_TASK_DELAY_ACCT) += delayacct.o Index: linux-2.6.24.7/kernel/fork.c =================================================================== --- linux-2.6.24.7.orig/kernel/fork.c +++ linux-2.6.24.7/kernel/fork.c @@ -1045,6 +1045,10 @@ static struct task_struct *copy_process( copy_flags(clone_flags, p); INIT_LIST_HEAD(&p->children); INIT_LIST_HEAD(&p->sibling); +#ifdef CONFIG_PREEMPT_RCU + p->rcu_read_lock_nesting = 0; + p->rcu_flipctr_idx = 0; +#endif /* #ifdef CONFIG_PREEMPT_RCU */ p->vfork_done = NULL; spin_lock_init(&p->alloc_lock); Index: linux-2.6.24.7/kernel/rcupreempt.c =================================================================== --- /dev/null +++ linux-2.6.24.7/kernel/rcupreempt.c @@ -0,0 +1,766 @@ +/* + * Read-Copy Update mechanism for mutual exclusion, realtime implementation + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA. + * + * Copyright IBM Corporation, 2006 + * + * Authors: Paul E. McKenney <paulmck@us.ibm.com> + * With thanks to Esben Nielsen, Bill Huey, and Ingo Molnar + * for pushing me away from locks and towards counters, and + * to Suparna Bhattacharya for pushing me completely away + * from atomic instructions on the read side. + * + * Papers: http://www.rdrop.com/users/paulmck/RCU + * + * For detailed explanation of Read-Copy Update mechanism see - + * Documentation/RCU/ *.txt + * + */ +#include <linux/types.h> +#include <linux/kernel.h> +#include <linux/init.h> +#include <linux/spinlock.h> +#include <linux/smp.h> +#include <linux/rcupdate.h> +#include <linux/interrupt.h> +#include <linux/sched.h> +#include <asm/atomic.h> +#include <linux/bitops.h> +#include <linux/module.h> +#include <linux/completion.h> +#include <linux/moduleparam.h> +#include <linux/percpu.h> +#include <linux/notifier.h> +#include <linux/rcupdate.h> +#include <linux/cpu.h> +#include <linux/random.h> +#include <linux/delay.h> +#include <linux/byteorder/swabb.h> +#include <linux/cpumask.h> +#include <linux/rcupreempt_trace.h> + +/* + * PREEMPT_RCU data structures. + */ + +#define GP_STAGES 2 +struct rcu_data { + spinlock_t lock; /* Protect rcu_data fields. */ + long completed; /* Number of last completed batch. */ + int waitlistcount; + struct tasklet_struct rcu_tasklet; + struct rcu_head *nextlist; + struct rcu_head **nexttail; + struct rcu_head *waitlist[GP_STAGES]; + struct rcu_head **waittail[GP_STAGES]; + struct rcu_head *donelist; + struct rcu_head **donetail; +#ifdef CONFIG_RCU_TRACE + struct rcupreempt_trace trace; +#endif /* #ifdef CONFIG_RCU_TRACE */ +}; +struct rcu_ctrlblk { + spinlock_t fliplock; /* Protect state-machine transitions. */ + long completed; /* Number of last completed batch. */ +}; +static DEFINE_PER_CPU(struct rcu_data, rcu_data); +static struct rcu_ctrlblk rcu_ctrlblk = { + .fliplock = SPIN_LOCK_UNLOCKED, + .completed = 0, +}; +static DEFINE_PER_CPU(int [2], rcu_flipctr) = { 0, 0 }; + +/* + * States for rcu_try_flip() and friends. + */ + +enum rcu_try_flip_states { + rcu_try_flip_idle_state, /* "I" */ + rcu_try_flip_waitack_state, /* "A" */ + rcu_try_flip_waitzero_state, /* "Z" */ + rcu_try_flip_waitmb_state /* "M" */ +}; +static enum rcu_try_flip_states rcu_try_flip_state = rcu_try_flip_idle_state; +#ifdef CONFIG_RCU_TRACE +static char *rcu_try_flip_state_names[] = + { "idle", "waitack", "waitzero", "waitmb" }; +#endif /* #ifdef CONFIG_RCU_TRACE */ + +/* + * Enum and per-CPU flag to determine when each CPU has seen + * the most recent counter flip. + */ + +enum rcu_flip_flag_values { + rcu_flip_seen, /* Steady/initial state, last flip seen. */ + /* Only GP detector can update. */ + rcu_flipped /* Flip just completed, need confirmation. */ + /* Only corresponding CPU can update. */ +}; +static DEFINE_PER_CPU(enum rcu_flip_flag_values, rcu_flip_flag) = rcu_flip_seen; + +/* + * Enum and per-CPU flag to determine when each CPU has executed the + * needed memory barrier to fence in memory references from its last RCU + * read-side critical section in the just-completed grace period. + */ + +enum rcu_mb_flag_values { + rcu_mb_done, /* Steady/initial state, no mb()s required. */ + /* Only GP detector can update. */ + rcu_mb_needed /* Flip just completed, need an mb(). */ + /* Only corresponding CPU can update. */ +}; +static DEFINE_PER_CPU(enum rcu_mb_flag_values, rcu_mb_flag) = rcu_mb_done; + +/* + * Macro that prevents the compiler from reordering accesses, but does + * absolutely -nothing- to prevent CPUs from reordering. This is used + * only to mediate communication between mainline code and hardware + * interrupt and NMI handlers. + */ +#define ORDERED_WRT_IRQ(x) (*(volatile typeof(x) *)&(x)) + +/* + * RCU_DATA_ME: find the current CPU's rcu_data structure. + * RCU_DATA_CPU: find the specified CPU's rcu_data structure. + */ +#define RCU_DATA_ME() (&__get_cpu_var(rcu_data)) +#define RCU_DATA_CPU(cpu) (&per_cpu(rcu_data, cpu)) + +/* + * Helper macro for tracing when the appropriate rcu_data is not + * cached in a local variable, but where the CPU number is so cached. + */ +#define RCU_TRACE_CPU(f, cpu) RCU_TRACE(f, &(RCU_DATA_CPU(cpu)->trace)); + +/* + * Helper macro for tracing when the appropriate rcu_data is not + * cached in a local variable. + */ +#define RCU_TRACE_ME(f) RCU_TRACE(f, &(RCU_DATA_ME()->trace)); + +/* + * Helper macro for tracing when the appropriate rcu_data is pointed + * to by a local variable. + */ +#define RCU_TRACE_RDP(f, rdp) RCU_TRACE(f, &((rdp)->trace)); + +/* + * Return the number of RCU batches processed thus far. Useful + * for debug and statistics. + */ +long rcu_batches_completed(void) +{ + return rcu_ctrlblk.completed; +} +EXPORT_SYMBOL_GPL(rcu_batches_completed); + +/* + * Return the number of RCU batches processed thus far. Useful for debug + * and statistics. The _bh variant is identical to straight RCU. + */ +long rcu_batches_completed_bh(void) +{ + return rcu_ctrlblk.completed; +} +EXPORT_SYMBOL_GPL(rcu_batches_completed_bh); + +void __rcu_read_lock(void) +{ + int idx; + struct task_struct *me = current; + int nesting; + + nesting = ORDERED_WRT_IRQ(me->rcu_read_lock_nesting); + if (nesting != 0) { + + /* An earlier rcu_read_lock() covers us, just count it. */ + + me->rcu_read_lock_nesting = nesting + 1; + + } else { + unsigned long oldirq; + + /* + * Disable local interrupts to prevent the grace-period + * detection state machine from seeing us half-done. + * NMIs can still occur, of course, and might themselves + * contain rcu_read_lock(). + */ + + local_irq_save(oldirq); + + /* + * Outermost nesting of rcu_read_lock(), so increment + * the current counter for the current CPU. Use volatile + * casts to prevent the compiler from reordering. + */ + + idx = ORDERED_WRT_IRQ(rcu_ctrlblk.completed) & 0x1; + smp_read_barrier_depends(); /* @@@@ might be unneeded */ + ORDERED_WRT_IRQ(__get_cpu_var(rcu_flipctr)[idx])++; + + /* + * Now that the per-CPU counter has been incremented, we + * are protected from races with rcu_read_lock() invoked + * from NMI handlers on this CPU. We can therefore safely + * increment the nesting counter, relieving further NMIs + * of the need to increment the per-CPU counter. + */ + + ORDERED_WRT_IRQ(me->rcu_read_lock_nesting) = nesting + 1; + + /* + * Now that we have preventing any NMIs from storing + * to the ->rcu_flipctr_idx, we can safely use it to + * remember which counter to decrement in the matching + * rcu_read_unlock(). + */ + + ORDERED_WRT_IRQ(me->rcu_flipctr_idx) = idx; + local_irq_restore(oldirq); + } +} +EXPORT_SYMBOL_GPL(__rcu_read_lock); + +void __rcu_read_unlock(void) +{ + int idx; + struct task_struct *me = current; + int nesting; + + nesting = ORDERED_WRT_IRQ(me->rcu_read_lock_nesting); + if (nesting > 1) { + + /* + * We are still protected by the enclosing rcu_read_lock(), + * so simply decrement the counter. + */ + + me->rcu_read_lock_nesting = nesting - 1; + + } else { + unsigned long oldirq; + + /* + * Disable local interrupts to prevent the grace-period + * detection state machine from seeing us half-done. + * NMIs can still occur, of course, and might themselves + * contain rcu_read_lock() and rcu_read_unlock(). + */ + + local_irq_save(oldirq); + + /* + * Outermost nesting of rcu_read_unlock(), so we must + * decrement the current counter for the current CPU. + * This must be done carefully, because NMIs can + * occur at any point in this code, and any rcu_read_lock() + * and rcu_read_unlock() pairs in the NMI handlers + * must interact non-destructively with this code. + * Lots of volatile casts, and -very- careful ordering. + * + * Changes to this code, including this one, must be + * inspected, validated, and tested extremely carefully!!! + */ + + /* + * First, pick up the index. Enforce ordering for + * DEC Alpha. + */ + + idx = ORDERED_WRT_IRQ(me->rcu_flipctr_idx); + smp_read_barrier_depends(); /* @@@ Needed??? */ + + /* + * Now that we have fetched the counter index, it is + * safe to decrement the per-task RCU nesting counter. + * After this, any interrupts or NMIs will increment and + * decrement the per-CPU counters. + */ + ORDERED_WRT_IRQ(me->rcu_read_lock_nesting) = nesting - 1; + + /* + * It is now safe to decrement this task's nesting count. + * NMIs that occur after this statement will route their + * rcu_read_lock() calls through this "else" clause, and + * will thus start incrementing the per-CPU coutner on + * their own. They will also clobber ->rcu_flipctr_idx, + * but that is OK, since we have already fetched it. + */ + + ORDERED_WRT_IRQ(__get_cpu_var(rcu_flipctr)[idx])--; + local_irq_restore(oldirq); + } +} +EXPORT_SYMBOL_GPL(__rcu_read_unlock); + +/* + * If a global counter flip has occurred since the last time that we + * advanced callbacks, advance them. Hardware interrupts must be + * disabled when calling this function. + */ +static void __rcu_advance_callbacks(struct rcu_data *rdp) +{ + int cpu; + int i; + int wlc = 0; + + if (rdp->completed != rcu_ctrlblk.completed) { + if (rdp->waitlist[GP_STAGES - 1] != NULL) { + *rdp->donetail = rdp->waitlist[GP_STAGES - 1]; + rdp->donetail = rdp->waittail[GP_STAGES - 1]; + RCU_TRACE_RDP(rcupreempt_trace_move2done, rdp); + } + for (i = GP_STAGES - 2; i >= 0; i--) { + if (rdp->waitlist[i] != NULL) { + rdp->waitlist[i + 1] = rdp->waitlist[i]; + rdp->waittail[i + 1] = rdp->waittail[i]; + wlc++; + } else { + rdp->waitlist[i + 1] = NULL; + rdp->waittail[i + 1] = + &rdp->waitlist[i + 1]; + } + } + if (rdp->nextlist != NULL) { + rdp->waitlist[0] = rdp->nextlist; + rdp->waittail[0] = rdp->nexttail; + wlc++; + rdp->nextlist = NULL; + rdp->nexttail = &rdp->nextlist; + RCU_TRACE_RDP(rcupreempt_trace_move2wait, rdp); + } else { + rdp->waitlist[0] = NULL; + rdp->waittail[0] = &rdp->waitlist[0]; + } + rdp->waitlistcount = wlc; + rdp->completed = rcu_ctrlblk.completed; + } + + /* + * Check to see if this CPU needs to report that it has seen + * the most recent counter flip, thereby declaring that all + * subsequent rcu_read_lock() invocations will respect this flip. + */ + + cpu = raw_smp_processor_id(); + if (per_cpu(rcu_flip_flag, cpu) == rcu_flipped) { + smp_mb(); /* Subsequent counter accesses must see new value */ + per_cpu(rcu_flip_flag, cpu) = rcu_flip_seen; + smp_mb(); /* Subsequent RCU read-side critical sections */ + /* seen -after- acknowledgement. */ + } +} + +/* + * Get here when RCU is idle. Decide whether we need to + * move out of idle state, and return non-zero if so. + * "Straightforward" approach for the moment, might later + * use callback-list lengths, grace-period duration, or + * some such to determine when to exit idle state. + * Might also need a pre-idle test that does not acquire + * the lock, but let's get the simple case working first... + */ + +static int +rcu_try_flip_idle(void) +{ + int cpu; + + RCU_TRACE_ME(rcupreempt_trace_try_flip_i1); + if (!rcu_pending(smp_processor_id())) { + RCU_TRACE_ME(rcupreempt_trace_try_flip_ie1); + return 0; + } + + /* + * Do the flip. + */ + + RCU_TRACE_ME(rcupreempt_trace_try_flip_g1); + rcu_ctrlblk.completed++; /* stands in for rcu_try_flip_g2 */ + + /* + * Need a memory barrier so that other CPUs see the new + * counter value before they see the subsequent change of all + * the rcu_flip_flag instances to rcu_flipped. + */ + + smp_mb(); /* see above block comment. */ + + /* Now ask each CPU for acknowledgement of the flip. */ + + for_each_possible_cpu(cpu) + per_cpu(rcu_flip_flag, cpu) = rcu_flipped; + + return 1; +} + +/* + * Wait for CPUs to acknowledge the flip. + */ + +static int +rcu_try_flip_waitack(void) +{ + int cpu; + + RCU_TRACE_ME(rcupreempt_trace_try_flip_a1); + for_each_possible_cpu(cpu) + if (per_cpu(rcu_flip_flag, cpu) != rcu_flip_seen) { + RCU_TRACE_ME(rcupreempt_trace_try_flip_ae1); + return 0; + } + + /* + * Make sure our checks above don't bleed into subsequent + * waiting for the sum of the counters to reach zero. + */ + + smp_mb(); /* see above block comment. */ + RCU_TRACE_ME(rcupreempt_trace_try_flip_a2); + return 1; +} + +/* + * Wait for collective ``last'' counter to reach zero, + * then tell all CPUs to do an end-of-grace-period memory barrier. + */ + +static int +rcu_try_flip_waitzero(void) +{ + int cpu; + int lastidx = !(rcu_ctrlblk.completed & 0x1); + int sum = 0; + + /* Check to see if the sum of the "last" counters is zero. */ + + RCU_TRACE_ME(rcupreempt_trace_try_flip_z1); + for_each_possible_cpu(cpu) + sum += per_cpu(rcu_flipctr, cpu)[lastidx]; + if (sum != 0) { + RCU_TRACE_ME(rcupreempt_trace_try_flip_ze1); + return 0; + } + + smp_mb(); /* Don't call for memory barriers before we see zero. */ + + /* Call for a memory barrier from each CPU. */ + + for_each_possible_cpu(cpu) + per_cpu(rcu_mb_flag, cpu) = rcu_mb_needed; + + RCU_TRACE_ME(rcupreempt_trace_try_flip_z2); + return 1; +} + +/* + * Wait for all CPUs to do their end-of-grace-period memory barrier. + * Return 0 once all CPUs have done so. + */ + +static int +rcu_try_flip_waitmb(void) +{ + int cpu; + + RCU_TRACE_ME(rcupreempt_trace_try_flip_m1); + for_each_possible_cpu(cpu) + if (per_cpu(rcu_mb_flag, cpu) != rcu_mb_done) { + RCU_TRACE_ME(rcupreempt_trace_try_flip_me1); + return 0; + } + + smp_mb(); /* Ensure that the above checks precede any following flip. */ + RCU_TRACE_ME(rcupreempt_trace_try_flip_m2); + return 1; +} + +/* + * Attempt a single flip of the counters. Remember, a single flip does + * -not- constitute a grace period. Instead, the interval between + * at least three consecutive flips is a grace period. + * + * If anyone is nuts enough to run this CONFIG_PREEMPT_RCU implementation + * on a large SMP, they might want to use a hierarchical organization of + * the per-CPU-counter pairs. + */ +static void rcu_try_flip(void) +{ + unsigned long oldirq; + + RCU_TRACE_ME(rcupreempt_trace_try_flip_1); + if (unlikely(!spin_trylock_irqsave(&rcu_ctrlblk.fliplock, oldirq))) { + RCU_TRACE_ME(rcupreempt_trace_try_flip_e1); + return; + } + + /* + * Take the next transition(s) through the RCU grace-period + * flip-counter state machine. + */ + + switch (rcu_try_flip_state) { + case rcu_try_flip_idle_state: + if (rcu_try_flip_idle()) + rcu_try_flip_state = rcu_try_flip_waitack_state; + break; + case rcu_try_flip_waitack_state: + if (rcu_try_flip_waitack()) + rcu_try_flip_state = rcu_try_flip_waitzero_state; + break; + case rcu_try_flip_waitzero_state: + if (rcu_try_flip_waitzero()) + rcu_try_flip_state = rcu_try_flip_waitmb_state; + break; + case rcu_try_flip_waitmb_state: + if (rcu_try_flip_waitmb()) + rcu_try_flip_state = rcu_try_flip_idle_state; + } + spin_unlock_irqrestore(&rcu_ctrlblk.fliplock, oldirq); +} + +/* + * Check to see if this CPU needs to do a memory barrier in order to + * ensure that any prior RCU read-side critical sections have committed + * their counter manipulations and critical-section memory references + * before declaring the grace period to be completed. + */ +static void rcu_check_mb(int cpu) +{ + if (per_cpu(rcu_mb_flag, cpu) == rcu_mb_needed) { + smp_mb(); /* Ensure RCU read-side accesses are visible. */ + per_cpu(rcu_mb_flag, cpu) = rcu_mb_done; + } +} + +void rcu_check_callbacks(int cpu, int user) +{ + unsigned long oldirq; + struct rcu_data *rdp = RCU_DATA_CPU(cpu); + + rcu_check_mb(cpu); + if (rcu_ctrlblk.completed == rdp->completed) + rcu_try_flip(); + spin_lock_irqsave(&rdp->lock, oldirq); + RCU_TRACE_RDP(rcupreempt_trace_check_callbacks, rdp); + __rcu_advance_callbacks(rdp); + if (rdp->donelist == NULL) { + spin_unlock_irqrestore(&rdp->lock, oldirq); + } else { + spin_unlock_irqrestore(&rdp->lock, oldirq); + raise_softirq(RCU_SOFTIRQ); + } +} + +/* + * Needed by dynticks, to make sure all RCU processing has finished + * when we go idle: + */ +void rcu_advance_callbacks(int cpu, int user) +{ + unsigned long oldirq; + struct rcu_data *rdp = RCU_DATA_CPU(cpu); + + if (rcu_ctrlblk.completed == rdp->completed) { + rcu_try_flip(); + if (rcu_ctrlblk.completed == rdp->completed) + return; + } + spin_lock_irqsave(&rdp->lock, oldirq); + RCU_TRACE_RDP(rcupreempt_trace_check_callbacks, rdp); + __rcu_advance_callbacks(rdp); + spin_unlock_irqrestore(&rdp->lock, oldirq); +} + +static void rcu_process_callbacks(struct softirq_action *unused) +{ + unsigned long flags; + struct rcu_head *next, *list; + struct rcu_data *rdp = RCU_DATA_ME(); + + spin_lock_irqsave(&rdp->lock, flags); + list = rdp->donelist; + if (list == NULL) { + spin_unlock_irqrestore(&rdp->lock, flags); + return; + } + rdp->donelist = NULL; + rdp->donetail = &rdp->donelist; + RCU_TRACE_RDP(rcupreempt_trace_done_remove, rdp); + spin_unlock_irqrestore(&rdp->lock, flags); + while (list) { + next = list->next; + list->func(list); + list = next; + RCU_TRACE_ME(rcupreempt_trace_invoke); + } +} + +void fastcall call_rcu(struct rcu_head *head, + void (*func)(struct rcu_head *rcu)) +{ + unsigned long oldirq; + struct rcu_data *rdp; + + head->func = func; + head->next = NULL; + local_irq_save(oldirq); + rdp = RCU_DATA_ME(); + spin_lock(&rdp->lock); + __rcu_advance_callbacks(rdp); + *rdp->nexttail = head; + rdp->nexttail = &head->next; + RCU_TRACE_RDP(rcupreempt_trace_next_add, rdp); + spin_unlock(&rdp->lock); + local_irq_restore(oldirq); +} +EXPORT_SYMBOL_GPL(call_rcu); + +/* + * Wait until all currently running preempt_disable() code segments + * (including hardware-irq-disable segments) complete. Note that + * in -rt this does -not- necessarily result in all currently executing + * interrupt -handlers- having completed. + */ +void __synchronize_sched(void) +{ + cpumask_t oldmask; + int cpu; + + if (sched_getaffinity(0, &oldmask) < 0) + oldmask = cpu_possible_map; + for_each_online_cpu(cpu) { + sched_setaffinity(0, cpumask_of_cpu(cpu)); + schedule(); + } + sched_setaffinity(0, oldmask); +} +EXPORT_SYMBOL_GPL(__synchronize_sched); + +/* + * Check to see if any future RCU-related work will need to be done + * by the current CPU, even if none need be done immediately, returning + * 1 if so. Assumes that notifiers would take care of handling any + * outstanding requests from the RCU core. + * + * This function is part of the RCU implementation; it is -not- + * an exported member of the RCU API. + */ +int rcu_needs_cpu(int cpu) +{ + struct rcu_data *rdp = RCU_DATA_CPU(cpu); + + return (rdp->donelist != NULL || + !!rdp->waitlistcount || + rdp->nextlist != NULL); +} + +int rcu_pending(int cpu) +{ + struct rcu_data *rdp = RCU_DATA_CPU(cpu); + + /* The CPU has at least one callback queued somewhere. */ + + if (rdp->donelist != NULL || + !!rdp->waitlistcount || + rdp->nextlist != NULL) + return 1; + + /* The RCU core needs an acknowledgement from this CPU. */ + + if ((per_cpu(rcu_flip_flag, cpu) == rcu_flipped) || + (per_cpu(rcu_mb_flag, cpu) == rcu_mb_needed)) + return 1; + + /* This CPU has fallen behind the global grace-period number. */ + + if (rdp->completed != rcu_ctrlblk.completed) + return 1; + + /* Nothing needed from this CPU. */ + + return 0; +} + +void __init __rcu_init(void) +{ + int cpu; + int i; + struct rcu_data *rdp; + + for_each_possible_cpu(cpu) { + rdp = RCU_DATA_CPU(cpu); + spin_lock_init(&rdp->lock); + rdp->completed = 0; + rdp->waitlistcount = 0; + rdp->nextlist = NULL; + rdp->nexttail = &rdp->nextlist; + for (i = 0; i < GP_STAGES; i++) { + rdp->waitlist[i] = NULL; + rdp->waittail[i] = &rdp->waitlist[i]; + } + rdp->donelist = NULL; + rdp->donetail = &rdp->donelist; + } + open_softirq(RCU_SOFTIRQ, rcu_process_callbacks, NULL); +} + +/* + * Deprecated, use synchronize_rcu() or synchronize_sched() instead. + */ +void synchronize_kernel(void) +{ + synchronize_rcu(); +} + +#ifdef CONFIG_RCU_TRACE +int *rcupreempt_flipctr(int cpu) +{ + return &per_cpu(rcu_flipctr, cpu)[0]; +} +EXPORT_SYMBOL_GPL(rcupreempt_flipctr); + +int rcupreempt_flip_flag(int cpu) +{ + return per_cpu(rcu_flip_flag, cpu); +} +EXPORT_SYMBOL_GPL(rcupreempt_flip_flag); + +int rcupreempt_mb_flag(int cpu) +{ + return per_cpu(rcu_mb_flag, cpu); +} +EXPORT_SYMBOL_GPL(rcupreempt_mb_flag); + +char *rcupreempt_try_flip_state_name(void) +{ + return rcu_try_flip_state_names[rcu_try_flip_state]; +} +EXPORT_SYMBOL_GPL(rcupreempt_try_flip_state_name); + +struct rcupreempt_trace *rcupreempt_trace_cpu(int cpu) +{ + struct rcu_data *rdp = RCU_DATA_CPU(cpu); + + return &rdp->trace; +} +EXPORT_SYMBOL_GPL(rcupreempt_trace_cpu); + +#endif /* #ifdef RCU_TRACE */ Index: linux-2.6.24.7/kernel/rcupreempt_trace.c =================================================================== --- /dev/null +++ linux-2.6.24.7/kernel/rcupreempt_trace.c @@ -0,0 +1,330 @@ +/* + * Read-Copy Update tracing for realtime implementation + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA. + * + * Copyright IBM Corporation, 2006 + * + * Papers: http://www.rdrop.com/users/paulmck/RCU + * + * For detailed explanation of Read-Copy Update mechanism see - + * Documentation/RCU/ *.txt + * + */ +#include <linux/types.h> +#include <linux/kernel.h> +#include <linux/init.h> +#include <linux/spinlock.h> +#include <linux/smp.h> +#include <linux/rcupdate.h> +#include <linux/interrupt.h> +#include <linux/sched.h> +#include <asm/atomic.h> +#include <linux/bitops.h> +#include <linux/module.h> +#include <linux/completion.h> +#include <linux/moduleparam.h> +#include <linux/percpu.h> +#include <linux/notifier.h> +#include <linux/rcupdate.h> +#include <linux/cpu.h> +#include <linux/mutex.h> +#include <linux/rcupreempt_trace.h> +#include <linux/debugfs.h> + +static struct mutex rcupreempt_trace_mutex; +static char *rcupreempt_trace_buf; +#define RCUPREEMPT_TRACE_BUF_SIZE 4096 + +void rcupreempt_trace_move2done(struct rcupreempt_trace *trace) +{ + trace->done_length += trace->wait_length; + trace->done_add += trace->wait_length; + trace->wait_length = 0; +} +void rcupreempt_trace_move2wait(struct rcupreempt_trace *trace) +{ + trace->wait_length += trace->next_length; + trace->wait_add += trace->next_length; + trace->next_length = 0; +} +void rcupreempt_trace_try_flip_1(struct rcupreempt_trace *trace) +{ + atomic_inc(&trace->rcu_try_flip_1); +} +void rcupreempt_trace_try_flip_e1(struct rcupreempt_trace *trace) +{ + atomic_inc(&trace->rcu_try_flip_e1); +} +void rcupreempt_trace_try_flip_i1(struct rcupreempt_trace *trace) +{ + trace->rcu_try_flip_i1++; +} +void rcupreempt_trace_try_flip_ie1(struct rcupreempt_trace *trace) +{ + trace->rcu_try_flip_ie1++; +} +void rcupreempt_trace_try_flip_g1(struct rcupreempt_trace *trace) +{ + trace->rcu_try_flip_g1++; +} +void rcupreempt_trace_try_flip_a1(struct rcupreempt_trace *trace) +{ + trace->rcu_try_flip_a1++; +} +void rcupreempt_trace_try_flip_ae1(struct rcupreempt_trace *trace) +{ + trace->rcu_try_flip_ae1++; +} +void rcupreempt_trace_try_flip_a2(struct rcupreempt_trace *trace) +{ + trace->rcu_try_flip_a2++; +} +void rcupreempt_trace_try_flip_z1(struct rcupreempt_trace *trace) +{ + trace->rcu_try_flip_z1++; +} +void rcupreempt_trace_try_flip_ze1(struct rcupreempt_trace *trace) +{ + trace->rcu_try_flip_ze1++; +} +void rcupreempt_trace_try_flip_z2(struct rcupreempt_trace *trace) +{ + trace->rcu_try_flip_z2++; +} +void rcupreempt_trace_try_flip_m1(struct rcupreempt_trace *trace) +{ + trace->rcu_try_flip_m1++; +} +void rcupreempt_trace_try_flip_me1(struct rcupreempt_trace *trace) +{ + trace->rcu_try_flip_me1++; +} +void rcupreempt_trace_try_flip_m2(struct rcupreempt_trace *trace) +{ + trace->rcu_try_flip_m2++; +} +void rcupreempt_trace_check_callbacks(struct rcupreempt_trace *trace) +{ + trace->rcu_check_callbacks++; +} +void rcupreempt_trace_done_remove(struct rcupreempt_trace *trace) +{ + trace->done_remove += trace->done_length; + trace->done_length = 0; +} +void rcupreempt_trace_invoke(struct rcupreempt_trace *trace) +{ + atomic_inc(&trace->done_invoked); +} +void rcupreempt_trace_next_add(struct rcupreempt_trace *trace) +{ + trace->next_add++; + trace->next_length++; +} + +static void rcupreempt_trace_sum(struct rcupreempt_trace *sp) +{ + struct rcupreempt_trace *cp; + int cpu; + + memset(sp, 0, sizeof(*sp)); + for_each_possible_cpu(cpu) { + cp = rcupreempt_trace_cpu(cpu); + sp->next_length += cp->next_length; + sp->next_add += cp->next_add; + sp->wait_length += cp->wait_length; + sp->wait_add += cp->wait_add; + sp->done_length += cp->done_length; + sp->done_add += cp->done_add; + sp->done_remove += cp->done_remove; + atomic_set(&sp->done_invoked, atomic_read(&cp->done_invoked)); + sp->rcu_check_callbacks += cp->rcu_check_callbacks; + atomic_set(&sp->rcu_try_flip_1, + atomic_read(&cp->rcu_try_flip_1)); + atomic_set(&sp->rcu_try_flip_e1, + atomic_read(&cp->rcu_try_flip_e1)); + sp->rcu_try_flip_i1 += cp->rcu_try_flip_i1; + sp->rcu_try_flip_ie1 += cp->rcu_try_flip_ie1; + sp->rcu_try_flip_g1 += cp->rcu_try_flip_g1; + sp->rcu_try_flip_a1 += cp->rcu_try_flip_a1; + sp->rcu_try_flip_ae1 += cp->rcu_try_flip_ae1; + sp->rcu_try_flip_a2 += cp->rcu_try_flip_a2; + sp->rcu_try_flip_z1 += cp->rcu_try_flip_z1; + sp->rcu_try_flip_ze1 += cp->rcu_try_flip_ze1; + sp->rcu_try_flip_z2 += cp->rcu_try_flip_z2; + sp->rcu_try_flip_m1 += cp->rcu_try_flip_m1; + sp->rcu_try_flip_me1 += cp->rcu_try_flip_me1; + sp->rcu_try_flip_m2 += cp->rcu_try_flip_m2; + } +} + +static ssize_t rcustats_read(struct file *filp, char __user *buffer, + size_t count, loff_t *ppos) +{ + struct rcupreempt_trace trace; + ssize_t bcount; + int cnt = 0; + + rcupreempt_trace_sum(&trace); + mutex_lock(&rcupreempt_trace_mutex); + snprintf(&rcupreempt_trace_buf[cnt], RCUPREEMPT_TRACE_BUF_SIZE - cnt, + "ggp=%ld rcc=%ld\n", + rcu_batches_completed(), + trace.rcu_check_callbacks); + snprintf(&rcupreempt_trace_buf[cnt], RCUPREEMPT_TRACE_BUF_SIZE - cnt, + "na=%ld nl=%ld wa=%ld wl=%ld da=%ld dl=%ld dr=%ld di=%d\n" + "1=%d e1=%d i1=%ld ie1=%ld g1=%ld a1=%ld ae1=%ld a2=%ld\n" + "z1=%ld ze1=%ld z2=%ld m1=%ld me1=%ld m2=%ld\n", + + trace.next_add, trace.next_length, + trace.wait_add, trace.wait_length, + trace.done_add, trace.done_length, + trace.done_remove, atomic_read(&trace.done_invoked), + atomic_read(&trace.rcu_try_flip_1), + atomic_read(&trace.rcu_try_flip_e1), + trace.rcu_try_flip_i1, trace.rcu_try_flip_ie1, + trace.rcu_try_flip_g1, + trace.rcu_try_flip_a1, trace.rcu_try_flip_ae1, + trace.rcu_try_flip_a2, + trace.rcu_try_flip_z1, trace.rcu_try_flip_ze1, + trace.rcu_try_flip_z2, + trace.rcu_try_flip_m1, trace.rcu_try_flip_me1, + trace.rcu_try_flip_m2); + bcount = simple_read_from_buffer(buffer, count, ppos, + rcupreempt_trace_buf, strlen(rcupreempt_trace_buf)); + mutex_unlock(&rcupreempt_trace_mutex); + return bcount; +} + +static ssize_t rcugp_read(struct file *filp, char __user *buffer, + size_t count, loff_t *ppos) +{ + long oldgp = rcu_batches_completed(); + ssize_t bcount; + + mutex_lock(&rcupreempt_trace_mutex); + synchronize_rcu(); + snprintf(rcupreempt_trace_buf, RCUPREEMPT_TRACE_BUF_SIZE, + "oldggp=%ld newggp=%ld\n", oldgp, rcu_batches_completed()); + bcount = simple_read_from_buffer(buffer, count, ppos, + rcupreempt_trace_buf, strlen(rcupreempt_trace_buf)); + mutex_unlock(&rcupreempt_trace_mutex); + return bcount; +} + +static ssize_t rcuctrs_read(struct file *filp, char __user *buffer, + size_t count, loff_t *ppos) +{ + int cnt = 0; + int cpu; + int f = rcu_batches_completed() & 0x1; + ssize_t bcount; + + mutex_lock(&rcupreempt_trace_mutex); + + cnt += snprintf(&rcupreempt_trace_buf[cnt], RCUPREEMPT_TRACE_BUF_SIZE, + "CPU last cur F M\n"); + for_each_online_cpu(cpu) { + int *flipctr = rcupreempt_flipctr(cpu); + cnt += snprintf(&rcupreempt_trace_buf[cnt], + RCUPREEMPT_TRACE_BUF_SIZE - cnt, + "%3d %4d %3d %d %d\n", + cpu, + flipctr[!f], + flipctr[f], + rcupreempt_flip_flag(cpu), + rcupreempt_mb_flag(cpu)); + } + cnt += snprintf(&rcupreempt_trace_buf[cnt], + RCUPREEMPT_TRACE_BUF_SIZE - cnt, + "ggp = %ld, state = %s\n", + rcu_batches_completed(), + rcupreempt_try_flip_state_name()); + cnt += snprintf(&rcupreempt_trace_buf[cnt], + RCUPREEMPT_TRACE_BUF_SIZE - cnt, + "\n"); + bcount = simple_read_from_buffer(buffer, count, ppos, + rcupreempt_trace_buf, strlen(rcupreempt_trace_buf)); + mutex_unlock(&rcupreempt_trace_mutex); + return bcount; +} + +static struct file_operations rcustats_fops = { + .owner = THIS_MODULE, + .read = rcustats_read, +}; + +static struct file_operations rcugp_fops = { + .owner = THIS_MODULE, + .read = rcugp_read, +}; + +static struct file_operations rcuctrs_fops = { + .owner = THIS_MODULE, + .read = rcuctrs_read, +}; + +static struct dentry *rcudir, *statdir, *ctrsdir, *gpdir; +static int rcupreempt_debugfs_init(void) +{ + rcudir = debugfs_create_dir("rcu", NULL); + if (!rcudir) + goto out; + statdir = debugfs_create_file("rcustats", 0444, rcudir, + NULL, &rcustats_fops); + if (!statdir) + goto free_out; + + gpdir = debugfs_create_file("rcugp", 0444, rcudir, NULL, &rcugp_fops); + if (!gpdir) + goto free_out; + + ctrsdir = debugfs_create_file("rcuctrs", 0444, rcudir, + NULL, &rcuctrs_fops); + if (!ctrsdir) + goto free_out; + return 0; +free_out: + if (statdir) + debugfs_remove(statdir); + if (gpdir) + debugfs_remove(gpdir); + debugfs_remove(rcudir); +out: + return 1; +} + +static int __init rcupreempt_trace_init(void) +{ + mutex_init(&rcupreempt_trace_mutex); + rcupreempt_trace_buf = kmalloc(RCUPREEMPT_TRACE_BUF_SIZE, GFP_KERNEL); + if (!rcupreempt_trace_buf) + return 1; + return rcupreempt_debugfs_init(); +} + +static void __exit rcupreempt_trace_cleanup(void) +{ + debugfs_remove(statdir); + debugfs_remove(gpdir); + debugfs_remove(ctrsdir); + debugfs_remove(rcudir); + kfree(rcupreempt_trace_buf); +} + + +module_init(rcupreempt_trace_init); +module_exit(rcupreempt_trace_cleanup); ������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/rcu-new-4.patch�����������������������������������������������������������������������������0000664�0000764�0000764�00000054246�11041657733�014021� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From paulmck@linux.vnet.ibm.com Thu Sep 27 00:09:37 2007 Date: Mon, 10 Sep 2007 11:35:25 -0700 From: Paul E. McKenney <paulmck@linux.vnet.ibm.com> To: linux-kernel@vger.kernel.org Cc: linux-rt-users@vger.kernel.org, mingo@elte.hu, akpm@linux-foundation.org, dipankar@in.ibm.com, josht@linux.vnet.ibm.com, tytso@us.ibm.com, dvhltc@us.ibm.com, tglx@linutronix.de, a.p.zijlstra@chello.nl, bunk@kernel.org, ego@in.ibm.com, oleg@tv-sign.ru, srostedt@redhat.com Subject: [PATCH RFC 4/9] RCU: synchronize_sched() workaround for CPU hotplug Work in progress, not for inclusion. The combination of CPU hotplug and PREEMPT_RCU has resulted in deadlocks due to the migration-based implementation of synchronize_sched() in -rt. This experimental patch maps synchronize_sched() back onto Classic RCU, eliminating the migration, thus hopefully also eliminating the deadlocks. It is not clear that this is a good long-term approach, but it will at least permit people doing CPU hotplug in -rt kernels additional wiggle room in their design and implementation. The basic approach is to cause the -rt kernel to incorporate rcuclassic.c as well as rcupreempt.c, but to #ifdef out the conflicting portions of rcuclassic.c so that only the code needed to implement synchronize_sched() remains in a PREEMPT_RT build. Invocations of grace-period detection from the scheduling-clock interrupt go to rcuclassic.c, which then invokes the corresponding functions in rcupreempt.c (with _rt suffix added to keep the linker happy). Also applies the RCU_SOFTIRQ to classic RCU. The bulk of this patch just moves code around, but likely increases scheduling-clock latency. If this patch does turn out to be the right approach, the #ifdefs in kernel/rcuclassic.c might be dealt with. ;-) At current writing, Gautham Shenoy's most recent CPU-hotplug fixes seem likely to obsolete this patch (which would be a very good thing indeed!). If this really pans out, this portion of the patch will vanish during the forward-porting process. Signed-off-by: Steven Rostedt <rostedt@goodmis.org> (for RCU_SOFTIRQ) Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> --- include/linux/rcuclassic.h | 79 +++++-------------------------------- include/linux/rcupdate.h | 30 ++++++++++++-- include/linux/rcupreempt.h | 27 ++++++------ kernel/Makefile | 2 kernel/rcuclassic.c | 95 ++++++++++++++++++++++++++++++++++++--------- kernel/rcupdate.c | 22 ++++++++-- kernel/rcupreempt.c | 50 +++++------------------ 7 files changed, 158 insertions(+), 147 deletions(-) Index: linux-2.6.24.7/include/linux/rcuclassic.h =================================================================== --- linux-2.6.24.7.orig/include/linux/rcuclassic.h +++ linux-2.6.24.7/include/linux/rcuclassic.h @@ -42,80 +42,19 @@ #include <linux/cpumask.h> #include <linux/seqlock.h> - -/* Global control variables for rcupdate callback mechanism. */ -struct rcu_ctrlblk { - long cur; /* Current batch number. */ - long completed; /* Number of the last completed batch */ - int next_pending; /* Is the next batch already waiting? */ - - int signaled; - - spinlock_t lock ____cacheline_internodealigned_in_smp; - cpumask_t cpumask; /* CPUs that need to switch in order */ - /* for current batch to proceed. */ -} ____cacheline_internodealigned_in_smp; - -/* Is batch a before batch b ? */ -static inline int rcu_batch_before(long a, long b) -{ - return (a - b) < 0; -} - -/* Is batch a after batch b ? */ -static inline int rcu_batch_after(long a, long b) -{ - return (a - b) > 0; -} +DECLARE_PER_CPU(int, rcu_data_bh_passed_quiesc); /* - * Per-CPU data for Read-Copy UPdate. - * nxtlist - new callbacks are added here - * curlist - current batch for which quiescent cycle started if any - */ -struct rcu_data { - /* 1) quiescent state handling : */ - long quiescbatch; /* Batch # for grace period */ - int passed_quiesc; /* User-mode/idle loop etc. */ - int qs_pending; /* core waits for quiesc state */ - - /* 2) batch handling */ - long batch; /* Batch # for current RCU batch */ - struct rcu_head *nxtlist; - struct rcu_head **nxttail; - long qlen; /* # of queued callbacks */ - struct rcu_head *curlist; - struct rcu_head **curtail; - struct rcu_head *donelist; - struct rcu_head **donetail; - long blimit; /* Upper limit on a processed batch */ - int cpu; - struct rcu_head barrier; -}; - -DECLARE_PER_CPU(struct rcu_data, rcu_data); -DECLARE_PER_CPU(struct rcu_data, rcu_bh_data); - -/* - * Increment the quiescent state counter. + * Increment the bottom-half quiescent state counter. * The counter is a bit degenerated: We do not need to know * how many quiescent states passed, just if there was at least * one since the start of the grace period. Thus just a flag. */ -static inline void rcu_qsctr_inc(int cpu) -{ - struct rcu_data *rdp = &per_cpu(rcu_data, cpu); - rdp->passed_quiesc = 1; -} static inline void rcu_bh_qsctr_inc(int cpu) { - struct rcu_data *rdp = &per_cpu(rcu_bh_data, cpu); - rdp->passed_quiesc = 1; + per_cpu(rcu_data_bh_passed_quiesc, cpu) = 1; } -extern int rcu_pending(int cpu); -extern int rcu_needs_cpu(int cpu); - #define __rcu_read_lock() \ do { \ preempt_disable(); \ @@ -139,9 +78,15 @@ extern int rcu_needs_cpu(int cpu); #define __synchronize_sched() synchronize_rcu() -extern void __rcu_init(void); -extern void rcu_check_callbacks(int cpu, int user); -extern void rcu_restart_cpu(int cpu); +#define rcu_advance_callbacks_rt(cpu, user) do { } while (0) +#define rcu_check_callbacks_rt(cpu, user) do { } while (0) +#define rcu_init_rt() do { } while (0) +#define rcu_needs_cpu_rt(cpu) 0 +#define rcu_pending_rt(cpu) 0 +#define rcu_process_callbacks_rt(unused) do { } while (0) + +extern void FASTCALL(call_rcu_classic(struct rcu_head *head, + void (*func)(struct rcu_head *head))); #endif /* __KERNEL__ */ #endif /* __LINUX_RCUCLASSIC_H */ Index: linux-2.6.24.7/include/linux/rcupdate.h =================================================================== --- linux-2.6.24.7.orig/include/linux/rcupdate.h +++ linux-2.6.24.7/include/linux/rcupdate.h @@ -220,8 +220,11 @@ extern struct lockdep_map rcu_lock_map; * delimited by rcu_read_lock() and rcu_read_unlock(), * and may be nested. */ -extern void FASTCALL(call_rcu(struct rcu_head *head, - void (*func)(struct rcu_head *head))); +#ifdef CONFIG_CLASSIC_RCU +#define call_rcu(head, func) call_rcu_classic(head, func) +#else /* #ifdef CONFIG_CLASSIC_RCU */ +#define call_rcu(head, func) call_rcu_preempt(head, func) +#endif /* #else #ifdef CONFIG_CLASSIC_RCU */ /** * call_rcu_bh - Queue an RCU for invocation after a quicker grace period. @@ -249,9 +252,28 @@ extern long rcu_batches_completed(void); extern long rcu_batches_completed_bh(void); /* Internal to kernel */ -extern void rcu_init(void); extern void rcu_check_callbacks(int cpu, int user); -extern int rcu_needs_cpu(int cpu); +extern long rcu_batches_completed(void); +extern long rcu_batches_completed_bh(void); +extern void rcu_check_callbacks(int cpu, int user); +extern void rcu_init(void); +extern int rcu_needs_cpu(int cpu); +extern int rcu_pending(int cpu); +struct softirq_action; +extern void rcu_restart_cpu(int cpu); + +DECLARE_PER_CPU(int, rcu_data_passed_quiesc); + +/* + * Increment the quiescent state counter. + * The counter is a bit degenerated: We do not need to know + * how many quiescent states passed, just if there was at least + * one since the start of the grace period. Thus just a flag. + */ +static inline void rcu_qsctr_inc(int cpu) +{ + per_cpu(rcu_data_passed_quiesc, cpu) = 1; +} #endif /* __KERNEL__ */ #endif /* __LINUX_RCUPDATE_H */ Index: linux-2.6.24.7/include/linux/rcupreempt.h =================================================================== --- linux-2.6.24.7.orig/include/linux/rcupreempt.h +++ linux-2.6.24.7/include/linux/rcupreempt.h @@ -42,25 +42,26 @@ #include <linux/cpumask.h> #include <linux/seqlock.h> -#define rcu_qsctr_inc(cpu) -#define rcu_bh_qsctr_inc(cpu) #define call_rcu_bh(head, rcu) call_rcu(head, rcu) - -extern void __rcu_read_lock(void); -extern void __rcu_read_unlock(void); -extern int rcu_pending(int cpu); -extern int rcu_needs_cpu(int cpu); - +#define rcu_bh_qsctr_inc(cpu) do { } while (0) #define __rcu_read_lock_bh() { rcu_read_lock(); local_bh_disable(); } #define __rcu_read_unlock_bh() { local_bh_enable(); rcu_read_unlock(); } - #define __rcu_read_lock_nesting() (current->rcu_read_lock_nesting) +extern void FASTCALL(call_rcu_classic(struct rcu_head *head, + void (*func)(struct rcu_head *head))); +extern void FASTCALL(call_rcu_preempt(struct rcu_head *head, + void (*func)(struct rcu_head *head))); +extern void __rcu_read_lock(void); +extern void __rcu_read_unlock(void); extern void __synchronize_sched(void); - -extern void __rcu_init(void); -extern void rcu_check_callbacks(int cpu, int user); -extern void rcu_restart_cpu(int cpu); +extern void rcu_advance_callbacks_rt(int cpu, int user); +extern void rcu_check_callbacks_rt(int cpu, int user); +extern void rcu_init_rt(void); +extern int rcu_needs_cpu_rt(int cpu); +extern int rcu_pending_rt(int cpu); +struct softirq_action; +extern void rcu_process_callbacks_rt(struct softirq_action *unused); #ifdef CONFIG_RCU_TRACE struct rcupreempt_trace; Index: linux-2.6.24.7/kernel/Makefile =================================================================== --- linux-2.6.24.7.orig/kernel/Makefile +++ linux-2.6.24.7/kernel/Makefile @@ -65,7 +65,7 @@ obj-$(CONFIG_GENERIC_HARDIRQS) += irq/ obj-$(CONFIG_SECCOMP) += seccomp.o obj-$(CONFIG_RCU_TORTURE_TEST) += rcutorture.o obj-$(CONFIG_CLASSIC_RCU) += rcuclassic.o -obj-$(CONFIG_PREEMPT_RCU) += rcupreempt.o +obj-$(CONFIG_PREEMPT_RCU) += rcuclassic.o rcupreempt.o ifeq ($(CONFIG_PREEMPT_RCU),y) obj-$(CONFIG_RCU_TRACE) += rcupreempt_trace.o endif Index: linux-2.6.24.7/kernel/rcuclassic.c =================================================================== --- linux-2.6.24.7.orig/kernel/rcuclassic.c +++ linux-2.6.24.7/kernel/rcuclassic.c @@ -45,10 +45,53 @@ #include <linux/moduleparam.h> #include <linux/percpu.h> #include <linux/notifier.h> -/* #include <linux/rcupdate.h> @@@ */ #include <linux/cpu.h> #include <linux/mutex.h> + +/* Global control variables for rcupdate callback mechanism. */ +struct rcu_ctrlblk { + long cur; /* Current batch number. */ + long completed; /* Number of the last completed batch */ + int next_pending; /* Is the next batch already waiting? */ + + int signaled; + + spinlock_t lock ____cacheline_internodealigned_in_smp; + cpumask_t cpumask; /* CPUs that need to switch in order */ + /* for current batch to proceed. */ +} ____cacheline_internodealigned_in_smp; + +/* Is batch a before batch b ? */ +static inline int rcu_batch_before(long a, long b) +{ + return (a - b) < 0; +} + +/* + * Per-CPU data for Read-Copy UPdate. + * nxtlist - new callbacks are added here + * curlist - current batch for which quiescent cycle started if any + */ +struct rcu_data { + /* 1) quiescent state handling : */ + long quiescbatch; /* Batch # for grace period */ + int *passed_quiesc; /* User-mode/idle loop etc. */ + int qs_pending; /* core waits for quiesc state */ + + /* 2) batch handling */ + long batch; /* Batch # for current RCU batch */ + struct rcu_head *nxtlist; + struct rcu_head **nxttail; + long qlen; /* # of queued callbacks */ + struct rcu_head *curlist; + struct rcu_head **curtail; + struct rcu_head *donelist; + struct rcu_head **donetail; + long blimit; /* Upper limit on a processed batch */ + int cpu; +}; + /* Definition for rcupdate control block. */ static struct rcu_ctrlblk rcu_ctrlblk = { .cur = -300, @@ -63,11 +106,11 @@ static struct rcu_ctrlblk rcu_bh_ctrlblk .cpumask = CPU_MASK_NONE, }; -DEFINE_PER_CPU(struct rcu_data, rcu_data) = { 0L }; -DEFINE_PER_CPU(struct rcu_data, rcu_bh_data) = { 0L }; +static DEFINE_PER_CPU(struct rcu_data, rcu_data) = { 0L }; +static DEFINE_PER_CPU(struct rcu_data, rcu_bh_data) = { 0L }; +DEFINE_PER_CPU(int, rcu_data_bh_passed_quiesc); /* Fake initialization required by compiler */ -static DEFINE_PER_CPU(struct tasklet_struct, rcu_tasklet) = {NULL}; static int blimit = 10; static int qhimark = 10000; static int qlowmark = 100; @@ -110,8 +153,8 @@ static inline void force_quiescent_state * sections are delimited by rcu_read_lock() and rcu_read_unlock(), * and may be nested. */ -void fastcall call_rcu(struct rcu_head *head, - void (*func)(struct rcu_head *rcu)) +void fastcall call_rcu_classic(struct rcu_head *head, + void (*func)(struct rcu_head *rcu)) { unsigned long flags; struct rcu_data *rdp; @@ -128,7 +171,9 @@ void fastcall call_rcu(struct rcu_head * } local_irq_restore(flags); } -EXPORT_SYMBOL_GPL(call_rcu); +EXPORT_SYMBOL_GPL(call_rcu_classic); + +#ifdef CONFIG_CLASSIC_RCU /** * call_rcu_bh - Queue an RCU for invocation after a quicker grace period. @@ -166,7 +211,9 @@ void fastcall call_rcu_bh(struct rcu_hea local_irq_restore(flags); } +#ifdef CONFIG_CLASSIC_RCU EXPORT_SYMBOL_GPL(call_rcu_bh); +#endif /* #ifdef CONFIG_CLASSIC_RCU */ /* * Return the number of RCU batches processed thus far. Useful @@ -176,7 +223,9 @@ long rcu_batches_completed(void) { return rcu_ctrlblk.completed; } +#ifdef CONFIG_CLASSIC_RCU EXPORT_SYMBOL_GPL(rcu_batches_completed); +#endif /* #ifdef CONFIG_CLASSIC_RCU */ /* * Return the number of RCU batches processed thus far. Useful @@ -186,7 +235,11 @@ long rcu_batches_completed_bh(void) { return rcu_bh_ctrlblk.completed; } +#ifdef CONFIG_CLASSIC_RCU EXPORT_SYMBOL_GPL(rcu_batches_completed_bh); +#endif /* #ifdef CONFIG_CLASSIC_RCU */ + +#endif /* #ifdef CONFIG_CLASSIC_RCU */ /* * Invoke the completed RCU callbacks. They are expected to be in @@ -217,7 +270,7 @@ static void rcu_do_batch(struct rcu_data if (!rdp->donelist) rdp->donetail = &rdp->donelist; else - tasklet_schedule(&per_cpu(rcu_tasklet, rdp->cpu)); + raise_softirq(RCU_SOFTIRQ); } /* @@ -294,7 +347,7 @@ static void rcu_check_quiescent_state(st if (rdp->quiescbatch != rcp->cur) { /* start new grace period: */ rdp->qs_pending = 1; - rdp->passed_quiesc = 0; + *rdp->passed_quiesc = 0; rdp->quiescbatch = rcp->cur; return; } @@ -310,7 +363,7 @@ static void rcu_check_quiescent_state(st * Was there a quiescent state since the beginning of the grace * period? If no, then exit and wait for the next call. */ - if (!rdp->passed_quiesc) + if (!*rdp->passed_quiesc) return; rdp->qs_pending = 0; @@ -369,7 +422,6 @@ static void rcu_offline_cpu(int cpu) &per_cpu(rcu_bh_data, cpu)); put_cpu_var(rcu_data); put_cpu_var(rcu_bh_data); - tasklet_kill_immediate(&per_cpu(rcu_tasklet, cpu), cpu); } #else @@ -381,7 +433,7 @@ static void rcu_offline_cpu(int cpu) #endif /* - * This does the RCU processing work from tasklet context. + * This does the RCU processing work from softirq context. */ static void __rcu_process_callbacks(struct rcu_ctrlblk *rcp, struct rcu_data *rdp) @@ -426,10 +478,11 @@ static void __rcu_process_callbacks(stru rcu_do_batch(rdp); } -static void rcu_process_callbacks(unsigned long unused) +static void rcu_process_callbacks(struct softirq_action *unused) { __rcu_process_callbacks(&rcu_ctrlblk, &__get_cpu_var(rcu_data)); __rcu_process_callbacks(&rcu_bh_ctrlblk, &__get_cpu_var(rcu_bh_data)); + rcu_process_callbacks_rt(unused); } static int __rcu_pending(struct rcu_ctrlblk *rcp, struct rcu_data *rdp) @@ -464,7 +517,8 @@ static int __rcu_pending(struct rcu_ctrl int rcu_pending(int cpu) { return __rcu_pending(&rcu_ctrlblk, &per_cpu(rcu_data, cpu)) || - __rcu_pending(&rcu_bh_ctrlblk, &per_cpu(rcu_bh_data, cpu)); + __rcu_pending(&rcu_bh_ctrlblk, &per_cpu(rcu_bh_data, cpu)) || + rcu_pending_rt(cpu); } /* @@ -478,7 +532,8 @@ int rcu_needs_cpu(int cpu) struct rcu_data *rdp = &per_cpu(rcu_data, cpu); struct rcu_data *rdp_bh = &per_cpu(rcu_bh_data, cpu); - return (!!rdp->curlist || !!rdp_bh->curlist || rcu_pending(cpu)); + return (!!rdp->curlist || !!rdp_bh->curlist || rcu_pending(cpu) || + rcu_needs_cpu_rt(cpu)); } void rcu_check_callbacks(int cpu, int user) @@ -490,7 +545,8 @@ void rcu_check_callbacks(int cpu, int us rcu_bh_qsctr_inc(cpu); } else if (!in_softirq()) rcu_bh_qsctr_inc(cpu); - tasklet_schedule(&per_cpu(rcu_tasklet, cpu)); + rcu_check_callbacks_rt(cpu, user); + raise_softirq(RCU_SOFTIRQ); } static void rcu_init_percpu_data(int cpu, struct rcu_ctrlblk *rcp, @@ -512,8 +568,9 @@ static void __cpuinit rcu_online_cpu(int struct rcu_data *bh_rdp = &per_cpu(rcu_bh_data, cpu); rcu_init_percpu_data(cpu, &rcu_ctrlblk, rdp); + rdp->passed_quiesc = &per_cpu(rcu_data_passed_quiesc, cpu); rcu_init_percpu_data(cpu, &rcu_bh_ctrlblk, bh_rdp); - tasklet_init(&per_cpu(rcu_tasklet, cpu), rcu_process_callbacks, 0UL); + bh_rdp->passed_quiesc = &per_cpu(rcu_data_bh_passed_quiesc, cpu); } static int __cpuinit rcu_cpu_notify(struct notifier_block *self, @@ -545,12 +602,14 @@ static struct notifier_block __cpuinitda * Note that rcu_qsctr and friends are implicitly * initialized due to the choice of ``0'' for RCU_CTR_INVALID. */ -void __init __rcu_init(void) +void __init rcu_init(void) { rcu_cpu_notify(&rcu_nb, CPU_UP_PREPARE, (void *)(long)smp_processor_id()); /* Register notifier for non-boot CPUs */ register_cpu_notifier(&rcu_nb); + rcu_init_rt(); + open_softirq(RCU_SOFTIRQ, rcu_process_callbacks, NULL); } module_param(blimit, int, 0); Index: linux-2.6.24.7/kernel/rcupdate.c =================================================================== --- linux-2.6.24.7.orig/kernel/rcupdate.c +++ linux-2.6.24.7/kernel/rcupdate.c @@ -59,6 +59,7 @@ struct rcu_synchronize { struct completion completion; }; +DEFINE_PER_CPU(int, rcu_data_passed_quiesc); static DEFINE_PER_CPU(struct rcu_head, rcu_barrier_head) = {NULL}; static atomic_t rcu_barrier_cpu_count; static DEFINE_MUTEX(rcu_barrier_mutex); @@ -95,6 +96,22 @@ void synchronize_rcu(void) } EXPORT_SYMBOL_GPL(synchronize_rcu); +#ifdef CONFIG_PREEMPT_RCU + +/* + * Map synchronize_sched() to the classic RCU implementation. + */ +void __synchronize_sched(void) +{ + struct rcu_synchronize rcu; + + init_completion(&rcu.completion); + call_rcu_classic(&rcu.head, wakeme_after_rcu); + wait_for_completion(&rcu.completion); +} +EXPORT_SYMBOL_GPL(__synchronize_sched); +#endif /* #ifdef CONFIG_PREEMPT_RCU */ + static void rcu_barrier_callback(struct rcu_head *notused) { if (atomic_dec_and_test(&rcu_barrier_cpu_count)) @@ -138,8 +155,3 @@ void rcu_barrier(void) mutex_unlock(&rcu_barrier_mutex); } EXPORT_SYMBOL_GPL(rcu_barrier); - -void __init rcu_init(void) -{ - __rcu_init(); -} Index: linux-2.6.24.7/kernel/rcupreempt.c =================================================================== --- linux-2.6.24.7.orig/kernel/rcupreempt.c +++ linux-2.6.24.7/kernel/rcupreempt.c @@ -61,7 +61,6 @@ struct rcu_data { spinlock_t lock; /* Protect rcu_data fields. */ long completed; /* Number of last completed batch. */ int waitlistcount; - struct tasklet_struct rcu_tasklet; struct rcu_head *nextlist; struct rcu_head **nexttail; struct rcu_head *waitlist[GP_STAGES]; @@ -550,7 +549,7 @@ static void rcu_check_mb(int cpu) } } -void rcu_check_callbacks(int cpu, int user) +void rcu_check_callbacks_rt(int cpu, int user) { unsigned long oldirq; struct rcu_data *rdp = RCU_DATA_CPU(cpu); @@ -561,19 +560,14 @@ void rcu_check_callbacks(int cpu, int us spin_lock_irqsave(&rdp->lock, oldirq); RCU_TRACE_RDP(rcupreempt_trace_check_callbacks, rdp); __rcu_advance_callbacks(rdp); - if (rdp->donelist == NULL) { - spin_unlock_irqrestore(&rdp->lock, oldirq); - } else { - spin_unlock_irqrestore(&rdp->lock, oldirq); - raise_softirq(RCU_SOFTIRQ); - } + spin_unlock_irqrestore(&rdp->lock, oldirq); } /* * Needed by dynticks, to make sure all RCU processing has finished - * when we go idle: + * when we go idle. (Currently unused, needed?) */ -void rcu_advance_callbacks(int cpu, int user) +void rcu_advance_callbacks_rt(int cpu, int user) { unsigned long oldirq; struct rcu_data *rdp = RCU_DATA_CPU(cpu); @@ -589,7 +583,7 @@ void rcu_advance_callbacks(int cpu, int spin_unlock_irqrestore(&rdp->lock, oldirq); } -static void rcu_process_callbacks(struct softirq_action *unused) +void rcu_process_callbacks_rt(struct softirq_action *unused) { unsigned long flags; struct rcu_head *next, *list; @@ -613,8 +607,8 @@ static void rcu_process_callbacks(struct } } -void fastcall call_rcu(struct rcu_head *head, - void (*func)(struct rcu_head *rcu)) +void fastcall call_rcu_preempt(struct rcu_head *head, + void (*func)(struct rcu_head *rcu)) { unsigned long oldirq; struct rcu_data *rdp; @@ -631,28 +625,7 @@ void fastcall call_rcu(struct rcu_head * spin_unlock(&rdp->lock); local_irq_restore(oldirq); } -EXPORT_SYMBOL_GPL(call_rcu); - -/* - * Wait until all currently running preempt_disable() code segments - * (including hardware-irq-disable segments) complete. Note that - * in -rt this does -not- necessarily result in all currently executing - * interrupt -handlers- having completed. - */ -void __synchronize_sched(void) -{ - cpumask_t oldmask; - int cpu; - - if (sched_getaffinity(0, &oldmask) < 0) - oldmask = cpu_possible_map; - for_each_online_cpu(cpu) { - sched_setaffinity(0, cpumask_of_cpu(cpu)); - schedule(); - } - sched_setaffinity(0, oldmask); -} -EXPORT_SYMBOL_GPL(__synchronize_sched); +EXPORT_SYMBOL_GPL(call_rcu_preempt); /* * Check to see if any future RCU-related work will need to be done @@ -663,7 +636,7 @@ EXPORT_SYMBOL_GPL(__synchronize_sched); * This function is part of the RCU implementation; it is -not- * an exported member of the RCU API. */ -int rcu_needs_cpu(int cpu) +int rcu_needs_cpu_rt(int cpu) { struct rcu_data *rdp = RCU_DATA_CPU(cpu); @@ -672,7 +645,7 @@ int rcu_needs_cpu(int cpu) rdp->nextlist != NULL); } -int rcu_pending(int cpu) +int rcu_pending_rt(int cpu) { struct rcu_data *rdp = RCU_DATA_CPU(cpu); @@ -699,7 +672,7 @@ int rcu_pending(int cpu) return 0; } -void __init __rcu_init(void) +void __init rcu_init_rt(void) { int cpu; int i; @@ -719,7 +692,6 @@ void __init __rcu_init(void) rdp->donelist = NULL; rdp->donetail = &rdp->donelist; } - open_softirq(RCU_SOFTIRQ, rcu_process_callbacks, NULL); } /* ����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/rcu-new-5.patch�����������������������������������������������������������������������������0000664�0000764�0000764�00000016717�11041657735�014025� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From paulmck@linux.vnet.ibm.com Thu Sep 27 00:10:09 2007 Date: Mon, 10 Sep 2007 11:36:22 -0700 From: Paul E. McKenney <paulmck@linux.vnet.ibm.com> To: linux-kernel@vger.kernel.org Cc: linux-rt-users@vger.kernel.org, mingo@elte.hu, akpm@linux-foundation.org, dipankar@in.ibm.com, josht@linux.vnet.ibm.com, tytso@us.ibm.com, dvhltc@us.ibm.com, tglx@linutronix.de, a.p.zijlstra@chello.nl, bunk@kernel.org, ego@in.ibm.com, oleg@tv-sign.ru, srostedt@redhat.com Subject: [PATCH RFC 5/9] RCU: CPU hotplug support for preemptible RCU Work in progress, not for inclusion. This patch allows preemptible RCU to tolerate CPU-hotplug operations. It accomplishes this by maintaining a local copy of a map of online CPUs, which it accesses under its own lock. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> --- include/linux/rcuclassic.h | 2 include/linux/rcupreempt.h | 2 kernel/rcuclassic.c | 8 +++ kernel/rcupreempt.c | 93 +++++++++++++++++++++++++++++++++++++++++++-- 4 files changed, 100 insertions(+), 5 deletions(-) Index: linux-2.6.24.7/include/linux/rcuclassic.h =================================================================== --- linux-2.6.24.7.orig/include/linux/rcuclassic.h +++ linux-2.6.24.7/include/linux/rcuclassic.h @@ -82,6 +82,8 @@ static inline void rcu_bh_qsctr_inc(int #define rcu_check_callbacks_rt(cpu, user) do { } while (0) #define rcu_init_rt() do { } while (0) #define rcu_needs_cpu_rt(cpu) 0 +#define rcu_offline_cpu_rt(cpu) +#define rcu_online_cpu_rt(cpu) #define rcu_pending_rt(cpu) 0 #define rcu_process_callbacks_rt(unused) do { } while (0) Index: linux-2.6.24.7/include/linux/rcupreempt.h =================================================================== --- linux-2.6.24.7.orig/include/linux/rcupreempt.h +++ linux-2.6.24.7/include/linux/rcupreempt.h @@ -59,6 +59,8 @@ extern void rcu_advance_callbacks_rt(int extern void rcu_check_callbacks_rt(int cpu, int user); extern void rcu_init_rt(void); extern int rcu_needs_cpu_rt(int cpu); +extern void rcu_offline_cpu_rt(int cpu); +extern void rcu_online_cpu_rt(int cpu); extern int rcu_pending_rt(int cpu); struct softirq_action; extern void rcu_process_callbacks_rt(struct softirq_action *unused); Index: linux-2.6.24.7/kernel/rcuclassic.c =================================================================== --- linux-2.6.24.7.orig/kernel/rcuclassic.c +++ linux-2.6.24.7/kernel/rcuclassic.c @@ -414,14 +414,19 @@ static void __rcu_offline_cpu(struct rcu static void rcu_offline_cpu(int cpu) { struct rcu_data *this_rdp = &get_cpu_var(rcu_data); +#ifdef CONFIG_CLASSIC_RCU struct rcu_data *this_bh_rdp = &get_cpu_var(rcu_bh_data); +#endif /* #ifdef CONFIG_CLASSIC_RCU */ __rcu_offline_cpu(this_rdp, &rcu_ctrlblk, &per_cpu(rcu_data, cpu)); +#ifdef CONFIG_CLASSIC_RCU __rcu_offline_cpu(this_bh_rdp, &rcu_bh_ctrlblk, &per_cpu(rcu_bh_data, cpu)); - put_cpu_var(rcu_data); put_cpu_var(rcu_bh_data); +#endif /* #ifdef CONFIG_CLASSIC_RCU */ + put_cpu_var(rcu_data); + rcu_offline_cpu_rt(cpu); } #else @@ -571,6 +576,7 @@ static void __cpuinit rcu_online_cpu(int rdp->passed_quiesc = &per_cpu(rcu_data_passed_quiesc, cpu); rcu_init_percpu_data(cpu, &rcu_bh_ctrlblk, bh_rdp); bh_rdp->passed_quiesc = &per_cpu(rcu_data_bh_passed_quiesc, cpu); + rcu_online_cpu_rt(cpu); } static int __cpuinit rcu_cpu_notify(struct notifier_block *self, Index: linux-2.6.24.7/kernel/rcupreempt.c =================================================================== --- linux-2.6.24.7.orig/kernel/rcupreempt.c +++ linux-2.6.24.7/kernel/rcupreempt.c @@ -125,6 +125,8 @@ enum rcu_mb_flag_values { }; static DEFINE_PER_CPU(enum rcu_mb_flag_values, rcu_mb_flag) = rcu_mb_done; +static cpumask_t rcu_cpu_online_map = CPU_MASK_NONE; + /* * Macro that prevents the compiler from reordering accesses, but does * absolutely -nothing- to prevent CPUs from reordering. This is used @@ -404,7 +406,7 @@ rcu_try_flip_idle(void) /* Now ask each CPU for acknowledgement of the flip. */ - for_each_possible_cpu(cpu) + for_each_cpu_mask(cpu, rcu_cpu_online_map) per_cpu(rcu_flip_flag, cpu) = rcu_flipped; return 1; @@ -420,7 +422,7 @@ rcu_try_flip_waitack(void) int cpu; RCU_TRACE_ME(rcupreempt_trace_try_flip_a1); - for_each_possible_cpu(cpu) + for_each_cpu_mask(cpu, rcu_cpu_online_map) if (per_cpu(rcu_flip_flag, cpu) != rcu_flip_seen) { RCU_TRACE_ME(rcupreempt_trace_try_flip_ae1); return 0; @@ -462,7 +464,7 @@ rcu_try_flip_waitzero(void) /* Call for a memory barrier from each CPU. */ - for_each_possible_cpu(cpu) + for_each_cpu_mask(cpu, rcu_cpu_online_map) per_cpu(rcu_mb_flag, cpu) = rcu_mb_needed; RCU_TRACE_ME(rcupreempt_trace_try_flip_z2); @@ -480,7 +482,7 @@ rcu_try_flip_waitmb(void) int cpu; RCU_TRACE_ME(rcupreempt_trace_try_flip_m1); - for_each_possible_cpu(cpu) + for_each_cpu_mask(cpu, rcu_cpu_online_map) if (per_cpu(rcu_mb_flag, cpu) != rcu_mb_done) { RCU_TRACE_ME(rcupreempt_trace_try_flip_me1); return 0; @@ -583,6 +585,89 @@ void rcu_advance_callbacks_rt(int cpu, i spin_unlock_irqrestore(&rdp->lock, oldirq); } +#ifdef CONFIG_HOTPLUG_CPU + +#define rcu_offline_cpu_rt_enqueue(srclist, srctail, dstlist, dsttail) do { \ + *dsttail = srclist; \ + if (srclist != NULL) { \ + dsttail = srctail; \ + srclist = NULL; \ + srctail = &srclist;\ + } \ + } while (0) + + +void rcu_offline_cpu_rt(int cpu) +{ + int i; + struct rcu_head *list = NULL; + unsigned long oldirq; + struct rcu_data *rdp = RCU_DATA_CPU(cpu); + struct rcu_head **tail = &list; + + /* Remove all callbacks from the newly dead CPU, retaining order. */ + + spin_lock_irqsave(&rdp->lock, oldirq); + rcu_offline_cpu_rt_enqueue(rdp->donelist, rdp->donetail, list, tail); + for (i = GP_STAGES - 1; i >= 0; i--) + rcu_offline_cpu_rt_enqueue(rdp->waitlist[i], rdp->waittail[i], + list, tail); + rcu_offline_cpu_rt_enqueue(rdp->nextlist, rdp->nexttail, list, tail); + spin_unlock_irqrestore(&rdp->lock, oldirq); + rdp->waitlistcount = 0; + + /* Disengage the newly dead CPU from grace-period computation. */ + + spin_lock_irqsave(&rcu_ctrlblk.fliplock, oldirq); + rcu_check_mb(cpu); + if (per_cpu(rcu_flip_flag, cpu) == rcu_flipped) { + smp_mb(); /* Subsequent counter accesses must see new value */ + per_cpu(rcu_flip_flag, cpu) = rcu_flip_seen; + smp_mb(); /* Subsequent RCU read-side critical sections */ + /* seen -after- acknowledgement. */ + } + cpu_clear(cpu, rcu_cpu_online_map); + spin_unlock_irqrestore(&rcu_ctrlblk.fliplock, oldirq); + + /* + * Place the removed callbacks on the current CPU's queue. + * Make them all start a new grace period: simple approach, + * in theory could starve a given set of callbacks, but + * you would need to be doing some serious CPU hotplugging + * to make this happen. If this becomes a problem, adding + * a synchronize_rcu() to the hotplug path would be a simple + * fix. + */ + + rdp = RCU_DATA_ME(); + spin_lock_irqsave(&rdp->lock, oldirq); + *rdp->nexttail = list; + if (list) + rdp->nexttail = tail; + spin_unlock_irqrestore(&rdp->lock, oldirq); +} + +void __devinit rcu_online_cpu_rt(int cpu) +{ + unsigned long oldirq; + + spin_lock_irqsave(&rcu_ctrlblk.fliplock, oldirq); + cpu_set(cpu, rcu_cpu_online_map); + spin_unlock_irqrestore(&rcu_ctrlblk.fliplock, oldirq); +} + +#else /* #ifdef CONFIG_HOTPLUG_CPU */ + +void rcu_offline_cpu(int cpu) +{ +} + +void __devinit rcu_online_cpu_rt(int cpu) +{ +} + +#endif /* #else #ifdef CONFIG_HOTPLUG_CPU */ + void rcu_process_callbacks_rt(struct softirq_action *unused) { unsigned long flags; �������������������������������������������������patches/rcu-new-7.patch�����������������������������������������������������������������������������0000664�0000764�0000764�00000020254�11041657732�014013� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From paulmck@linux.vnet.ibm.com Thu Sep 27 15:32:09 2007 Date: Mon, 10 Sep 2007 11:39:46 -0700 From: Paul E. McKenney <paulmck@linux.vnet.ibm.com> To: linux-kernel@vger.kernel.org Cc: linux-rt-users@vger.kernel.org, mingo@elte.hu, akpm@linux-foundation.org, dipankar@in.ibm.com, josht@linux.vnet.ibm.com, tytso@us.ibm.com, dvhltc@us.ibm.com, tglx@linutronix.de, a.p.zijlstra@chello.nl, bunk@kernel.org, ego@in.ibm.com, oleg@tv-sign.ru, srostedt@redhat.com Subject: [PATCH RFC 7/9] RCU: rcutorture testing for RCU priority boosting Work in progress, not for inclusion. Still uses xtime because this patch is still against 2.6.22. This patch modifies rcutorture to also torture RCU priority boosting. The torturing involves forcing RCU read-side critical sections (already performed as part of the torturing of RCU) to run for extremely long time periods, increasing the probability of their being preempted and thus needing priority boosting. The fact that rcutorture's "nreaders" module parameter defaults to twice the number of CPUs helps ensure lots of the needed preemption. To cause the torturing to be fully effective in -mm, run in presence of CPU-hotplug operations. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> --- kernel/rcutorture.c | 91 ++++++++++++++++++++++++++++++++++++++-------- kernel/time/timekeeping.c | 2 + 2 files changed, 79 insertions(+), 14 deletions(-) Index: linux-2.6.24.7/kernel/rcutorture.c =================================================================== --- linux-2.6.24.7.orig/kernel/rcutorture.c +++ linux-2.6.24.7/kernel/rcutorture.c @@ -57,6 +57,7 @@ static int stat_interval; /* Interval be static int verbose; /* Print more debug info. */ static int test_no_idle_hz; /* Test RCU's support for tickless idle CPUs. */ static int shuffle_interval = 5; /* Interval between shuffles (in sec)*/ +static int preempt_torture; /* Realtime task preempts torture readers. */ static char *torture_type = "rcu"; /* What RCU implementation to torture. */ module_param(nreaders, int, 0444); @@ -71,6 +72,8 @@ module_param(test_no_idle_hz, bool, 0444 MODULE_PARM_DESC(test_no_idle_hz, "Test support for tickless idle CPUs"); module_param(shuffle_interval, int, 0444); MODULE_PARM_DESC(shuffle_interval, "Number of seconds between shuffles"); +module_param(preempt_torture, bool, 0444); +MODULE_PARM_DESC(preempt_torture, "Enable realtime preemption torture"); module_param(torture_type, charp, 0444); MODULE_PARM_DESC(torture_type, "Type of RCU to torture (rcu, rcu_bh, srcu)"); @@ -191,6 +194,8 @@ struct rcu_torture_ops { int (*completed)(void); void (*deferredfree)(struct rcu_torture *p); void (*sync)(void); + long (*preemptstart)(void); + void (*preemptend)(void); int (*stats)(char *page); char *name; }; @@ -255,16 +260,75 @@ static void rcu_torture_deferred_free(st call_rcu(&p->rtort_rcu, rcu_torture_cb); } +static struct task_struct *rcu_preeempt_task; +static unsigned long rcu_torture_preempt_errors; + +static int rcu_torture_preempt(void *arg) +{ + int completedstart; + int err; + time_t gcstart; + struct sched_param sp; + + sp.sched_priority = MAX_RT_PRIO - 1; + err = sched_setscheduler(current, SCHED_RR, &sp); + if (err != 0) + printk(KERN_ALERT "rcu_torture_preempt() priority err: %d\n", + err); + current->flags |= PF_NOFREEZE; + + do { + completedstart = rcu_torture_completed(); + gcstart = xtime.tv_sec; + while ((xtime.tv_sec - gcstart < 10) && + (rcu_torture_completed() == completedstart)) + cond_resched(); + if (rcu_torture_completed() == completedstart) + rcu_torture_preempt_errors++; + schedule_timeout_interruptible(1); + } while (!kthread_should_stop()); + return 0; +} + +static long rcu_preempt_start(void) +{ + long retval = 0; + + rcu_preeempt_task = kthread_run(rcu_torture_preempt, NULL, + "rcu_torture_preempt"); + if (IS_ERR(rcu_preeempt_task)) { + VERBOSE_PRINTK_ERRSTRING("Failed to create preempter"); + retval = PTR_ERR(rcu_preeempt_task); + rcu_preeempt_task = NULL; + } + return retval; +} + +static void rcu_preempt_end(void) +{ + if (rcu_preeempt_task != NULL) { + VERBOSE_PRINTK_STRING("Stopping rcu_preempt task"); + kthread_stop(rcu_preeempt_task); + } + rcu_preeempt_task = NULL; +} + +static int rcu_preempt_stats(char *page) +{ + return sprintf(page, + "Preemption stalls: %lu\n", rcu_torture_preempt_errors); +} + static struct rcu_torture_ops rcu_ops = { - .init = NULL, - .cleanup = NULL, .readlock = rcu_torture_read_lock, .readdelay = rcu_read_delay, .readunlock = rcu_torture_read_unlock, .completed = rcu_torture_completed, .deferredfree = rcu_torture_deferred_free, .sync = synchronize_rcu, - .stats = NULL, + .preemptstart = rcu_preempt_start, + .preemptend = rcu_preempt_end, + .stats = rcu_preempt_stats, .name = "rcu" }; @@ -296,14 +360,12 @@ static void rcu_sync_torture_init(void) static struct rcu_torture_ops rcu_sync_ops = { .init = rcu_sync_torture_init, - .cleanup = NULL, .readlock = rcu_torture_read_lock, .readdelay = rcu_read_delay, .readunlock = rcu_torture_read_unlock, .completed = rcu_torture_completed, .deferredfree = rcu_sync_torture_deferred_free, .sync = synchronize_rcu, - .stats = NULL, .name = "rcu_sync" }; @@ -355,28 +417,23 @@ static void rcu_bh_torture_synchronize(v } static struct rcu_torture_ops rcu_bh_ops = { - .init = NULL, - .cleanup = NULL, .readlock = rcu_bh_torture_read_lock, .readdelay = rcu_read_delay, /* just reuse rcu's version. */ .readunlock = rcu_bh_torture_read_unlock, .completed = rcu_bh_torture_completed, .deferredfree = rcu_bh_torture_deferred_free, .sync = rcu_bh_torture_synchronize, - .stats = NULL, .name = "rcu_bh" }; static struct rcu_torture_ops rcu_bh_sync_ops = { .init = rcu_sync_torture_init, - .cleanup = NULL, .readlock = rcu_bh_torture_read_lock, .readdelay = rcu_read_delay, /* just reuse rcu's version. */ .readunlock = rcu_bh_torture_read_unlock, .completed = rcu_bh_torture_completed, .deferredfree = rcu_sync_torture_deferred_free, .sync = rcu_bh_torture_synchronize, - .stats = NULL, .name = "rcu_bh_sync" }; @@ -488,14 +545,12 @@ static void sched_torture_synchronize(vo static struct rcu_torture_ops sched_ops = { .init = rcu_sync_torture_init, - .cleanup = NULL, .readlock = sched_torture_read_lock, .readdelay = rcu_read_delay, /* just reuse rcu's version. */ .readunlock = sched_torture_read_unlock, .completed = sched_torture_completed, .deferredfree = rcu_sync_torture_deferred_free, .sync = sched_torture_synchronize, - .stats = NULL, .name = "sched" }; @@ -787,9 +842,10 @@ rcu_torture_print_module_parms(char *tag printk(KERN_ALERT "%s" TORTURE_FLAG "--- %s: nreaders=%d nfakewriters=%d " "stat_interval=%d verbose=%d test_no_idle_hz=%d " - "shuffle_interval = %d\n", + "shuffle_interval=%d preempt_torture=%d\n", torture_type, tag, nrealreaders, nfakewriters, - stat_interval, verbose, test_no_idle_hz, shuffle_interval); + stat_interval, verbose, test_no_idle_hz, shuffle_interval, + preempt_torture); } static void @@ -842,6 +898,8 @@ rcu_torture_cleanup(void) kthread_stop(stats_task); } stats_task = NULL; + if (preempt_torture && (cur_ops->preemptend != NULL)) + cur_ops->preemptend(); /* Wait for all RCU callbacks to fire. */ rcu_barrier(); @@ -984,6 +1042,11 @@ rcu_torture_init(void) goto unwind; } } + if (preempt_torture && (cur_ops->preemptstart != NULL)) { + firsterr = cur_ops->preemptstart(); + if (firsterr != 0) + goto unwind; + } return 0; unwind: Index: linux-2.6.24.7/kernel/time/timekeeping.c =================================================================== --- linux-2.6.24.7.orig/kernel/time/timekeeping.c +++ linux-2.6.24.7/kernel/time/timekeeping.c @@ -26,6 +26,7 @@ */ __cacheline_aligned_in_smp DEFINE_SEQLOCK(xtime_lock); +EXPORT_SYMBOL_GPL(xtime_lock); /* * The current time @@ -45,6 +46,7 @@ __cacheline_aligned_in_smp DEFINE_SEQLOC struct timespec xtime __attribute__ ((aligned (16))); struct timespec wall_to_monotonic __attribute__ ((aligned (16))); static unsigned long total_sleep_time; /* seconds */ +EXPORT_SYMBOL_GPL(xtime); static struct timespec xtime_cache __attribute__ ((aligned (16))); static inline void update_xtime_cache(u64 nsec) ����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/rcu-new-9.patch�����������������������������������������������������������������������������0000664�0000764�0000764�00000050471�11041657732�014021� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From paulmck@linux.vnet.ibm.com Thu Sep 27 15:33:20 2007 Date: Mon, 10 Sep 2007 11:42:40 -0700 From: Paul E. McKenney <paulmck@linux.vnet.ibm.com> To: linux-kernel@vger.kernel.org Cc: linux-rt-users@vger.kernel.org, mingo@elte.hu, akpm@linux-foundation.org, dipankar@in.ibm.com, josht@linux.vnet.ibm.com, tytso@us.ibm.com, dvhltc@us.ibm.com, tglx@linutronix.de, a.p.zijlstra@chello.nl, bunk@kernel.org, ego@in.ibm.com, oleg@tv-sign.ru, srostedt@redhat.com Subject: [PATCH RFC 9/9] RCU: preemptible documentation and comment cleanups Work in progress, not for inclusion. This patch updates the RCU documentation to reflect preemptible RCU as well as recent publications. Fix an incorrect comment in the code. Change the name ORDERED_WRT_IRQ() to ACCESS_ONCE() to better describe its function. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> --- Documentation/RCU/RTFP.txt | 234 ++++++++++++++++++++++++++++++++++++++++-- Documentation/RCU/rcu.txt | 20 +++ Documentation/RCU/torture.txt | 44 ++++++- kernel/rcupreempt.c | 22 +-- 4 files changed, 290 insertions(+), 30 deletions(-) Index: linux-2.6.24.7/Documentation/RCU/RTFP.txt =================================================================== --- linux-2.6.24.7.orig/Documentation/RCU/RTFP.txt +++ linux-2.6.24.7/Documentation/RCU/RTFP.txt @@ -9,8 +9,8 @@ The first thing resembling RCU was publi [Kung80] recommended use of a garbage collector to defer destruction of nodes in a parallel binary search tree in order to simplify its implementation. This works well in environments that have garbage -collectors, but current production garbage collectors incur significant -read-side overhead. +collectors, but most production garbage collectors incur significant +overhead. In 1982, Manber and Ladner [Manber82,Manber84] recommended deferring destruction until all threads running at that time have terminated, again @@ -99,16 +99,25 @@ locking, reduces contention, reduces mem parallelizes pipeline stalls and memory latency for writers. However, these techniques still impose significant read-side overhead in the form of memory barriers. Researchers at Sun worked along similar lines -in the same timeframe [HerlihyLM02,HerlihyLMS03]. These techniques -can be thought of as inside-out reference counts, where the count is -represented by the number of hazard pointers referencing a given data -structure (rather than the more conventional counter field within the -data structure itself). +in the same timeframe [HerlihyLM02]. These techniques can be thought +of as inside-out reference counts, where the count is represented by the +number of hazard pointers referencing a given data structure (rather than +the more conventional counter field within the data structure itself). + +By the same token, RCU can be thought of as a "bulk reference count", +where some form of reference counter covers all reference by a given CPU +or thread during a set timeframe. This timeframe is related to, but +not necessarily exactly the same as, an RCU grace period. In classic +RCU, the reference counter is the per-CPU bit in the "bitmask" field, +and each such bit covers all references that might have been made by +the corresponding CPU during the prior grace period. Of course, RCU +can be thought of in other terms as well. In 2003, the K42 group described how RCU could be used to create -hot-pluggable implementations of operating-system functions. Later that -year saw a paper describing an RCU implementation of System V IPC -[Arcangeli03], and an introduction to RCU in Linux Journal [McKenney03a]. +hot-pluggable implementations of operating-system functions [Appavoo03a]. +Later that year saw a paper describing an RCU implementation of System +V IPC [Arcangeli03], and an introduction to RCU in Linux Journal +[McKenney03a]. 2004 has seen a Linux-Journal article on use of RCU in dcache [McKenney04a], a performance comparison of locking to RCU on several @@ -117,10 +126,27 @@ number of operating-system kernels [Paul describing how to make RCU safe for soft-realtime applications [Sarma04c], and a paper describing SELinux performance with RCU [JamesMorris04b]. -2005 has seen further adaptation of RCU to realtime use, permitting +2005 brought further adaptation of RCU to realtime use, permitting preemption of RCU realtime critical sections [PaulMcKenney05a, PaulMcKenney05b]. +2006 saw the first best-paper award for an RCU paper [ThomasEHart2006a], +as well as further work on efficient implementations of preemptible +RCU [PaulEMcKenney2006b], but priority-boosting of RCU read-side critical +sections proved elusive. An RCU implementation permitting general +blocking in read-side critical sections appeared [PaulEMcKenney2006c], +Robert Olsson described an RCU-protected trie-hash combination +[RobertOlsson2006a]. + +In 2007, the RCU priority-boosting problem finally was solved +[PaulEMcKenney2007BoostRCU], and an RCU paper was first accepted into +an academic journal [ThomasEHart2007a]. An LWN article on the use of +Promela and spin to validate parallel algorithms [PaulEMcKenney2007QRCUspin] +also described Oleg Nesterov's QRCU, the first RCU implementation that +can boast deep sub-microsecond grace periods (in absence of readers, +and read-side overhead is roughly that of a global reference count). + + Bibtex Entries @article{Kung80 @@ -203,6 +229,41 @@ Bibtex Entries ,Address="New Orleans, LA" } +@conference{Pu95a, +Author = "Calton Pu and Tito Autrey and Andrew Black and Charles Consel and +Crispin Cowan and Jon Inouye and Lakshmi Kethana and Jonathan Walpole and +Ke Zhang", +Title = "Optimistic Incremental Specialization: Streamlining a Commercial +Operating System", +Booktitle = "15\textsuperscript{th} ACM Symposium on +Operating Systems Principles (SOSP'95)", +address = "Copper Mountain, CO", +month="December", +year="1995", +pages="314-321", +annotation=" + Uses a replugger, but with a flag to signal when people are + using the resource at hand. Only one reader at a time. +" +} + +@conference{Cowan96a, +Author = "Crispin Cowan and Tito Autrey and Charles Krasic and +Calton Pu and Jonathan Walpole", +Title = "Fast Concurrent Dynamic Linking for an Adaptive Operating System", +Booktitle = "International Conference on Configurable Distributed Systems +(ICCDS'96)", +address = "Annapolis, MD", +month="May", +year="1996", +pages="108", +isbn="0-8186-7395-8", +annotation=" + Uses a replugger, but with a counter to signal when people are + using the resource at hand. Allows multiple readers. +" +} + @techreport{Slingwine95 ,author="John D. Slingwine and Paul E. McKenney" ,title="Apparatus and Method for Achieving Reduced Overhead Mutual @@ -312,6 +373,49 @@ Andrea Arcangeli and Andi Kleen and Orra [Viewed June 23, 2004]" } +@conference{Michael02a +,author="Maged M. Michael" +,title="Safe Memory Reclamation for Dynamic Lock-Free Objects Using Atomic +Reads and Writes" +,Year="2002" +,Month="August" +,booktitle="{Proceedings of the 21\textsuperscript{st} Annual ACM +Symposium on Principles of Distributed Computing}" +,pages="21-30" +,annotation=" + Each thread keeps an array of pointers to items that it is + currently referencing. Sort of an inside-out garbage collection + mechanism, but one that requires the accessing code to explicitly + state its needs. Also requires read-side memory barriers on + most architectures. +" +} + +@conference{Michael02b +,author="Maged M. Michael" +,title="High Performance Dynamic Lock-Free Hash Tables and List-Based Sets" +,Year="2002" +,Month="August" +,booktitle="{Proceedings of the 14\textsuperscript{th} Annual ACM +Symposium on Parallel +Algorithms and Architecture}" +,pages="73-82" +,annotation=" + Like the title says... +" +} + +@InProceedings{HerlihyLM02 +,author={Maurice Herlihy and Victor Luchangco and Mark Moir} +,title="The Repeat Offender Problem: A Mechanism for Supporting Dynamic-Sized, +Lock-Free Data Structures" +,booktitle={Proceedings of 16\textsuperscript{th} International +Symposium on Distributed Computing} +,year=2002 +,month="October" +,pages="339-353" +} + @article{Appavoo03a ,author="J. Appavoo and K. Hui and C. A. N. Soules and R. W. Wisniewski and D. M. {Da Silva} and O. Krieger and M. A. Auslander and D. J. Edelsohn and @@ -447,3 +551,111 @@ Oregon Health and Sciences University" Realtime turns into making RCU yet more realtime friendly. " } + +@conference{ThomasEHart2006a +,Author="Thomas E. Hart and Paul E. McKenney and Angela Demke Brown" +,Title="Making Lockless Synchronization Fast: Performance Implications +of Memory Reclamation" +,Booktitle="20\textsuperscript{th} {IEEE} International Parallel and +Distributed Processing Symposium" +,month="April" +,year="2006" +,day="25-29" +,address="Rhodes, Greece" +,annotation=" + Compares QSBR (AKA "classic RCU"), HPBR, EBR, and lock-free + reference counting. +" +} + +@Conference{PaulEMcKenney2006b +,Author="Paul E. McKenney and Dipankar Sarma and Ingo Molnar and +Suparna Bhattacharya" +,Title="Extending RCU for Realtime and Embedded Workloads" +,Booktitle="{Ottawa Linux Symposium}" +,Month="July" +,Year="2006" +,pages="v2 123-138" +,note="Available: +\url{http://www.linuxsymposium.org/2006/view_abstract.php?content_key=184} +\url{http://www.rdrop.com/users/paulmck/RCU/OLSrtRCU.2006.08.11a.pdf} +[Viewed January 1, 2007]" +,annotation=" + Described how to improve the -rt implementation of realtime RCU. +" +} + +@unpublished{PaulEMcKenney2006c +,Author="Paul E. McKenney" +,Title="Sleepable {RCU}" +,month="October" +,day="9" +,year="2006" +,note="Available: +\url{http://lwn.net/Articles/202847/} +Revised: +\url{http://www.rdrop.com/users/paulmck/RCU/srcu.2007.01.14a.pdf} +[Viewed August 21, 2006]" +,annotation=" + LWN article introducing SRCU. +" +} + +@unpublished{RobertOlsson2006a +,Author="Robert Olsson and Stefan Nilsson" +,Title="{TRASH}: A dynamic {LC}-trie and hash data structure" +,month="August" +,day="18" +,year="2006" +,note="Available: +\url{http://www.nada.kth.se/~snilsson/public/papers/trash/trash.pdf} +[Viewed February 24, 2007]" +,annotation=" + RCU-protected dynamic trie-hash combination. +" +} + +@unpublished{PaulEMcKenney2007BoostRCU +,Author="Paul E. McKenney" +,Title="Priority-Boosting {RCU} Read-Side Critical Sections" +,month="February" +,day="5" +,year="2007" +,note="Available: +\url{http://lwn.net/Articles/220677/} +Revised: +\url{http://www.rdrop.com/users/paulmck/RCU/RCUbooststate.2007.04.16a.pdf} +[Viewed September 7, 2007]" +,annotation=" + LWN article introducing RCU priority boosting. +" +} + +@unpublished{ThomasEHart2007a +,Author="Thomas E. Hart and Paul E. McKenney and Angela Demke Brown and Jonathan Walpole" +,Title="Performance of memory reclamation for lockless synchronization" +,journal="J. Parallel Distrib. Comput." +,year="2007" +,note="To appear in J. Parallel Distrib. Comput. + \url{doi=10.1016/j.jpdc.2007.04.010}" +,annotation={ + Compares QSBR (AKA "classic RCU"), HPBR, EBR, and lock-free + reference counting. Journal version of ThomasEHart2006a. +} +} + +@unpublished{PaulEMcKenney2007QRCUspin +,Author="Paul E. McKenney" +,Title="Using Promela and Spin to verify parallel algorithms" +,month="August" +,day="1" +,year="2007" +,note="Available: +\url{http://lwn.net/Articles/243851/} +[Viewed September 8, 2007]" +,annotation=" + LWN article describing Promela and spin, and also using Oleg + Nesterov's QRCU as an example (with Paul McKenney's fastpath). +" +} + Index: linux-2.6.24.7/Documentation/RCU/rcu.txt =================================================================== --- linux-2.6.24.7.orig/Documentation/RCU/rcu.txt +++ linux-2.6.24.7/Documentation/RCU/rcu.txt @@ -36,6 +36,14 @@ o How can the updater tell when a grace executed in user mode, or executed in the idle loop, we can safely free up that item. + Preemptible variants of RCU (CONFIG_PREEMPT_RCU) get the + same effect, but require that the readers manipulate CPU-local + counters. These counters allow limited types of blocking + within RCU read-side critical sections. SRCU also uses + CPU-local counters, and permits general blocking within + RCU read-side critical sections. These two variants of + RCU detect grace periods by sampling these counters. + o If I am running on a uniprocessor kernel, which can only do one thing at a time, why should I wait for a grace period? @@ -46,7 +54,10 @@ o How can I see where RCU is currently u Search for "rcu_read_lock", "rcu_read_unlock", "call_rcu", "rcu_read_lock_bh", "rcu_read_unlock_bh", "call_rcu_bh", "srcu_read_lock", "srcu_read_unlock", "synchronize_rcu", - "synchronize_net", and "synchronize_srcu". + "synchronize_net", "synchronize_srcu", and the other RCU + primitives. Or grab one of the cscope databases from: + + http://www.rdrop.com/users/paulmck/RCU/linuxusage/rculocktab.html o What guidelines should I follow when writing code that uses RCU? @@ -67,7 +78,12 @@ o I hear that RCU is patented? What is o I hear that RCU needs work in order to support realtime kernels? - Yes, work in progress. + This work is largely completed. Realtime-friendly RCU can be + enabled via the CONFIG_PREEMPT_RCU kernel configuration parameter. + In addition, the CONFIG_PREEMPT_RCU_BOOST kernel configuration + parameter enables priority boosting of preempted RCU read-side + critical sections, though this is only needed if you have + CPU-bound realtime threads. o Where can I find more information on RCU? Index: linux-2.6.24.7/Documentation/RCU/torture.txt =================================================================== --- linux-2.6.24.7.orig/Documentation/RCU/torture.txt +++ linux-2.6.24.7/Documentation/RCU/torture.txt @@ -37,6 +37,24 @@ nfakewriters This is the number of RCU f to trigger special cases caused by multiple writers, such as the synchronize_srcu() early return optimization. +preempt_torture Specifies that torturing of preemptible RCU is to be + undertaken, defaults to no such testing. This test + creates a kernel thread that runs at the lowest possible + realtime priority, alternating between ten seconds + of spinning and a short sleep period. The goal is + to preempt lower-priority RCU readers. Note that this + currently does not fail the full test, but instead simply + counts the number of times that a ten-second CPU burst + coincides with a stall in grace-period detection. + + Of course, if the grace period advances during a CPU burst, + that indicates that no RCU reader was preempted, so the + burst ends early in that case. + + Note that such stalls are expected behavior in preemptible + RCU implementations when RCU priority boosting is not + enabled (PREEMPT_RCU_BOOST=n). + stat_interval The number of seconds between output of torture statistics (via printk()). Regardless of the interval, statistics are printed when the module is unloaded. @@ -46,12 +64,13 @@ stat_interval The number of seconds betw shuffle_interval The number of seconds to keep the test threads affinitied - to a particular subset of the CPUs. Used in conjunction - with test_no_idle_hz. + to a particular subset of the CPUs, defaults to 5 seconds. + Used in conjunction with test_no_idle_hz. test_no_idle_hz Whether or not to test the ability of RCU to operate in a kernel that disables the scheduling-clock interrupt to idle CPUs. Boolean parameter, "1" to test, "0" otherwise. + Defaults to omitting this test. torture_type The type of RCU to test: "rcu" for the rcu_read_lock() API, "rcu_sync" for rcu_read_lock() with synchronous reclamation, @@ -82,8 +101,6 @@ be evident. ;-) The entries are as follows: -o "ggp": The number of counter flips (or batches) since boot. - o "rtc": The hexadecimal address of the structure currently visible to readers. @@ -117,8 +134,8 @@ o "Reader Pipe": Histogram of "ages" of o "Reader Batch": Another histogram of "ages" of structures seen by readers, but in terms of counter flips (or batches) rather than in terms of grace periods. The legal number of non-zero - entries is again two. The reason for this separate view is - that it is easier to get the third entry to show up in the + entries is again two. The reason for this separate view is that + it is sometimes easier to get the third entry to show up in the "Reader Batch" list than in the "Reader Pipe" list. o "Free-Block Circulation": Shows the number of torture structures @@ -145,6 +162,21 @@ of the "old" and "current" counters for "idx" value maps the "old" and "current" values to the underlying array, and is useful for debugging. +In addition, preemptible RCU rcutorture runs will report preemption +stalls: + +rcu-torture: rtc: ffffffff88005a40 ver: 17041 tfle: 1 rta: 17041 rtaf: 7904 rtf: 16941 rtmbe: 0 +rcu-torture: Reader Pipe: 975332139 34406 0 0 0 0 0 0 0 0 0 +rcu-torture: Reader Batch: 975349310 17234 0 0 0 0 0 0 0 0 0 +rcu-torture: Free-Block Circulation: 17040 17030 17028 17022 17009 16994 16982 16969 16955 16941 0 +Preemption stalls: 0 + +The first four lines are as before, and the last line records the number +of times that grace-period processing stalled during a realtime CPU burst. +Note that a non-zero value does not -prove- that RCU priority boosting is +broken, because there are other things that can stall RCU grace-period +processing. Here is hoping that someone comes up with a better test! + USAGE Index: linux-2.6.24.7/kernel/rcupreempt.c =================================================================== --- linux-2.6.24.7.orig/kernel/rcupreempt.c +++ linux-2.6.24.7/kernel/rcupreempt.c @@ -133,7 +133,7 @@ static cpumask_t rcu_cpu_online_map = CP * only to mediate communication between mainline code and hardware * interrupt and NMI handlers. */ -#define ORDERED_WRT_IRQ(x) (*(volatile typeof(x) *)&(x)) +#define ACCESS_ONCE(x) (*(volatile typeof(x) *)&(x)) /* * RCU_DATA_ME: find the current CPU's rcu_data structure. @@ -186,7 +186,7 @@ void __rcu_read_lock(void) struct task_struct *me = current; int nesting; - nesting = ORDERED_WRT_IRQ(me->rcu_read_lock_nesting); + nesting = ACCESS_ONCE(me->rcu_read_lock_nesting); if (nesting != 0) { /* An earlier rcu_read_lock() covers us, just count it. */ @@ -211,9 +211,9 @@ void __rcu_read_lock(void) * casts to prevent the compiler from reordering. */ - idx = ORDERED_WRT_IRQ(rcu_ctrlblk.completed) & 0x1; + idx = ACCESS_ONCE(rcu_ctrlblk.completed) & 0x1; smp_read_barrier_depends(); /* @@@@ might be unneeded */ - ORDERED_WRT_IRQ(__get_cpu_var(rcu_flipctr)[idx])++; + ACCESS_ONCE(__get_cpu_var(rcu_flipctr)[idx])++; /* * Now that the per-CPU counter has been incremented, we @@ -223,7 +223,7 @@ void __rcu_read_lock(void) * of the need to increment the per-CPU counter. */ - ORDERED_WRT_IRQ(me->rcu_read_lock_nesting) = nesting + 1; + ACCESS_ONCE(me->rcu_read_lock_nesting) = nesting + 1; /* * Now that we have preventing any NMIs from storing @@ -232,7 +232,7 @@ void __rcu_read_lock(void) * rcu_read_unlock(). */ - ORDERED_WRT_IRQ(me->rcu_flipctr_idx) = idx; + ACCESS_ONCE(me->rcu_flipctr_idx) = idx; local_irq_restore(oldirq); } } @@ -244,7 +244,7 @@ void __rcu_read_unlock(void) struct task_struct *me = current; int nesting; - nesting = ORDERED_WRT_IRQ(me->rcu_read_lock_nesting); + nesting = ACCESS_ONCE(me->rcu_read_lock_nesting); if (nesting > 1) { /* @@ -284,7 +284,7 @@ void __rcu_read_unlock(void) * DEC Alpha. */ - idx = ORDERED_WRT_IRQ(me->rcu_flipctr_idx); + idx = ACCESS_ONCE(me->rcu_flipctr_idx); smp_read_barrier_depends(); /* @@@ Needed??? */ /* @@ -293,7 +293,7 @@ void __rcu_read_unlock(void) * After this, any interrupts or NMIs will increment and * decrement the per-CPU counters. */ - ORDERED_WRT_IRQ(me->rcu_read_lock_nesting) = nesting - 1; + ACCESS_ONCE(me->rcu_read_lock_nesting) = nesting - 1; /* * It is now safe to decrement this task's nesting count. @@ -304,7 +304,7 @@ void __rcu_read_unlock(void) * but that is OK, since we have already fetched it. */ - ORDERED_WRT_IRQ(__get_cpu_var(rcu_flipctr)[idx])--; + ACCESS_ONCE(__get_cpu_var(rcu_flipctr)[idx])--; local_irq_restore(oldirq); } } @@ -496,7 +496,7 @@ rcu_try_flip_waitmb(void) /* * Attempt a single flip of the counters. Remember, a single flip does * -not- constitute a grace period. Instead, the interval between - * at least three consecutive flips is a grace period. + * at least GP_STAGES+2 consecutive flips is a grace period. * * If anyone is nuts enough to run this CONFIG_PREEMPT_RCU implementation * on a large SMP, they might want to use a hierarchical organization of �������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/rcu-new-10.patch����������������������������������������������������������������������������0000664�0000764�0000764�00000010016�11041657732�014060� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������� Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> --- --- kernel/rcupreempt.c | 88 +++++++++++++++++++++++++++++++++++++++++++++++++--- 1 file changed, 84 insertions(+), 4 deletions(-) Index: linux-2.6.24.7/kernel/rcupreempt.c =================================================================== --- linux-2.6.24.7.orig/kernel/rcupreempt.c +++ linux-2.6.24.7/kernel/rcupreempt.c @@ -406,8 +406,47 @@ rcu_try_flip_idle(void) /* Now ask each CPU for acknowledgement of the flip. */ - for_each_cpu_mask(cpu, rcu_cpu_online_map) + for_each_cpu_mask(cpu, rcu_cpu_online_map) { per_cpu(rcu_flip_flag, cpu) = rcu_flipped; + per_cpu(rcu_dyntick_snapshot, cpu) = + per_cpu(dynticks_progress_counter, cpu); + } + + return 1; +} + +static inline int +rcu_try_flip_waitack_needed(int cpu) +{ + long curr; + long snap; + + curr = per_cpu(dynticks_progress_counter, cpu); + snap = per_cpu(rcu_dyntick_snapshot, cpu); + smp_mb(); /* force ordering with cpu entering/leaving dynticks. */ + + /* + * If the CPU remained in dynticks mode for the entire time + * and didn't take any interrupts, NMIs, SMIs, or whatever, + * then it cannot be in the middle of an rcu_read_lock(), so + * the next rcu_read_lock() it executes must use the new value + * of the counter. So we can safely pretend that this CPU + * already acknowledged the counter. + */ + + if ((curr == snap) && ((curr & 0x1) == 0)) + return 0; + + /* + * If the CPU passed through or entered a dynticks idle phase with + * no active irq handlers, then, as above, we can safely pretend + * that this CPU already acknowledged the counter. + */ + + if ((curr - snap) > 2 || (snap & 0x1) == 0) + return 0; + + /* We need this CPU to explicitly acknowledge the counter flip. */ return 1; } @@ -423,7 +462,8 @@ rcu_try_flip_waitack(void) RCU_TRACE_ME(rcupreempt_trace_try_flip_a1); for_each_cpu_mask(cpu, rcu_cpu_online_map) - if (per_cpu(rcu_flip_flag, cpu) != rcu_flip_seen) { + if (rcu_try_flip_waitack_needed(cpu) && + per_cpu(rcu_flip_flag, cpu) != rcu_flip_seen) { RCU_TRACE_ME(rcupreempt_trace_try_flip_ae1); return 0; } @@ -464,13 +504,50 @@ rcu_try_flip_waitzero(void) /* Call for a memory barrier from each CPU. */ - for_each_cpu_mask(cpu, rcu_cpu_online_map) + for_each_cpu_mask(cpu, rcu_cpu_online_map) { per_cpu(rcu_mb_flag, cpu) = rcu_mb_needed; + per_cpu(rcu_dyntick_snapshot, cpu) = + per_cpu(dynticks_progress_counter, cpu); + } RCU_TRACE_ME(rcupreempt_trace_try_flip_z2); return 1; } +static inline int +rcu_try_flip_waitmb_needed(int cpu) +{ + long curr; + long snap; + + curr = per_cpu(dynticks_progress_counter, cpu); + snap = per_cpu(rcu_dyntick_snapshot, cpu); + smp_mb(); /* force ordering with cpu entering/leaving dynticks. */ + + /* + * If the CPU remained in dynticks mode for the entire time + * and didn't take any interrupts, NMIs, SMIs, or whatever, + * then it cannot have executed an RCU read-side critical section + * during that time, so there is no need for it to execute a + * memory barrier. + */ + + if ((curr == snap) && ((curr & 0x1) == 0)) + return 0; + + /* + * If the CPU either entered or exited an outermost interrupt, + * SMI, NMI, or whatever handler, then we know that it executed + * a memory barrier when doing so. So we don't need another one. + */ + if (curr != snap) + return 0; + + /* We need the CPU to execute a memory barrier. */ + + return 1; +} + /* * Wait for all CPUs to do their end-of-grace-period memory barrier. * Return 0 once all CPUs have done so. @@ -483,7 +560,8 @@ rcu_try_flip_waitmb(void) RCU_TRACE_ME(rcupreempt_trace_try_flip_m1); for_each_cpu_mask(cpu, rcu_cpu_online_map) - if (per_cpu(rcu_mb_flag, cpu) != rcu_mb_done) { + if (rcu_try_flip_waitmb_needed(cpu) && + per_cpu(rcu_mb_flag, cpu) != rcu_mb_done) { RCU_TRACE_ME(rcupreempt_trace_try_flip_me1); return 0; } @@ -779,6 +857,8 @@ void __init rcu_init_rt(void) } } +static DEFINE_PER_CPU(long, rcu_dyntick_snapshot); + /* * Deprecated, use synchronize_rcu() or synchronize_sched() instead. */ ������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/rcu-fix-rcu-preempt.patch�������������������������������������������������������������������0000664�0000764�0000764�00000024106�11041657731�016104� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������--- kernel/rcupreempt.c | 271 ++++++++++++++++++++++++++++++++++++++-------------- 1 file changed, 203 insertions(+), 68 deletions(-) Index: linux-2.6.24.7/kernel/rcupreempt.c =================================================================== --- linux-2.6.24.7.orig/kernel/rcupreempt.c +++ linux-2.6.24.7/kernel/rcupreempt.c @@ -23,6 +23,10 @@ * to Suparna Bhattacharya for pushing me completely away * from atomic instructions on the read side. * + * - Added handling of Dynamic Ticks + * Copyright 2007 - Paul E. Mckenney <paulmck@us.ibm.com> + * - Steven Rostedt <srostedt@redhat.com> + * * Papers: http://www.rdrop.com/users/paulmck/RCU * * For detailed explanation of Read-Copy Update mechanism see - @@ -368,51 +372,131 @@ static void __rcu_advance_callbacks(stru } } -/* - * Get here when RCU is idle. Decide whether we need to - * move out of idle state, and return non-zero if so. - * "Straightforward" approach for the moment, might later - * use callback-list lengths, grace-period duration, or - * some such to determine when to exit idle state. - * Might also need a pre-idle test that does not acquire - * the lock, but let's get the simple case working first... - */ +#ifdef CONFIG_NO_HZ -static int -rcu_try_flip_idle(void) +DEFINE_PER_CPU(long, dynticks_progress_counter) = 1; +static DEFINE_PER_CPU(long, rcu_dyntick_snapshot); +static DEFINE_PER_CPU(int, rcu_update_flag); + +/** + * rcu_irq_enter - Called from Hard irq handlers and NMI/SMI. + * + * If the CPU was idle with dynamic ticks active, this updates the + * dynticks_progress_counter to let the RCU handling know that the + * CPU is active. + */ +void rcu_irq_enter(void) { - int cpu; + int cpu = smp_processor_id(); - RCU_TRACE_ME(rcupreempt_trace_try_flip_i1); - if (!rcu_pending(smp_processor_id())) { - RCU_TRACE_ME(rcupreempt_trace_try_flip_ie1); - return 0; - } + if (per_cpu(rcu_update_flag, cpu)) + per_cpu(rcu_update_flag, cpu)++; /* - * Do the flip. + * Only update if we are coming from a stopped ticks mode + * (dynticks_progress_counter is even). */ + if (!in_interrupt() && (per_cpu(dynticks_progress_counter, cpu) & 0x1) == 0) { + /* + * The following might seem like we could have a race + * with NMI/SMIs. But this really isn't a problem. + * Here we do a read/modify/write, and the race happens + * when an NMI/SMI comes in after the read and before + * the write. But NMI/SMIs will increment this counter + * twice before returning, so the zero bit will not + * be corrupted by the NMI/SMI which is the most important + * part. + * + * The only thing is that we would bring back the counter + * to a postion that it was in during the NMI/SMI. + * But the zero bit would be set, so the rest of the + * counter would again be ignored. + * + * On return from the IRQ, the counter may have the zero + * bit be 0 and the counter the same as the return from + * the NMI/SMI. If the state machine was so unlucky to + * see that, it still doesn't matter, since all + * RCU read-side critical sections on this CPU would + * have already completed. + */ + per_cpu(dynticks_progress_counter, cpu)++; + /* + * The following memory barrier ensures that any + * rcu_read_lock() primitives in the irq handler + * are seen by other CPUs to follow the above + * increment to dynticks_progress_counter. This is + * required in order for other CPUs to correctly + * determine when it is safe to advance the RCU + * grace-period state machine. + */ + smp_mb(); /* see above block comment. */ + /* + * Since we can't determine the dynamic tick mode from + * the dynticks_progress_counter after this routine, + * we use a second flag to acknowledge that we came + * from an idle state with ticks stopped. + */ + per_cpu(rcu_update_flag, cpu)++; + /* + * If we take an NMI/SMI now, they will also increment + * the rcu_update_flag, and will not update the + * dynticks_progress_counter on exit. That is for + * this IRQ to do. + */ + } +} - RCU_TRACE_ME(rcupreempt_trace_try_flip_g1); - rcu_ctrlblk.completed++; /* stands in for rcu_try_flip_g2 */ +/** + * rcu_irq_exit - Called from exiting Hard irq context. + * + * If the CPU was idle with dynamic ticks active, update the + * dynticks_progress_counter to put let the RCU handling be + * aware that the CPU is going back to idle with no ticks. + */ +void rcu_irq_exit(void) +{ + int cpu = smp_processor_id(); /* - * Need a memory barrier so that other CPUs see the new - * counter value before they see the subsequent change of all - * the rcu_flip_flag instances to rcu_flipped. + * rcu_update_flag is set if we interrupted the CPU + * when it was idle with ticks stopped. + * Once this occurs, we keep track of interrupt nesting + * because a NMI/SMI could also come in, and we still + * only want the IRQ that started the increment of the + * dynticks_progress_counter to be the one that modifies + * it on exit. */ + if (per_cpu(rcu_update_flag, cpu)) { + if (--per_cpu(rcu_update_flag, cpu)) + return; - smp_mb(); /* see above block comment. */ + /* This must match the interrupt nesting */ + WARN_ON(in_interrupt()); - /* Now ask each CPU for acknowledgement of the flip. */ + /* + * If an NMI/SMI happens now we are still + * protected by the dynticks_progress_counter being odd. + */ - for_each_cpu_mask(cpu, rcu_cpu_online_map) { - per_cpu(rcu_flip_flag, cpu) = rcu_flipped; - per_cpu(rcu_dyntick_snapshot, cpu) = - per_cpu(dynticks_progress_counter, cpu); + /* + * The following memory barrier ensures that any + * rcu_read_unlock() primitives in the irq handler + * are seen by other CPUs to preceed the following + * increment to dynticks_progress_counter. This + * is required in order for other CPUs to determine + * when it is safe to advance the RCU grace-period + * state machine. + */ + smp_mb(); /* see above block comment. */ + per_cpu(dynticks_progress_counter, cpu)++; + WARN_ON(per_cpu(dynticks_progress_counter, cpu) & 0x1); } +} - return 1; +static void dyntick_save_progress_counter(int cpu) +{ + per_cpu(rcu_dyntick_snapshot, cpu) = + per_cpu(dynticks_progress_counter, cpu); } static inline int @@ -451,6 +535,94 @@ rcu_try_flip_waitack_needed(int cpu) return 1; } +static inline int +rcu_try_flip_waitmb_needed(int cpu) +{ + long curr; + long snap; + + curr = per_cpu(dynticks_progress_counter, cpu); + snap = per_cpu(rcu_dyntick_snapshot, cpu); + smp_mb(); /* force ordering with cpu entering/leaving dynticks. */ + + /* + * If the CPU remained in dynticks mode for the entire time + * and didn't take any interrupts, NMIs, SMIs, or whatever, + * then it cannot have executed an RCU read-side critical section + * during that time, so there is no need for it to execute a + * memory barrier. + */ + + if ((curr == snap) && ((curr & 0x1) == 0)) + return 0; + + /* + * If the CPU either entered or exited an outermost interrupt, + * SMI, NMI, or whatever handler, then we know that it executed + * a memory barrier when doing so. So we don't need another one. + */ + if (curr != snap) + return 0; + + /* We need the CPU to execute a memory barrier. */ + + return 1; +} + +#else /* !CONFIG_NO_HZ */ + +# define dyntick_save_progress_counter(cpu) do { } while (0) +# define rcu_try_flip_waitack_needed(cpu) (1) +# define rcu_try_flip_waitmb_needed(cpu) (1) + +#endif /* CONFIG_NO_HZ */ + +/* + * Get here when RCU is idle. Decide whether we need to + * move out of idle state, and return non-zero if so. + * "Straightforward" approach for the moment, might later + * use callback-list lengths, grace-period duration, or + * some such to determine when to exit idle state. + * Might also need a pre-idle test that does not acquire + * the lock, but let's get the simple case working first... + */ + +static int +rcu_try_flip_idle(void) +{ + int cpu; + + RCU_TRACE_ME(rcupreempt_trace_try_flip_i1); + if (!rcu_pending(smp_processor_id())) { + RCU_TRACE_ME(rcupreempt_trace_try_flip_ie1); + return 0; + } + + /* + * Do the flip. + */ + + RCU_TRACE_ME(rcupreempt_trace_try_flip_g1); + rcu_ctrlblk.completed++; /* stands in for rcu_try_flip_g2 */ + + /* + * Need a memory barrier so that other CPUs see the new + * counter value before they see the subsequent change of all + * the rcu_flip_flag instances to rcu_flipped. + */ + + smp_mb(); /* see above block comment. */ + + /* Now ask each CPU for acknowledgement of the flip. */ + + for_each_cpu_mask(cpu, rcu_cpu_online_map) { + per_cpu(rcu_flip_flag, cpu) = rcu_flipped; + dyntick_save_progress_counter(cpu); + } + + return 1; +} + /* * Wait for CPUs to acknowledge the flip. */ @@ -506,48 +678,13 @@ rcu_try_flip_waitzero(void) for_each_cpu_mask(cpu, rcu_cpu_online_map) { per_cpu(rcu_mb_flag, cpu) = rcu_mb_needed; - per_cpu(rcu_dyntick_snapshot, cpu) = - per_cpu(dynticks_progress_counter, cpu); + dyntick_save_progress_counter(cpu); } RCU_TRACE_ME(rcupreempt_trace_try_flip_z2); return 1; } -static inline int -rcu_try_flip_waitmb_needed(int cpu) -{ - long curr; - long snap; - - curr = per_cpu(dynticks_progress_counter, cpu); - snap = per_cpu(rcu_dyntick_snapshot, cpu); - smp_mb(); /* force ordering with cpu entering/leaving dynticks. */ - - /* - * If the CPU remained in dynticks mode for the entire time - * and didn't take any interrupts, NMIs, SMIs, or whatever, - * then it cannot have executed an RCU read-side critical section - * during that time, so there is no need for it to execute a - * memory barrier. - */ - - if ((curr == snap) && ((curr & 0x1) == 0)) - return 0; - - /* - * If the CPU either entered or exited an outermost interrupt, - * SMI, NMI, or whatever handler, then we know that it executed - * a memory barrier when doing so. So we don't need another one. - */ - if (curr != snap) - return 0; - - /* We need the CPU to execute a memory barrier. */ - - return 1; -} - /* * Wait for all CPUs to do their end-of-grace-period memory barrier. * Return 0 once all CPUs have done so. @@ -857,8 +994,6 @@ void __init rcu_init_rt(void) } } -static DEFINE_PER_CPU(long, rcu_dyntick_snapshot); - /* * Deprecated, use synchronize_rcu() or synchronize_sched() instead. */ ����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/rcu-dynticks-update.patch�������������������������������������������������������������������0000664�0000764�0000764�00000007436�11041657731�016174� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������--- include/linux/hardirq.h | 10 ++++++++++ include/linux/rcuclassic.h | 2 ++ include/linux/rcupreempt.h | 22 ++++++++++++++++++++++ kernel/softirq.c | 1 + kernel/time/tick-sched.c | 3 +++ 5 files changed, 38 insertions(+) Index: linux-2.6.24.7/include/linux/hardirq.h =================================================================== --- linux-2.6.24.7.orig/include/linux/hardirq.h +++ linux-2.6.24.7/include/linux/hardirq.h @@ -113,6 +113,14 @@ static inline void account_system_vtime( } #endif +#if defined(CONFIG_PREEMPT_RCU) && defined(CONFIG_NO_HZ) +extern void rcu_irq_enter(void); +extern void rcu_irq_exit(void); +#else +# define rcu_irq_enter() do { } while (0) +# define rcu_irq_exit() do { } while (0) +#endif /* CONFIG_PREEMPT_RCU */ + /* * It is safe to do non-atomic ops on ->hardirq_context, * because NMI handlers may not preempt and the ops are @@ -121,6 +129,7 @@ static inline void account_system_vtime( */ #define __irq_enter() \ do { \ + rcu_irq_enter(); \ account_system_vtime(current); \ add_preempt_count(HARDIRQ_OFFSET); \ trace_hardirq_enter(); \ @@ -139,6 +148,7 @@ extern void irq_enter(void); trace_hardirq_exit(); \ account_system_vtime(current); \ sub_preempt_count(HARDIRQ_OFFSET); \ + rcu_irq_exit(); \ } while (0) /* Index: linux-2.6.24.7/include/linux/rcuclassic.h =================================================================== --- linux-2.6.24.7.orig/include/linux/rcuclassic.h +++ linux-2.6.24.7/include/linux/rcuclassic.h @@ -86,6 +86,8 @@ static inline void rcu_bh_qsctr_inc(int #define rcu_online_cpu_rt(cpu) #define rcu_pending_rt(cpu) 0 #define rcu_process_callbacks_rt(unused) do { } while (0) +#define rcu_enter_nohz() do { } while (0) +#define rcu_exit_nohz() do { } while (0) extern void FASTCALL(call_rcu_classic(struct rcu_head *head, void (*func)(struct rcu_head *head))); Index: linux-2.6.24.7/include/linux/rcupreempt.h =================================================================== --- linux-2.6.24.7.orig/include/linux/rcupreempt.h +++ linux-2.6.24.7/include/linux/rcupreempt.h @@ -77,5 +77,27 @@ extern struct rcupreempt_trace *rcupreem struct softirq_action; +#ifdef CONFIG_NO_HZ +DECLARE_PER_CPU(long, dynticks_progress_counter); + +static inline void rcu_enter_nohz(void) +{ + __get_cpu_var(dynticks_progress_counter)++; + WARN_ON(__get_cpu_var(dynticks_progress_counter) & 0x1); + mb(); +} + +static inline void rcu_exit_nohz(void) +{ + mb(); + __get_cpu_var(dynticks_progress_counter)++; + WARN_ON(!(__get_cpu_var(dynticks_progress_counter) & 0x1)); +} + +#else /* CONFIG_NO_HZ */ +#define rcu_enter_nohz() do { } while (0) +#define rcu_exit_nohz() do { } while (0) +#endif /* CONFIG_NO_HZ */ + #endif /* __KERNEL__ */ #endif /* __LINUX_RCUPREEMPT_H */ Index: linux-2.6.24.7/kernel/softirq.c =================================================================== --- linux-2.6.24.7.orig/kernel/softirq.c +++ linux-2.6.24.7/kernel/softirq.c @@ -306,6 +306,7 @@ void irq_exit(void) /* Make sure that timer wheel updates are propagated */ if (!in_interrupt() && idle_cpu(smp_processor_id()) && !need_resched()) tick_nohz_stop_sched_tick(); + rcu_irq_exit(); #endif preempt_enable_no_resched(); } Index: linux-2.6.24.7/kernel/time/tick-sched.c =================================================================== --- linux-2.6.24.7.orig/kernel/time/tick-sched.c +++ linux-2.6.24.7/kernel/time/tick-sched.c @@ -249,6 +249,7 @@ void tick_nohz_stop_sched_tick(void) ts->idle_tick = ts->sched_timer.expires; ts->tick_stopped = 1; ts->idle_jiffies = last_jiffies; + rcu_enter_nohz(); } /* @@ -337,6 +338,8 @@ void tick_nohz_restart_sched_tick(void) if (!ts->tick_stopped) return; + rcu_exit_nohz(); + /* Update jiffies first */ now = ktime_get(); ����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/rcu-hrt-fixups.patch������������������������������������������������������������������������0000664�0000764�0000764�00000005426�11041657735�015176� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������� include/linux/rcuclassic.h | 3 +++ include/linux/rcupreempt.h | 2 ++ kernel/rcuclassic.c | 19 ++++++++++++++++--- 3 files changed, 21 insertions(+), 3 deletions(-) Index: linux-2.6.24.7/include/linux/rcuclassic.h =================================================================== --- linux-2.6.24.7.orig/include/linux/rcuclassic.h +++ linux-2.6.24.7/include/linux/rcuclassic.h @@ -92,5 +92,8 @@ static inline void rcu_bh_qsctr_inc(int extern void FASTCALL(call_rcu_classic(struct rcu_head *head, void (*func)(struct rcu_head *head))); +struct softirq_action; +extern void rcu_process_callbacks(struct softirq_action *unused); + #endif /* __KERNEL__ */ #endif /* __LINUX_RCUCLASSIC_H */ Index: linux-2.6.24.7/include/linux/rcupreempt.h =================================================================== --- linux-2.6.24.7.orig/include/linux/rcupreempt.h +++ linux-2.6.24.7/include/linux/rcupreempt.h @@ -99,5 +99,7 @@ static inline void rcu_exit_nohz(void) #define rcu_exit_nohz() do { } while (0) #endif /* CONFIG_NO_HZ */ +extern void rcu_process_callbacks(struct softirq_action *unused); + #endif /* __KERNEL__ */ #endif /* __LINUX_RCUPREEMPT_H */ Index: linux-2.6.24.7/kernel/rcuclassic.c =================================================================== --- linux-2.6.24.7.orig/kernel/rcuclassic.c +++ linux-2.6.24.7/kernel/rcuclassic.c @@ -443,6 +443,8 @@ static void rcu_offline_cpu(int cpu) static void __rcu_process_callbacks(struct rcu_ctrlblk *rcp, struct rcu_data *rdp) { + unsigned long flags; + if (rdp->curlist && !rcu_batch_before(rcp->completed, rdp->batch)) { *rdp->donetail = rdp->curlist; rdp->donetail = rdp->curtail; @@ -451,12 +453,12 @@ static void __rcu_process_callbacks(stru } if (rdp->nxtlist && !rdp->curlist) { - local_irq_disable(); + local_irq_save(flags); rdp->curlist = rdp->nxtlist; rdp->curtail = rdp->nxttail; rdp->nxtlist = NULL; rdp->nxttail = &rdp->nxtlist; - local_irq_enable(); + local_irq_restore(flags); /* * start the next batch of callbacks @@ -483,7 +485,7 @@ static void __rcu_process_callbacks(stru rcu_do_batch(rdp); } -static void rcu_process_callbacks(struct softirq_action *unused) +void rcu_process_callbacks(struct softirq_action *unused) { __rcu_process_callbacks(&rcu_ctrlblk, &__get_cpu_var(rcu_data)); __rcu_process_callbacks(&rcu_bh_ctrlblk, &__get_cpu_var(rcu_bh_data)); @@ -541,6 +543,17 @@ int rcu_needs_cpu(int cpu) rcu_needs_cpu_rt(cpu)); } +void rcu_advance_callbacks(int cpu, int user) +{ + if (user || + (idle_cpu(cpu) && !in_softirq() && + hardirq_count() <= (1 << HARDIRQ_SHIFT))) { + rcu_qsctr_inc(cpu); + rcu_bh_qsctr_inc(cpu); + } else if (!in_softirq()) + rcu_bh_qsctr_inc(cpu); +} + void rcu_check_callbacks(int cpu, int user) { if (user || ������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/arm-cmpxchg.patch���������������������������������������������������������������������������0000664�0000764�0000764�00000002422�11041657733�014473� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������� include/asm-arm/atomic.h | 35 +++++++++++++++++++++++++++++++++++ 1 file changed, 35 insertions(+) Index: linux-2.6.24.7/include/asm-arm/atomic.h =================================================================== --- linux-2.6.24.7.orig/include/asm-arm/atomic.h +++ linux-2.6.24.7/include/asm-arm/atomic.h @@ -173,6 +173,41 @@ static inline void atomic_clear_mask(uns raw_local_irq_restore(flags); } +#ifndef CONFIG_SMP +/* + * Atomic compare and exchange. + */ +#define __HAVE_ARCH_CMPXCHG 1 + +extern unsigned long wrong_size_cmpxchg(volatile void *ptr); + +static inline unsigned long __cmpxchg(volatile void *ptr, + unsigned long old, + unsigned long new, int size) +{ + unsigned long flags, prev; + volatile unsigned long *p = ptr; + + if (size == 4) { + local_irq_save(flags); + if ((prev = *p) == old) + *p = new; + local_irq_restore(flags); + return(prev); + } else + return wrong_size_cmpxchg(ptr); +} + +#define cmpxchg(ptr,o,n) \ +({ \ + __typeof__(*(ptr)) _o_ = (o); \ + __typeof__(*(ptr)) _n_ = (n); \ + (__typeof__(*(ptr))) __cmpxchg((ptr), (unsigned long)_o_, \ + (unsigned long)_n_, sizeof(*(ptr))); \ +}) + +#endif + #endif /* __LINUX_ARM_ARCH__ */ #define atomic_xchg(v, new) (xchg(&((v)->counter), new)) ����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/arm-fix-atomic-cmpxchg.patch����������������������������������������������������������������0000664�0000764�0000764�00000001177�11041657731�016535� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������--- include/asm-arm/atomic.h | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) Index: linux-2.6.24.7/include/asm-arm/atomic.h =================================================================== --- linux-2.6.24.7.orig/include/asm-arm/atomic.h +++ linux-2.6.24.7/include/asm-arm/atomic.h @@ -189,10 +189,10 @@ static inline unsigned long __cmpxchg(vo volatile unsigned long *p = ptr; if (size == 4) { - local_irq_save(flags); + raw_local_irq_save(flags); if ((prev = *p) == old) *p = new; - local_irq_restore(flags); + raw_local_irq_restore(flags); return(prev); } else return wrong_size_cmpxchg(ptr); �������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/arm-cmpxchg-support-armv6.patch�������������������������������������������������������������0000664�0000764�0000764�00000003166�11041657731�017242� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������[PATCH -rt] cmpxchg support on ARMv6 Current rt patch don't support the cmpxchg on ARMv6. This patch supports cmpxchg in ARMv6. It's tested on OMAP2 (apollon board). Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com> p.s., Pleaes cc to me, I'm not subscriber on this mailing list. -- --- include/asm-arm/atomic.h | 40 ++++++++++++++++++++++++++++++++++++++++ 1 file changed, 40 insertions(+) Index: linux-2.6.24.7/include/asm-arm/atomic.h =================================================================== --- linux-2.6.24.7.orig/include/asm-arm/atomic.h +++ linux-2.6.24.7/include/asm-arm/atomic.h @@ -114,6 +114,46 @@ static inline void atomic_clear_mask(uns : "cc"); } +/* + * Atomic compare and exchange. + */ +#define __HAVE_ARCH_CMPXCHG 1 + +extern unsigned long wrong_size_cmpxchg(volatile void *ptr); + +static inline unsigned long __cmpxchg(volatile void *ptr, + unsigned long old, + unsigned long new, int size) +{ + volatile unsigned long *p = ptr; + + if (size == 4) { + unsigned long oldval, res; + + do { + __asm__ __volatile__("@ atomic_cmpxchg\n" + "ldrex %1, [%2]\n" + "mov %0, #0\n" + "teq %1, %3\n" + "strexeq %0, %4, [%2]\n" + : "=&r" (res), "=&r" (oldval) + : "r" (p), "Ir" (old), "r" (new) + : "cc"); + } while (res); + + return oldval; + } else + return wrong_size_cmpxchg(ptr); +} + +#define cmpxchg(ptr,o,n) \ +({ \ + __typeof__(*(ptr)) _o_ = (o); \ + __typeof__(*(ptr)) _n_ = (n); \ + (__typeof__(*(ptr))) __cmpxchg((ptr), (unsigned long)_o_, \ + (unsigned long)_n_, sizeof(*(ptr))); \ +}) + #else /* ARM_ARCH_6 */ #include <asm/system.h> ����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/arm-futex-atomic-cmpxchg.patch��������������������������������������������������������������0000664�0000764�0000764�00000010277�11041657732�017104� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������Implement futex macros for ARM Signed-off-by: Khem Raj <kraj@mvista.com> Signed-off-by: Nicolas Pitre <nico@cam.org> Signed-off-by: George Davis <gdavis@mvista.com> arch/arm/kernel/process.c | 2 include/asm-arm/futex.h | 125 ++++++++++++++++++++++++++++++++++++++++++++-- 2 files changed, 124 insertions(+), 3 deletions(-) Index: linux-2.6.24.7/arch/arm/kernel/process.c =================================================================== --- linux-2.6.24.7.orig/arch/arm/kernel/process.c +++ linux-2.6.24.7/arch/arm/kernel/process.c @@ -37,6 +37,8 @@ #include <asm/uaccess.h> #include <asm/mach/time.h> +DEFINE_SPINLOCK(futex_atomic_lock); + static const char *processor_modes[] = { "USER_26", "FIQ_26" , "IRQ_26" , "SVC_26" , "UK4_26" , "UK5_26" , "UK6_26" , "UK7_26" , "UK8_26" , "UK9_26" , "UK10_26", "UK11_26", "UK12_26", "UK13_26", "UK14_26", "UK15_26", Index: linux-2.6.24.7/include/asm-arm/futex.h =================================================================== --- linux-2.6.24.7.orig/include/asm-arm/futex.h +++ linux-2.6.24.7/include/asm-arm/futex.h @@ -1,6 +1,125 @@ -#ifndef _ASM_FUTEX_H -#define _ASM_FUTEX_H +#ifndef _ASM_ARM_FUTEX_H +#define _ASM_ARM_FUTEX_H -#include <asm-generic/futex.h> +#ifdef __KERNEL__ +#include <linux/futex.h> +#include <linux/errno.h> +#include <linux/uaccess.h> + +extern spinlock_t futex_atomic_lock; + +#define __futex_atomic_op(insn, ret, oldval, uaddr, oparg) \ + __asm__ __volatile__ ( \ + "1: ldrt %1, [%2] \n" \ + insn \ + "2: strt %0, [%2] \n" \ + " mov %0, #0 \n" \ + "3: \n" \ + " .section __ex_table, \"a\" \n" \ + " .align 3 \n" \ + " .long 1b, 4f, 2b, 4f \n" \ + " .previous \n" \ + " .section .fixup,\"ax\" \n" \ + "4: mov %0, %4 \n" \ + " b 3b \n" \ + " .previous" \ + : "=&r" (ret), "=&r" (oldval) \ + : "r" (uaddr), "r" (oparg), "Ir" (-EFAULT) \ + : "cc", "memory") + +static inline int +futex_atomic_op_inuser (int encoded_op, int __user *uaddr) +{ + int op = (encoded_op >> 28) & 7; + int cmp = (encoded_op >> 24) & 15; + int oparg = (encoded_op << 8) >> 20; + int cmparg = (encoded_op << 20) >> 20; + int oldval = 0, ret; + if (encoded_op & (FUTEX_OP_OPARG_SHIFT << 28)) + oparg = 1 << oparg; + + if (!access_ok (VERIFY_WRITE, uaddr, sizeof(int))) + return -EFAULT; + + pagefault_disable(); + + spin_lock(&futex_atomic_lock); + + switch (op) { + case FUTEX_OP_SET: + __futex_atomic_op(" mov %0, %3\n", + ret, oldval, uaddr, oparg); + break; + case FUTEX_OP_ADD: + __futex_atomic_op(" add %0, %1, %3\n", + ret, oldval, uaddr, oparg); + break; + case FUTEX_OP_OR: + __futex_atomic_op(" orr %0, %1, %3\n", + ret, oldval, uaddr, oparg); + break; + case FUTEX_OP_ANDN: + __futex_atomic_op(" and %0, %1, %3\n", + ret, oldval, uaddr, oparg); + break; + case FUTEX_OP_XOR: + __futex_atomic_op(" eor %0, %1, %3\n", + ret, oldval, uaddr, oparg); + break; + default: + ret = -ENOSYS; + } + + spin_unlock(&futex_atomic_lock); + + pagefault_enable(); + + if (!ret) { + switch (cmp) { + case FUTEX_OP_CMP_EQ: ret = (oldval == cmparg); break; + case FUTEX_OP_CMP_NE: ret = (oldval != cmparg); break; + case FUTEX_OP_CMP_LT: ret = (oldval < cmparg); break; + case FUTEX_OP_CMP_GE: ret = (oldval >= cmparg); break; + case FUTEX_OP_CMP_LE: ret = (oldval <= cmparg); break; + case FUTEX_OP_CMP_GT: ret = (oldval > cmparg); break; + default: ret = -ENOSYS; + } + } + return ret; +} + +static inline int +futex_atomic_cmpxchg_inatomic(int __user *uaddr, int oldval, int newval) +{ + int val; + + if (!access_ok(VERIFY_WRITE, uaddr, sizeof(int))) + return -EFAULT; + + spin_lock(&futex_atomic_lock); + + __asm__ __volatile__( "@futex_atomic_cmpxchg_inatomic \n" + "1: ldrt %0, [%3] \n" + " teq %0, %1 \n" + "2: streqt %2, [%3] \n" + "3: \n" + " .section __ex_table, \"a\" \n" + " .align 3 \n" + " .long 1b, 4f, 2b, 4f \n" + " .previous \n" + " .section .fixup,\"ax\" \n" + "4: mov %0, %4 \n" + " b 3b \n" + " .previous" + : "=&r" (val) + : "r" (oldval), "r" (newval), "r" (uaddr), "Ir" (-EFAULT) + : "cc"); + + spin_unlock(&futex_atomic_lock); + + return val; +} + +#endif #endif ���������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/arm-preempt-config.patch��������������������������������������������������������������������0000664�0000764�0000764�00000001776�11041657735�015776� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������� arch/arm/Kconfig | 13 +------------ 1 file changed, 1 insertion(+), 12 deletions(-) Index: linux-2.6.24.7/arch/arm/Kconfig =================================================================== --- linux-2.6.24.7.orig/arch/arm/Kconfig +++ linux-2.6.24.7/arch/arm/Kconfig @@ -622,18 +622,7 @@ config LOCAL_TIMERS accounting to be spread across the timer interval, preventing a "thundering herd" at every timer tick. -config PREEMPT - bool "Preemptible Kernel (EXPERIMENTAL)" - depends on EXPERIMENTAL - help - This option reduces the latency of the kernel when reacting to - real-time or interactive events by allowing a low priority process to - be preempted even if it is in kernel mode executing a system call. - This allows applications to run more reliably even when the system is - under load. - - Say Y here if you are building a kernel for a desktop, embedded - or real-time system. Say N if you are unsure. +source kernel/Kconfig.preempt config NO_IDLE_HZ bool "Dynamic tick timer" ��patches/m68knommu-add-cmpxchg-in-default-fashion.patch����������������������������������������������0000664�0000764�0000764�00000003626�11041657732�021764� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From 46a77f70fc1a6f11c01eb8265feea0ab93c3cbac Mon Sep 17 00:00:00 2001 From: Sebastian Siewior <bigeasy@linutronix.de> Date: Fri, 18 Apr 2008 17:02:27 +0200 Subject: [PATCH] m68knommu: add cmpxchg in default fashion not RT-safe, generic Signed-off-by: Sebastian Siewior <bigeasy@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> --- include/asm-m68knommu/system.h | 34 +++++++++++++++++++++++++--------- 1 file changed, 25 insertions(+), 9 deletions(-) Index: linux-2.6.24.7/include/asm-m68knommu/system.h =================================================================== --- linux-2.6.24.7.orig/include/asm-m68knommu/system.h +++ linux-2.6.24.7/include/asm-m68knommu/system.h @@ -192,20 +192,36 @@ static inline unsigned long __xchg(unsig * indicated by comparing RETURN with OLD. */ #define __HAVE_ARCH_CMPXCHG 1 +extern unsigned long __cmpxchg_called_with_bad_pointer(volatile void *p, + unsigned long old, unsigned long new, int size); -static __inline__ unsigned long -cmpxchg(volatile int *p, int old, int new) +static inline unsigned long +__cmpxchg(volatile void *ptr, unsigned long old, unsigned long new, int size) { - unsigned long flags; - int prev; + unsigned long flags, prev; + volatile unsigned int *p = ptr; - local_irq_save(flags); - if ((prev = *p) == old) - *p = new; - local_irq_restore(flags); - return(prev); + if (size == 4) { + + local_irq_save(flags); + if ((prev = *p) == old) + *p = new; + local_irq_restore(flags); + return prev; + } + + /* we should not get here, if you do we end up with a linker error */ + return __cmpxchg_called_with_bad_pointer(p, old, new, size); } +#define cmpxchg(ptr,o,n) \ + ({ \ + __typeof__(*(ptr)) _o_ = (o); \ + __typeof__(*(ptr)) _n_ = (n); \ + (__typeof__(*(ptr))) __cmpxchg((ptr), (unsigned long)_o_, \ + (unsigned long)_n_, sizeof(*(ptr))); \ + }) + #ifdef CONFIG_M68332 #define HARD_RESET_NOW() ({ \ ����������������������������������������������������������������������������������������������������������patches/m68knommu-make-cmpxchg-RT-safe.patch��������������������������������������������������������0000664�0000764�0000764�00000003161�11041657733�017730� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From fb918e8b87e699c8174d8f7fbff1ab558c7389b0 Mon Sep 17 00:00:00 2001 From: Sebastian Siewior <bigeasy@linutronix.de> Date: Fri, 18 Apr 2008 17:02:29 +0200 Subject: [PATCH] m68knommu: make cmpxchg RT-safe Signed-off-by: Sebastian Siewior <bigeasy@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> --- include/asm-m68knommu/system.h | 11 ++++++----- 1 file changed, 6 insertions(+), 5 deletions(-) Index: linux-2.6.24.7/include/asm-m68knommu/system.h =================================================================== --- linux-2.6.24.7.orig/include/asm-m68knommu/system.h +++ linux-2.6.24.7/include/asm-m68knommu/system.h @@ -2,9 +2,10 @@ #define _M68KNOMMU_SYSTEM_H #include <linux/linkage.h> +#include <linux/kernel.h> +#include <linux/irqflags.h> #include <asm/segment.h> #include <asm/entry.h> - /* * switch_to(n) should switch tasks to task ptr, first checking that * ptr isn't the current task, in which case it does nothing. This @@ -128,7 +129,7 @@ static inline unsigned long __xchg(unsig { unsigned long tmp, flags; - local_irq_save(flags); + raw_local_irq_save(flags); switch (size) { case 1: @@ -150,7 +151,7 @@ static inline unsigned long __xchg(unsig : "=&d" (tmp) : "d" (x), "m" (*__xg(ptr)) : "memory"); break; } - local_irq_restore(flags); + raw_local_irq_restore(flags); return tmp; } #else @@ -203,10 +204,10 @@ __cmpxchg(volatile void *ptr, unsigned l if (size == 4) { - local_irq_save(flags); + raw_local_irq_save(flags); if ((prev = *p) == old) *p = new; - local_irq_restore(flags); + raw_local_irq_restore(flags); return prev; } ���������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/m68knommu-add-read_barrier_depends-and-irqs_disab.patch�������������������������������������0000664�0000764�0000764�00000002226�11041657734�023656� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From 0d8027d99d3e76601b45885253b2412c4d47ee4f Mon Sep 17 00:00:00 2001 From: Sebastian Siewior <bigeasy@linutronix.de> Date: Fri, 18 Apr 2008 17:02:29 +0200 Subject: [PATCH] m68knommu: add read_barrier_depends() and irqs_disabled_flags() Signed-off-by: Sebastian Siewior <bigeasy@linutronix.de> --- include/asm-m68knommu/system.h | 11 +++++++++++ 1 file changed, 11 insertions(+) Index: linux-2.6.24.7/include/asm-m68knommu/system.h =================================================================== --- linux-2.6.24.7.orig/include/asm-m68knommu/system.h +++ linux-2.6.24.7/include/asm-m68knommu/system.h @@ -119,6 +119,8 @@ asmlinkage void resume(void); #define smp_read_barrier_depends() do { } while(0) #endif +#define read_barrier_depends() ((void)0) + #define xchg(ptr,x) ((__typeof__(*(ptr)))__xchg((unsigned long)(x),(ptr),sizeof(*(ptr)))) struct __xchg_dummy { unsigned long a[100]; }; @@ -350,4 +352,13 @@ __cmpxchg(volatile void *ptr, unsigned l #endif #define arch_align_stack(x) (x) + +static inline int irqs_disabled_flags(unsigned long flags) +{ + if (flags & 0x0700) + return 0; + else + return 1; +} + #endif /* _M68KNOMMU_SYSTEM_H */ ��������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/preempt-softirqs-core.patch�����������������������������������������������������������������0000664�0000764�0000764�00000044041�11041657733�016542� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������--- include/linux/bottom_half.h | 1 include/linux/interrupt.h | 13 +- include/linux/sched.h | 17 ++ kernel/Kconfig.preempt | 16 ++ kernel/sched.c | 28 ++++ kernel/softirq.c | 273 ++++++++++++++++++++++++++++++++++---------- 6 files changed, 279 insertions(+), 69 deletions(-) Index: linux-2.6.24.7/include/linux/bottom_half.h =================================================================== --- linux-2.6.24.7.orig/include/linux/bottom_half.h +++ linux-2.6.24.7/include/linux/bottom_half.h @@ -2,7 +2,6 @@ #define _LINUX_BH_H extern void local_bh_disable(void); -extern void __local_bh_enable(void); extern void _local_bh_enable(void); extern void local_bh_enable(void); extern void local_bh_enable_ip(unsigned long ip); Index: linux-2.6.24.7/include/linux/interrupt.h =================================================================== --- linux-2.6.24.7.orig/include/linux/interrupt.h +++ linux-2.6.24.7/include/linux/interrupt.h @@ -257,6 +257,8 @@ enum HRTIMER_SOFTIRQ, #endif RCU_SOFTIRQ, /* Preferable RCU should always be the last softirq */ + /* Entries after this are ignored in split softirq mode */ + MAX_SOFTIRQ, }; /* softirq mask and active fields moved to irq_cpustat_t in @@ -269,13 +271,21 @@ struct softirq_action void *data; }; +#define __raise_softirq_irqoff(nr) do { or_softirq_pending(1UL << (nr)); } while (0) +#define __do_raise_softirq_irqoff(nr) __raise_softirq_irqoff(nr) + asmlinkage void do_softirq(void); extern void open_softirq(int nr, void (*action)(struct softirq_action*), void *data); extern void softirq_init(void); -#define __raise_softirq_irqoff(nr) do { or_softirq_pending(1UL << (nr)); } while (0) extern void FASTCALL(raise_softirq_irqoff(unsigned int nr)); extern void FASTCALL(raise_softirq(unsigned int nr)); +extern void wakeup_irqd(void); +#ifdef CONFIG_PREEMPT_SOFTIRQS +extern void wait_for_softirq(int softirq); +#else +# define wait_for_softirq(x) do {} while(0) +#endif /* Tasklets --- multithreaded analogue of BHs. @@ -387,6 +397,7 @@ extern void tasklet_kill(struct tasklet_ extern void tasklet_kill_immediate(struct tasklet_struct *t, unsigned int cpu); extern void tasklet_init(struct tasklet_struct *t, void (*func)(unsigned long), unsigned long data); +void takeover_tasklets(unsigned int cpu); /* * Autoprobing for irqs: Index: linux-2.6.24.7/include/linux/sched.h =================================================================== --- linux-2.6.24.7.orig/include/linux/sched.h +++ linux-2.6.24.7/include/linux/sched.h @@ -91,6 +91,12 @@ struct sched_param { #include <asm/processor.h> +#ifdef CONFIG_PREEMPT_SOFTIRQS +extern int softirq_preemption; +#else +# define softirq_preemption 0 +#endif + struct exec_domain; struct futex_pi_state; struct bio; @@ -1410,6 +1416,7 @@ static inline void put_task_struct(struc #define PF_SWAPWRITE 0x00800000 /* Allowed to write to swap */ #define PF_SPREAD_PAGE 0x01000000 /* Spread page cache over cpuset */ #define PF_SPREAD_SLAB 0x02000000 /* Spread some slab caches over cpuset */ +#define PF_SOFTIRQ 0x04000000 /* softirq context */ #define PF_MEMPOLICY 0x10000000 /* Non-default NUMA mempolicy */ #define PF_MUTEX_TESTER 0x20000000 /* Thread belongs to the rt mutex tester */ #define PF_FREEZER_SKIP 0x40000000 /* Freezer should not count it as freezeable */ @@ -1889,6 +1896,7 @@ static inline int need_resched(void) extern int cond_resched(void); extern int cond_resched_lock(spinlock_t * lock); extern int cond_resched_softirq(void); +extern int cond_resched_softirq_context(void); /* * Does a critical section need to be broken due to another @@ -1904,10 +1912,13 @@ extern int cond_resched_softirq(void); * Does a critical section need to be broken due to another * task waiting or preemption being signalled: */ -static inline int lock_need_resched(spinlock_t *lock) +#define lock_need_resched(lock) \ + unlikely(need_lockbreak(lock) || need_resched()) + +static inline int softirq_need_resched(void) { - if (need_lockbreak(lock) || need_resched()) - return 1; + if (softirq_preemption && (current->flags & PF_SOFTIRQ)) + return need_resched(); return 0; } Index: linux-2.6.24.7/kernel/Kconfig.preempt =================================================================== --- linux-2.6.24.7.orig/kernel/Kconfig.preempt +++ linux-2.6.24.7/kernel/Kconfig.preempt @@ -91,6 +91,22 @@ config RCU_TRACE Say Y here if you want to enable RCU tracing Say N if you are unsure. +config PREEMPT_SOFTIRQS + bool "Thread Softirqs" + default n +# depends on PREEMPT + help + This option reduces the latency of the kernel by 'threading' + soft interrupts. This means that all softirqs will execute + in softirqd's context. While this helps latency, it can also + reduce performance. + + The threading of softirqs can also be controlled via + /proc/sys/kernel/softirq_preemption runtime flag and the + sofirq-preempt=0/1 boot-time option. + + Say N if you are unsure. + config PREEMPT_BKL bool "Preempt The Big Kernel Lock" depends on SMP || PREEMPT Index: linux-2.6.24.7/kernel/sched.c =================================================================== --- linux-2.6.24.7.orig/kernel/sched.c +++ linux-2.6.24.7/kernel/sched.c @@ -3461,7 +3461,7 @@ void account_system_time(struct task_str tmp = cputime_to_cputime64(cputime); if (hardirq_count() - hardirq_offset) cpustat->irq = cputime64_add(cpustat->irq, tmp); - else if (softirq_count()) + else if (softirq_count() || (p->flags & PF_SOFTIRQ)) cpustat->softirq = cputime64_add(cpustat->softirq, tmp); else if (p != rq->idle) cpustat->system = cputime64_add(cpustat->system, tmp); @@ -3820,7 +3820,7 @@ asmlinkage void __sched preempt_schedule int saved_lock_depth; #endif /* Catch callers which need to be fixed */ - BUG_ON(ti->preempt_count || !irqs_disabled()); + WARN_ON_ONCE(ti->preempt_count || !irqs_disabled()); do { add_preempt_count(PREEMPT_ACTIVE); @@ -4781,9 +4781,12 @@ int cond_resched_lock(spinlock_t *lock) } EXPORT_SYMBOL(cond_resched_lock); +/* + * Voluntarily preempt a process context that has softirqs disabled: + */ int __sched cond_resched_softirq(void) { - BUG_ON(!in_softirq()); + WARN_ON_ONCE(!in_softirq()); if (need_resched() && system_state == SYSTEM_RUNNING) { local_bh_enable(); @@ -4795,6 +4798,25 @@ int __sched cond_resched_softirq(void) } EXPORT_SYMBOL(cond_resched_softirq); +/* + * Voluntarily preempt a softirq context (possible with softirq threading): + */ +int __sched cond_resched_softirq_context(void) +{ + WARN_ON_ONCE(!in_softirq()); + + if (softirq_need_resched() && system_state == SYSTEM_RUNNING) { + raw_local_irq_disable(); + _local_bh_enable(); + raw_local_irq_enable(); + __cond_resched(); + local_bh_disable(); + return 1; + } + return 0; +} +EXPORT_SYMBOL(cond_resched_softirq_context); + /** * yield - yield the current processor to other threads. * Index: linux-2.6.24.7/kernel/softirq.c =================================================================== --- linux-2.6.24.7.orig/kernel/softirq.c +++ linux-2.6.24.7/kernel/softirq.c @@ -4,9 +4,15 @@ * Copyright (C) 1992 Linus Torvalds * * Rewritten. Old one was good in 2.2, but in 2.3 it was immoral. --ANK (990903) + * + * Softirq-split implemetation by + * Copyright (C) 2005 Thomas Gleixner, Ingo Molnar */ #include <linux/module.h> +#include <linux/kallsyms.h> +#include <linux/syscalls.h> +#include <linux/wait.h> #include <linux/kernel_stat.h> #include <linux/interrupt.h> #include <linux/init.h> @@ -46,7 +52,41 @@ EXPORT_SYMBOL(irq_stat); static struct softirq_action softirq_vec[32] __cacheline_aligned_in_smp; -static DEFINE_PER_CPU(struct task_struct *, ksoftirqd); +struct softirqdata { + int nr; + unsigned long cpu; + struct task_struct *tsk; +#ifdef CONFIG_PREEMPT_SOFTIRQS + wait_queue_head_t wait; + int running; +#endif +}; + +static DEFINE_PER_CPU(struct softirqdata [MAX_SOFTIRQ], ksoftirqd); + +#ifdef CONFIG_PREEMPT_SOFTIRQS +/* + * Preempting the softirq causes cases that would not be a + * problem when the softirq is not preempted. That is a + * process may have code to spin while waiting for a softirq + * to finish on another CPU. But if it happens that the + * process has preempted the softirq, this could cause a + * deadlock. + */ +void wait_for_softirq(int softirq) +{ + struct softirqdata *data = &__get_cpu_var(ksoftirqd)[softirq]; + if (data->running) { + DECLARE_WAITQUEUE(wait, current); + set_current_state(TASK_UNINTERRUPTIBLE); + add_wait_queue(&data->wait, &wait); + if (data->running) + schedule(); + remove_wait_queue(&data->wait, &wait); + __set_current_state(TASK_RUNNING); + } +} +#endif /* * we cannot loop indefinitely here to avoid userspace starvation, @@ -54,16 +94,32 @@ static DEFINE_PER_CPU(struct task_struct * to the pending events, so lets the scheduler to balance * the softirq load for us. */ -static inline void wakeup_softirqd(void) +static void wakeup_softirqd(int softirq) { /* Interrupts are disabled: no need to stop preemption */ - struct task_struct *tsk = __get_cpu_var(ksoftirqd); + struct task_struct *tsk = __get_cpu_var(ksoftirqd)[softirq].tsk; if (tsk && tsk->state != TASK_RUNNING) wake_up_process(tsk); } /* + * Wake up the softirq threads which have work + */ +static void trigger_softirqs(void) +{ + u32 pending = local_softirq_pending(); + int curr = 0; + + while (pending) { + if (pending & 1) + wakeup_softirqd(curr); + pending >>= 1; + curr++; + } +} + +/* * This one is for softirq.c-internal use, * where hardirqs are disabled legitimately: */ @@ -98,20 +154,6 @@ void local_bh_disable(void) EXPORT_SYMBOL(local_bh_disable); -void __local_bh_enable(void) -{ - WARN_ON_ONCE(in_irq()); - - /* - * softirqs should never be enabled by __local_bh_enable(), - * it always nests inside local_bh_enable() sections: - */ - WARN_ON_ONCE(softirq_count() == SOFTIRQ_OFFSET); - - sub_preempt_count(SOFTIRQ_OFFSET); -} -EXPORT_SYMBOL_GPL(__local_bh_enable); - /* * Special-case - softirqs can safely be enabled in * cond_resched_softirq(), or by __do_softirq(), @@ -205,7 +247,7 @@ EXPORT_SYMBOL(local_bh_enable_ip); */ #define MAX_SOFTIRQ_RESTART 10 -asmlinkage void __do_softirq(void) +asmlinkage void ___do_softirq(void) { struct softirq_action *h; __u32 pending; @@ -215,9 +257,6 @@ asmlinkage void __do_softirq(void) pending = local_softirq_pending(); account_system_vtime(current); - __local_bh_disable((unsigned long)__builtin_return_address(0)); - trace_softirq_enter(); - cpu = smp_processor_id(); restart: /* Reset the pending bitmask before enabling irqs */ @@ -229,8 +268,17 @@ restart: do { if (pending & 1) { - h->action(h); + { + u32 preempt_count = preempt_count(); + h->action(h); + if (preempt_count != preempt_count()) { + print_symbol("BUG: softirq exited %s with wrong preemption count!\n", (unsigned long) h->action); + printk("entered with %08x, exited with %08x.\n", preempt_count, preempt_count()); + preempt_count() = preempt_count; + } + } rcu_bh_qsctr_inc(cpu); + cond_resched_softirq_context(); } h++; pending >>= 1; @@ -243,12 +291,34 @@ restart: goto restart; if (pending) - wakeup_softirqd(); + trigger_softirqs(); +} + +asmlinkage void __do_softirq(void) +{ +#ifdef CONFIG_PREEMPT_SOFTIRQS + /* + * 'preempt harder'. Push all softirq processing off to ksoftirqd. + */ + if (softirq_preemption) { + if (local_softirq_pending()) + trigger_softirqs(); + return; + } +#endif + /* + * 'immediate' softirq execution: + */ + __local_bh_disable((unsigned long)__builtin_return_address(0)); + trace_softirq_enter(); + + ___do_softirq(); trace_softirq_exit(); account_system_vtime(current); _local_bh_enable(); + } #ifndef __ARCH_HAS_DO_SOFTIRQ @@ -316,19 +386,11 @@ void irq_exit(void) */ inline fastcall void raise_softirq_irqoff(unsigned int nr) { - __raise_softirq_irqoff(nr); + __do_raise_softirq_irqoff(nr); - /* - * If we're in an interrupt or softirq, we're done - * (this also catches softirq-disabled code). We will - * actually run the softirq once we return from - * the irq or softirq. - * - * Otherwise we wake up ksoftirqd to make sure we - * schedule the softirq soon. - */ - if (!in_interrupt()) - wakeup_softirqd(); +#ifdef CONFIG_PREEMPT_SOFTIRQS + wakeup_softirqd(nr); +#endif } void fastcall raise_softirq(unsigned int nr) @@ -411,7 +473,7 @@ static void tasklet_action(struct softir local_irq_disable(); t->next = __get_cpu_var(tasklet_vec).list; __get_cpu_var(tasklet_vec).list = t; - __raise_softirq_irqoff(TASKLET_SOFTIRQ); + __do_raise_softirq_irqoff(TASKLET_SOFTIRQ); local_irq_enable(); } } @@ -444,7 +506,7 @@ static void tasklet_hi_action(struct sof local_irq_disable(); t->next = __get_cpu_var(tasklet_hi_vec).list; __get_cpu_var(tasklet_hi_vec).list = t; - __raise_softirq_irqoff(HI_SOFTIRQ); + __do_raise_softirq_irqoff(HI_SOFTIRQ); local_irq_enable(); } } @@ -484,13 +546,24 @@ void __init softirq_init(void) open_softirq(HI_SOFTIRQ, tasklet_hi_action, NULL); } -static int ksoftirqd(void * __bind_cpu) +static int ksoftirqd(void * __data) { + struct sched_param param = { .sched_priority = MAX_USER_RT_PRIO/2 }; + struct softirqdata *data = __data; + u32 mask = (1 << data->nr); + struct softirq_action *h; + +#ifdef CONFIG_PREEMPT_SOFTIRQS + init_waitqueue_head(&data->wait); +#endif + + sys_sched_setscheduler(current->pid, SCHED_FIFO, ¶m); + current->flags |= PF_SOFTIRQ; set_current_state(TASK_INTERRUPTIBLE); while (!kthread_should_stop()) { preempt_disable(); - if (!local_softirq_pending()) { + if (!(local_softirq_pending() & mask)) { preempt_enable_no_resched(); schedule(); preempt_disable(); @@ -498,19 +571,41 @@ static int ksoftirqd(void * __bind_cpu) __set_current_state(TASK_RUNNING); - while (local_softirq_pending()) { +#ifdef CONFIG_PREEMPT_SOFTIRQS + data->running = 1; +#endif + + while (local_softirq_pending() & mask) { /* Preempt disable stops cpu going offline. If already offline, we'll be on wrong CPU: don't process */ - if (cpu_is_offline((long)__bind_cpu)) + if (cpu_is_offline(data->cpu)) goto wait_to_die; - do_softirq(); + + local_irq_disable(); preempt_enable_no_resched(); + set_softirq_pending(local_softirq_pending() & ~mask); + local_bh_disable(); + local_irq_enable(); + + h = &softirq_vec[data->nr]; + if (h) + h->action(h); + rcu_bh_qsctr_inc(data->cpu); + + local_irq_disable(); + _local_bh_enable(); + local_irq_enable(); + cond_resched(); preempt_disable(); } preempt_enable(); set_current_state(TASK_INTERRUPTIBLE); +#ifdef CONFIG_PREEMPT_SOFTIRQS + data->running = 0; + wake_up(&data->wait); +#endif } __set_current_state(TASK_RUNNING); return 0; @@ -557,7 +652,7 @@ void tasklet_kill_immediate(struct taskl BUG(); } -static void takeover_tasklets(unsigned int cpu) +void takeover_tasklets(unsigned int cpu) { struct tasklet_struct **i; @@ -579,49 +674,82 @@ static void takeover_tasklets(unsigned i } #endif /* CONFIG_HOTPLUG_CPU */ +static const char *softirq_names [] = +{ + [HI_SOFTIRQ] = "high", + [SCHED_SOFTIRQ] = "sched", + [TIMER_SOFTIRQ] = "timer", + [NET_TX_SOFTIRQ] = "net-tx", + [NET_RX_SOFTIRQ] = "net-rx", + [BLOCK_SOFTIRQ] = "block", + [TASKLET_SOFTIRQ] = "tasklet", +#ifdef CONFIG_HIGH_RES_TIMERS + [HRTIMER_SOFTIRQ] = "hrtimer", +#endif + [RCU_SOFTIRQ] = "rcu", +}; + static int __cpuinit cpu_callback(struct notifier_block *nfb, unsigned long action, void *hcpu) { - int hotcpu = (unsigned long)hcpu; + int hotcpu = (unsigned long)hcpu, i; struct task_struct *p; switch (action) { case CPU_UP_PREPARE: case CPU_UP_PREPARE_FROZEN: - p = kthread_create(ksoftirqd, hcpu, "ksoftirqd/%d", hotcpu); - if (IS_ERR(p)) { - printk("ksoftirqd for %i failed\n", hotcpu); - return NOTIFY_BAD; + for (i = 0; i < MAX_SOFTIRQ; i++) { + per_cpu(ksoftirqd, hotcpu)[i].nr = i; + per_cpu(ksoftirqd, hotcpu)[i].cpu = hotcpu; + per_cpu(ksoftirqd, hotcpu)[i].tsk = NULL; + } + for (i = 0; i < MAX_SOFTIRQ; i++) { + p = kthread_create(ksoftirqd, + &per_cpu(ksoftirqd, hotcpu)[i], + "softirq-%s/%d", softirq_names[i], + hotcpu); + if (IS_ERR(p)) { + printk("ksoftirqd %d for %i failed\n", i, + hotcpu); + return NOTIFY_BAD; + } + kthread_bind(p, hotcpu); + per_cpu(ksoftirqd, hotcpu)[i].tsk = p; } - kthread_bind(p, hotcpu); - per_cpu(ksoftirqd, hotcpu) = p; - break; + break; + break; case CPU_ONLINE: case CPU_ONLINE_FROZEN: - wake_up_process(per_cpu(ksoftirqd, hotcpu)); + for (i = 0; i < MAX_SOFTIRQ; i++) + wake_up_process(per_cpu(ksoftirqd, hotcpu)[i].tsk); break; #ifdef CONFIG_HOTPLUG_CPU case CPU_UP_CANCELED: case CPU_UP_CANCELED_FROZEN: - if (!per_cpu(ksoftirqd, hotcpu)) - break; - /* Unbind so it can run. Fall thru. */ - kthread_bind(per_cpu(ksoftirqd, hotcpu), - any_online_cpu(cpu_online_map)); +#if 0 + for (i = 0; i < MAX_SOFTIRQ; i++) { + if (!per_cpu(ksoftirqd, hotcpu)[i].tsk) + continue; + kthread_bind(per_cpu(ksoftirqd, hotcpu)[i].tsk, + any_online_cpu(cpu_online_map)); + } +#endif case CPU_DEAD: case CPU_DEAD_FROZEN: { struct sched_param param = { .sched_priority = MAX_RT_PRIO-1 }; - p = per_cpu(ksoftirqd, hotcpu); - per_cpu(ksoftirqd, hotcpu) = NULL; sched_setscheduler(p, SCHED_FIFO, ¶m); - kthread_stop(p); + for (i = 0; i < MAX_SOFTIRQ; i++) { + p = per_cpu(ksoftirqd, hotcpu)[i].tsk; + per_cpu(ksoftirqd, hotcpu)[i].tsk = NULL; + kthread_stop(p); + } takeover_tasklets(hotcpu); break; - } #endif /* CONFIG_HOTPLUG_CPU */ } + } return NOTIFY_OK; } @@ -640,6 +768,29 @@ __init int spawn_ksoftirqd(void) return 0; } + +#ifdef CONFIG_PREEMPT_SOFTIRQS + +int softirq_preemption = 1; + +EXPORT_SYMBOL(softirq_preemption); + +static int __init softirq_preempt_setup (char *str) +{ + if (!strncmp(str, "off", 3)) + softirq_preemption = 0; + else + get_option(&str, &softirq_preemption); + if (!softirq_preemption) + printk("turning off softirq preemption!\n"); + + return 1; +} + +__setup("softirq-preempt=", softirq_preempt_setup); + +#endif + #ifdef CONFIG_SMP /* * Call a function on all processors �����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/preempt-irqs-core.patch���������������������������������������������������������������������0000664�0000764�0000764�00000065100�11041657734�015646� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������--- include/linux/interrupt.h | 19 ++ include/linux/irq.h | 26 +++- include/linux/sched.h | 14 ++ init/main.c | 5 kernel/irq/autoprobe.c | 1 kernel/irq/chip.c | 38 +++++ kernel/irq/handle.c | 37 +++++ kernel/irq/internals.h | 4 kernel/irq/manage.c | 292 +++++++++++++++++++++++++++++++++++++++++++++- kernel/irq/proc.c | 129 ++++++++++++++------ kernel/irq/spurious.c | 11 + kernel/sched.c | 23 +++ 12 files changed, 543 insertions(+), 56 deletions(-) Index: linux-2.6.24.7/include/linux/interrupt.h =================================================================== --- linux-2.6.24.7.orig/include/linux/interrupt.h +++ linux-2.6.24.7/include/linux/interrupt.h @@ -50,10 +50,12 @@ #define IRQF_SAMPLE_RANDOM 0x00000040 #define IRQF_SHARED 0x00000080 #define IRQF_PROBE_SHARED 0x00000100 -#define IRQF_TIMER 0x00000200 +#define __IRQF_TIMER 0x00000200 #define IRQF_PERCPU 0x00000400 #define IRQF_NOBALANCING 0x00000800 #define IRQF_IRQPOLL 0x00001000 +#define IRQF_NODELAY 0x00002000 +#define IRQF_TIMER (__IRQF_TIMER | IRQF_NODELAY) typedef irqreturn_t (*irq_handler_t)(int, void *); @@ -65,7 +67,7 @@ struct irqaction { void *dev_id; struct irqaction *next; int irq; - struct proc_dir_entry *dir; + struct proc_dir_entry *dir, *threaded; }; extern irqreturn_t no_action(int cpl, void *dev_id); @@ -196,6 +198,7 @@ static inline int disable_irq_wake(unsig #ifndef __ARCH_SET_SOFTIRQ_PENDING #define set_softirq_pending(x) (local_softirq_pending() = (x)) +// FIXME: PREEMPT_RT: set_bit()? #define or_softirq_pending(x) (local_softirq_pending() |= (x)) #endif @@ -271,12 +274,18 @@ struct softirq_action void *data; }; -#define __raise_softirq_irqoff(nr) do { or_softirq_pending(1UL << (nr)); } while (0) -#define __do_raise_softirq_irqoff(nr) __raise_softirq_irqoff(nr) - asmlinkage void do_softirq(void); extern void open_softirq(int nr, void (*action)(struct softirq_action*), void *data); extern void softirq_init(void); + +#ifdef CONFIG_PREEMPT_HARDIRQS +# define __raise_softirq_irqoff(nr) raise_softirq_irqoff(nr) +# define __do_raise_softirq_irqoff(nr) do { or_softirq_pending(1UL << (nr)); } while (0) +#else +# define __raise_softirq_irqoff(nr) do { or_softirq_pending(1UL << (nr)); } while (0) +# define __do_raise_softirq_irqoff(nr) __raise_softirq_irqoff(nr) +#endif + extern void FASTCALL(raise_softirq_irqoff(unsigned int nr)); extern void FASTCALL(raise_softirq(unsigned int nr)); extern void wakeup_irqd(void); Index: linux-2.6.24.7/include/linux/irq.h =================================================================== --- linux-2.6.24.7.orig/include/linux/irq.h +++ linux-2.6.24.7/include/linux/irq.h @@ -19,10 +19,12 @@ #include <linux/cpumask.h> #include <linux/irqreturn.h> #include <linux/errno.h> +#include <linux/wait.h> #include <asm/irq.h> #include <asm/ptrace.h> #include <asm/irq_regs.h> +#include <asm/timex.h> struct irq_desc; typedef void fastcall (*irq_flow_handler_t)(unsigned int irq, @@ -61,6 +63,7 @@ typedef void fastcall (*irq_flow_handler #define IRQ_WAKEUP 0x00100000 /* IRQ triggers system wakeup */ #define IRQ_MOVE_PENDING 0x00200000 /* need to re-target IRQ destination */ #define IRQ_NO_BALANCING 0x00400000 /* IRQ is excluded from balancing */ +#define IRQ_NODELAY 0x40000000 /* IRQ must run immediately */ #ifdef CONFIG_IRQ_PER_CPU # define CHECK_IRQ_PER_CPU(var) ((var) & IRQ_PER_CPU) @@ -141,6 +144,9 @@ struct irq_chip { * @irq_count: stats field to detect stalled irqs * @irqs_unhandled: stats field for spurious unhandled interrupts * @last_unhandled: aging timer for unhandled count + * @thread: Thread pointer for threaded preemptible irq handling + * @wait_for_handler: Waitqueue to wait for a running preemptible handler + * @cycles: Timestamp for stats and debugging * @lock: locking for SMP * @affinity: IRQ affinity on SMP * @cpu: cpu index useful for balancing @@ -163,6 +169,9 @@ struct irq_desc { unsigned int irq_count; /* For detecting broken IRQs */ unsigned int irqs_unhandled; unsigned long last_unhandled; /* Aging timer for unhandled count */ + struct task_struct *thread; + wait_queue_head_t wait_for_handler; + cycles_t timestamp; spinlock_t lock; #ifdef CONFIG_SMP cpumask_t affinity; @@ -397,7 +406,22 @@ extern int set_irq_msi(unsigned int irq, #define get_irq_data(irq) (irq_desc[irq].handler_data) #define get_irq_msi(irq) (irq_desc[irq].msi_desc) -#endif /* CONFIG_GENERIC_HARDIRQS */ +/* Early initialization of irqs */ +extern void early_init_hardirqs(void); +extern cycles_t irq_timestamp(unsigned int irq); + +#if defined(CONFIG_PREEMPT_HARDIRQS) +extern void init_hardirqs(void); +#else +static inline void init_hardirqs(void) { } +#endif + +#else /* end GENERIC HARDIRQS */ + +static inline void early_init_hardirqs(void) { } +static inline void init_hardirqs(void) { } + +#endif /* !CONFIG_GENERIC_HARDIRQS */ #endif /* !CONFIG_S390 */ Index: linux-2.6.24.7/include/linux/sched.h =================================================================== --- linux-2.6.24.7.orig/include/linux/sched.h +++ linux-2.6.24.7/include/linux/sched.h @@ -96,6 +96,11 @@ extern int softirq_preemption; #else # define softirq_preemption 0 #endif +#ifdef CONFIG_PREEMPT_HARDIRQS +extern int hardirq_preemption; +#else +# define hardirq_preemption 0 +#endif struct exec_domain; struct futex_pi_state; @@ -1417,6 +1422,7 @@ static inline void put_task_struct(struc #define PF_SPREAD_PAGE 0x01000000 /* Spread page cache over cpuset */ #define PF_SPREAD_SLAB 0x02000000 /* Spread some slab caches over cpuset */ #define PF_SOFTIRQ 0x04000000 /* softirq context */ +#define PF_HARDIRQ 0x08000000 /* hardirq context */ #define PF_MEMPOLICY 0x10000000 /* Non-default NUMA mempolicy */ #define PF_MUTEX_TESTER 0x20000000 /* Thread belongs to the rt mutex tester */ #define PF_FREEZER_SKIP 0x40000000 /* Freezer should not count it as freezeable */ @@ -1897,6 +1903,7 @@ extern int cond_resched(void); extern int cond_resched_lock(spinlock_t * lock); extern int cond_resched_softirq(void); extern int cond_resched_softirq_context(void); +extern int cond_resched_hardirq_context(void); /* * Does a critical section need to be broken due to another @@ -1922,6 +1929,13 @@ static inline int softirq_need_resched(v return 0; } +static inline int hardirq_need_resched(void) +{ + if (hardirq_preemption && (current->flags & PF_HARDIRQ)) + return need_resched(); + return 0; +} + /* * Reevaluate whether the task has signals pending delivery. * Wake the task if so. Index: linux-2.6.24.7/init/main.c =================================================================== --- linux-2.6.24.7.orig/init/main.c +++ linux-2.6.24.7/init/main.c @@ -47,6 +47,7 @@ #include <linux/delayacct.h> #include <linux/unistd.h> #include <linux/rmap.h> +#include <linux/irq.h> #include <linux/mempolicy.h> #include <linux/key.h> #include <linux/unwind.h> @@ -550,8 +551,10 @@ asmlinkage void __init start_kernel(void * fragile until we cpu_idle() for the first time. */ preempt_disable(); + build_all_zonelists(); page_alloc_init(); + early_init_hardirqs(); printk(KERN_NOTICE "Kernel command line: %s\n", boot_command_line); parse_early_param(); parse_args("Booting kernel", static_command_line, __start___param, @@ -825,6 +828,8 @@ static int __init kernel_init(void * unu smp_prepare_cpus(max_cpus); + init_hardirqs(); + do_pre_smp_initcalls(); smp_init(); Index: linux-2.6.24.7/kernel/irq/autoprobe.c =================================================================== --- linux-2.6.24.7.orig/kernel/irq/autoprobe.c +++ linux-2.6.24.7/kernel/irq/autoprobe.c @@ -7,6 +7,7 @@ */ #include <linux/irq.h> +#include <linux/delay.h> #include <linux/module.h> #include <linux/interrupt.h> #include <linux/delay.h> Index: linux-2.6.24.7/kernel/irq/chip.c =================================================================== --- linux-2.6.24.7.orig/kernel/irq/chip.c +++ linux-2.6.24.7/kernel/irq/chip.c @@ -287,8 +287,10 @@ static inline void mask_ack_irq(struct i if (desc->chip->mask_ack) desc->chip->mask_ack(irq); else { - desc->chip->mask(irq); - desc->chip->ack(irq); + if (desc->chip->mask) + desc->chip->mask(irq); + if (desc->chip->ack) + desc->chip->ack(irq); } } @@ -313,8 +315,10 @@ handle_simple_irq(unsigned int irq, stru spin_lock(&desc->lock); - if (unlikely(desc->status & IRQ_INPROGRESS)) + if (unlikely(desc->status & IRQ_INPROGRESS)) { + desc->status |= IRQ_PENDING; goto out_unlock; + } desc->status &= ~(IRQ_REPLAY | IRQ_WAITING); kstat_cpu(cpu).irqs[irq]++; @@ -323,6 +327,11 @@ handle_simple_irq(unsigned int irq, stru goto out_unlock; desc->status |= IRQ_INPROGRESS; + /* + * hardirq redirection to the irqd process context: + */ + if (redirect_hardirq(desc)) + goto out_unlock; spin_unlock(&desc->lock); action_ret = handle_IRQ_event(irq, action); @@ -369,6 +378,13 @@ handle_level_irq(unsigned int irq, struc goto out_unlock; desc->status |= IRQ_INPROGRESS; + + /* + * hardirq redirection to the irqd process context: + */ + if (redirect_hardirq(desc)) + goto out_unlock; + spin_unlock(&desc->lock); action_ret = handle_IRQ_event(irq, action); @@ -421,6 +437,15 @@ handle_fasteoi_irq(unsigned int irq, str } desc->status |= IRQ_INPROGRESS; + /* + * In the threaded case we fall back to a mask+eoi sequence: + */ + if (redirect_hardirq(desc)) { + if (desc->chip->mask) + desc->chip->mask(irq); + goto out; + } + desc->status &= ~IRQ_PENDING; spin_unlock(&desc->lock); @@ -432,7 +457,6 @@ handle_fasteoi_irq(unsigned int irq, str desc->status &= ~IRQ_INPROGRESS; out: desc->chip->eoi(irq); - spin_unlock(&desc->lock); } @@ -481,6 +505,12 @@ handle_edge_irq(unsigned int irq, struct /* Mark the IRQ currently in progress.*/ desc->status |= IRQ_INPROGRESS; + /* + * hardirq redirection to the irqd process context: + */ + if (redirect_hardirq(desc)) + goto out_unlock; + do { struct irqaction *action = desc->action; irqreturn_t action_ret; Index: linux-2.6.24.7/kernel/irq/handle.c =================================================================== --- linux-2.6.24.7.orig/kernel/irq/handle.c +++ linux-2.6.24.7/kernel/irq/handle.c @@ -13,6 +13,7 @@ #include <linux/irq.h> #include <linux/module.h> #include <linux/random.h> +#include <linux/kallsyms.h> #include <linux/interrupt.h> #include <linux/kernel_stat.h> @@ -133,24 +134,54 @@ irqreturn_t handle_IRQ_event(unsigned in handle_dynamic_tick(action); - if (!(action->flags & IRQF_DISABLED)) - local_irq_enable_in_hardirq(); + /* + * Unconditionally enable interrupts for threaded + * IRQ handlers: + */ + if (!hardirq_count() || !(action->flags & IRQF_DISABLED)) + local_irq_enable(); do { + unsigned int preempt_count = preempt_count(); + ret = action->handler(irq, action->dev_id); + if (preempt_count() != preempt_count) { + print_symbol("BUG: unbalanced irq-handler preempt count in %s!\n", (unsigned long) action->handler); + printk("entered with %08x, exited with %08x.\n", preempt_count, preempt_count()); + dump_stack(); + preempt_count() = preempt_count; + } if (ret == IRQ_HANDLED) status |= action->flags; retval |= ret; action = action->next; } while (action); - if (status & IRQF_SAMPLE_RANDOM) + if (status & IRQF_SAMPLE_RANDOM) { + local_irq_enable(); add_interrupt_randomness(irq); + } local_irq_disable(); return retval; } +int redirect_hardirq(struct irq_desc *desc) +{ + /* + * Direct execution: + */ + if (!hardirq_preemption || (desc->status & IRQ_NODELAY) || + !desc->thread) + return 0; + + BUG_ON(!irqs_disabled()); + if (desc->thread && desc->thread->state != TASK_RUNNING) + wake_up_process(desc->thread); + + return 1; +} + #ifndef CONFIG_GENERIC_HARDIRQS_NO__DO_IRQ /** * __do_IRQ - original all in one highlevel IRQ handler Index: linux-2.6.24.7/kernel/irq/internals.h =================================================================== --- linux-2.6.24.7.orig/kernel/irq/internals.h +++ linux-2.6.24.7/kernel/irq/internals.h @@ -10,6 +10,10 @@ extern void irq_chip_set_defaults(struct /* Set default handler: */ extern void compat_irq_chip_set_default_handler(struct irq_desc *desc); +extern int redirect_hardirq(struct irq_desc *desc); + +void recalculate_desc_flags(struct irq_desc *desc); + #ifdef CONFIG_PROC_FS extern void register_irq_proc(unsigned int irq); extern void register_handler_proc(unsigned int irq, struct irqaction *action); Index: linux-2.6.24.7/kernel/irq/manage.c =================================================================== --- linux-2.6.24.7.orig/kernel/irq/manage.c +++ linux-2.6.24.7/kernel/irq/manage.c @@ -8,8 +8,10 @@ */ #include <linux/irq.h> -#include <linux/module.h> #include <linux/random.h> +#include <linux/module.h> +#include <linux/kthread.h> +#include <linux/syscalls.h> #include <linux/interrupt.h> #include "internals.h" @@ -41,8 +43,12 @@ void synchronize_irq(unsigned int irq) * Wait until we're out of the critical section. This might * give the wrong answer due to the lack of memory barriers. */ - while (desc->status & IRQ_INPROGRESS) - cpu_relax(); + if (hardirq_preemption && !(desc->status & IRQ_NODELAY)) + wait_event(desc->wait_for_handler, + !(desc->status & IRQ_INPROGRESS)); + else + while (desc->status & IRQ_INPROGRESS) + cpu_relax(); /* Ok, that indicated we're done: double-check carefully. */ spin_lock_irqsave(&desc->lock, flags); @@ -234,6 +240,21 @@ int set_irq_wake(unsigned int irq, unsig EXPORT_SYMBOL(set_irq_wake); /* + * If any action has IRQF_NODELAY then turn IRQ_NODELAY on: + */ +void recalculate_desc_flags(struct irq_desc *desc) +{ + struct irqaction *action; + + desc->status &= ~IRQ_NODELAY; + for (action = desc->action ; action; action = action->next) + if (action->flags & IRQF_NODELAY) + desc->status |= IRQ_NODELAY; +} + +static int start_irq_thread(int irq, struct irq_desc *desc); + +/* * Internal function that tells the architecture code whether a * particular irq has been exclusively allocated or is available * for driver use. @@ -298,6 +319,9 @@ int setup_irq(unsigned int irq, struct i rand_initialize_irq(irq); } + if (!(new->flags & IRQF_NODELAY)) + if (start_irq_thread(irq, desc)) + return -ENOMEM; /* * The following block of code has to be executed atomically */ @@ -338,6 +362,11 @@ int setup_irq(unsigned int irq, struct i if (new->flags & IRQF_NOBALANCING) desc->status |= IRQ_NO_BALANCING; + /* + * Propagate any possible IRQF_NODELAY flag into IRQ_NODELAY: + */ + recalculate_desc_flags(desc); + if (!shared) { irq_chip_set_defaults(desc->chip); @@ -384,7 +413,7 @@ int setup_irq(unsigned int irq, struct i new->irq = irq; register_irq_proc(irq); - new->dir = NULL; + new->dir = new->threaded = NULL; register_handler_proc(irq, new); return 0; @@ -455,6 +484,7 @@ void free_irq(unsigned int irq, void *de else desc->chip->disable(irq); } + recalculate_desc_flags(desc); spin_unlock_irqrestore(&desc->lock, flags); unregister_handler_proc(irq, action); @@ -577,3 +607,257 @@ int request_irq(unsigned int irq, irq_ha return retval; } EXPORT_SYMBOL(request_irq); + +#ifdef CONFIG_PREEMPT_HARDIRQS + +int hardirq_preemption = 1; + +EXPORT_SYMBOL(hardirq_preemption); + +static int __init hardirq_preempt_setup (char *str) +{ + if (!strncmp(str, "off", 3)) + hardirq_preemption = 0; + else + get_option(&str, &hardirq_preemption); + if (!hardirq_preemption) + printk("turning off hardirq preemption!\n"); + + return 1; +} + +__setup("hardirq-preempt=", hardirq_preempt_setup); + + +/* + * threaded simple handler + */ +static void thread_simple_irq(irq_desc_t *desc) +{ + struct irqaction *action = desc->action; + unsigned int irq = desc - irq_desc; + irqreturn_t action_ret; + + if (action && !desc->depth) { + spin_unlock(&desc->lock); + action_ret = handle_IRQ_event(irq, action); + cond_resched_hardirq_context(); + spin_lock_irq(&desc->lock); + if (!noirqdebug) + note_interrupt(irq, desc, action_ret); + } + desc->status &= ~IRQ_INPROGRESS; +} + +/* + * threaded level type irq handler + */ +static void thread_level_irq(irq_desc_t *desc) +{ + unsigned int irq = desc - irq_desc; + + thread_simple_irq(desc); + if (!(desc->status & IRQ_DISABLED) && desc->chip->unmask) + desc->chip->unmask(irq); +} + +/* + * threaded fasteoi type irq handler + */ +static void thread_fasteoi_irq(irq_desc_t *desc) +{ + unsigned int irq = desc - irq_desc; + + thread_simple_irq(desc); + if (!(desc->status & IRQ_DISABLED) && desc->chip->unmask) + desc->chip->unmask(irq); +} + +/* + * threaded edge type IRQ handler + */ +static void thread_edge_irq(irq_desc_t *desc) +{ + unsigned int irq = desc - irq_desc; + + do { + struct irqaction *action = desc->action; + irqreturn_t action_ret; + + if (unlikely(!action)) { + desc->status &= ~IRQ_INPROGRESS; + desc->chip->mask(irq); + return; + } + + /* + * When another irq arrived while we were handling + * one, we could have masked the irq. + * Renable it, if it was not disabled in meantime. + */ + if (unlikely(((desc->status & (IRQ_PENDING | IRQ_MASKED)) == + (IRQ_PENDING | IRQ_MASKED)) && !desc->depth)) + desc->chip->unmask(irq); + + desc->status &= ~IRQ_PENDING; + spin_unlock(&desc->lock); + action_ret = handle_IRQ_event(irq, action); + cond_resched_hardirq_context(); + spin_lock_irq(&desc->lock); + if (!noirqdebug) + note_interrupt(irq, desc, action_ret); + } while ((desc->status & IRQ_PENDING) && !desc->depth); + + desc->status &= ~IRQ_INPROGRESS; +} + +/* + * threaded edge type IRQ handler + */ +static void thread_do_irq(irq_desc_t *desc) +{ + unsigned int irq = desc - irq_desc; + + do { + struct irqaction *action = desc->action; + irqreturn_t action_ret; + + if (unlikely(!action)) { + desc->status &= ~IRQ_INPROGRESS; + desc->chip->disable(irq); + return; + } + + desc->status &= ~IRQ_PENDING; + spin_unlock(&desc->lock); + action_ret = handle_IRQ_event(irq, action); + cond_resched_hardirq_context(); + spin_lock_irq(&desc->lock); + if (!noirqdebug) + note_interrupt(irq, desc, action_ret); + } while ((desc->status & IRQ_PENDING) && !desc->depth); + + desc->status &= ~IRQ_INPROGRESS; + desc->chip->end(irq); +} + +static void do_hardirq(struct irq_desc *desc) +{ + unsigned long flags; + + spin_lock_irqsave(&desc->lock, flags); + + if (!(desc->status & IRQ_INPROGRESS)) + goto out; + + if (desc->handle_irq == handle_simple_irq) + thread_simple_irq(desc); + else if (desc->handle_irq == handle_level_irq) + thread_level_irq(desc); + else if (desc->handle_irq == handle_fasteoi_irq) + thread_fasteoi_irq(desc); + else if (desc->handle_irq == handle_edge_irq) + thread_edge_irq(desc); + else + thread_do_irq(desc); + out: + spin_unlock_irqrestore(&desc->lock, flags); + + if (waitqueue_active(&desc->wait_for_handler)) + wake_up(&desc->wait_for_handler); +} + +extern asmlinkage void __do_softirq(void); + +static int do_irqd(void * __desc) +{ + struct sched_param param = { 0, }; + struct irq_desc *desc = __desc; + +#ifdef CONFIG_SMP + set_cpus_allowed(current, desc->affinity); +#endif + current->flags |= PF_NOFREEZE | PF_HARDIRQ; + + /* + * Set irq thread priority to SCHED_FIFO/50: + */ + param.sched_priority = MAX_USER_RT_PRIO/2; + + sys_sched_setscheduler(current->pid, SCHED_FIFO, ¶m); + + while (!kthread_should_stop()) { + local_irq_disable(); + set_current_state(TASK_INTERRUPTIBLE); + irq_enter(); + do_hardirq(desc); + irq_exit(); + local_irq_enable(); + cond_resched(); +#ifdef CONFIG_SMP + /* + * Did IRQ affinities change? + */ + if (!cpus_equal(current->cpus_allowed, desc->affinity)) + set_cpus_allowed(current, desc->affinity); +#endif + schedule(); + } + __set_current_state(TASK_RUNNING); + + return 0; +} + +static int ok_to_create_irq_threads; + +static int start_irq_thread(int irq, struct irq_desc *desc) +{ + if (desc->thread || !ok_to_create_irq_threads) + return 0; + + desc->thread = kthread_create(do_irqd, desc, "IRQ-%d", irq); + if (!desc->thread) { + printk(KERN_ERR "irqd: could not create IRQ thread %d!\n", irq); + return -ENOMEM; + } + + /* + * An interrupt may have come in before the thread pointer was + * stored in desc->thread; make sure the thread gets woken up in + * such a case: + */ + smp_mb(); + wake_up_process(desc->thread); + + return 0; +} + +void __init init_hardirqs(void) +{ + int i; + ok_to_create_irq_threads = 1; + + for (i = 0; i < NR_IRQS; i++) { + irq_desc_t *desc = irq_desc + i; + + if (desc->action && !(desc->status & IRQ_NODELAY)) + start_irq_thread(i, desc); + } +} + +#else + +static int start_irq_thread(int irq, struct irq_desc *desc) +{ + return 0; +} + +#endif + +void __init early_init_hardirqs(void) +{ + int i; + + for (i = 0; i < NR_IRQS; i++) + init_waitqueue_head(&irq_desc[i].wait_for_handler); +} Index: linux-2.6.24.7/kernel/irq/proc.c =================================================================== --- linux-2.6.24.7.orig/kernel/irq/proc.c +++ linux-2.6.24.7/kernel/irq/proc.c @@ -7,6 +7,8 @@ */ #include <linux/irq.h> +#include <asm/uaccess.h> +#include <linux/profile.h> #include <linux/proc_fs.h> #include <linux/interrupt.h> @@ -75,44 +77,6 @@ static int irq_affinity_write_proc(struc #endif -#define MAX_NAMELEN 128 - -static int name_unique(unsigned int irq, struct irqaction *new_action) -{ - struct irq_desc *desc = irq_desc + irq; - struct irqaction *action; - unsigned long flags; - int ret = 1; - - spin_lock_irqsave(&desc->lock, flags); - for (action = desc->action ; action; action = action->next) { - if ((action != new_action) && action->name && - !strcmp(new_action->name, action->name)) { - ret = 0; - break; - } - } - spin_unlock_irqrestore(&desc->lock, flags); - return ret; -} - -void register_handler_proc(unsigned int irq, struct irqaction *action) -{ - char name [MAX_NAMELEN]; - - if (!irq_desc[irq].dir || action->dir || !action->name || - !name_unique(irq, action)) - return; - - memset(name, 0, MAX_NAMELEN); - snprintf(name, MAX_NAMELEN, "%s", action->name); - - /* create /proc/irq/1234/handler/ */ - action->dir = proc_mkdir(name, irq_desc[irq].dir); -} - -#undef MAX_NAMELEN - #define MAX_NAMELEN 10 void register_irq_proc(unsigned int irq) @@ -150,10 +114,96 @@ void register_irq_proc(unsigned int irq) void unregister_handler_proc(unsigned int irq, struct irqaction *action) { + if (action->threaded) + remove_proc_entry(action->threaded->name, action->dir); if (action->dir) remove_proc_entry(action->dir->name, irq_desc[irq].dir); } +#ifndef CONFIG_PREEMPT_RT + +static int threaded_read_proc(char *page, char **start, off_t off, + int count, int *eof, void *data) +{ + return sprintf(page, "%c\n", + ((struct irqaction *)data)->flags & IRQF_NODELAY ? '0' : '1'); +} + +static int threaded_write_proc(struct file *file, const char __user *buffer, + unsigned long count, void *data) +{ + int c; + struct irqaction *action = data; + irq_desc_t *desc = irq_desc + action->irq; + + if (get_user(c, buffer)) + return -EFAULT; + if (c != '0' && c != '1') + return -EINVAL; + + spin_lock_irq(&desc->lock); + + if (c == '0') + action->flags |= IRQF_NODELAY; + if (c == '1') + action->flags &= ~IRQF_NODELAY; + recalculate_desc_flags(desc); + + spin_unlock_irq(&desc->lock); + + return 1; +} + +#endif + +#define MAX_NAMELEN 128 + +static int name_unique(unsigned int irq, struct irqaction *new_action) +{ + struct irq_desc *desc = irq_desc + irq; + struct irqaction *action; + + for (action = desc->action ; action; action = action->next) + if ((action != new_action) && action->name && + !strcmp(new_action->name, action->name)) + return 0; + return 1; +} + +void register_handler_proc(unsigned int irq, struct irqaction *action) +{ + char name [MAX_NAMELEN]; + + if (!irq_desc[irq].dir || action->dir || !action->name || + !name_unique(irq, action)) + return; + + memset(name, 0, MAX_NAMELEN); + snprintf(name, MAX_NAMELEN, "%s", action->name); + + /* create /proc/irq/1234/handler/ */ + action->dir = proc_mkdir(name, irq_desc[irq].dir); + + if (!action->dir) + return; +#ifndef CONFIG_PREEMPT_RT + { + struct proc_dir_entry *entry; + /* create /proc/irq/1234/handler/threaded */ + entry = create_proc_entry("threaded", 0600, action->dir); + if (!entry) + return; + entry->nlink = 1; + entry->data = (void *)action; + entry->read_proc = threaded_read_proc; + entry->write_proc = threaded_write_proc; + action->threaded = entry; + } +#endif +} + +#undef MAX_NAMELEN + void init_irq_proc(void) { int i; @@ -163,6 +213,9 @@ void init_irq_proc(void) if (!root_irq_dir) return; + /* create /proc/irq/prof_cpu_mask */ + create_prof_cpu_mask(root_irq_dir); + /* * Create entries for all existing IRQs. */ Index: linux-2.6.24.7/kernel/irq/spurious.c =================================================================== --- linux-2.6.24.7.orig/kernel/irq/spurious.c +++ linux-2.6.24.7/kernel/irq/spurious.c @@ -10,6 +10,10 @@ #include <linux/module.h> #include <linux/kallsyms.h> #include <linux/interrupt.h> +#ifdef CONFIG_X86_IO_APIC +# include <asm/apicdef.h> +# include <asm/io_apic.h> +#endif static int irqfixup __read_mostly; @@ -203,6 +207,12 @@ void note_interrupt(unsigned int irq, st * The interrupt is stuck */ __report_bad_irq(irq, desc, action_ret); +#ifdef CONFIG_X86_IO_APIC + if (!sis_apic_bug) { + sis_apic_bug = 1; + printk(KERN_ERR "turning off IO-APIC fast mode.\n"); + } +#else /* * Now kill the IRQ */ @@ -210,6 +220,7 @@ void note_interrupt(unsigned int irq, st desc->status |= IRQ_DISABLED; desc->depth = 1; desc->chip->disable(irq); +#endif } desc->irqs_unhandled = 0; } Index: linux-2.6.24.7/kernel/sched.c =================================================================== --- linux-2.6.24.7.orig/kernel/sched.c +++ linux-2.6.24.7/kernel/sched.c @@ -3459,7 +3459,7 @@ void account_system_time(struct task_str /* Add system time to cpustat. */ tmp = cputime_to_cputime64(cputime); - if (hardirq_count() - hardirq_offset) + if (hardirq_count() - hardirq_offset || (p->flags & PF_HARDIRQ)) cpustat->irq = cputime64_add(cpustat->irq, tmp); else if (softirq_count() || (p->flags & PF_SOFTIRQ)) cpustat->softirq = cputime64_add(cpustat->softirq, tmp); @@ -4817,6 +4817,27 @@ int __sched cond_resched_softirq_context } EXPORT_SYMBOL(cond_resched_softirq_context); +/* + * Preempt a hardirq context if necessary (possible with hardirq threading): + */ +int cond_resched_hardirq_context(void) +{ + WARN_ON_ONCE(!in_irq()); + WARN_ON_ONCE(!irqs_disabled()); + + if (hardirq_need_resched()) { + irq_exit(); + local_irq_enable(); + __cond_resched(); + local_irq_disable(); + __irq_enter(); + + return 1; + } + return 0; +} +EXPORT_SYMBOL(cond_resched_hardirq_context); + /** * yield - yield the current processor to other threads. * ����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/preempt-irqs-softirq-in-hardirq.patch�������������������������������������������������������0000664�0000764�0000764�00000004134�11041657733�020440� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������--- include/linux/interrupt.h | 1 + kernel/softirq.c | 35 +++++++++++++++++++++++++++++++++++ 2 files changed, 36 insertions(+) Index: linux-2.6.24.7/include/linux/interrupt.h =================================================================== --- linux-2.6.24.7.orig/include/linux/interrupt.h +++ linux-2.6.24.7/include/linux/interrupt.h @@ -277,6 +277,7 @@ struct softirq_action asmlinkage void do_softirq(void); extern void open_softirq(int nr, void (*action)(struct softirq_action*), void *data); extern void softirq_init(void); +extern void do_softirq_from_hardirq(void); #ifdef CONFIG_PREEMPT_HARDIRQS # define __raise_softirq_irqoff(nr) raise_softirq_irqoff(nr) Index: linux-2.6.24.7/kernel/softirq.c =================================================================== --- linux-2.6.24.7.orig/kernel/softirq.c +++ linux-2.6.24.7/kernel/softirq.c @@ -296,6 +296,8 @@ restart: asmlinkage void __do_softirq(void) { + unsigned long p_flags; + #ifdef CONFIG_PREEMPT_SOFTIRQS /* * 'preempt harder'. Push all softirq processing off to ksoftirqd. @@ -311,6 +313,38 @@ asmlinkage void __do_softirq(void) */ __local_bh_disable((unsigned long)__builtin_return_address(0)); trace_softirq_enter(); + p_flags = current->flags & PF_HARDIRQ; + current->flags &= ~PF_HARDIRQ; + + ___do_softirq(); + + trace_softirq_exit(); + + account_system_vtime(current); + _local_bh_enable(); + + current->flags |= p_flags; +} + +/* + * Process softirqs straight from hardirq context, + * without having to switch to a softirq thread. + * This can reduce the context-switch rate. + * + * NOTE: this is unused right now. + */ +void do_softirq_from_hardirq(void) +{ + unsigned long p_flags; + + if (!local_softirq_pending()) + return; + /* + * 'immediate' softirq execution: + */ + __local_bh_disable((unsigned long)__builtin_return_address(0)); + p_flags = current->flags & PF_HARDIRQ; + current->flags &= ~PF_HARDIRQ; ___do_softirq(); @@ -319,6 +353,7 @@ asmlinkage void __do_softirq(void) account_system_vtime(current); _local_bh_enable(); + current->flags |= p_flags; } #ifndef __ARCH_HAS_DO_SOFTIRQ ������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/preempt-irqs-direct-debug-keyboard.patch����������������������������������������������������0000664�0000764�0000764�00000004754�11041657734�021062� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������--- include/linux/sched.h | 6 ++++++ init/main.c | 2 ++ kernel/irq/handle.c | 31 +++++++++++++++++++++++++++++++ 3 files changed, 39 insertions(+) Index: linux-2.6.24.7/include/linux/sched.h =================================================================== --- linux-2.6.24.7.orig/include/linux/sched.h +++ linux-2.6.24.7/include/linux/sched.h @@ -271,6 +271,12 @@ extern void account_process_tick(struct extern void update_process_times(int user); extern void scheduler_tick(void); +#ifdef CONFIG_GENERIC_HARDIRQS +extern int debug_direct_keyboard; +#else +# define debug_direct_keyboard 0 +#endif + #ifdef CONFIG_DETECT_SOFTLOCKUP extern void softlockup_tick(void); extern void spawn_softlockup_task(void); Index: linux-2.6.24.7/init/main.c =================================================================== --- linux-2.6.24.7.orig/init/main.c +++ linux-2.6.24.7/init/main.c @@ -858,5 +858,7 @@ static int __init kernel_init(void * unu * initmem segments and start the user-mode stuff.. */ init_post(); + WARN_ON(debug_direct_keyboard); + return 0; } Index: linux-2.6.24.7/kernel/irq/handle.c =================================================================== --- linux-2.6.24.7.orig/kernel/irq/handle.c +++ linux-2.6.24.7/kernel/irq/handle.c @@ -132,6 +132,11 @@ irqreturn_t handle_IRQ_event(unsigned in irqreturn_t ret, retval = IRQ_NONE; unsigned int status = 0; +#ifdef __i386__ + if (debug_direct_keyboard && irq == 1) + lockdep_off(); +#endif + handle_dynamic_tick(action); /* @@ -163,9 +168,30 @@ irqreturn_t handle_IRQ_event(unsigned in } local_irq_disable(); +#ifdef __i386__ + if (debug_direct_keyboard && irq == 1) + lockdep_on(); +#endif return retval; } +/* + * Hack - used for development only. + */ +int __read_mostly debug_direct_keyboard = 0; + +int __init debug_direct_keyboard_setup(char *str) +{ + debug_direct_keyboard = 1; + printk(KERN_INFO "Switching IRQ 1 (keyboard) to to direct!\n"); +#ifdef CONFIG_PREEMPT_RT + printk(KERN_INFO "WARNING: kernel may easily crash this way!\n"); +#endif + return 1; +} + +__setup("debug_direct_keyboard", debug_direct_keyboard_setup); + int redirect_hardirq(struct irq_desc *desc) { /* @@ -175,6 +201,11 @@ int redirect_hardirq(struct irq_desc *de !desc->thread) return 0; +#ifdef __i386__ + if (debug_direct_keyboard && (desc - irq_desc == 1)) + return 0; +#endif + BUG_ON(!irqs_disabled()); if (desc->thread && desc->thread->state != TASK_RUNNING) wake_up_process(desc->thread); ��������������������patches/preempt-irqs-timer.patch��������������������������������������������������������������������0000664�0000764�0000764�00000016471�11041657734�016045� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������--- include/linux/timer.h | 4 + kernel/timer.c | 128 +++++++++++++++++++++++++++++++++++++------------- 2 files changed, 99 insertions(+), 33 deletions(-) Index: linux-2.6.24.7/include/linux/timer.h =================================================================== --- linux-2.6.24.7.orig/include/linux/timer.h +++ linux-2.6.24.7/include/linux/timer.h @@ -146,10 +146,12 @@ static inline void add_timer(struct time __mod_timer(timer, timer->expires); } -#ifdef CONFIG_SMP +#if defined(CONFIG_SMP) || defined(CONFIG_PREEMPT_SOFTIRQS) + extern int timer_pending_sync(struct timer_list *timer); extern int try_to_del_timer_sync(struct timer_list *timer); extern int del_timer_sync(struct timer_list *timer); #else +# define timer_pending_sync(t) timer_pending(t) # define try_to_del_timer_sync(t) del_timer(t) # define del_timer_sync(t) del_timer(t) #endif Index: linux-2.6.24.7/kernel/timer.c =================================================================== --- linux-2.6.24.7.orig/kernel/timer.c +++ linux-2.6.24.7/kernel/timer.c @@ -34,6 +34,7 @@ #include <linux/posix-timers.h> #include <linux/cpu.h> #include <linux/syscalls.h> +#include <linux/kallsyms.h> #include <linux/delay.h> #include <linux/tick.h> #include <linux/kallsyms.h> @@ -69,6 +70,7 @@ typedef struct tvec_root_s { struct tvec_t_base_s { spinlock_t lock; struct timer_list *running_timer; + wait_queue_head_t wait_for_running_timer; unsigned long timer_jiffies; tvec_root_t tv1; tvec_t tv2; @@ -249,9 +251,7 @@ EXPORT_SYMBOL_GPL(round_jiffies_relative static inline void set_running_timer(tvec_base_t *base, struct timer_list *timer) { -#ifdef CONFIG_SMP base->running_timer = timer; -#endif } static void internal_add_timer(tvec_base_t *base, struct timer_list *timer) @@ -395,7 +395,7 @@ int __mod_timer(struct timer_list *timer { tvec_base_t *base, *new_base; unsigned long flags; - int ret = 0; + int ret = 0, cpu; timer_stats_timer_set_start_info(timer); BUG_ON(!timer->function); @@ -407,7 +407,8 @@ int __mod_timer(struct timer_list *timer ret = 1; } - new_base = __get_cpu_var(tvec_bases); + cpu = raw_smp_processor_id(); + new_base = per_cpu(tvec_bases, cpu); if (base != new_base) { /* @@ -465,6 +466,18 @@ void add_timer_on(struct timer_list *tim spin_unlock_irqrestore(&base->lock, flags); } +/* + * Wait for a running timer + */ +void wait_for_running_timer(struct timer_list *timer) +{ + tvec_base_t *base = timer->base; + + if (base->running_timer == timer) + wait_event(base->wait_for_running_timer, + base->running_timer != timer); +} + /** * mod_timer - modify a timer's timeout * @timer: the timer to be modified @@ -535,7 +548,35 @@ int del_timer(struct timer_list *timer) EXPORT_SYMBOL(del_timer); -#ifdef CONFIG_SMP +#if defined(CONFIG_SMP) || defined(CONFIG_PREEMPT_SOFTIRQS) +/* + * This function checks whether a timer is active and not running on any + * CPU. Upon successful (ret >= 0) exit the timer is not queued and the + * handler is not running on any CPU. + * + * It must not be called from interrupt contexts. + */ +int timer_pending_sync(struct timer_list *timer) +{ + tvec_base_t *base; + unsigned long flags; + int ret = -1; + + base = lock_timer_base(timer, &flags); + + if (base->running_timer == timer) + goto out; + + ret = 0; + if (timer_pending(timer)) + ret = 1; +out: + spin_unlock_irqrestore(&base->lock, flags); + + return ret; +} + + /** * try_to_del_timer_sync - Try to deactivate a timer * @timer: timer do del @@ -592,7 +633,7 @@ int del_timer_sync(struct timer_list *ti int ret = try_to_del_timer_sync(timer); if (ret >= 0) return ret; - cpu_relax(); + wait_for_running_timer(timer); } } @@ -638,6 +679,20 @@ static inline void __run_timers(tvec_bas struct list_head *head = &work_list; int index = base->timer_jiffies & TVR_MASK; + if (softirq_need_resched()) { + spin_unlock_irq(&base->lock); + wake_up(&base->wait_for_running_timer); + cond_resched_softirq_context(); + cpu_relax(); + spin_lock_irq(&base->lock); + /* + * We can simply continue after preemption, nobody + * else can touch timer_jiffies so 'index' is still + * valid. Any new jiffy will be taken care of in + * subsequent loops: + */ + } + /* * Cascade timers: */ @@ -665,18 +720,17 @@ static inline void __run_timers(tvec_bas int preempt_count = preempt_count(); fn(data); if (preempt_count != preempt_count()) { - printk(KERN_WARNING "huh, entered %p " - "with preempt_count %08x, exited" - " with %08x?\n", - fn, preempt_count, - preempt_count()); - BUG(); + print_symbol("BUG: unbalanced timer-handler preempt count in %s!\n", (unsigned long) fn); + printk("entered with %08x, exited with %08x.\n", preempt_count, preempt_count()); + preempt_count() = preempt_count; } } + set_running_timer(base, NULL); + cond_resched_softirq_context(); spin_lock_irq(&base->lock); } } - set_running_timer(base, NULL); + wake_up(&base->wait_for_running_timer); spin_unlock_irq(&base->lock); } @@ -849,10 +903,10 @@ void update_process_times(int user_tick) /* Note: this timer irq context must be accounted for as well. */ account_process_tick(p, user_tick); + scheduler_tick(); run_local_timers(); if (rcu_pending(cpu)) rcu_check_callbacks(cpu, user_tick); - scheduler_tick(); run_posix_cpu_timers(p); } @@ -898,35 +952,45 @@ static inline void calc_load(unsigned lo } /* - * This function runs timers and the timer-tq in bottom half context. + * Called by the local, per-CPU timer interrupt on SMP. */ -static void run_timer_softirq(struct softirq_action *h) +void run_local_timers(void) { - tvec_base_t *base = __get_cpu_var(tvec_bases); - - hrtimer_run_queues(); - - if (time_after_eq(jiffies, base->timer_jiffies)) - __run_timers(base); + raise_softirq(TIMER_SOFTIRQ); + softlockup_tick(); } /* - * Called by the local, per-CPU timer interrupt on SMP. + * Time of day handling: */ -void run_local_timers(void) +static inline void update_times(void) { - raise_softirq(TIMER_SOFTIRQ); - softlockup_tick(); + static unsigned long last_tick = INITIAL_JIFFIES; + unsigned long ticks, flags; + + write_seqlock_irqsave(&xtime_lock, flags); + ticks = jiffies - last_tick; + if (ticks) { + last_tick += ticks; + update_wall_time(); + calc_load(ticks); + } + write_sequnlock_irqrestore(&xtime_lock, flags); } + /* - * Called by the timer interrupt. xtime_lock must already be taken - * by the timer IRQ! + * This function runs timers and the timer-tq in bottom half context. */ -static inline void update_times(unsigned long ticks) +static void run_timer_softirq(struct softirq_action *h) { - update_wall_time(); - calc_load(ticks); + tvec_base_t *base = __get_cpu_var(tvec_bases); + + update_times(); + hrtimer_run_queues(); + + if (time_after_eq(jiffies, base->timer_jiffies)) + __run_timers(base); } /* @@ -938,7 +1002,6 @@ static inline void update_times(unsigned void do_timer(unsigned long ticks) { jiffies_64 += ticks; - update_times(ticks); } #ifdef __ARCH_WANT_SYS_ALARM @@ -1270,6 +1333,7 @@ static int __cpuinit init_timers_cpu(int spin_lock_init(&base->lock); lockdep_set_class(&base->lock, base_lock_keys + cpu); + init_waitqueue_head(&base->wait_for_running_timer); for (j = 0; j < TVN_SIZE; j++) { INIT_LIST_HEAD(base->tv5.vec + j); �������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/preempt-irqs-hrtimer.patch������������������������������������������������������������������0000664�0000764�0000764�00000010316�11041673237�016363� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������� include/linux/hrtimer.h | 10 ++++++++++ kernel/hrtimer.c | 35 ++++++++++++++++++++++++++++++++++- kernel/itimer.c | 1 + kernel/posix-timers.c | 3 +++ 4 files changed, 48 insertions(+), 1 deletion(-) Index: linux-2.6.24.7/include/linux/hrtimer.h =================================================================== --- linux-2.6.24.7.orig/include/linux/hrtimer.h +++ linux-2.6.24.7/include/linux/hrtimer.h @@ -200,6 +200,9 @@ struct hrtimer_cpu_base { struct list_head cb_pending; unsigned long nr_events; #endif +#ifdef CONFIG_PREEMPT_SOFTIRQS + wait_queue_head_t wait; +#endif }; #ifdef CONFIG_HIGH_RES_TIMERS @@ -270,6 +273,13 @@ static inline int hrtimer_restart(struct return hrtimer_start(timer, timer->expires, HRTIMER_MODE_ABS); } +/* Softirq preemption could deadlock timer removal */ +#ifdef CONFIG_PREEMPT_SOFTIRQS + extern void hrtimer_wait_for_timer(const struct hrtimer *timer); +#else +# define hrtimer_wait_for_timer(timer) do { cpu_relax(); } while (0) +#endif + /* Query timers: */ extern ktime_t hrtimer_get_remaining(const struct hrtimer *timer); extern int hrtimer_get_res(const clockid_t which_clock, struct timespec *tp); Index: linux-2.6.24.7/kernel/hrtimer.c =================================================================== --- linux-2.6.24.7.orig/kernel/hrtimer.c +++ linux-2.6.24.7/kernel/hrtimer.c @@ -989,7 +989,7 @@ int hrtimer_cancel(struct hrtimer *timer if (ret >= 0) return ret; - cpu_relax(); + hrtimer_wait_for_timer(timer); } } EXPORT_SYMBOL_GPL(hrtimer_cancel); @@ -1100,6 +1100,32 @@ int hrtimer_get_res(const clockid_t whic } EXPORT_SYMBOL_GPL(hrtimer_get_res); +#ifdef CONFIG_PREEMPT_SOFTIRQS +# define wake_up_timer_waiters(b) wake_up(&(b)->wait) + +/** + * hrtimer_wait_for_timer - Wait for a running timer + * + * @timer: timer to wait for + * + * The function waits in case the timers callback function is + * currently executed on the waitqueue of the timer base. The + * waitqueue is woken up after the timer callback function has + * finished execution. + */ +void hrtimer_wait_for_timer(const struct hrtimer *timer) +{ + struct hrtimer_clock_base *base = timer->base; + + if (base && base->cpu_base) + wait_event(base->cpu_base->wait, + !(timer->state & HRTIMER_STATE_CALLBACK)); +} + +#else +# define wake_up_timer_waiters(b) do { } while (0) +#endif + #ifdef CONFIG_HIGH_RES_TIMERS /* @@ -1246,6 +1272,8 @@ static void run_hrtimer_softirq(struct s } } spin_unlock_irq(&cpu_base->lock); + + wake_up_timer_waiters(cpu_base); } #endif /* CONFIG_HIGH_RES_TIMERS */ @@ -1296,6 +1324,8 @@ static inline void run_hrtimer_queue(str } } spin_unlock_irq(&cpu_base->lock); + + wake_up_timer_waiters(cpu_base); } /* @@ -1477,6 +1507,9 @@ static void __cpuinit init_hrtimers_cpu( cpu_base->clock_base[i].cpu_base = cpu_base; hrtimer_init_hres(cpu_base); +#ifdef CONFIG_PREEMPT_SOFTIRQS + init_waitqueue_head(&cpu_base->wait); +#endif } #ifdef CONFIG_HOTPLUG_CPU Index: linux-2.6.24.7/kernel/itimer.c =================================================================== --- linux-2.6.24.7.orig/kernel/itimer.c +++ linux-2.6.24.7/kernel/itimer.c @@ -170,6 +170,7 @@ again: /* We are sharing ->siglock with it_real_fn() */ if (hrtimer_try_to_cancel(timer) < 0) { spin_unlock_irq(&tsk->sighand->siglock); + hrtimer_wait_for_timer(&tsk->signal->real_timer); goto again; } expires = timeval_to_ktime(value->it_value); Index: linux-2.6.24.7/kernel/posix-timers.c =================================================================== --- linux-2.6.24.7.orig/kernel/posix-timers.c +++ linux-2.6.24.7/kernel/posix-timers.c @@ -809,6 +809,7 @@ retry: unlock_timer(timr, flag); if (error == TIMER_RETRY) { + hrtimer_wait_for_timer(&timr->it.real.timer); rtn = NULL; // We already got the old time... goto retry; } @@ -848,6 +849,7 @@ retry_delete: if (timer_delete_hook(timer) == TIMER_RETRY) { unlock_timer(timer, flags); + hrtimer_wait_for_timer(&timer->it.real.timer); goto retry_delete; } @@ -880,6 +882,7 @@ retry_delete: if (timer_delete_hook(timer) == TIMER_RETRY) { unlock_timer(timer, flags); + hrtimer_wait_for_timer(&timer->it.real.timer); goto retry_delete; } list_del(&timer->list); ������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/preempt-irqs-i386.patch���������������������������������������������������������������������0000664�0000764�0000764�00000011605�11041657733�015407� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������--- arch/x86/kernel/i8259_32.c | 9 ++++++--- arch/x86/kernel/io_apic_32.c | 20 +++++--------------- arch/x86/mach-default/setup.c | 3 ++- arch/x86/mach-visws/visws_apic.c | 2 ++ arch/x86/mach-voyager/setup.c | 3 ++- 5 files changed, 17 insertions(+), 20 deletions(-) Index: linux-2.6.24.7/arch/x86/kernel/i8259_32.c =================================================================== --- linux-2.6.24.7.orig/arch/x86/kernel/i8259_32.c +++ linux-2.6.24.7/arch/x86/kernel/i8259_32.c @@ -169,6 +169,8 @@ static void mask_and_ack_8259A(unsigned */ if (cached_irq_mask & irqmask) goto spurious_8259A_irq; + if (irq & 8) + outb(0x60+(irq&7),PIC_SLAVE_CMD); /* 'Specific EOI' to slave */ cached_irq_mask |= irqmask; handle_real_irq: @@ -296,10 +298,10 @@ void init_8259A(int auto_eoi) outb_p(0x11, PIC_MASTER_CMD); /* ICW1: select 8259A-1 init */ outb_p(0x20 + 0, PIC_MASTER_IMR); /* ICW2: 8259A-1 IR0-7 mapped to 0x20-0x27 */ outb_p(1U << PIC_CASCADE_IR, PIC_MASTER_IMR); /* 8259A-1 (the master) has a slave on IR2 */ - if (auto_eoi) /* master does Auto EOI */ - outb_p(MASTER_ICW4_DEFAULT | PIC_ICW4_AEOI, PIC_MASTER_IMR); - else /* master expects normal EOI */ + if (!auto_eoi) /* master expects normal EOI */ outb_p(MASTER_ICW4_DEFAULT, PIC_MASTER_IMR); + else /* master does Auto EOI */ + outb_p(MASTER_ICW4_DEFAULT | PIC_ICW4_AEOI, PIC_MASTER_IMR); outb_p(0x11, PIC_SLAVE_CMD); /* ICW1: select 8259A-2 init */ outb_p(0x20 + 8, PIC_SLAVE_IMR); /* ICW2: 8259A-2 IR0-7 mapped to 0x28-0x2f */ @@ -351,6 +353,7 @@ static irqreturn_t math_error_irq(int cp */ static struct irqaction fpu_irq = { .handler = math_error_irq, + .flags = IRQF_NODELAY, .mask = CPU_MASK_NONE, .name = "fpu", }; Index: linux-2.6.24.7/arch/x86/kernel/io_apic_32.c =================================================================== --- linux-2.6.24.7.orig/arch/x86/kernel/io_apic_32.c +++ linux-2.6.24.7/arch/x86/kernel/io_apic_32.c @@ -261,18 +261,6 @@ static void __unmask_IO_APIC_irq (unsign __modify_IO_APIC_irq(irq, 0, 0x00010000); } -/* mask = 1, trigger = 0 */ -static void __mask_and_edge_IO_APIC_irq (unsigned int irq) -{ - __modify_IO_APIC_irq(irq, 0x00010000, 0x00008000); -} - -/* mask = 0, trigger = 1 */ -static void __unmask_and_level_IO_APIC_irq (unsigned int irq) -{ - __modify_IO_APIC_irq(irq, 0x00008000, 0x00010000); -} - static void mask_IO_APIC_irq (unsigned int irq) { unsigned long flags; @@ -1493,7 +1481,7 @@ void __init print_IO_APIC(void) return; } -#if 0 +#if 1 static void print_APIC_bitfield (int base) { @@ -1989,8 +1977,10 @@ static void ack_ioapic_quirk_irq(unsigne if (!(v & (1 << (i & 0x1f)))) { atomic_inc(&irq_mis_count); spin_lock(&ioapic_lock); - __mask_and_edge_IO_APIC_irq(irq); - __unmask_and_level_IO_APIC_irq(irq); + /* mask = 1, trigger = 0 */ + __modify_IO_APIC_irq(irq, 0x00010000, 0x00008000); + /* mask = 0, trigger = 1 */ + __modify_IO_APIC_irq(irq, 0x00008000, 0x00010000); spin_unlock(&ioapic_lock); } } Index: linux-2.6.24.7/arch/x86/mach-default/setup.c =================================================================== --- linux-2.6.24.7.orig/arch/x86/mach-default/setup.c +++ linux-2.6.24.7/arch/x86/mach-default/setup.c @@ -37,6 +37,7 @@ void __init pre_intr_init_hook(void) */ static struct irqaction irq2 = { .handler = no_action, + .flags = IRQF_NODELAY, .mask = CPU_MASK_NONE, .name = "cascade", }; @@ -85,7 +86,7 @@ void __init trap_init_hook(void) static struct irqaction irq0 = { .handler = timer_interrupt, - .flags = IRQF_DISABLED | IRQF_NOBALANCING | IRQF_IRQPOLL, + .flags = IRQF_DISABLED | IRQF_NOBALANCING | IRQF_IRQPOLL | IRQF_NODELAY, .mask = CPU_MASK_NONE, .name = "timer" }; Index: linux-2.6.24.7/arch/x86/mach-visws/visws_apic.c =================================================================== --- linux-2.6.24.7.orig/arch/x86/mach-visws/visws_apic.c +++ linux-2.6.24.7/arch/x86/mach-visws/visws_apic.c @@ -257,11 +257,13 @@ out_unlock: static struct irqaction master_action = { .handler = piix4_master_intr, .name = "PIIX4-8259", + .flags = IRQF_NODELAY, }; static struct irqaction cascade_action = { .handler = no_action, .name = "cascade", + .flags = IRQF_NODELAY, }; Index: linux-2.6.24.7/arch/x86/mach-voyager/setup.c =================================================================== --- linux-2.6.24.7.orig/arch/x86/mach-voyager/setup.c +++ linux-2.6.24.7/arch/x86/mach-voyager/setup.c @@ -20,6 +20,7 @@ void __init pre_intr_init_hook(void) */ static struct irqaction irq2 = { .handler = no_action, + .flags = IRQF_NODELAY, .mask = CPU_MASK_NONE, .name = "cascade", }; @@ -46,7 +47,7 @@ void __init trap_init_hook(void) static struct irqaction irq0 = { .handler = timer_interrupt, - .flags = IRQF_DISABLED | IRQF_NOBALANCING | IRQF_IRQPOLL, + .flags = IRQF_DISABLED | IRQF_NOBALANCING | IRQF_IRQPOLL | IRQF_NODELAY, .mask = CPU_MASK_NONE, .name = "timer" }; ���������������������������������������������������������������������������������������������������������������������������patches/preempt-irqs-i386-ioapic-mask-quirk.patch���������������������������������������������������0000664�0000764�0000764�00000013712�11041657733�020734� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From mschmidt@redhat.com Thu Jun 21 13:32:02 2007 Return-Path: <mschmidt@redhat.com> Received: from mx1.redhat.com (mx1.redhat.com [66.187.233.31]) by mail.tglx.de (Postfix) with ESMTP id CA11565C065 for <tglx@linutronix.de>; Thu, 21 Jun 2007 13:32:02 +0200 (CEST) Received: from int-mx1.corp.redhat.com (int-mx1.corp.redhat.com [172.16.52.254]) by mx1.redhat.com (8.13.1/8.13.1) with ESMTP id l5LBVoq3016914; Thu, 21 Jun 2007 07:31:50 -0400 Received: from pobox.stuttgart.redhat.com (pobox.stuttgart.redhat.com [172.16.2.10]) by int-mx1.corp.redhat.com (8.13.1/8.13.1) with ESMTP id l5LBVmp0010104; Thu, 21 Jun 2007 07:31:49 -0400 Received: from [10.34.32.84] (brian.englab.brq.redhat.com [10.34.32.84]) by pobox.stuttgart.redhat.com (8.12.11.20060308/8.12.11) with ESMTP id l5LBVl5k000423; Thu, 21 Jun 2007 13:31:47 +0200 Message-ID: <467A61A3.7060804@redhat.com> Date: Thu, 21 Jun 2007 13:31:47 +0200 From: Michal Schmidt <mschmidt@redhat.com> User-Agent: Thunderbird 1.5.0.12 (X11/20070529) MIME-Version: 1.0 To: Steven Rostedt <rostedt@goodmis.org> CC: Ingo Molnar <mingo@redhat.com>, Thomas Gleixner <tglx@linutronix.de>, linux-rt-users@vger.kernel.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH -rt] irq nobody cared workaround for i386 References: <4676CF81.2000205@redhat.com> <4677D7AF.7040700@redhat.com> <467932B4.6030800@redhat.com> <467936FE.8050704@redhat.com> In-Reply-To: <467936FE.8050704@redhat.com> X-Enigmail-Version: 0.94.2.0 Content-Type: text/plain; charset=ISO-8859-1 X-Evolution-Source: imap://tglx%40linutronix.de@localhost:8993/ Content-Transfer-Encoding: 8bit Steven Rostedt wrote: > Michal Schmidt wrote: > >> I came to the conclusion that the IO-APICs which need the fix for the >> nobody cared bug don't have the issue ack_ioapic_quirk_irq is designed >> to work-around. It should be safe simply to use the normal >> ack_ioapic_irq as the .eoi method in pcix_ioapic_chip. >> So this is the port of Steven's fix for the nobody cared bug to i386. It >> works fine on IBM LS21 I have access to. >> >> > You want to make that "apic > 0". Note the spacing. If it breaks > 80 characters, then simply put it to a new line. > > [...] > ACK > > -- Steve > OK, I fixed the spacing in both occurences. Signed-off-by: Michal Schmidt <mschmidt@redhat.com> --- arch/x86/kernel/io_apic_32.c | 62 ++++++++++++++++++++++++++++++++++++++----- 1 file changed, 55 insertions(+), 7 deletions(-) Index: linux-2.6.24.7/arch/x86/kernel/io_apic_32.c =================================================================== --- linux-2.6.24.7.orig/arch/x86/kernel/io_apic_32.c +++ linux-2.6.24.7/arch/x86/kernel/io_apic_32.c @@ -261,6 +261,18 @@ static void __unmask_IO_APIC_irq (unsign __modify_IO_APIC_irq(irq, 0, 0x00010000); } +/* trigger = 0 (edge mode) */ +static void __pcix_mask_IO_APIC_irq (unsigned int irq) +{ + __modify_IO_APIC_irq(irq, 0, 0x00008000); +} + +/* mask = 0, trigger = 1 (level mode) */ +static void __pcix_unmask_IO_APIC_irq (unsigned int irq) +{ + __modify_IO_APIC_irq(irq, 0x00008000, 0x00010000); +} + static void mask_IO_APIC_irq (unsigned int irq) { unsigned long flags; @@ -279,6 +291,24 @@ static void unmask_IO_APIC_irq (unsigned spin_unlock_irqrestore(&ioapic_lock, flags); } +static void pcix_mask_IO_APIC_irq (unsigned int irq) +{ + unsigned long flags; + + spin_lock_irqsave(&ioapic_lock, flags); + __pcix_mask_IO_APIC_irq(irq); + spin_unlock_irqrestore(&ioapic_lock, flags); +} + +static void pcix_unmask_IO_APIC_irq (unsigned int irq) +{ + unsigned long flags; + + spin_lock_irqsave(&ioapic_lock, flags); + __pcix_unmask_IO_APIC_irq(irq); + spin_unlock_irqrestore(&ioapic_lock, flags); +} + static void clear_IO_APIC_pin(unsigned int apic, unsigned int pin) { struct IO_APIC_route_entry entry; @@ -1224,23 +1254,28 @@ static int assign_irq_vector(int irq) return vector; } + static struct irq_chip ioapic_chip; +static struct irq_chip pcix_ioapic_chip; #define IOAPIC_AUTO -1 #define IOAPIC_EDGE 0 #define IOAPIC_LEVEL 1 -static void ioapic_register_intr(int irq, int vector, unsigned long trigger) +static void ioapic_register_intr(int irq, int vector, unsigned long trigger, + int pcix) { + struct irq_chip *chip = pcix ? &pcix_ioapic_chip : &ioapic_chip; + if ((trigger == IOAPIC_AUTO && IO_APIC_irq_trigger(irq)) || trigger == IOAPIC_LEVEL) { irq_desc[irq].status |= IRQ_LEVEL; - set_irq_chip_and_handler_name(irq, &ioapic_chip, - handle_fasteoi_irq, "fasteoi"); + set_irq_chip_and_handler_name(irq, chip, handle_fasteoi_irq, + pcix ? "pcix-fasteoi" : "fasteoi"); } else { irq_desc[irq].status &= ~IRQ_LEVEL; - set_irq_chip_and_handler_name(irq, &ioapic_chip, - handle_edge_irq, "edge"); + set_irq_chip_and_handler_name(irq, chip, handle_edge_irq, + pcix ? "pcix-edge" : "edge"); } set_intr_gate(vector, interrupt[irq]); } @@ -1310,7 +1345,8 @@ static void __init setup_IO_APIC_irqs(vo if (IO_APIC_IRQ(irq)) { vector = assign_irq_vector(irq); entry.vector = vector; - ioapic_register_intr(irq, vector, IOAPIC_AUTO); + ioapic_register_intr(irq, vector, IOAPIC_AUTO, + apic > 0); if (!apic && (irq < 16)) disable_8259A_irq(irq); @@ -2005,6 +2041,18 @@ static struct irq_chip ioapic_chip __rea .retrigger = ioapic_retrigger_irq, }; +static struct irq_chip pcix_ioapic_chip __read_mostly = { + .name = "IO-APIC", + .startup = startup_ioapic_irq, + .mask = pcix_mask_IO_APIC_irq, + .unmask = pcix_unmask_IO_APIC_irq, + .ack = ack_ioapic_irq, + .eoi = ack_ioapic_irq, +#ifdef CONFIG_SMP + .set_affinity = set_ioapic_affinity_irq, +#endif + .retrigger = ioapic_retrigger_irq, +}; static inline void init_IO_APIC_traps(void) { @@ -2817,7 +2865,7 @@ int io_apic_set_pci_routing (int ioapic, mp_ioapics[ioapic].mpc_apicid, pin, entry.vector, irq, edge_level, active_high_low); - ioapic_register_intr(irq, entry.vector, edge_level); + ioapic_register_intr(irq, entry.vector, edge_level, ioapic > 0); if (!ioapic && (irq < 16)) disable_8259A_irq(irq); ������������������������������������������������������patches/preempt-irqs-mips.patch���������������������������������������������������������������������0000664�0000764�0000764�00000001100�11041657734�015654� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������--- arch/mips/kernel/i8253.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) Index: linux-2.6.24.7/arch/mips/kernel/i8253.c =================================================================== --- linux-2.6.24.7.orig/arch/mips/kernel/i8253.c +++ linux-2.6.24.7/arch/mips/kernel/i8253.c @@ -100,7 +100,7 @@ static irqreturn_t timer_interrupt(int i static struct irqaction irq0 = { .handler = timer_interrupt, - .flags = IRQF_DISABLED | IRQF_NOBALANCING, + .flags = IRQF_DISABLED | IRQF_NOBALANCING | IRQF_NODELAY, .mask = CPU_MASK_NONE, .name = "timer" }; ����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/preempt-irqs-x86-64.patch�������������������������������������������������������������������0000664�0000764�0000764�00000002057�11041657735�015575� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������--- arch/x86/kernel/i8259_64.c | 1 + arch/x86/kernel/time_64.c | 3 ++- 2 files changed, 3 insertions(+), 1 deletion(-) Index: linux-2.6.24.7/arch/x86/kernel/i8259_64.c =================================================================== --- linux-2.6.24.7.orig/arch/x86/kernel/i8259_64.c +++ linux-2.6.24.7/arch/x86/kernel/i8259_64.c @@ -397,6 +397,7 @@ device_initcall(i8259A_init_sysfs); static struct irqaction irq2 = { .handler = no_action, + .flags = IRQF_NODELAY, .mask = CPU_MASK_NONE, .name = "cascade", }; Index: linux-2.6.24.7/arch/x86/kernel/time_64.c =================================================================== --- linux-2.6.24.7.orig/arch/x86/kernel/time_64.c +++ linux-2.6.24.7/arch/x86/kernel/time_64.c @@ -259,7 +259,8 @@ static unsigned int __init tsc_calibrate static struct irqaction irq0 = { .handler = timer_event_interrupt, - .flags = IRQF_DISABLED | IRQF_IRQPOLL | IRQF_NOBALANCING, + .flags = IRQF_DISABLED | IRQF_IRQPOLL | IRQF_NOBALANCING | + IRQF_NODELAY, .mask = CPU_MASK_NONE, .name = "timer" }; ���������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/preempt-irqs-x86-64-ioapic-mask-quirk.patch�������������������������������������������������0000664�0000764�0000764�00000006520�11041657730�021113� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������--- arch/x86/kernel/io_apic_64.c | 62 +++++++++++++++++++++++++++++++++---------- 1 file changed, 49 insertions(+), 13 deletions(-) Index: linux-2.6.24.7/arch/x86/kernel/io_apic_64.c =================================================================== --- linux-2.6.24.7.orig/arch/x86/kernel/io_apic_64.c +++ linux-2.6.24.7/arch/x86/kernel/io_apic_64.c @@ -354,6 +354,9 @@ DO_ACTION( __mask, 0, |= 0x0 DO_ACTION( __unmask, 0, &= 0xfffeffff, ) /* mask = 0 */ +DO_ACTION( __pcix_mask, 0, &= 0xffff7fff, ) /* edge */ +DO_ACTION( __pcix_unmask, 0, = (reg & 0xfffeffff) | 0x00008000, ) /* level */ + static void mask_IO_APIC_irq (unsigned int irq) { unsigned long flags; @@ -371,6 +374,23 @@ static void unmask_IO_APIC_irq (unsigned __unmask_IO_APIC_irq(irq); spin_unlock_irqrestore(&ioapic_lock, flags); } +static void pcix_mask_IO_APIC_irq (unsigned int irq) +{ + unsigned long flags; + + spin_lock_irqsave(&ioapic_lock, flags); + __pcix_mask_IO_APIC_irq(irq); + spin_unlock_irqrestore(&ioapic_lock, flags); +} + +static void pcix_unmask_IO_APIC_irq (unsigned int irq) +{ + unsigned long flags; + + spin_lock_irqsave(&ioapic_lock, flags); + __pcix_unmask_IO_APIC_irq(irq); + spin_unlock_irqrestore(&ioapic_lock, flags); +} static void clear_IO_APIC_pin(unsigned int apic, unsigned int pin) { @@ -796,17 +816,20 @@ void __setup_vector_irq(int cpu) static struct irq_chip ioapic_chip; +static struct irq_chip pcix_ioapic_chip; -static void ioapic_register_intr(int irq, unsigned long trigger) +static void ioapic_register_intr(int irq, unsigned long trigger, int pcix) { + struct irq_chip *chip = pcix ? &pcix_ioapic_chip : &ioapic_chip; + if (trigger) { irq_desc[irq].status |= IRQ_LEVEL; - set_irq_chip_and_handler_name(irq, &ioapic_chip, - handle_fasteoi_irq, "fasteoi"); + set_irq_chip_and_handler_name(irq, chip, handle_fasteoi_irq, + pcix ? "pcix-fasteoi" : "fasteoi"); } else { irq_desc[irq].status &= ~IRQ_LEVEL; - set_irq_chip_and_handler_name(irq, &ioapic_chip, - handle_edge_irq, "edge"); + set_irq_chip_and_handler_name(irq, chip, handle_edge_irq, + pcix ? "pcix-edge" : "edge"); } } @@ -851,7 +874,7 @@ static void setup_IO_APIC_irq(int apic, if (trigger) entry.mask = 1; - ioapic_register_intr(irq, trigger); + ioapic_register_intr(irq, trigger, apic > 0); if (irq < 16) disable_8259A_irq(irq); @@ -1488,14 +1511,27 @@ static void ack_apic_level(unsigned int } static struct irq_chip ioapic_chip __read_mostly = { - .name = "IO-APIC", - .startup = startup_ioapic_irq, - .mask = mask_IO_APIC_irq, - .unmask = unmask_IO_APIC_irq, - .ack = ack_apic_edge, - .eoi = ack_apic_level, + .name = "IO-APIC", + .startup = startup_ioapic_irq, + .mask = mask_IO_APIC_irq, + .unmask = unmask_IO_APIC_irq, + .ack = ack_apic_edge, + .eoi = ack_apic_level, +#ifdef CONFIG_SMP + .set_affinity = set_ioapic_affinity_irq, +#endif + .retrigger = ioapic_retrigger_irq, +}; + +static struct irq_chip pcix_ioapic_chip __read_mostly = { + .name = "IO-APIC", + .startup = startup_ioapic_irq, + .mask = pcix_mask_IO_APIC_irq, + .unmask = pcix_unmask_IO_APIC_irq, + .ack = ack_apic_edge, + .eoi = ack_apic_level, #ifdef CONFIG_SMP - .set_affinity = set_ioapic_affinity_irq, + .set_affinity = set_ioapic_affinity_irq, #endif .retrigger = ioapic_retrigger_irq, }; ��������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/preempt-irqs-arm.patch����������������������������������������������������������������������0000664�0000764�0000764�00000001040�11041657731�015463� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������--- arch/arm/common/time-acorn.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) Index: linux-2.6.24.7/arch/arm/common/time-acorn.c =================================================================== --- linux-2.6.24.7.orig/arch/arm/common/time-acorn.c +++ linux-2.6.24.7/arch/arm/common/time-acorn.c @@ -77,7 +77,7 @@ ioc_timer_interrupt(int irq, void *dev_i static struct irqaction ioc_timer_irq = { .name = "timer", - .flags = IRQF_DISABLED, + .flags = IRQF_DISABLED | IRQF_NODELAY, .handler = ioc_timer_interrupt }; ������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/preempt-irqs-arm-fix-oprofile.patch���������������������������������������������������������0000664�0000764�0000764�00000001545�11041657733�020100� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������Convert XScale performance monitor unit (PMU) interrupt used by oprofile to IRQF_NODELAY. PMU results not useful if ISR is run as thread. Signed-off-by: Kevin Hilman <khilman@mvista.com> arch/arm/oprofile/op_model_xscale.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) Index: linux-2.6.24.7/arch/arm/oprofile/op_model_xscale.c =================================================================== --- linux-2.6.24.7.orig/arch/arm/oprofile/op_model_xscale.c +++ linux-2.6.24.7/arch/arm/oprofile/op_model_xscale.c @@ -381,8 +381,9 @@ static int xscale_pmu_start(void) { int ret; u32 pmnc = read_pmnc(); + int irq_flags = IRQF_DISABLED | IRQF_NODELAY; - ret = request_irq(XSCALE_PMU_IRQ, xscale_pmu_interrupt, IRQF_DISABLED, + ret = request_irq(XSCALE_PMU_IRQ, xscale_pmu_interrupt, irq_flags, "XScale PMU", (void *)results); if (ret < 0) { �����������������������������������������������������������������������������������������������������������������������������������������������������������patches/preempt-irqs-ppc.patch����������������������������������������������������������������������0000664�0000764�0000764�00000011735�11043037014�015466� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������--- arch/powerpc/kernel/entry_32.S | 6 +++--- arch/powerpc/kernel/irq.c | 2 -- arch/powerpc/kernel/ppc_ksyms.c | 1 - arch/powerpc/platforms/iseries/setup.c | 6 ++++-- arch/powerpc/platforms/pseries/setup.c | 6 ++++-- include/asm-powerpc/thread_info.h | 5 +++++ 6 files changed, 16 insertions(+), 10 deletions(-) Index: linux-2.6.24.7/arch/powerpc/kernel/entry_32.S =================================================================== --- linux-2.6.24.7.orig/arch/powerpc/kernel/entry_32.S +++ linux-2.6.24.7/arch/powerpc/kernel/entry_32.S @@ -662,7 +662,7 @@ user_exc_return: /* r10 contains MSR_KE /* Check current_thread_info()->flags */ rlwinm r9,r1,0,0,(31-THREAD_SHIFT) lwz r9,TI_FLAGS(r9) - andi. r0,r9,(_TIF_SIGPENDING|_TIF_RESTORE_SIGMASK|_TIF_NEED_RESCHED) + andi. r0,r9,(_TIF_SIGPENDING|_TIF_RESTORE_SIGMASK|_TIF_NEED_RESCHED|_TIF_NEED_RESCHED_DELAYED) bne do_work restore_user: @@ -897,7 +897,7 @@ global_dbcr0: #endif /* !(CONFIG_4xx || CONFIG_BOOKE) */ do_work: /* r10 contains MSR_KERNEL here */ - andi. r0,r9,_TIF_NEED_RESCHED + andi. r0,r9,(_TIF_NEED_RESCHED|_TIF_NEED_RESCHED_DELAYED) beq do_user_signal do_resched: /* r10 contains MSR_KERNEL here */ @@ -911,7 +911,7 @@ recheck: MTMSRD(r10) /* disable interrupts */ rlwinm r9,r1,0,0,(31-THREAD_SHIFT) lwz r9,TI_FLAGS(r9) - andi. r0,r9,_TIF_NEED_RESCHED + andi. r0,r9,(_TIF_NEED_RESCHED|_TIF_NEED_RESCHED_DELAYED) bne- do_resched andi. r0,r9,_TIF_SIGPENDING|_TIF_RESTORE_SIGMASK beq restore_user Index: linux-2.6.24.7/arch/powerpc/kernel/irq.c =================================================================== --- linux-2.6.24.7.orig/arch/powerpc/kernel/irq.c +++ linux-2.6.24.7/arch/powerpc/kernel/irq.c @@ -94,8 +94,6 @@ extern atomic_t ipi_sent; #endif #ifdef CONFIG_PPC64 -EXPORT_SYMBOL(irq_desc); - int distribute_irqs = 1; static inline notrace unsigned long get_hard_enabled(void) Index: linux-2.6.24.7/arch/powerpc/kernel/ppc_ksyms.c =================================================================== --- linux-2.6.24.7.orig/arch/powerpc/kernel/ppc_ksyms.c +++ linux-2.6.24.7/arch/powerpc/kernel/ppc_ksyms.c @@ -167,7 +167,6 @@ EXPORT_SYMBOL(screen_info); #ifdef CONFIG_PPC32 EXPORT_SYMBOL(timer_interrupt); -EXPORT_SYMBOL(irq_desc); EXPORT_SYMBOL(tb_ticks_per_jiffy); EXPORT_SYMBOL(console_drivers); EXPORT_SYMBOL(cacheable_memcpy); Index: linux-2.6.24.7/arch/powerpc/platforms/iseries/setup.c =================================================================== --- linux-2.6.24.7.orig/arch/powerpc/platforms/iseries/setup.c +++ linux-2.6.24.7/arch/powerpc/platforms/iseries/setup.c @@ -564,12 +564,14 @@ static void iseries_shared_idle(void) { while (1) { tick_nohz_stop_sched_tick(); - while (!need_resched() && !hvlpevent_is_pending()) { + while (!need_resched() && !need_resched_delayed() + && !hvlpevent_is_pending()) { local_irq_disable(); ppc64_runlatch_off(); /* Recheck with irqs off */ - if (!need_resched() && !hvlpevent_is_pending()) + if (!need_resched() && !need_resched_delayed() + && !hvlpevent_is_pending()) yield_shared_processor(); HMT_medium(); Index: linux-2.6.24.7/arch/powerpc/platforms/pseries/setup.c =================================================================== --- linux-2.6.24.7.orig/arch/powerpc/platforms/pseries/setup.c +++ linux-2.6.24.7/arch/powerpc/platforms/pseries/setup.c @@ -413,7 +413,8 @@ static void pseries_dedicated_idle_sleep set_thread_flag(TIF_POLLING_NRFLAG); while (get_tb() < start_snooze) { - if (need_resched() || cpu_is_offline(cpu)) + if (need_resched() || need_resched_delayed() || + cpu_is_offline(cpu)) goto out; ppc64_runlatch_off(); HMT_low(); @@ -424,7 +425,8 @@ static void pseries_dedicated_idle_sleep clear_thread_flag(TIF_POLLING_NRFLAG); smp_mb(); local_irq_disable(); - if (need_resched() || cpu_is_offline(cpu)) + if (need_resched() || need_resched_delayed() || + cpu_is_offline(cpu)) goto out; } Index: linux-2.6.24.7/include/asm-powerpc/thread_info.h =================================================================== --- linux-2.6.24.7.orig/include/asm-powerpc/thread_info.h +++ linux-2.6.24.7/include/asm-powerpc/thread_info.h @@ -124,6 +124,9 @@ static inline struct thread_info *curren #define TIF_FREEZE 14 /* Freezing for suspend */ #define TIF_RUNLATCH 15 /* Is the runlatch enabled? */ #define TIF_ABI_PENDING 16 /* 32/64 bit switch needed */ +#define TIF_NEED_RESCHED_DELAYED \ + 17 /* reschedule on return to userspace */ + /* as above, but as bit values */ #define _TIF_SYSCALL_TRACE (1<<TIF_SYSCALL_TRACE) @@ -142,6 +145,8 @@ static inline struct thread_info *curren #define _TIF_FREEZE (1<<TIF_FREEZE) #define _TIF_RUNLATCH (1<<TIF_RUNLATCH) #define _TIF_ABI_PENDING (1<<TIF_ABI_PENDING) +#define _TIF_NEED_RESCHED_DELAYED (1<<TIF_NEED_RESCHED_DELAYED) + #define _TIF_SYSCALL_T_OR_A (_TIF_SYSCALL_TRACE|_TIF_SYSCALL_AUDIT|_TIF_SECCOMP) #define _TIF_USER_WORK_MASK ( _TIF_SIGPENDING | \ �����������������������������������patches/preempt-irqs-ppc-ack-irq-fixups.patch�������������������������������������������������������0000664�0000764�0000764�00000005372�11041657735�020347� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������As fasteoi type chips never had to define their ack() method before the recent Ingo's change to handle_fasteoi_irq(), any attempt to execute handler in thread resulted in the kernel crash. So, define their ack() methods to be the same as their eoi() ones... Signed-off-by: Sergei Shtylyov <sshtylyov@ru.mvista.com> --- Since there was no feedback on three solutions I suggested, I'm going the way of least resistance and making the fasteoi type chips behave the way that handle_fasteoi_irq() is expecting from them... arch/powerpc/platforms/cell/interrupt.c | 1 + arch/powerpc/platforms/iseries/irq.c | 1 + arch/powerpc/platforms/pseries/xics.c | 2 ++ arch/powerpc/sysdev/mpic.c | 1 + 4 files changed, 5 insertions(+) Index: linux-2.6.24.7/arch/powerpc/platforms/cell/interrupt.c =================================================================== --- linux-2.6.24.7.orig/arch/powerpc/platforms/cell/interrupt.c +++ linux-2.6.24.7/arch/powerpc/platforms/cell/interrupt.c @@ -90,6 +90,7 @@ static struct irq_chip iic_chip = { .typename = " CELL-IIC ", .mask = iic_mask, .unmask = iic_unmask, + .ack = iic_eoi, .eoi = iic_eoi, }; Index: linux-2.6.24.7/arch/powerpc/platforms/iseries/irq.c =================================================================== --- linux-2.6.24.7.orig/arch/powerpc/platforms/iseries/irq.c +++ linux-2.6.24.7/arch/powerpc/platforms/iseries/irq.c @@ -278,6 +278,7 @@ static struct irq_chip iseries_pic = { .shutdown = iseries_shutdown_IRQ, .unmask = iseries_enable_IRQ, .mask = iseries_disable_IRQ, + .ack = iseries_end_IRQ, .eoi = iseries_end_IRQ }; Index: linux-2.6.24.7/arch/powerpc/platforms/pseries/xics.c =================================================================== --- linux-2.6.24.7.orig/arch/powerpc/platforms/pseries/xics.c +++ linux-2.6.24.7/arch/powerpc/platforms/pseries/xics.c @@ -461,6 +461,7 @@ static struct irq_chip xics_pic_direct = .startup = xics_startup, .mask = xics_mask_irq, .unmask = xics_unmask_irq, + .ack = xics_eoi_direct, .eoi = xics_eoi_direct, .set_affinity = xics_set_affinity }; @@ -471,6 +472,7 @@ static struct irq_chip xics_pic_lpar = { .startup = xics_startup, .mask = xics_mask_irq, .unmask = xics_unmask_irq, + .ack = xics_eoi_lpar, .eoi = xics_eoi_lpar, .set_affinity = xics_set_affinity }; Index: linux-2.6.24.7/arch/powerpc/sysdev/mpic.c =================================================================== --- linux-2.6.24.7.orig/arch/powerpc/sysdev/mpic.c +++ linux-2.6.24.7/arch/powerpc/sysdev/mpic.c @@ -845,6 +845,7 @@ int mpic_set_irq_type(unsigned int virq, static struct irq_chip mpic_irq_chip = { .mask = mpic_mask_irq, .unmask = mpic_unmask_irq, + .ack = mpic_end_irq, .eoi = mpic_end_irq, .set_type = mpic_set_irq_type, }; ����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/preempt-irqs-ppc-fix-b5.patch���������������������������������������������������������������0000664�0000764�0000764�00000003155�11041657732�016570� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������� To fix the following boot time error by removing ack member added by the rt patch. - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Processor 1 found. Brought up 2 CPUs ------------[ cut here ]------------ kernel BUG at arch/powerpc/platforms/cell/interrupt.c:86! pu 0x1: Vector: 700 (Program Check) at [c00000000fff3c80] pc: c000000000033f9c: .iic_eoi+0x58/0x64 lr: c00000000009add8: .handle_percpu_irq+0xd4/0xf4 sp: c00000000fff3f00 msr: 9000000000021032 current = 0xc000000000fee040 paca = 0xc000000000509e80 pid = 0, comm = swapper kernel BUG at arch/powerpc/platforms/cell/interrupt.c:86! enter ? for help [link register ] c00000000009add8 .handle_percpu_irq+0xd4/0xf4 [c00000000fff3f00] c00000000009ada8 .handle_percpu_irq+0xa4/0xf4 (unreliable) [c00000000fff3f90] c000000000023bb8 .call_handle_irq+0x1c/0x2c [c000000000ff7950] c00000000000c910 .do_IRQ+0xf8/0x1b8 [c000000000ff79f0] c000000000034f34 .cbe_system_reset_exception+0x74/0xb4 [c000000000ff7a70] c000000000022610 .system_reset_exception+0x40/0xe0 [c000000000ff7af0] c000000000003378 system_reset_common+0xf8/0x100 --- arch/powerpc/platforms/cell/interrupt.c | 1 - 1 file changed, 1 deletion(-) Index: linux-2.6.24.7/arch/powerpc/platforms/cell/interrupt.c =================================================================== --- linux-2.6.24.7.orig/arch/powerpc/platforms/cell/interrupt.c +++ linux-2.6.24.7/arch/powerpc/platforms/cell/interrupt.c @@ -90,7 +90,6 @@ static struct irq_chip iic_chip = { .typename = " CELL-IIC ", .mask = iic_mask, .unmask = iic_unmask, - .ack = iic_eoi, .eoi = iic_eoi, }; �������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/preempt-irqs-ppc-fix-b6.patch���������������������������������������������������������������0000664�0000764�0000764�00000002664�11043075234�016566� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������� To fix the following boot time warnings by setting soft_enabled and hard_enabled. - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Freeing unused kernel memory: 248k freed BUG: scheduling with irqs disabled: rc.sysinit/0x00000000/373 caller is user_work+0x14/0x2c Call Trace: [C00000001FEC3D10] [C00000000000FAA0] .show_stack+0x68/0x1b0 (unreliable) [C00000001FEC3DB0] [C0000000003E78DC] .schedule+0x78/0x128 [C00000001FEC3E30] [C000000000008C40] user_work+0x14/0x2c BUG: scheduling with irqs disabled: sed/0x00000000/378 caller is user_work+0x14/0x2c Call Trace: [C00000000FA33D10] [C00000000000FAA0] .show_stack+0x68/0x1b0 (unreliable) [C00000000FA33DB0] [C0000000003E78DC] .schedule+0x78/0x128 [C00000000FA33E30] [C000000000008C40] user_work+0x14/0x2c - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Signed-off-by: Tsutomu Owa <tsutomu.owa@toshiba.co.jp> -- owa --- arch/powerpc/kernel/entry_64.S | 5 +++++ 1 file changed, 5 insertions(+) Index: linux-2.6.24.7/arch/powerpc/kernel/entry_64.S =================================================================== --- linux-2.6.24.7.orig/arch/powerpc/kernel/entry_64.S +++ linux-2.6.24.7/arch/powerpc/kernel/entry_64.S @@ -599,6 +599,11 @@ do_work: user_work: #endif + /* here we are preempting the current task */ + li r0,1 + stb r0,PACASOFTIRQEN(r13) + stb r0,PACAHARDIRQEN(r13) + /* Enable interrupts */ ori r10,r10,MSR_EE mtmsrd r10,1 ����������������������������������������������������������������������������patches/preempt-irqs-ppc-celleb-beatic-eoi.patch����������������������������������������������������0000664�0000764�0000764�00000006756�11041657735�020740� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From tsutomu.owa@toshiba.co.jp Tue May 15 17:44:07 2007 Date: Tue, 15 May 2007 17:44:07 +0900 From: Tsutomu OWA <tsutomu.owa@toshiba.co.jp> To: linuxppc-dev@ozlabs.org Cc: mingo@elte.hu, tglx@linutronix.de Subject: Re: [RFC] [patch 1/2] powerpc 2.6.21-rt1: fix kernel hang and/or panic > It occurs on 2.6.21 + patch-2.6.21-rt1 + series of patches that I posted > yesterday. When doing 'hdparm -t /dev/hda' several times, it silently hangs. I think it freezes since It does not response to ping as well. On the other hand, PREEMPT_NONE kernel works just fine. After looking into the rt interrupt handling code, I noticed that code path differs between PREEMPT_NONE and PREEMPT_RT; NONE: mask() -> unmask() -> eoi() RT: mask() -> eoi() -> unmask() The hypervisor underlying the linux on Celleb wants to be called in this "mask() -> unmask() -> eoi()" order. This patch mimics the behavior of PREEPT_NONE even if PREEMPT_RT is specified. Or, would it be better to create/add a new (threaded) irq handler? Any comments? Thanks in advance Signed-off-by: Tsutomu OWA <tsutomu.owa@toshiba.co.jp> -- owa --- arch/powerpc/platforms/celleb/interrupt.c | 39 +++++++++++++++++++++++++----- 1 file changed, 33 insertions(+), 6 deletions(-) Index: linux-2.6.24.7/arch/powerpc/platforms/celleb/interrupt.c =================================================================== --- linux-2.6.24.7.orig/arch/powerpc/platforms/celleb/interrupt.c +++ linux-2.6.24.7/arch/powerpc/platforms/celleb/interrupt.c @@ -29,6 +29,10 @@ #include "interrupt.h" #include "beat_wrapper.h" +#ifdef CONFIG_PREEMPT_HARDIRQS +extern int hardirq_preemption; +#endif /* CONFIG_PREEMPT_HARDIRQS */ + #define MAX_IRQS NR_IRQS static DEFINE_SPINLOCK(beatic_irq_mask_lock); static uint64_t beatic_irq_mask_enable[(MAX_IRQS+255)/64]; @@ -71,12 +75,35 @@ static void beatic_mask_irq(unsigned int spin_unlock_irqrestore(&beatic_irq_mask_lock, flags); } +static void __beatic_eoi_irq(unsigned int irq_plug) +{ + s64 err; + + if ((err = beat_downcount_of_interrupt(irq_plug)) != 0) { + if ((err & 0xFFFFFFFF) != 0xFFFFFFF5) /* -11: wrong state */ + panic("Failed to downcount IRQ! Error = %16lx", err); + + printk(KERN_ERR "IRQ over-downcounted, plug %d\n", irq_plug); + } +} + static void beatic_unmask_irq(unsigned int irq_plug) { unsigned long flags; +#ifdef CONFIG_PREEMPT_HARDIRQS + if (hardirq_preemption) + __beatic_eoi_irq(irq_plug); +#endif /* CONFIG_PREEMPT_HARDIRQS */ + spin_lock_irqsave(&beatic_irq_mask_lock, flags); beatic_irq_mask_enable[irq_plug/64] |= 1UL << (63 - (irq_plug%64)); + +#ifdef CONFIG_PREEMPT_HARDIRQS + if (hardirq_preemption) + beatic_irq_mask_ack[irq_plug/64] |= 1UL << (63 - (irq_plug%64)); +#endif /* CONFIG_PREEMPT_HARDIRQS */ + beatic_update_irq_mask(irq_plug); spin_unlock_irqrestore(&beatic_irq_mask_lock, flags); } @@ -93,15 +120,15 @@ static void beatic_ack_irq(unsigned int static void beatic_end_irq(unsigned int irq_plug) { - s64 err; unsigned long flags; - if ((err = beat_downcount_of_interrupt(irq_plug)) != 0) { - if ((err & 0xFFFFFFFF) != 0xFFFFFFF5) /* -11: wrong state */ - panic("Failed to downcount IRQ! Error = %16lx", err); +#ifdef CONFIG_PREEMPT_HARDIRQS + if (hardirq_preemption) + return; +#endif /* CONFIG_PREEMPT_HARDIRQS */ + + __beatic_eoi_irq(irq_plug); - printk(KERN_ERR "IRQ over-downcounted, plug %d\n", irq_plug); - } spin_lock_irqsave(&beatic_irq_mask_lock, flags); beatic_irq_mask_ack[irq_plug/64] |= 1UL << (63 - (irq_plug%64)); beatic_update_irq_mask(irq_plug); ������������������patches/preempt-irqs-ppc-fix-more-fasteoi.patch�����������������������������������������������������0000664�0000764�0000764�00000007014�11041657730�020650� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From sshtylyov@ru.mvista.com Thu May 17 15:18:39 2007 Return-Path: <sshtylyov@ru.mvista.com> X-Spam-Checker-Version: SpamAssassin 3.1.7-deb (2006-10-05) on debian X-Spam-Level: X-Spam-Status: No, score=0.0 required=5.0 tests=AWL autolearn=unavailable version=3.1.7-deb Received: from imap.sh.mvista.com (unknown [63.81.120.155]) by mail.tglx.de (Postfix) with ESMTP id BFD3A65C065 for <tglx@linutronix.de>; Thu, 17 May 2007 15:18:39 +0200 (CEST) Received: from wasted.dev.rtsoft.ru (unknown [10.150.0.9]) by imap.sh.mvista.com (Postfix) with ESMTP id 8E3CB3EC9; Thu, 17 May 2007 06:18:35 -0700 (PDT) From: Sergei Shtylyov <sshtylyov@ru.mvista.com> Organization: MontaVista Software Inc. To: mingo@elte.hu, tglx@linutronix.de Subject: [PATCH 2.6.21-rt2] PowerPC: revert fix for threaded fasteoi IRQ handlers Date: Thu, 17 May 2007 17:20:08 +0400 User-Agent: KMail/1.5 Cc: linux-kernel@vger.kernel.org, linuxppc-dev@ozlabs.org, dwalker@mvista.com References: <200611192243.34850.sshtylyov@ru.mvista.com> In-Reply-To: <200611192243.34850.sshtylyov@ru.mvista.com> MIME-Version: 1.0 Content-Disposition: inline Message-Id: <200705171719.34968.sshtylyov@ru.mvista.com> Content-Type: text/plain; charset="us-ascii" X-Evolution-Source: imap://tglx%40linutronix.de@localhost:8993/ Content-Transfer-Encoding: 8bit Revert the change to the "fasteoi" type chips as after handle_fasteoi_irq() had been fixed, they've become meaningless (and even dangerous -- as was the case with Celleb that has been fixed earlier)... Signed-off-by: Sergei Shtylyov <sshtylyov@ru.mvista.com> --- The patch in question wasn't even initially accepted but then was erroneously restored along with the TOD patch. I've asked to revert it but to no avail, so here's the formal patch to revert it at last... arch/powerpc/platforms/iseries/irq.c | 1 - arch/powerpc/platforms/pseries/xics.c | 2 -- arch/powerpc/sysdev/mpic.c | 1 - 3 files changed, 4 deletions(-) Index: linux-2.6.24.7/arch/powerpc/platforms/iseries/irq.c =================================================================== --- linux-2.6.24.7.orig/arch/powerpc/platforms/iseries/irq.c +++ linux-2.6.24.7/arch/powerpc/platforms/iseries/irq.c @@ -278,7 +278,6 @@ static struct irq_chip iseries_pic = { .shutdown = iseries_shutdown_IRQ, .unmask = iseries_enable_IRQ, .mask = iseries_disable_IRQ, - .ack = iseries_end_IRQ, .eoi = iseries_end_IRQ }; Index: linux-2.6.24.7/arch/powerpc/platforms/pseries/xics.c =================================================================== --- linux-2.6.24.7.orig/arch/powerpc/platforms/pseries/xics.c +++ linux-2.6.24.7/arch/powerpc/platforms/pseries/xics.c @@ -461,7 +461,6 @@ static struct irq_chip xics_pic_direct = .startup = xics_startup, .mask = xics_mask_irq, .unmask = xics_unmask_irq, - .ack = xics_eoi_direct, .eoi = xics_eoi_direct, .set_affinity = xics_set_affinity }; @@ -472,7 +471,6 @@ static struct irq_chip xics_pic_lpar = { .startup = xics_startup, .mask = xics_mask_irq, .unmask = xics_unmask_irq, - .ack = xics_eoi_lpar, .eoi = xics_eoi_lpar, .set_affinity = xics_set_affinity }; Index: linux-2.6.24.7/arch/powerpc/sysdev/mpic.c =================================================================== --- linux-2.6.24.7.orig/arch/powerpc/sysdev/mpic.c +++ linux-2.6.24.7/arch/powerpc/sysdev/mpic.c @@ -845,7 +845,6 @@ int mpic_set_irq_type(unsigned int virq, static struct irq_chip mpic_irq_chip = { .mask = mpic_mask_irq, .unmask = mpic_unmask_irq, - .ack = mpic_end_irq, .eoi = mpic_end_irq, .set_type = mpic_set_irq_type, }; ��������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/preempt-irqs-ppc-preempt-schedule-irq-entry-fix.patch���������������������������������������0000664�0000764�0000764�00000010453�11043075234�023450� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From tsutomu.owa@toshiba.co.jp Tue May 22 13:47:39 2007 Return-Path: <tsutomu.owa@toshiba.co.jp> X-Spam-Checker-Version: SpamAssassin 3.1.7-deb (2006-10-05) on debian X-Spam-Level: X-Spam-Status: No, score=0.0 required=5.0 tests=UNPARSEABLE_RELAY autolearn=unavailable version=3.1.7-deb Received: from inet-tsb5.toshiba.co.jp (inet-tsb5.toshiba.co.jp [202.33.96.24]) by mail.tglx.de (Postfix) with ESMTP id 57F7E65C065 for <tglx@linutronix.de>; Tue, 22 May 2007 13:47:39 +0200 (CEST) Received: from tsb-wall.toshiba.co.jp ([133.199.160.134]) by inet-tsb5.toshiba.co.jp with ESMTP id l4MBlERT003242; Tue, 22 May 2007 20:47:14 +0900 (JST) Received: (from root@localhost) by tsb-wall.toshiba.co.jp id l4MBlEQK014361; Tue, 22 May 2007 20:47:14 +0900 (JST) Received: from ovp1.toshiba.co.jp [133.199.192.124] by tsb-wall.toshiba.co.jp with ESMTP id WAA14360; Tue, 22 May 2007 20:47:14 +0900 Received: from mx2.toshiba.co.jp (localhost [127.0.0.1]) by ovp1.toshiba.co.jp with ESMTP id l4MBlEDs007674; Tue, 22 May 2007 20:47:14 +0900 (JST) Received: from rdcgw.rdc.toshiba.co.jp by toshiba.co.jp id l4MBlDm9015993; Tue, 22 May 2007 20:47:13 +0900 (JST) Received: from island.swc.toshiba.co.jp by rdcgw.rdc.toshiba.co.jp (8.8.8p2+Sun/3.7W) with ESMTP id UAA17003; Tue, 22 May 2007 20:47:13 +0900 (JST) Received: from forest.toshiba.co.jp (forest [133.196.122.2]) by island.swc.toshiba.co.jp (Postfix) with ESMTP id 6A26B40002; Tue, 22 May 2007 20:47:13 +0900 (JST) Date: Tue, 22 May 2007 20:47:13 +0900 Message-ID: <yyi7ir16pxa.wl@toshiba.co.jp> From: Tsutomu OWA <tsutomu.owa@toshiba.co.jp> To: linuxppc-dev@ozlabs.org, linux-kernel@vger.kernel.org Cc: mingo@elte.hu, tglx@linutronix.de Subject: [PATCH] powerpc 2.6.21-rt6: replace preempt_schedule w/ preempt_schedule_irq User-Agent: Wanderlust/2.8.1 (Something) Emacs/20.7 Mule/4.0 (HANANOEN) Organization: Software Engineering Center, TOSHIBA. MIME-Version: 1.0 (generated by SEMI 1.14.4 - "Hosorogi") Content-Type: text/plain; charset=US-ASCII X-Evolution-Source: imap://tglx%40linutronix.de@localhost:8993/ Content-Transfer-Encoding: 8bit Hi Ingo and Thomas, Please apply. Replace preempt_schedule() w/ preempt_schedule_irq() in irq return path, to avoid irq-entry recursion and stack overflow problems for powerpc64. It hits when doing netperf from another machine to the machine running rt kernel. This patch applies on top of linux-2.6.21 + patch-2.6.21-rt6. Compile, boot and netperf tested on celleb. - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - ~ $ uname -a Linux Linux 2.6.21-rt6 #1 SMP PREEMPT RT Tue May 22 19:18:00 JST 2007 ppc64 unkn own ~ $ Unable to handle kernel paging request for data at address 0xc0000180004cd9b 0 Faulting instruction address: 0xc00000000003da48 cpu 0x0: Vector: 300 (Data Access) at [c00000000fffba00] pc: c00000000003da48: .resched_task+0x34/0xc4 lr: c0000000000410b4: .try_to_wake_up+0x4cc/0x5a8 sp: c00000000fffbc80 msr: 9000000000001032 dar: c0000180004cd9b0 dsisr: 40000000 current = 0xc00000000244ed20 paca = 0xc0000000004cd980 pid = 425, comm = netserver enter ? for help [c00000000fffbd00] c0000000000410b4 .try_to_wake_up+0x4cc/0x5a8 [c00000000fffbde0] c0000000000880c8 .redirect_hardirq+0x68/0x88 [c00000000fffbe60] c00000000008aec8 .handle_level_irq+0x13c/0x220 [c00000000fffbf00] c000000000032538 .spider_irq_cascade+0x98/0xec [c00000000fffbf90] c000000000022280 .call_handle_irq+0x1c/0x2c [c0000000025abea0] c00000000000c33c .do_IRQ+0xc8/0x17c [c0000000025abf30] c00000000000444c hardware_interrupt_entry+0x18/0x4c --- arch/powerpc/kernel/entry_64.S | 9 ++------- 1 file changed, 2 insertions(+), 7 deletions(-) Index: linux-2.6.24.7/arch/powerpc/kernel/entry_64.S =================================================================== --- linux-2.6.24.7.orig/arch/powerpc/kernel/entry_64.S +++ linux-2.6.24.7/arch/powerpc/kernel/entry_64.S @@ -579,14 +579,9 @@ do_work: cmpdi r0,0 crandc eq,cr1*4+eq,eq bne restore - /* here we are preempting the current task */ 1: - li r0,1 - stb r0,PACASOFTIRQEN(r13) - stb r0,PACAHARDIRQEN(r13) - ori r10,r10,MSR_EE - mtmsrd r10,1 /* reenable interrupts */ - bl .preempt_schedule + /* preempt_schedule_irq() expects interrupts disabled. */ + bl .preempt_schedule_irq mfmsr r10 clrrdi r9,r1,THREAD_SHIFT rldicl r10,r10,48,1 /* disable interrupts again */ ���������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/preempt-irqs-m68knommu-make-timer-interrupt-non-threaded.patch������������������������������0000664�0000764�0000764�00000002642�11041657731�025207� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From 0b5363d39c40b92c1c6cc806ecef7086576e5162 Mon Sep 17 00:00:00 2001 From: Sebastian Siewior <bigeasy@linutronix.de> Date: Fri, 18 Apr 2008 17:02:23 +0200 Subject: [PATCH] m68knommu: make timer interrupt non threaded Signed-off-by: Sebastian Siewior <bigeasy@linutronix.de> --- arch/m68knommu/kernel/process.c | 6 ++++-- arch/m68knommu/platform/coldfire/pit.c | 2 +- 2 files changed, 5 insertions(+), 3 deletions(-) Index: linux-2.6.24.7/arch/m68knommu/kernel/process.c =================================================================== --- linux-2.6.24.7.orig/arch/m68knommu/kernel/process.c +++ linux-2.6.24.7/arch/m68knommu/kernel/process.c @@ -77,9 +77,11 @@ void cpu_idle(void) stop_critical_timings(); idle(); start_critical_timings(); - preempt_enable_no_resched(); - schedule(); + local_irq_disable(); + __preempt_enable_no_resched(); + __schedule(); preempt_disable(); + local_irq_enable(); } } Index: linux-2.6.24.7/arch/m68knommu/platform/coldfire/pit.c =================================================================== --- linux-2.6.24.7.orig/arch/m68knommu/platform/coldfire/pit.c +++ linux-2.6.24.7/arch/m68knommu/platform/coldfire/pit.c @@ -119,7 +119,7 @@ static irqreturn_t pit_tick(int irq, voi static struct irqaction pit_irq = { .name = "timer", - .flags = IRQF_DISABLED | IRQF_TIMER, + .flags = IRQF_DISABLED | IRQF_TIMER | IRQF_NODELAY, .handler = pit_tick, }; ����������������������������������������������������������������������������������������������patches/preempt-irqs-Kconfig.patch������������������������������������������������������������������0000664�0000764�0000764�00000002165�11041657734�016300� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������--- kernel/Kconfig.preempt | 19 +++++++++++++++++++ 1 file changed, 19 insertions(+) Index: linux-2.6.24.7/kernel/Kconfig.preempt =================================================================== --- linux-2.6.24.7.orig/kernel/Kconfig.preempt +++ linux-2.6.24.7/kernel/Kconfig.preempt @@ -107,6 +107,25 @@ config PREEMPT_SOFTIRQS Say N if you are unsure. +config PREEMPT_HARDIRQS + bool "Thread Hardirqs" + default n + depends on !GENERIC_HARDIRQS_NO__DO_IRQ + select PREEMPT_SOFTIRQS + help + This option reduces the latency of the kernel by 'threading' + hardirqs. This means that all (or selected) hardirqs will run + in their own kernel thread context. While this helps latency, + this feature can also reduce performance. + + The threading of hardirqs can also be controlled via the + /proc/sys/kernel/hardirq_preemption runtime flag and the + hardirq-preempt=0/1 boot-time option. Per-irq threading can + be enabled/disable via the /proc/irq/<IRQ>/<handler>/threaded + runtime flags. + + Say N if you are unsure. + config PREEMPT_BKL bool "Preempt The Big Kernel Lock" depends on SMP || PREEMPT �����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/rt-apis.patch�������������������������������������������������������������������������������0000664�0000764�0000764�00000006170�11041657733�013650� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������� add new, -rt specific IRQ API variants. Maps to the same as before on non-PREEMPT_RT. include/linux/bottom_half.h | 8 ++++++++ include/linux/interrupt.h | 35 ++++++++++++++++++++++++++++++++++- 2 files changed, 42 insertions(+), 1 deletion(-) Index: linux-2.6.24.7/include/linux/bottom_half.h =================================================================== --- linux-2.6.24.7.orig/include/linux/bottom_half.h +++ linux-2.6.24.7/include/linux/bottom_half.h @@ -1,9 +1,17 @@ #ifndef _LINUX_BH_H #define _LINUX_BH_H +#ifdef CONFIG_PREEMPT_RT +# define local_bh_disable() do { } while (0) +# define __local_bh_disable(ip) do { } while (0) +# define _local_bh_enable() do { } while (0) +# define local_bh_enable() do { } while (0) +# define local_bh_enable_ip(ip) do { } while (0) +#else extern void local_bh_disable(void); extern void _local_bh_enable(void); extern void local_bh_enable(void); extern void local_bh_enable_ip(unsigned long ip); +#endif #endif /* _LINUX_BH_H */ Index: linux-2.6.24.7/include/linux/interrupt.h =================================================================== --- linux-2.6.24.7.orig/include/linux/interrupt.h +++ linux-2.6.24.7/include/linux/interrupt.h @@ -97,7 +97,7 @@ extern void devm_free_irq(struct device #ifdef CONFIG_LOCKDEP # define local_irq_enable_in_hardirq() do { } while (0) #else -# define local_irq_enable_in_hardirq() local_irq_enable() +# define local_irq_enable_in_hardirq() local_irq_enable_nort() #endif extern void disable_irq_nosync(unsigned int irq); @@ -465,4 +465,37 @@ static inline void init_irq_proc(void) } #endif +#ifdef CONFIG_PREEMPT_RT +# define local_irq_disable_nort() do { } while (0) +# define local_irq_enable_nort() do { } while (0) +# define local_irq_enable_rt() local_irq_enable() +# define local_irq_save_nort(flags) do { local_save_flags(flags); } while (0) +# define local_irq_restore_nort(flags) do { (void)(flags); } while (0) +# define spin_lock_nort(lock) do { } while (0) +# define spin_unlock_nort(lock) do { } while (0) +# define spin_lock_bh_nort(lock) do { } while (0) +# define spin_unlock_bh_nort(lock) do { } while (0) +# define spin_lock_rt(lock) spin_lock(lock) +# define spin_unlock_rt(lock) spin_unlock(lock) +# define smp_processor_id_rt(cpu) (cpu) +# define in_atomic_rt() (!oops_in_progress && \ + (in_atomic() || irqs_disabled())) +# define read_trylock_rt(lock) ({read_lock(lock); 1; }) +#else +# define local_irq_disable_nort() local_irq_disable() +# define local_irq_enable_nort() local_irq_enable() +# define local_irq_enable_rt() do { } while (0) +# define local_irq_save_nort(flags) local_irq_save(flags) +# define local_irq_restore_nort(flags) local_irq_restore(flags) +# define spin_lock_rt(lock) do { } while (0) +# define spin_unlock_rt(lock) do { } while (0) +# define spin_lock_nort(lock) spin_lock(lock) +# define spin_unlock_nort(lock) spin_unlock(lock) +# define spin_lock_bh_nort(lock) spin_lock_bh(lock) +# define spin_unlock_bh_nort(lock) spin_unlock_bh(lock) +# define smp_processor_id_rt(cpu) smp_processor_id() +# define in_atomic_rt() 0 +# define read_trylock_rt(lock) read_trylock(lock) +#endif + #endif ��������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/rt-slab-new.patch���������������������������������������������������������������������������0000664�0000764�0000764�00000113162�11041657734�014425� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������� new slab port. Signed-off-by: Ingo Molnar <mingo@elte.hu> --- mm/slab.c | 494 +++++++++++++++++++++++++++++++++++++++----------------------- 1 file changed, 317 insertions(+), 177 deletions(-) Index: linux-2.6.24.7/mm/slab.c =================================================================== --- linux-2.6.24.7.orig/mm/slab.c +++ linux-2.6.24.7/mm/slab.c @@ -116,6 +116,63 @@ #include <asm/page.h> /* + * On !PREEMPT_RT, raw irq flags are used as a per-CPU locking + * mechanism. + * + * On PREEMPT_RT, we use per-CPU locks for this. That's why the + * calling convention is changed slightly: a new 'flags' argument + * is passed to 'irq disable/enable' - the PREEMPT_RT code stores + * the CPU number of the lock there. + */ +#ifndef CONFIG_PREEMPT_RT +# define slab_irq_disable(cpu) \ + do { local_irq_disable(); (cpu) = smp_processor_id(); } while (0) +# define slab_irq_enable(cpu) local_irq_enable() +# define slab_irq_save(flags, cpu) \ + do { local_irq_save(flags); (cpu) = smp_processor_id(); } while (0) +# define slab_irq_restore(flags, cpu) local_irq_restore(flags) +/* + * In the __GFP_WAIT case we enable/disable interrupts on !PREEMPT_RT, + * which has no per-CPU locking effect since we are holding the cache + * lock in that case already. + * + * (On PREEMPT_RT, these are NOPs, but we have to drop/get the irq locks.) + */ +# define slab_irq_disable_nort() local_irq_disable() +# define slab_irq_enable_nort() local_irq_enable() +# define slab_irq_disable_rt(flags) do { (void)(flags); } while (0) +# define slab_irq_enable_rt(flags) do { (void)(flags); } while (0) +# define slab_spin_lock_irq(lock, cpu) \ + do { spin_lock_irq(lock); (cpu) = smp_processor_id(); } while (0) +# define slab_spin_unlock_irq(lock, cpu) \ + spin_unlock_irq(lock) +# define slab_spin_lock_irqsave(lock, flags, cpu) \ + do { spin_lock_irqsave(lock, flags); (cpu) = smp_processor_id(); } while (0) +# define slab_spin_unlock_irqrestore(lock, flags, cpu) \ + do { spin_unlock_irqrestore(lock, flags); } while (0) +#else +DEFINE_PER_CPU_LOCKED(int, slab_irq_locks) = { 0, }; +# define slab_irq_disable(cpu) (void)get_cpu_var_locked(slab_irq_locks, &(cpu)) +# define slab_irq_enable(cpu) put_cpu_var_locked(slab_irq_locks, cpu) +# define slab_irq_save(flags, cpu) \ + do { slab_irq_disable(cpu); (void) (flags); } while (0) +# define slab_irq_restore(flags, cpu) \ + do { slab_irq_enable(cpu); (void) (flags); } while (0) +# define slab_irq_disable_rt(cpu) slab_irq_disable(cpu) +# define slab_irq_enable_rt(cpu) slab_irq_enable(cpu) +# define slab_irq_disable_nort() do { } while (0) +# define slab_irq_enable_nort() do { } while (0) +# define slab_spin_lock_irq(lock, cpu) \ + do { slab_irq_disable(cpu); spin_lock(lock); } while (0) +# define slab_spin_unlock_irq(lock, cpu) \ + do { spin_unlock(lock); slab_irq_enable(cpu); } while (0) +# define slab_spin_lock_irqsave(lock, flags, cpu) \ + do { slab_irq_disable(cpu); spin_lock_irqsave(lock, flags); } while (0) +# define slab_spin_unlock_irqrestore(lock, flags, cpu) \ + do { spin_unlock_irqrestore(lock, flags); slab_irq_enable(cpu); } while (0) +#endif + +/* * DEBUG - 1 for kmem_cache_create() to honour; SLAB_RED_ZONE & SLAB_POISON. * 0 for faster, smaller code (especially in the critical paths). * @@ -313,7 +370,7 @@ struct kmem_list3 __initdata initkmem_li static int drain_freelist(struct kmem_cache *cache, struct kmem_list3 *l3, int tofree); static void free_block(struct kmem_cache *cachep, void **objpp, int len, - int node); + int node, int *this_cpu); static int enable_cpucache(struct kmem_cache *cachep); static void cache_reap(struct work_struct *unused); @@ -757,9 +814,10 @@ int slab_is_available(void) static DEFINE_PER_CPU(struct delayed_work, reap_work); -static inline struct array_cache *cpu_cache_get(struct kmem_cache *cachep) +static inline struct array_cache * +cpu_cache_get(struct kmem_cache *cachep, int this_cpu) { - return cachep->array[smp_processor_id()]; + return cachep->array[this_cpu]; } static inline struct kmem_cache *__find_general_cachep(size_t size, @@ -993,7 +1051,7 @@ static int transfer_objects(struct array #ifndef CONFIG_NUMA #define drain_alien_cache(cachep, alien) do { } while (0) -#define reap_alien(cachep, l3) do { } while (0) +#define reap_alien(cachep, l3, this_cpu) do { } while (0) static inline struct array_cache **alloc_alien_cache(int node, int limit) { @@ -1004,7 +1062,8 @@ static inline void free_alien_cache(stru { } -static inline int cache_free_alien(struct kmem_cache *cachep, void *objp) +static inline int +cache_free_alien(struct kmem_cache *cachep, void *objp, int *this_cpu) { return 0; } @@ -1016,14 +1075,15 @@ static inline void *alternate_node_alloc } static inline void *____cache_alloc_node(struct kmem_cache *cachep, - gfp_t flags, int nodeid) + gfp_t flags, int nodeid, int *this_cpu) { return NULL; } #else /* CONFIG_NUMA */ -static void *____cache_alloc_node(struct kmem_cache *, gfp_t, int); +static void *____cache_alloc_node(struct kmem_cache *cachep, gfp_t flags, + int nodeid, int *this_cpu); static void *alternate_node_alloc(struct kmem_cache *, gfp_t); static struct array_cache **alloc_alien_cache(int node, int limit) @@ -1065,7 +1125,8 @@ static void free_alien_cache(struct arra } static void __drain_alien_cache(struct kmem_cache *cachep, - struct array_cache *ac, int node) + struct array_cache *ac, int node, + int *this_cpu) { struct kmem_list3 *rl3 = cachep->nodelists[node]; @@ -1079,7 +1140,7 @@ static void __drain_alien_cache(struct k if (rl3->shared) transfer_objects(rl3->shared, ac, ac->limit); - free_block(cachep, ac->entry, ac->avail, node); + free_block(cachep, ac->entry, ac->avail, node, this_cpu); ac->avail = 0; spin_unlock(&rl3->list_lock); } @@ -1088,15 +1149,16 @@ static void __drain_alien_cache(struct k /* * Called from cache_reap() to regularly drain alien caches round robin. */ -static void reap_alien(struct kmem_cache *cachep, struct kmem_list3 *l3) +static void +reap_alien(struct kmem_cache *cachep, struct kmem_list3 *l3, int *this_cpu) { - int node = __get_cpu_var(reap_node); + int node = per_cpu(reap_node, *this_cpu); if (l3->alien) { struct array_cache *ac = l3->alien[node]; if (ac && ac->avail && spin_trylock_irq(&ac->lock)) { - __drain_alien_cache(cachep, ac, node); + __drain_alien_cache(cachep, ac, node, this_cpu); spin_unlock_irq(&ac->lock); } } @@ -1105,21 +1167,22 @@ static void reap_alien(struct kmem_cache static void drain_alien_cache(struct kmem_cache *cachep, struct array_cache **alien) { - int i = 0; + int i = 0, this_cpu; struct array_cache *ac; unsigned long flags; for_each_online_node(i) { ac = alien[i]; if (ac) { - spin_lock_irqsave(&ac->lock, flags); - __drain_alien_cache(cachep, ac, i); - spin_unlock_irqrestore(&ac->lock, flags); + slab_spin_lock_irqsave(&ac->lock, flags, this_cpu); + __drain_alien_cache(cachep, ac, i, &this_cpu); + slab_spin_unlock_irqrestore(&ac->lock, flags, this_cpu); } } } -static inline int cache_free_alien(struct kmem_cache *cachep, void *objp) +static inline int +cache_free_alien(struct kmem_cache *cachep, void *objp, int *this_cpu) { struct slab *slabp = virt_to_slab(objp); int nodeid = slabp->nodeid; @@ -1143,13 +1206,13 @@ static inline int cache_free_alien(struc spin_lock(&alien->lock); if (unlikely(alien->avail == alien->limit)) { STATS_INC_ACOVERFLOW(cachep); - __drain_alien_cache(cachep, alien, nodeid); + __drain_alien_cache(cachep, alien, nodeid, this_cpu); } alien->entry[alien->avail++] = objp; spin_unlock(&alien->lock); } else { spin_lock(&(cachep->nodelists[nodeid])->list_lock); - free_block(cachep, &objp, 1, nodeid); + free_block(cachep, &objp, 1, nodeid, this_cpu); spin_unlock(&(cachep->nodelists[nodeid])->list_lock); } return 1; @@ -1166,6 +1229,7 @@ static void __cpuinit cpuup_canceled(lon struct array_cache *nc; struct array_cache *shared; struct array_cache **alien; + int this_cpu; cpumask_t mask; mask = node_to_cpumask(node); @@ -1177,29 +1241,31 @@ static void __cpuinit cpuup_canceled(lon if (!l3) goto free_array_cache; - spin_lock_irq(&l3->list_lock); + slab_spin_lock_irq(&l3->list_lock, this_cpu); /* Free limit for this kmem_list3 */ l3->free_limit -= cachep->batchcount; if (nc) - free_block(cachep, nc->entry, nc->avail, node); + free_block(cachep, nc->entry, nc->avail, node, + &this_cpu); if (!cpus_empty(mask)) { - spin_unlock_irq(&l3->list_lock); + slab_spin_unlock_irq(&l3->list_lock, + this_cpu); goto free_array_cache; } shared = l3->shared; if (shared) { free_block(cachep, shared->entry, - shared->avail, node); + shared->avail, node, &this_cpu); l3->shared = NULL; } alien = l3->alien; l3->alien = NULL; - spin_unlock_irq(&l3->list_lock); + slab_spin_unlock_irq(&l3->list_lock, this_cpu); kfree(shared); if (alien) { @@ -1228,6 +1294,7 @@ static int __cpuinit cpuup_prepare(long struct kmem_list3 *l3 = NULL; int node = cpu_to_node(cpu); const int memsize = sizeof(struct kmem_list3); + int this_cpu; /* * We need to do this right in the beginning since @@ -1258,11 +1325,11 @@ static int __cpuinit cpuup_prepare(long cachep->nodelists[node] = l3; } - spin_lock_irq(&cachep->nodelists[node]->list_lock); + slab_spin_lock_irq(&cachep->nodelists[node]->list_lock, this_cpu); cachep->nodelists[node]->free_limit = (1 + nr_cpus_node(node)) * cachep->batchcount + cachep->num; - spin_unlock_irq(&cachep->nodelists[node]->list_lock); + slab_spin_unlock_irq(&cachep->nodelists[node]->list_lock, this_cpu); } /* @@ -1299,7 +1366,7 @@ static int __cpuinit cpuup_prepare(long l3 = cachep->nodelists[node]; BUG_ON(!l3); - spin_lock_irq(&l3->list_lock); + slab_spin_lock_irq(&l3->list_lock, this_cpu); if (!l3->shared) { /* * We are serialised from CPU_DEAD or @@ -1314,7 +1381,7 @@ static int __cpuinit cpuup_prepare(long alien = NULL; } #endif - spin_unlock_irq(&l3->list_lock); + slab_spin_unlock_irq(&l3->list_lock, this_cpu); kfree(shared); free_alien_cache(alien); } @@ -1393,11 +1460,13 @@ static void init_list(struct kmem_cache int nodeid) { struct kmem_list3 *ptr; + int this_cpu; ptr = kmalloc_node(sizeof(struct kmem_list3), GFP_KERNEL, nodeid); BUG_ON(!ptr); - local_irq_disable(); + WARN_ON(spin_is_locked(&list->list_lock)); + slab_irq_disable(this_cpu); memcpy(ptr, list, sizeof(struct kmem_list3)); /* * Do not assume that spinlocks can be initialized via memcpy: @@ -1406,7 +1475,7 @@ static void init_list(struct kmem_cache MAKE_ALL_LISTS(cachep, ptr, nodeid); cachep->nodelists[nodeid] = ptr; - local_irq_enable(); + slab_irq_enable(this_cpu); } /* @@ -1569,36 +1638,34 @@ void __init kmem_cache_init(void) /* 4) Replace the bootstrap head arrays */ { struct array_cache *ptr; + int this_cpu; ptr = kmalloc(sizeof(struct arraycache_init), GFP_KERNEL); - local_irq_disable(); - BUG_ON(cpu_cache_get(&cache_cache) != &initarray_cache.cache); - memcpy(ptr, cpu_cache_get(&cache_cache), - sizeof(struct arraycache_init)); + slab_irq_disable(this_cpu); + BUG_ON(cpu_cache_get(&cache_cache, this_cpu) != &initarray_cache.cache); + memcpy(ptr, cpu_cache_get(&cache_cache, this_cpu), + sizeof(struct arraycache_init)); /* * Do not assume that spinlocks can be initialized via memcpy: */ spin_lock_init(&ptr->lock); - - cache_cache.array[smp_processor_id()] = ptr; - local_irq_enable(); + cache_cache.array[this_cpu] = ptr; + slab_irq_enable(this_cpu); ptr = kmalloc(sizeof(struct arraycache_init), GFP_KERNEL); - local_irq_disable(); - BUG_ON(cpu_cache_get(malloc_sizes[INDEX_AC].cs_cachep) - != &initarray_generic.cache); - memcpy(ptr, cpu_cache_get(malloc_sizes[INDEX_AC].cs_cachep), - sizeof(struct arraycache_init)); + slab_irq_disable(this_cpu); + BUG_ON(cpu_cache_get(malloc_sizes[INDEX_AC].cs_cachep, this_cpu) + != &initarray_generic.cache); + memcpy(ptr, cpu_cache_get(malloc_sizes[INDEX_AC].cs_cachep, this_cpu), + sizeof(struct arraycache_init)); /* * Do not assume that spinlocks can be initialized via memcpy: */ spin_lock_init(&ptr->lock); - - malloc_sizes[INDEX_AC].cs_cachep->array[smp_processor_id()] = - ptr; - local_irq_enable(); + malloc_sizes[INDEX_AC].cs_cachep->array[this_cpu] = ptr; + slab_irq_enable(this_cpu); } /* 5) Replace the bootstrap kmem_list3's */ { @@ -1750,7 +1817,7 @@ static void store_stackinfo(struct kmem_ *addr++ = 0x12345678; *addr++ = caller; - *addr++ = smp_processor_id(); + *addr++ = raw_smp_processor_id(); size -= 3 * sizeof(unsigned long); { unsigned long *sptr = &caller; @@ -1905,7 +1972,11 @@ static void check_poison_obj(struct kmem } #endif +static void +__cache_free(struct kmem_cache *cachep, void *objp, int *this_cpu); + #if DEBUG + /** * slab_destroy_objs - destroy a slab and its objects * @cachep: cache pointer being destroyed @@ -1914,7 +1985,8 @@ static void check_poison_obj(struct kmem * Call the registered destructor for each object in a slab that is being * destroyed. */ -static void slab_destroy_objs(struct kmem_cache *cachep, struct slab *slabp) +static void +slab_destroy_objs(struct kmem_cache *cachep, struct slab *slabp) { int i; for (i = 0; i < cachep->num; i++) { @@ -1957,7 +2029,8 @@ static void slab_destroy_objs(struct kme * Before calling the slab must have been unlinked from the cache. The * cache-lock is not held/needed. */ -static void slab_destroy(struct kmem_cache *cachep, struct slab *slabp) +static void +slab_destroy(struct kmem_cache *cachep, struct slab *slabp, int *this_cpu) { void *addr = slabp->s_mem - slabp->colouroff; @@ -1971,8 +2044,12 @@ static void slab_destroy(struct kmem_cac call_rcu(&slab_rcu->head, kmem_rcu_free); } else { kmem_freepages(cachep, addr); - if (OFF_SLAB(cachep)) - kmem_cache_free(cachep->slabp_cache, slabp); + if (OFF_SLAB(cachep)) { + if (this_cpu) + __cache_free(cachep->slabp_cache, slabp, this_cpu); + else + kmem_cache_free(cachep->slabp_cache, slabp); + } } } @@ -2069,6 +2146,8 @@ static size_t calculate_slab_order(struc static int __init_refok setup_cpu_cache(struct kmem_cache *cachep) { + int this_cpu; + if (g_cpucache_up == FULL) return enable_cpucache(cachep); @@ -2112,10 +2191,12 @@ static int __init_refok setup_cpu_cache( jiffies + REAPTIMEOUT_LIST3 + ((unsigned long)cachep) % REAPTIMEOUT_LIST3; - cpu_cache_get(cachep)->avail = 0; - cpu_cache_get(cachep)->limit = BOOT_CPUCACHE_ENTRIES; - cpu_cache_get(cachep)->batchcount = 1; - cpu_cache_get(cachep)->touched = 0; + this_cpu = raw_smp_processor_id(); + + cpu_cache_get(cachep, this_cpu)->avail = 0; + cpu_cache_get(cachep, this_cpu)->limit = BOOT_CPUCACHE_ENTRIES; + cpu_cache_get(cachep, this_cpu)->batchcount = 1; + cpu_cache_get(cachep, this_cpu)->touched = 0; cachep->batchcount = 1; cachep->limit = BOOT_CPUCACHE_ENTRIES; return 0; @@ -2403,19 +2484,19 @@ EXPORT_SYMBOL(kmem_cache_create); #if DEBUG static void check_irq_off(void) { +/* + * On PREEMPT_RT we use locks to protect the per-CPU lists, + * and keep interrupts enabled. + */ +#ifndef CONFIG_PREEMPT_RT BUG_ON(!irqs_disabled()); +#endif } static void check_irq_on(void) { +#ifndef CONFIG_PREEMPT_RT BUG_ON(irqs_disabled()); -} - -static void check_spinlock_acquired(struct kmem_cache *cachep) -{ -#ifdef CONFIG_SMP - check_irq_off(); - assert_spin_locked(&cachep->nodelists[numa_node_id()]->list_lock); #endif } @@ -2430,7 +2511,6 @@ static void check_spinlock_acquired_node #else #define check_irq_off() do { } while(0) #define check_irq_on() do { } while(0) -#define check_spinlock_acquired(x) do { } while(0) #define check_spinlock_acquired_node(x, y) do { } while(0) #endif @@ -2438,26 +2518,60 @@ static void drain_array(struct kmem_cach struct array_cache *ac, int force, int node); -static void do_drain(void *arg) +static void __do_drain(void *arg, int this_cpu) { struct kmem_cache *cachep = arg; + int node = cpu_to_node(this_cpu); struct array_cache *ac; - int node = numa_node_id(); check_irq_off(); - ac = cpu_cache_get(cachep); + ac = cpu_cache_get(cachep, this_cpu); spin_lock(&cachep->nodelists[node]->list_lock); - free_block(cachep, ac->entry, ac->avail, node); + free_block(cachep, ac->entry, ac->avail, node, &this_cpu); spin_unlock(&cachep->nodelists[node]->list_lock); ac->avail = 0; } +#ifdef CONFIG_PREEMPT_RT +static void do_drain(void *arg, int this_cpu) +{ + __do_drain(arg, this_cpu); +} +#else +static void do_drain(void *arg) +{ + __do_drain(arg, smp_processor_id()); +} +#endif + +#ifdef CONFIG_PREEMPT_RT +/* + * execute func() for all CPUs. On PREEMPT_RT we dont actually have + * to run on the remote CPUs - we only have to take their CPU-locks. + * (This is a rare operation, so cacheline bouncing is not an issue.) + */ +static void +slab_on_each_cpu(void (*func)(void *arg, int this_cpu), void *arg) +{ + unsigned int i; + + check_irq_on(); + for_each_online_cpu(i) { + spin_lock(&__get_cpu_lock(slab_irq_locks, i)); + func(arg, i); + spin_unlock(&__get_cpu_lock(slab_irq_locks, i)); + } +} +#else +# define slab_on_each_cpu(func, cachep) on_each_cpu(func, cachep, 1, 1) +#endif + static void drain_cpu_caches(struct kmem_cache *cachep) { struct kmem_list3 *l3; int node; - on_each_cpu(do_drain, cachep, 1, 1); + slab_on_each_cpu(do_drain, cachep); check_irq_on(); for_each_online_node(node) { l3 = cachep->nodelists[node]; @@ -2482,16 +2596,16 @@ static int drain_freelist(struct kmem_ca struct kmem_list3 *l3, int tofree) { struct list_head *p; - int nr_freed; + int nr_freed, this_cpu; struct slab *slabp; nr_freed = 0; while (nr_freed < tofree && !list_empty(&l3->slabs_free)) { - spin_lock_irq(&l3->list_lock); + slab_spin_lock_irq(&l3->list_lock, this_cpu); p = l3->slabs_free.prev; if (p == &l3->slabs_free) { - spin_unlock_irq(&l3->list_lock); + slab_spin_unlock_irq(&l3->list_lock, this_cpu); goto out; } @@ -2500,13 +2614,9 @@ static int drain_freelist(struct kmem_ca BUG_ON(slabp->inuse); #endif list_del(&slabp->list); - /* - * Safe to drop the lock. The slab is no longer linked - * to the cache. - */ l3->free_objects -= cache->num; - spin_unlock_irq(&l3->list_lock); - slab_destroy(cache, slabp); + slab_destroy(cache, slabp, &this_cpu); + slab_spin_unlock_irq(&l3->list_lock, this_cpu); nr_freed++; } out: @@ -2757,8 +2867,8 @@ static void slab_map_pages(struct kmem_c * Grow (by 1) the number of slabs within a cache. This is called by * kmem_cache_alloc() when there are no active objs left in a cache. */ -static int cache_grow(struct kmem_cache *cachep, - gfp_t flags, int nodeid, void *objp) +static int cache_grow(struct kmem_cache *cachep, gfp_t flags, int nodeid, + void *objp, int *this_cpu) { struct slab *slabp; size_t offset; @@ -2787,7 +2897,8 @@ static int cache_grow(struct kmem_cache offset *= cachep->colour_off; if (local_flags & __GFP_WAIT) - local_irq_enable(); + slab_irq_enable_nort(); + slab_irq_enable_rt(*this_cpu); /* * The test for missing atomic flag is performed here, rather than @@ -2817,8 +2928,10 @@ static int cache_grow(struct kmem_cache cache_init_objs(cachep, slabp); + slab_irq_disable_rt(*this_cpu); if (local_flags & __GFP_WAIT) - local_irq_disable(); + slab_irq_disable_nort(); + check_irq_off(); spin_lock(&l3->list_lock); @@ -2831,8 +2944,9 @@ static int cache_grow(struct kmem_cache opps1: kmem_freepages(cachep, objp); failed: + slab_irq_disable_rt(*this_cpu); if (local_flags & __GFP_WAIT) - local_irq_disable(); + slab_irq_disable_nort(); return 0; } @@ -2954,7 +3068,8 @@ bad: #define check_slabp(x,y) do { } while(0) #endif -static void *cache_alloc_refill(struct kmem_cache *cachep, gfp_t flags) +static void * +cache_alloc_refill(struct kmem_cache *cachep, gfp_t flags, int *this_cpu) { int batchcount; struct kmem_list3 *l3; @@ -2964,7 +3079,7 @@ static void *cache_alloc_refill(struct k retry: check_irq_off(); node = numa_node_id(); - ac = cpu_cache_get(cachep); + ac = cpu_cache_get(cachep, *this_cpu); batchcount = ac->batchcount; if (!ac->touched && batchcount > BATCHREFILL_LIMIT) { /* @@ -2974,7 +3089,7 @@ retry: */ batchcount = BATCHREFILL_LIMIT; } - l3 = cachep->nodelists[node]; + l3 = cachep->nodelists[cpu_to_node(*this_cpu)]; BUG_ON(ac->avail > 0 || !l3); spin_lock(&l3->list_lock); @@ -2997,7 +3112,7 @@ retry: slabp = list_entry(entry, struct slab, list); check_slabp(cachep, slabp); - check_spinlock_acquired(cachep); + check_spinlock_acquired_node(cachep, cpu_to_node(*this_cpu)); /* * The slab was either on partial or free list so @@ -3011,8 +3126,9 @@ retry: STATS_INC_ACTIVE(cachep); STATS_SET_HIGH(cachep); - ac->entry[ac->avail++] = slab_get_obj(cachep, slabp, - node); + ac->entry[ac->avail++] = + slab_get_obj(cachep, slabp, + cpu_to_node(*this_cpu)); } check_slabp(cachep, slabp); @@ -3031,10 +3147,10 @@ alloc_done: if (unlikely(!ac->avail)) { int x; - x = cache_grow(cachep, flags | GFP_THISNODE, node, NULL); + x = cache_grow(cachep, flags | GFP_THISNODE, cpu_to_node(*this_cpu), NULL, this_cpu); /* cache_grow can reenable interrupts, then ac could change. */ - ac = cpu_cache_get(cachep); + ac = cpu_cache_get(cachep, *this_cpu); if (!x && ac->avail == 0) /* no objects in sight? abort */ return NULL; @@ -3186,21 +3302,22 @@ static inline int should_failslab(struct #endif /* CONFIG_FAILSLAB */ -static inline void *____cache_alloc(struct kmem_cache *cachep, gfp_t flags) +static inline void * +____cache_alloc(struct kmem_cache *cachep, gfp_t flags, int *this_cpu) { void *objp; struct array_cache *ac; check_irq_off(); - ac = cpu_cache_get(cachep); + ac = cpu_cache_get(cachep, *this_cpu); if (likely(ac->avail)) { STATS_INC_ALLOCHIT(cachep); ac->touched = 1; objp = ac->entry[--ac->avail]; } else { STATS_INC_ALLOCMISS(cachep); - objp = cache_alloc_refill(cachep, flags); + objp = cache_alloc_refill(cachep, flags, this_cpu); } return objp; } @@ -3214,7 +3331,7 @@ static inline void *____cache_alloc(stru */ static void *alternate_node_alloc(struct kmem_cache *cachep, gfp_t flags) { - int nid_alloc, nid_here; + int nid_alloc, nid_here, this_cpu = raw_smp_processor_id(); if (in_interrupt() || (flags & __GFP_THISNODE)) return NULL; @@ -3224,7 +3341,7 @@ static void *alternate_node_alloc(struct else if (current->mempolicy) nid_alloc = slab_node(current->mempolicy); if (nid_alloc != nid_here) - return ____cache_alloc_node(cachep, flags, nid_alloc); + return ____cache_alloc_node(cachep, flags, nid_alloc, &this_cpu); return NULL; } @@ -3236,7 +3353,7 @@ static void *alternate_node_alloc(struct * allocator to do its reclaim / fallback magic. We then insert the * slab into the proper nodelist and then allocate from it. */ -static void *fallback_alloc(struct kmem_cache *cache, gfp_t flags) +static void *fallback_alloc(struct kmem_cache *cache, gfp_t flags, int *this_cpu) { struct zonelist *zonelist; gfp_t local_flags; @@ -3262,8 +3379,10 @@ retry: if (cpuset_zone_allowed_hardwall(*z, flags) && cache->nodelists[nid] && cache->nodelists[nid]->free_objects) - obj = ____cache_alloc_node(cache, - flags | GFP_THISNODE, nid); + + obj = ____cache_alloc_node(cache, + flags | GFP_THISNODE, nid, + this_cpu); } if (!obj) { @@ -3274,19 +3393,24 @@ retry: * set and go into memory reserves if necessary. */ if (local_flags & __GFP_WAIT) - local_irq_enable(); + slab_irq_enable_nort(); + slab_irq_enable_rt(*this_cpu); + kmem_flagcheck(cache, flags); obj = kmem_getpages(cache, flags, -1); + + slab_irq_disable_rt(*this_cpu); if (local_flags & __GFP_WAIT) - local_irq_disable(); + slab_irq_disable_nort(); + if (obj) { /* * Insert into the appropriate per node queues */ nid = page_to_nid(virt_to_page(obj)); - if (cache_grow(cache, flags, nid, obj)) { + if (cache_grow(cache, flags, nid, obj, this_cpu)) { obj = ____cache_alloc_node(cache, - flags | GFP_THISNODE, nid); + flags | GFP_THISNODE, nid, this_cpu); if (!obj) /* * Another processor may allocate the @@ -3307,7 +3431,7 @@ retry: * A interface to enable slab creation on nodeid */ static void *____cache_alloc_node(struct kmem_cache *cachep, gfp_t flags, - int nodeid) + int nodeid, int *this_cpu) { struct list_head *entry; struct slab *slabp; @@ -3355,11 +3479,11 @@ retry: must_grow: spin_unlock(&l3->list_lock); - x = cache_grow(cachep, flags | GFP_THISNODE, nodeid, NULL); + x = cache_grow(cachep, flags | GFP_THISNODE, nodeid, NULL, this_cpu); if (x) goto retry; - return fallback_alloc(cachep, flags); + return fallback_alloc(cachep, flags, this_cpu); done: return obj; @@ -3381,39 +3505,41 @@ static __always_inline void * __cache_alloc_node(struct kmem_cache *cachep, gfp_t flags, int nodeid, void *caller) { - unsigned long save_flags; + unsigned long irqflags; + int this_cpu; void *ptr; if (should_failslab(cachep, flags)) return NULL; cache_alloc_debugcheck_before(cachep, flags); - local_irq_save(save_flags); + + slab_irq_save(irqflags, this_cpu); if (unlikely(nodeid == -1)) - nodeid = numa_node_id(); + nodeid = cpu_to_node(this_cpu); if (unlikely(!cachep->nodelists[nodeid])) { /* Node not bootstrapped yet */ - ptr = fallback_alloc(cachep, flags); + ptr = fallback_alloc(cachep, flags, &this_cpu); goto out; } - if (nodeid == numa_node_id()) { + if (nodeid == cpu_to_node(this_cpu)) { /* * Use the locally cached objects if possible. * However ____cache_alloc does not allow fallback * to other nodes. It may fail while we still have * objects on other nodes available. */ - ptr = ____cache_alloc(cachep, flags); + ptr = ____cache_alloc(cachep, flags, &this_cpu); if (ptr) goto out; } /* ___cache_alloc_node can fall back to other nodes */ - ptr = ____cache_alloc_node(cachep, flags, nodeid); + ptr = ____cache_alloc_node(cachep, flags, nodeid, &this_cpu); out: - local_irq_restore(save_flags); + slab_irq_restore(irqflags, this_cpu); ptr = cache_alloc_debugcheck_after(cachep, flags, ptr, caller); if (unlikely((flags & __GFP_ZERO) && ptr)) @@ -3423,7 +3549,7 @@ __cache_alloc_node(struct kmem_cache *ca } static __always_inline void * -__do_cache_alloc(struct kmem_cache *cache, gfp_t flags) +__do_cache_alloc(struct kmem_cache *cache, gfp_t flags, int *this_cpu) { void *objp; @@ -3432,24 +3558,24 @@ __do_cache_alloc(struct kmem_cache *cach if (objp) goto out; } - objp = ____cache_alloc(cache, flags); + objp = ____cache_alloc(cache, flags, this_cpu); /* * We may just have run out of memory on the local node. * ____cache_alloc_node() knows how to locate memory on other nodes */ - if (!objp) - objp = ____cache_alloc_node(cache, flags, numa_node_id()); - + if (!objp) + objp = ____cache_alloc_node(cache, flags, + cpu_to_node(*this_cpu), this_cpu); out: return objp; } #else static __always_inline void * -__do_cache_alloc(struct kmem_cache *cachep, gfp_t flags) +__do_cache_alloc(struct kmem_cache *cachep, gfp_t flags, int *this_cpu) { - return ____cache_alloc(cachep, flags); + return ____cache_alloc(cachep, flags, this_cpu); } #endif /* CONFIG_NUMA */ @@ -3458,15 +3584,16 @@ static __always_inline void * __cache_alloc(struct kmem_cache *cachep, gfp_t flags, void *caller) { unsigned long save_flags; + int this_cpu; void *objp; if (should_failslab(cachep, flags)) return NULL; cache_alloc_debugcheck_before(cachep, flags); - local_irq_save(save_flags); - objp = __do_cache_alloc(cachep, flags); - local_irq_restore(save_flags); + slab_irq_save(save_flags, this_cpu); + objp = __do_cache_alloc(cachep, flags, &this_cpu); + slab_irq_restore(save_flags, this_cpu); objp = cache_alloc_debugcheck_after(cachep, flags, objp, caller); prefetchw(objp); @@ -3480,7 +3607,7 @@ __cache_alloc(struct kmem_cache *cachep, * Caller needs to acquire correct kmem_list's list_lock */ static void free_block(struct kmem_cache *cachep, void **objpp, int nr_objects, - int node) + int node, int *this_cpu) { int i; struct kmem_list3 *l3; @@ -3509,7 +3636,7 @@ static void free_block(struct kmem_cache * a different cache, refer to comments before * alloc_slabmgmt. */ - slab_destroy(cachep, slabp); + slab_destroy(cachep, slabp, this_cpu); } else { list_add(&slabp->list, &l3->slabs_free); } @@ -3523,11 +3650,12 @@ static void free_block(struct kmem_cache } } -static void cache_flusharray(struct kmem_cache *cachep, struct array_cache *ac) +static void +cache_flusharray(struct kmem_cache *cachep, struct array_cache *ac, int *this_cpu) { int batchcount; struct kmem_list3 *l3; - int node = numa_node_id(); + int node = cpu_to_node(*this_cpu); batchcount = ac->batchcount; #if DEBUG @@ -3549,7 +3677,7 @@ static void cache_flusharray(struct kmem } } - free_block(cachep, ac->entry, batchcount, node); + free_block(cachep, ac->entry, batchcount, node, this_cpu); free_done: #if STATS { @@ -3578,9 +3706,9 @@ free_done: * Release an obj back to its cache. If the obj has a constructed state, it must * be in this state _before_ it is released. Called with disabled ints. */ -static inline void __cache_free(struct kmem_cache *cachep, void *objp) +static void __cache_free(struct kmem_cache *cachep, void *objp, int *this_cpu) { - struct array_cache *ac = cpu_cache_get(cachep); + struct array_cache *ac = cpu_cache_get(cachep, *this_cpu); check_irq_off(); objp = cache_free_debugcheck(cachep, objp, __builtin_return_address(0)); @@ -3592,7 +3720,7 @@ static inline void __cache_free(struct k * variable to skip the call, which is mostly likely to be present in * the cache. */ - if (numa_platform && cache_free_alien(cachep, objp)) + if (numa_platform && cache_free_alien(cachep, objp, this_cpu)) return; if (likely(ac->avail < ac->limit)) { @@ -3601,7 +3729,7 @@ static inline void __cache_free(struct k return; } else { STATS_INC_FREEMISS(cachep); - cache_flusharray(cachep, ac); + cache_flusharray(cachep, ac, this_cpu); ac->entry[ac->avail++] = objp; } } @@ -3759,11 +3887,12 @@ EXPORT_SYMBOL(__kmalloc); void kmem_cache_free(struct kmem_cache *cachep, void *objp) { unsigned long flags; + int this_cpu; - local_irq_save(flags); + slab_irq_save(flags, this_cpu); debug_check_no_locks_freed(objp, obj_size(cachep)); - __cache_free(cachep, objp); - local_irq_restore(flags); + __cache_free(cachep, objp, &this_cpu); + slab_irq_restore(flags, this_cpu); } EXPORT_SYMBOL(kmem_cache_free); @@ -3780,15 +3909,16 @@ void kfree(const void *objp) { struct kmem_cache *c; unsigned long flags; + int this_cpu; if (unlikely(ZERO_OR_NULL_PTR(objp))) return; - local_irq_save(flags); + slab_irq_save(flags, this_cpu); kfree_debugcheck(objp); c = virt_to_cache(objp); debug_check_no_locks_freed(objp, obj_size(c)); - __cache_free(c, (void *)objp); - local_irq_restore(flags); + __cache_free(c, (void *)objp, &this_cpu); + slab_irq_restore(flags, this_cpu); } EXPORT_SYMBOL(kfree); @@ -3809,7 +3939,7 @@ EXPORT_SYMBOL_GPL(kmem_cache_name); */ static int alloc_kmemlist(struct kmem_cache *cachep) { - int node; + int node, this_cpu; struct kmem_list3 *l3; struct array_cache *new_shared; struct array_cache **new_alien = NULL; @@ -3837,11 +3967,11 @@ static int alloc_kmemlist(struct kmem_ca if (l3) { struct array_cache *shared = l3->shared; - spin_lock_irq(&l3->list_lock); + slab_spin_lock_irq(&l3->list_lock, this_cpu); if (shared) free_block(cachep, shared->entry, - shared->avail, node); + shared->avail, node, &this_cpu); l3->shared = new_shared; if (!l3->alien) { @@ -3850,7 +3980,7 @@ static int alloc_kmemlist(struct kmem_ca } l3->free_limit = (1 + nr_cpus_node(node)) * cachep->batchcount + cachep->num; - spin_unlock_irq(&l3->list_lock); + slab_spin_unlock_irq(&l3->list_lock, this_cpu); kfree(shared); free_alien_cache(new_alien); continue; @@ -3897,42 +4027,50 @@ struct ccupdate_struct { struct array_cache *new[NR_CPUS]; }; -static void do_ccupdate_local(void *info) +static void __do_ccupdate_local(void *info, int this_cpu) { struct ccupdate_struct *new = info; struct array_cache *old; check_irq_off(); - old = cpu_cache_get(new->cachep); + old = cpu_cache_get(new->cachep, this_cpu); - new->cachep->array[smp_processor_id()] = new->new[smp_processor_id()]; - new->new[smp_processor_id()] = old; + new->cachep->array[this_cpu] = new->new[this_cpu]; + new->new[this_cpu] = old; } +#ifdef CONFIG_PREEMPT_RT +static void do_ccupdate_local(void *arg, int this_cpu) +{ + __do_ccupdate_local(arg, this_cpu); +} +#else +static void do_ccupdate_local(void *arg) +{ + __do_ccupdate_local(arg, smp_processor_id()); +} +#endif + /* Always called with the cache_chain_mutex held */ static int do_tune_cpucache(struct kmem_cache *cachep, int limit, int batchcount, int shared) { - struct ccupdate_struct *new; - int i; - - new = kzalloc(sizeof(*new), GFP_KERNEL); - if (!new) - return -ENOMEM; + struct ccupdate_struct new; + int i, this_cpu; + memset(&new.new, 0, sizeof(new.new)); for_each_online_cpu(i) { - new->new[i] = alloc_arraycache(cpu_to_node(i), limit, + new.new[i] = alloc_arraycache(cpu_to_node(i), limit, batchcount); - if (!new->new[i]) { + if (!new.new[i]) { for (i--; i >= 0; i--) - kfree(new->new[i]); - kfree(new); + kfree(new.new[i]); return -ENOMEM; } } - new->cachep = cachep; + new.cachep = cachep; - on_each_cpu(do_ccupdate_local, (void *)new, 1, 1); + slab_on_each_cpu(do_ccupdate_local, (void *)&new); check_irq_on(); cachep->batchcount = batchcount; @@ -3940,15 +4078,15 @@ static int do_tune_cpucache(struct kmem_ cachep->shared = shared; for_each_online_cpu(i) { - struct array_cache *ccold = new->new[i]; + struct array_cache *ccold = new.new[i]; if (!ccold) continue; - spin_lock_irq(&cachep->nodelists[cpu_to_node(i)]->list_lock); - free_block(cachep, ccold->entry, ccold->avail, cpu_to_node(i)); - spin_unlock_irq(&cachep->nodelists[cpu_to_node(i)]->list_lock); + slab_spin_lock_irq(&cachep->nodelists[cpu_to_node(i)]->list_lock, this_cpu); + free_block(cachep, ccold->entry, ccold->avail, cpu_to_node(i), &this_cpu); + slab_spin_unlock_irq(&cachep->nodelists[cpu_to_node(i)]->list_lock, this_cpu); kfree(ccold); } - kfree(new); + return alloc_kmemlist(cachep); } @@ -4012,26 +4150,26 @@ static int enable_cpucache(struct kmem_c * if drain_array() is used on the shared array. */ void drain_array(struct kmem_cache *cachep, struct kmem_list3 *l3, - struct array_cache *ac, int force, int node) + struct array_cache *ac, int force, int node) { - int tofree; + int tofree, this_cpu; if (!ac || !ac->avail) return; if (ac->touched && !force) { ac->touched = 0; } else { - spin_lock_irq(&l3->list_lock); + slab_spin_lock_irq(&l3->list_lock, this_cpu); if (ac->avail) { tofree = force ? ac->avail : (ac->limit + 4) / 5; if (tofree > ac->avail) tofree = (ac->avail + 1) / 2; - free_block(cachep, ac->entry, tofree, node); + free_block(cachep, ac->entry, tofree, node, &this_cpu); ac->avail -= tofree; memmove(ac->entry, &(ac->entry[tofree]), sizeof(void *) * ac->avail); } - spin_unlock_irq(&l3->list_lock); + slab_spin_unlock_irq(&l3->list_lock, this_cpu); } } @@ -4049,11 +4187,12 @@ void drain_array(struct kmem_cache *cach */ static void cache_reap(struct work_struct *w) { + int this_cpu = raw_smp_processor_id(), node = cpu_to_node(this_cpu); struct kmem_cache *searchp; struct kmem_list3 *l3; - int node = numa_node_id(); struct delayed_work *work = container_of(w, struct delayed_work, work); + int work_done = 0; if (!mutex_trylock(&cache_chain_mutex)) /* Give up. Setup the next iteration. */ @@ -4069,9 +4208,10 @@ static void cache_reap(struct work_struc */ l3 = searchp->nodelists[node]; - reap_alien(searchp, l3); + reap_alien(searchp, l3, &this_cpu); - drain_array(searchp, l3, cpu_cache_get(searchp), 0, node); + drain_array(searchp, l3, cpu_cache_get(searchp, this_cpu), + 0, node); /* * These are racy checks but it does not matter @@ -4160,7 +4300,7 @@ static int s_show(struct seq_file *m, vo unsigned long num_slabs, free_objects = 0, shared_avail = 0; const char *name; char *error = NULL; - int node; + int this_cpu, node; struct kmem_list3 *l3; active_objs = 0; @@ -4171,7 +4311,7 @@ static int s_show(struct seq_file *m, vo continue; check_irq_on(); - spin_lock_irq(&l3->list_lock); + slab_spin_lock_irq(&l3->list_lock, this_cpu); list_for_each_entry(slabp, &l3->slabs_full, list) { if (slabp->inuse != cachep->num && !error) @@ -4196,7 +4336,7 @@ static int s_show(struct seq_file *m, vo if (l3->shared) shared_avail += l3->shared->avail; - spin_unlock_irq(&l3->list_lock); + slab_spin_unlock_irq(&l3->list_lock, this_cpu); } num_slabs += active_slabs; num_objs = num_slabs * cachep->num; @@ -4392,7 +4532,7 @@ static int leaks_show(struct seq_file *m struct kmem_list3 *l3; const char *name; unsigned long *n = m->private; - int node; + int node, this_cpu; int i; if (!(cachep->flags & SLAB_STORE_USER)) @@ -4410,13 +4550,13 @@ static int leaks_show(struct seq_file *m continue; check_irq_on(); - spin_lock_irq(&l3->list_lock); + slab_spin_lock_irq(&l3->list_lock, this_cpu); list_for_each_entry(slabp, &l3->slabs_full, list) handle_slab(n, cachep, slabp); list_for_each_entry(slabp, &l3->slabs_partial, list) handle_slab(n, cachep, slabp); - spin_unlock_irq(&l3->list_lock); + slab_spin_unlock_irq(&l3->list_lock, this_cpu); } name = cachep->name; if (n[0] == n[1]) { ��������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/rt-page_alloc.patch�������������������������������������������������������������������������0000664�0000764�0000764�00000014510�11041657732�014776� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������Subject: rt-friendly per-cpu pages From: Ingo Molnar <mingo@elte.hu> rt-friendly per-cpu pages: convert the irqs-off per-cpu locking method into a preemptible, explicit-per-cpu-locks method. Signed-off-by: Ingo Molnar <mingo@elte.hu> --- mm/page_alloc.c | 107 ++++++++++++++++++++++++++++++++++++++++---------------- 1 file changed, 77 insertions(+), 30 deletions(-) Index: linux-2.6.24.7/mm/page_alloc.c =================================================================== --- linux-2.6.24.7.orig/mm/page_alloc.c +++ linux-2.6.24.7/mm/page_alloc.c @@ -159,6 +159,53 @@ static unsigned long __meminitdata dma_r EXPORT_SYMBOL(movable_zone); #endif /* CONFIG_ARCH_POPULATES_NODE_MAP */ +#ifdef CONFIG_PREEMPT_RT +static DEFINE_PER_CPU_LOCKED(int, pcp_locks); +#endif + +static inline void __lock_cpu_pcp(unsigned long *flags, int cpu) +{ +#ifdef CONFIG_PREEMPT_RT + spin_lock(&__get_cpu_lock(pcp_locks, cpu)); + flags = 0; +#else + local_irq_save(*flags); +#endif +} + +static inline void lock_cpu_pcp(unsigned long *flags, int *this_cpu) +{ +#ifdef CONFIG_PREEMPT_RT + (void)get_cpu_var_locked(pcp_locks, this_cpu); + flags = 0; +#else + local_irq_save(*flags); + *this_cpu = smp_processor_id(); +#endif +} + +static inline void unlock_cpu_pcp(unsigned long flags, int this_cpu) +{ +#ifdef CONFIG_PREEMPT_RT + put_cpu_var_locked(pcp_locks, this_cpu); +#else + local_irq_restore(flags); +#endif +} + +static struct per_cpu_pageset * +get_zone_pcp(struct zone *zone, unsigned long *flags, int *this_cpu) +{ + lock_cpu_pcp(flags, this_cpu); + return zone_pcp(zone, *this_cpu); +} + +static void +put_zone_pcp(struct zone *zone, unsigned long flags, int this_cpu) +{ + unlock_cpu_pcp(flags, this_cpu); +} + #if MAX_NUMNODES > 1 int nr_node_ids __read_mostly = MAX_NUMNODES; EXPORT_SYMBOL(nr_node_ids); @@ -410,8 +457,8 @@ static inline int page_is_buddy(struct p * -- wli */ -static inline void __free_one_page(struct page *page, - struct zone *zone, unsigned int order) +static inline void +__free_one_page(struct page *page, struct zone *zone, unsigned int order) { unsigned long page_idx; int order_size = 1 << order; @@ -515,8 +562,9 @@ static void free_one_page(struct zone *z static void __free_pages_ok(struct page *page, unsigned int order) { unsigned long flags; - int i; int reserved = 0; + int this_cpu; + int i; for (i = 0 ; i < (1 << order) ; ++i) reserved += free_pages_check(page + i); @@ -528,10 +576,10 @@ static void __free_pages_ok(struct page arch_free_page(page, order); kernel_map_pages(page, 1 << order, 0); - local_irq_save(flags); - __count_vm_events(PGFREE, 1 << order); + lock_cpu_pcp(&flags, &this_cpu); + count_vm_events(PGFREE, 1 << order); free_one_page(page_zone(page), page, order); - local_irq_restore(flags); + unlock_cpu_pcp(flags, this_cpu); } /* @@ -876,23 +924,19 @@ static int rmqueue_bulk(struct zone *zon */ void drain_zone_pages(struct zone *zone, struct per_cpu_pages *pcp) { - unsigned long flags; int to_drain; - local_irq_save(flags); if (pcp->count >= pcp->batch) to_drain = pcp->batch; else to_drain = pcp->count; free_pages_bulk(zone, to_drain, &pcp->list, 0); pcp->count -= to_drain; - local_irq_restore(flags); } #endif static void __drain_pages(unsigned int cpu) { - unsigned long flags; struct zone *zone; int i; @@ -903,14 +947,16 @@ static void __drain_pages(unsigned int c continue; pset = zone_pcp(zone, cpu); + if (!pset) { + WARN_ON(1); + continue; + } for (i = 0; i < ARRAY_SIZE(pset->pcp); i++) { struct per_cpu_pages *pcp; pcp = &pset->pcp[i]; - local_irq_save(flags); free_pages_bulk(zone, pcp->count, &pcp->list, 0); pcp->count = 0; - local_irq_restore(flags); } } } @@ -957,10 +1003,11 @@ void mark_free_pages(struct zone *zone) void drain_local_pages(void) { unsigned long flags; + int this_cpu; - local_irq_save(flags); - __drain_pages(smp_processor_id()); - local_irq_restore(flags); + lock_cpu_pcp(&flags, &this_cpu); + __drain_pages(this_cpu); + unlock_cpu_pcp(flags, this_cpu); } void smp_drain_local_pages(void *arg) @@ -988,8 +1035,10 @@ void drain_all_local_pages(void) static void fastcall free_hot_cold_page(struct page *page, int cold) { struct zone *zone = page_zone(page); + struct per_cpu_pageset *pset; struct per_cpu_pages *pcp; unsigned long flags; + int this_cpu; if (PageAnon(page)) page->mapping = NULL; @@ -1001,9 +1050,11 @@ static void fastcall free_hot_cold_page( arch_free_page(page, 0); kernel_map_pages(page, 1, 0); - pcp = &zone_pcp(zone, get_cpu())->pcp[cold]; - local_irq_save(flags); - __count_vm_event(PGFREE); + pset = get_zone_pcp(zone, &flags, &this_cpu); + pcp = &pset->pcp[cold]; + + count_vm_event(PGFREE); + list_add(&page->lru, &pcp->list); set_page_private(page, get_pageblock_migratetype(page)); pcp->count++; @@ -1011,8 +1062,7 @@ static void fastcall free_hot_cold_page( free_pages_bulk(zone, pcp->batch, &pcp->list, 0); pcp->count -= pcp->batch; } - local_irq_restore(flags); - put_cpu(); + put_zone_pcp(zone, flags, this_cpu); } void fastcall free_hot_page(struct page *page) @@ -1054,16 +1104,15 @@ static struct page *buffered_rmqueue(str unsigned long flags; struct page *page; int cold = !!(gfp_flags & __GFP_COLD); - int cpu; + struct per_cpu_pageset *pset; int migratetype = allocflags_to_migratetype(gfp_flags); + int this_cpu; again: - cpu = get_cpu(); + pset = get_zone_pcp(zone, &flags, &this_cpu); if (likely(order == 0)) { - struct per_cpu_pages *pcp; + struct per_cpu_pages *pcp = &pset->pcp[cold]; - pcp = &zone_pcp(zone, cpu)->pcp[cold]; - local_irq_save(flags); if (!pcp->count) { pcp->count = rmqueue_bulk(zone, 0, pcp->batch, &pcp->list, migratetype); @@ -1086,7 +1135,7 @@ again: list_del(&page->lru); pcp->count--; } else { - spin_lock_irqsave(&zone->lock, flags); + spin_lock(&zone->lock); page = __rmqueue(zone, order, migratetype); spin_unlock(&zone->lock); if (!page) @@ -1095,8 +1144,7 @@ again: __count_zone_vm_events(PGALLOC, zone, 1 << order); zone_statistics(zonelist, zone); - local_irq_restore(flags); - put_cpu(); + put_zone_pcp(zone, flags, this_cpu); VM_BUG_ON(bad_range(zone, page)); if (prep_new_page(page, order, gfp_flags)) @@ -1104,8 +1152,7 @@ again: return page; failed: - local_irq_restore(flags); - put_cpu(); + put_zone_pcp(zone, flags, this_cpu); return NULL; } ����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/rt-mutex-preempt-debugging.patch������������������������������������������������������������0000664�0000764�0000764�00000013313�11041657733�017456� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������--- include/linux/preempt.h | 18 +++++++++++++++--- include/linux/smp.h | 2 +- init/main.c | 2 +- kernel/sched.c | 24 ++++++++++++++++++++++-- kernel/softirq.c | 6 +++--- kernel/stop_machine.c | 2 +- 6 files changed, 43 insertions(+), 11 deletions(-) Index: linux-2.6.24.7/include/linux/preempt.h =================================================================== --- linux-2.6.24.7.orig/include/linux/preempt.h +++ linux-2.6.24.7/include/linux/preempt.h @@ -9,6 +9,7 @@ #include <linux/thread_info.h> #include <linux/linkage.h> #include <linux/list.h> +#include <linux/thread_info.h> #if defined(CONFIG_DEBUG_PREEMPT) || defined(CONFIG_PREEMPT_TRACER) || \ defined(CONFIG_PREEMPT_TRACE) @@ -22,11 +23,12 @@ #define inc_preempt_count() add_preempt_count(1) #define dec_preempt_count() sub_preempt_count(1) -#define preempt_count() (current_thread_info()->preempt_count) +#define preempt_count() (current_thread_info()->preempt_count) #ifdef CONFIG_PREEMPT asmlinkage void preempt_schedule(void); +asmlinkage void preempt_schedule_irq(void); #define preempt_disable() \ do { \ @@ -34,12 +36,19 @@ do { \ barrier(); \ } while (0) -#define preempt_enable_no_resched() \ +#define __preempt_enable_no_resched() \ do { \ barrier(); \ dec_preempt_count(); \ } while (0) + +#ifdef CONFIG_DEBUG_PREEMPT +extern void notrace preempt_enable_no_resched(void); +#else +# define preempt_enable_no_resched() __preempt_enable_no_resched() +#endif + #define preempt_check_resched() \ do { \ if (unlikely(test_thread_flag(TIF_NEED_RESCHED))) \ @@ -48,7 +57,7 @@ do { \ #define preempt_enable() \ do { \ - preempt_enable_no_resched(); \ + __preempt_enable_no_resched(); \ barrier(); \ preempt_check_resched(); \ } while (0) @@ -85,6 +94,7 @@ do { \ #define preempt_disable() do { } while (0) #define preempt_enable_no_resched() do { } while (0) +#define __preempt_enable_no_resched() do { } while (0) #define preempt_enable() do { } while (0) #define preempt_check_resched() do { } while (0) @@ -92,6 +102,8 @@ do { \ #define preempt_enable_no_resched_notrace() do { } while (0) #define preempt_enable_notrace() do { } while (0) +#define preempt_schedule_irq() do { } while (0) + #endif #ifdef CONFIG_PREEMPT_NOTIFIERS Index: linux-2.6.24.7/include/linux/smp.h =================================================================== --- linux-2.6.24.7.orig/include/linux/smp.h +++ linux-2.6.24.7/include/linux/smp.h @@ -137,7 +137,7 @@ static inline void smp_send_reschedule(i #define get_cpu() ({ preempt_disable(); smp_processor_id(); }) #define put_cpu() preempt_enable() -#define put_cpu_no_resched() preempt_enable_no_resched() +#define put_cpu_no_resched() __preempt_enable_no_resched() void smp_setup_processor_id(void); Index: linux-2.6.24.7/init/main.c =================================================================== --- linux-2.6.24.7.orig/init/main.c +++ linux-2.6.24.7/init/main.c @@ -446,7 +446,7 @@ static void noinline __init_refok rest_i * at least once to get things moving: */ init_idle_bootup_task(current); - preempt_enable_no_resched(); + __preempt_enable_no_resched(); schedule(); preempt_disable(); Index: linux-2.6.24.7/kernel/sched.c =================================================================== --- linux-2.6.24.7.orig/kernel/sched.c +++ linux-2.6.24.7/kernel/sched.c @@ -1859,6 +1859,26 @@ fire_sched_out_preempt_notifiers(struct #endif +#ifdef CONFIG_DEBUG_PREEMPT +void notrace preempt_enable_no_resched(void) +{ + static int once = 1; + + barrier(); + dec_preempt_count(); + + if (once && !preempt_count()) { + once = 0; + printk(KERN_ERR "BUG: %s:%d task might have lost a preemption check!\n", + current->comm, current->pid); + dump_stack(); + } +} + +EXPORT_SYMBOL(preempt_enable_no_resched); +#endif + + /** * prepare_task_switch - prepare to switch tasks * @rq: the runqueue preparing to switch @@ -3753,7 +3773,7 @@ need_resched_nonpreemptible: rq = cpu_rq(cpu); goto need_resched_nonpreemptible; } - preempt_enable_no_resched(); + __preempt_enable_no_resched(); if (unlikely(test_thread_flag(TIF_NEED_RESCHED))) goto need_resched; } @@ -7051,7 +7071,7 @@ void __init sched_init(void) current->sched_class = &fair_sched_class; } -#ifdef CONFIG_DEBUG_SPINLOCK_SLEEP +#if defined(CONFIG_DEBUG_SPINLOCK_SLEEP) || defined(CONFIG_DEBUG_PREEMPT) void __might_sleep(char *file, int line) { #ifdef in_atomic Index: linux-2.6.24.7/kernel/softirq.c =================================================================== --- linux-2.6.24.7.orig/kernel/softirq.c +++ linux-2.6.24.7/kernel/softirq.c @@ -413,7 +413,7 @@ void irq_exit(void) tick_nohz_stop_sched_tick(); rcu_irq_exit(); #endif - preempt_enable_no_resched(); + __preempt_enable_no_resched(); } /* @@ -599,7 +599,7 @@ static int ksoftirqd(void * __data) while (!kthread_should_stop()) { preempt_disable(); if (!(local_softirq_pending() & mask)) { - preempt_enable_no_resched(); + __preempt_enable_no_resched(); schedule(); preempt_disable(); } @@ -618,7 +618,7 @@ static int ksoftirqd(void * __data) goto wait_to_die; local_irq_disable(); - preempt_enable_no_resched(); + __preempt_enable_no_resched(); set_softirq_pending(local_softirq_pending() & ~mask); local_bh_disable(); local_irq_enable(); Index: linux-2.6.24.7/kernel/stop_machine.c =================================================================== --- linux-2.6.24.7.orig/kernel/stop_machine.c +++ linux-2.6.24.7/kernel/stop_machine.c @@ -133,7 +133,7 @@ static void restart_machine(void) { stopmachine_set_state(STOPMACHINE_EXIT); local_irq_enable(); - preempt_enable_no_resched(); + __preempt_enable_no_resched(); } struct stop_machine_data ���������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/rt-mutex-irq-flags-checking.patch�����������������������������������������������������������0000664�0000764�0000764�00000004616�11041657733�017515� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������--- include/linux/irqflags.h | 37 ++++++++++++++++++++++++++++++++----- 1 file changed, 32 insertions(+), 5 deletions(-) Index: linux-2.6.24.7/include/linux/irqflags.h =================================================================== --- linux-2.6.24.7.orig/include/linux/irqflags.h +++ linux-2.6.24.7/include/linux/irqflags.h @@ -11,6 +11,12 @@ #ifndef _LINUX_TRACE_IRQFLAGS_H #define _LINUX_TRACE_IRQFLAGS_H +#define BUILD_CHECK_IRQ_FLAGS(flags) \ + do { \ + BUILD_BUG_ON(sizeof(flags) != sizeof(unsigned long)); \ + typecheck(unsigned long, flags); \ + } while (0) + #ifdef CONFIG_TRACE_IRQFLAGS extern void trace_hardirqs_on(void); extern void trace_hardirqs_off(void); @@ -59,10 +65,15 @@ #define local_irq_disable() \ do { raw_local_irq_disable(); trace_hardirqs_off(); } while (0) #define local_irq_save(flags) \ - do { raw_local_irq_save(flags); trace_hardirqs_off(); } while (0) + do { \ + BUILD_CHECK_IRQ_FLAGS(flags); \ + raw_local_irq_save(flags); \ + trace_hardirqs_off(); \ + } while (0) #define local_irq_restore(flags) \ do { \ + BUILD_CHECK_IRQ_FLAGS(flags); \ if (raw_irqs_disabled_flags(flags)) { \ raw_local_irq_restore(flags); \ trace_hardirqs_off(); \ @@ -78,8 +89,16 @@ */ # define raw_local_irq_disable() local_irq_disable() # define raw_local_irq_enable() local_irq_enable() -# define raw_local_irq_save(flags) local_irq_save(flags) -# define raw_local_irq_restore(flags) local_irq_restore(flags) +# define raw_local_irq_save(flags) \ + do { \ + BUILD_CHECK_IRQ_FLAGS(flags); \ + local_irq_save(flags); \ + } while (0) +# define raw_local_irq_restore(flags) \ + do { \ + BUILD_CHECK_IRQ_FLAGS(flags); \ + local_irq_restore(flags); \ + } while (0) #endif /* CONFIG_TRACE_IRQFLAGS_SUPPORT */ #ifdef CONFIG_TRACE_IRQFLAGS_SUPPORT @@ -89,7 +108,11 @@ raw_safe_halt(); \ } while (0) -#define local_save_flags(flags) raw_local_save_flags(flags) +#define local_save_flags(flags) \ + do { \ + BUILD_CHECK_IRQ_FLAGS(flags); \ + raw_local_save_flags(flags); \ + } while (0) #define irqs_disabled() \ ({ \ @@ -99,7 +122,11 @@ raw_irqs_disabled_flags(flags); \ }) -#define irqs_disabled_flags(flags) raw_irqs_disabled_flags(flags) +#define irqs_disabled_flags(flags) \ +({ \ + BUILD_CHECK_IRQ_FLAGS(flags); \ + raw_irqs_disabled_flags(flags); \ +}) #endif /* CONFIG_X86 */ #endif ������������������������������������������������������������������������������������������������������������������patches/rt-mutex-trivial-tcp-preempt-fix.patch������������������������������������������������������0000664�0000764�0000764�00000001254�11041657731�020544� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������--- net/ipv4/tcp.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) Index: linux-2.6.24.7/net/ipv4/tcp.c =================================================================== --- linux-2.6.24.7.orig/net/ipv4/tcp.c +++ linux-2.6.24.7/net/ipv4/tcp.c @@ -1155,11 +1155,11 @@ int tcp_recvmsg(struct kiocb *iocb, stru (len > sysctl_tcp_dma_copybreak) && !(flags & MSG_PEEK) && !sysctl_tcp_low_latency && __get_cpu_var(softnet_data).net_dma) { - preempt_enable_no_resched(); + preempt_enable(); tp->ucopy.pinned_list = dma_pin_iovec_pages(msg->msg_iov, len); } else { - preempt_enable_no_resched(); + preempt_enable(); } } #endif ����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/rt-mutex-trivial-route-cast-fix.patch�������������������������������������������������������0000664�0000764�0000764�00000000771�11041657732�020376� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������--- net/ipv4/route.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) Index: linux-2.6.24.7/net/ipv4/route.c =================================================================== --- linux-2.6.24.7.orig/net/ipv4/route.c +++ linux-2.6.24.7/net/ipv4/route.c @@ -240,7 +240,7 @@ static spinlock_t *rt_hash_locks; spin_lock_init(&rt_hash_locks[i]); \ } #else -# define rt_hash_lock_addr(slot) NULL +# define rt_hash_lock_addr(slot) ((spinlock_t *)NULL) # define rt_hash_lock_init() #endif �������patches/rt-mutex-delayed-resched.patch��������������������������������������������������������������0000664�0000764�0000764�00000010117�11041657733�017072� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������--- drivers/acpi/processor_idle.c | 6 +++--- include/linux/preempt.h | 16 ++++++++++++++++ include/linux/sched.h | 22 +++++++++++++++++++++- kernel/sched.c | 4 +++- 4 files changed, 43 insertions(+), 5 deletions(-) Index: linux-2.6.24.7/drivers/acpi/processor_idle.c =================================================================== --- linux-2.6.24.7.orig/drivers/acpi/processor_idle.c +++ linux-2.6.24.7/drivers/acpi/processor_idle.c @@ -209,7 +209,7 @@ static void acpi_safe_halt(void) * test NEED_RESCHED: */ smp_mb(); - if (!need_resched()) + if (!need_resched() || !need_resched_delayed()) safe_halt(); current_thread_info()->status |= TS_POLLING; } @@ -1417,7 +1417,7 @@ static int acpi_idle_enter_simple(struct */ smp_mb(); - if (unlikely(need_resched())) { + if (unlikely(need_resched() || need_resched_delayed())) { current_thread_info()->status |= TS_POLLING; local_irq_enable(); return 0; @@ -1503,7 +1503,7 @@ static int acpi_idle_enter_bm(struct cpu */ smp_mb(); - if (unlikely(need_resched())) { + if (unlikely(need_resched() || need_resched_delayed())) { current_thread_info()->status |= TS_POLLING; local_irq_enable(); return 0; Index: linux-2.6.24.7/include/linux/preempt.h =================================================================== --- linux-2.6.24.7.orig/include/linux/preempt.h +++ linux-2.6.24.7/include/linux/preempt.h @@ -55,6 +55,21 @@ do { \ preempt_schedule(); \ } while (0) + +/* + * If the architecture doens't have TIF_NEED_RESCHED_DELAYED + * help it out and define it back to TIF_NEED_RESCHED + */ +#ifndef TIF_NEED_RESCHED_DELAYED +# define TIF_NEED_RESCHED_DELAYED TIF_NEED_RESCHED +#endif + +#define preempt_check_resched_delayed() \ +do { \ + if (unlikely(test_thread_flag(TIF_NEED_RESCHED_DELAYED))) \ + preempt_schedule(); \ +} while (0) + #define preempt_enable() \ do { \ __preempt_enable_no_resched(); \ @@ -97,6 +112,7 @@ do { \ #define __preempt_enable_no_resched() do { } while (0) #define preempt_enable() do { } while (0) #define preempt_check_resched() do { } while (0) +#define preempt_check_resched_delayed() do { } while (0) #define preempt_disable_notrace() do { } while (0) #define preempt_enable_no_resched_notrace() do { } while (0) Index: linux-2.6.24.7/include/linux/sched.h =================================================================== --- linux-2.6.24.7.orig/include/linux/sched.h +++ linux-2.6.24.7/include/linux/sched.h @@ -1893,11 +1893,31 @@ static inline int signal_pending(struct return unlikely(test_tsk_thread_flag(p,TIF_SIGPENDING)); } -static inline int need_resched(void) +static inline int _need_resched(void) { return unlikely(test_thread_flag(TIF_NEED_RESCHED)); } +static inline int need_resched(void) +{ + return _need_resched(); +} + +static inline void set_tsk_need_resched_delayed(struct task_struct *tsk) +{ + set_tsk_thread_flag(tsk,TIF_NEED_RESCHED_DELAYED); +} + +static inline void clear_tsk_need_resched_delayed(struct task_struct *tsk) +{ + clear_tsk_thread_flag(tsk,TIF_NEED_RESCHED_DELAYED); +} + +static inline int need_resched_delayed(void) +{ + return unlikely(test_thread_flag(TIF_NEED_RESCHED_DELAYED)); +} + /* * cond_resched() and cond_resched_lock(): latency reduction via * explicit rescheduling in places that are safe. The return Index: linux-2.6.24.7/kernel/sched.c =================================================================== --- linux-2.6.24.7.orig/kernel/sched.c +++ linux-2.6.24.7/kernel/sched.c @@ -3735,6 +3735,7 @@ need_resched_nonpreemptible: __update_rq_clock(rq); spin_lock(&rq->lock); clear_tsk_need_resched(prev); + clear_tsk_need_resched_delayed(prev); if (prev->state && !(preempt_count() & PREEMPT_ACTIVE)) { if (unlikely((prev->state & TASK_INTERRUPTIBLE) && @@ -3774,7 +3775,8 @@ need_resched_nonpreemptible: goto need_resched_nonpreemptible; } __preempt_enable_no_resched(); - if (unlikely(test_thread_flag(TIF_NEED_RESCHED))) + if (unlikely(test_thread_flag(TIF_NEED_RESCHED) || + test_thread_flag(TIF_NEED_RESCHED_DELAYED))) goto need_resched; } EXPORT_SYMBOL(schedule); �������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/rt-mutex-core.patch�������������������������������������������������������������������������0000664�0000764�0000764�00000522705�11041673232�015003� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������--- drivers/input/ff-memless.c | 1 fs/proc/array.c | 28 + include/linux/bit_spinlock.h | 4 include/linux/init_task.h | 3 include/linux/mutex.h | 57 ++ include/linux/plist.h | 4 include/linux/rt_lock.h | 334 +++++++++++++++ include/linux/rtmutex.h | 4 include/linux/rwsem-spinlock.h | 35 - include/linux/rwsem.h | 108 ++++- include/linux/sched.h | 75 ++- include/linux/semaphore.h | 49 ++ include/linux/seqlock.h | 195 ++++++++- include/linux/spinlock.h | 804 +++++++++++++++++++++++++++++--------- include/linux/spinlock_api_smp.h | 91 ++-- include/linux/spinlock_api_up.h | 74 ++- include/linux/spinlock_types.h | 61 ++ include/linux/spinlock_types_up.h | 6 include/linux/spinlock_up.h | 8 kernel/Makefile | 6 kernel/fork.c | 6 kernel/futex.c | 4 kernel/hrtimer.c | 4 kernel/lockdep.c | 2 kernel/rt.c | 571 ++++++++++++++++++++++++++ kernel/rtmutex-debug.c | 108 +---- kernel/rtmutex.c | 433 ++++++++++++++++++-- kernel/rwsem.c | 44 +- kernel/sched.c | 78 ++- kernel/spinlock.c | 266 ++++++++---- lib/dec_and_lock.c | 4 lib/kernel_lock.c | 4 lib/locking-selftest.c | 6 lib/plist.c | 2 lib/rwsem-spinlock.c | 29 - lib/rwsem.c | 6 lib/semaphore-sleepers.c | 16 lib/spinlock_debug.c | 64 +-- 38 files changed, 2919 insertions(+), 675 deletions(-) Index: linux-2.6.24.7/drivers/input/ff-memless.c =================================================================== --- linux-2.6.24.7.orig/drivers/input/ff-memless.c +++ linux-2.6.24.7/drivers/input/ff-memless.c @@ -28,6 +28,7 @@ #include <linux/input.h> #include <linux/module.h> #include <linux/mutex.h> +#include <linux/interrupt.h> #include <linux/spinlock.h> #include <linux/jiffies.h> Index: linux-2.6.24.7/fs/proc/array.c =================================================================== --- linux-2.6.24.7.orig/fs/proc/array.c +++ linux-2.6.24.7/fs/proc/array.c @@ -131,17 +131,19 @@ static inline char *task_name(struct tas */ static const char *task_state_array[] = { "R (running)", /* 0 */ - "S (sleeping)", /* 1 */ - "D (disk sleep)", /* 2 */ - "T (stopped)", /* 4 */ - "T (tracing stop)", /* 8 */ - "Z (zombie)", /* 16 */ - "X (dead)" /* 32 */ + "M (running-mutex)", /* 1 */ + "S (sleeping)", /* 2 */ + "D (disk sleep)", /* 4 */ + "T (stopped)", /* 8 */ + "T (tracing stop)", /* 16 */ + "Z (zombie)", /* 32 */ + "X (dead)" /* 64 */ }; static inline const char *get_task_state(struct task_struct *tsk) { unsigned int state = (tsk->state & (TASK_RUNNING | + TASK_RUNNING_MUTEX | TASK_INTERRUPTIBLE | TASK_UNINTERRUPTIBLE | TASK_STOPPED | @@ -305,6 +307,19 @@ static inline char *task_context_switch_ p->nivcsw); } +#define get_blocked_on(t) (-1) + +static char *show_blocked_on(struct task_struct *task, char *buffer) +{ + pid_t pid = get_blocked_on(task); + + if (pid < 0) + return buffer; + + return buffer + sprintf(buffer,"BlckOn: %d\n",pid); +} + + int proc_pid_status(struct task_struct *task, char *buffer) { char *orig = buffer; @@ -324,6 +339,7 @@ int proc_pid_status(struct task_struct * buffer = task_show_regs(task, buffer); #endif buffer = task_context_switch_counts(task, buffer); + buffer = show_blocked_on(task,buffer); return buffer - orig; } Index: linux-2.6.24.7/include/linux/bit_spinlock.h =================================================================== --- linux-2.6.24.7.orig/include/linux/bit_spinlock.h +++ linux-2.6.24.7/include/linux/bit_spinlock.h @@ -1,6 +1,8 @@ #ifndef __LINUX_BIT_SPINLOCK_H #define __LINUX_BIT_SPINLOCK_H +#if 0 + /* * bit-based spin_lock() * @@ -91,5 +93,7 @@ static inline int bit_spin_is_locked(int #endif } +#endif + #endif /* __LINUX_BIT_SPINLOCK_H */ Index: linux-2.6.24.7/include/linux/init_task.h =================================================================== --- linux-2.6.24.7.orig/include/linux/init_task.h +++ linux-2.6.24.7/include/linux/init_task.h @@ -10,6 +10,7 @@ #include <linux/pid_namespace.h> #include <linux/user_namespace.h> #include <net/net_namespace.h> +#include <linux/spinlock.h> #define INIT_FDTABLE \ { \ @@ -165,7 +166,7 @@ extern struct group_info init_groups; .journal_info = NULL, \ .cpu_timers = INIT_CPU_TIMERS(tsk.cpu_timers), \ .fs_excl = ATOMIC_INIT(0), \ - .pi_lock = __SPIN_LOCK_UNLOCKED(tsk.pi_lock), \ + .pi_lock = RAW_SPIN_LOCK_UNLOCKED(tsk.pi_lock), \ .pids = { \ [PIDTYPE_PID] = INIT_PID_LINK(PIDTYPE_PID), \ [PIDTYPE_PGID] = INIT_PID_LINK(PIDTYPE_PGID), \ Index: linux-2.6.24.7/include/linux/mutex.h =================================================================== --- linux-2.6.24.7.orig/include/linux/mutex.h +++ linux-2.6.24.7/include/linux/mutex.h @@ -12,11 +12,66 @@ #include <linux/list.h> #include <linux/spinlock_types.h> +#include <linux/rt_lock.h> #include <linux/linkage.h> #include <linux/lockdep.h> #include <asm/atomic.h> +#ifdef CONFIG_PREEMPT_RT + +#include <linux/rtmutex.h> + +struct mutex { + struct rt_mutex lock; +#ifdef CONFIG_DEBUG_LOCK_ALLOC + struct lockdep_map dep_map; +#endif +}; + +#define __MUTEX_INITIALIZER(mutexname) \ + { \ + .lock = __RT_MUTEX_INITIALIZER(mutexname.lock) \ + } + +#define DEFINE_MUTEX(mutexname) \ + struct mutex mutexname = __MUTEX_INITIALIZER(mutexname) + +extern void +_mutex_init(struct mutex *lock, char *name, struct lock_class_key *key); + +extern void __lockfunc _mutex_lock(struct mutex *lock); +extern int __lockfunc _mutex_lock_interruptible(struct mutex *lock); +extern void __lockfunc _mutex_lock_nested(struct mutex *lock, int subclass); +extern int __lockfunc _mutex_lock_interruptible_nested(struct mutex *lock, int subclass); +extern int __lockfunc _mutex_trylock(struct mutex *lock); +extern void __lockfunc _mutex_unlock(struct mutex *lock); + +#define mutex_is_locked(l) rt_mutex_is_locked(&(l)->lock) +#define mutex_lock(l) _mutex_lock(l) +#define mutex_lock_interruptible(l) _mutex_lock_interruptible(l) +#define mutex_trylock(l) _mutex_trylock(l) +#define mutex_unlock(l) _mutex_unlock(l) +#define mutex_destroy(l) rt_mutex_destroy(&(l)->lock) + +#ifdef CONFIG_DEBUG_LOCK_ALLOC +# define mutex_lock_nested(l, s) _mutex_lock_nested(l, s) +# define mutex_lock_interruptible_nested(l, s) \ + _mutex_lock_interruptible_nested(l, s) +#else +# define mutex_lock_nested(l, s) _mutex_lock(l) +# define mutex_lock_interruptible_nested(l, s) \ + _mutex_lock_interruptible(l) +#endif + +# define mutex_init(mutex) \ +do { \ + static struct lock_class_key __key; \ + \ + _mutex_init((mutex), #mutex, &__key); \ +} while (0) + +#else /* * Simple, straightforward mutexes with strict semantics: * @@ -144,3 +199,5 @@ extern int fastcall mutex_trylock(struct extern void fastcall mutex_unlock(struct mutex *lock); #endif + +#endif Index: linux-2.6.24.7/include/linux/plist.h =================================================================== --- linux-2.6.24.7.orig/include/linux/plist.h +++ linux-2.6.24.7/include/linux/plist.h @@ -81,7 +81,7 @@ struct plist_head { struct list_head prio_list; struct list_head node_list; #ifdef CONFIG_DEBUG_PI_LIST - spinlock_t *lock; + raw_spinlock_t *lock; #endif }; @@ -125,7 +125,7 @@ struct plist_node { * @lock: list spinlock, remembered for debugging */ static inline void -plist_head_init(struct plist_head *head, spinlock_t *lock) +plist_head_init(struct plist_head *head, raw_spinlock_t *lock) { INIT_LIST_HEAD(&head->prio_list); INIT_LIST_HEAD(&head->node_list); Index: linux-2.6.24.7/include/linux/rt_lock.h =================================================================== --- /dev/null +++ linux-2.6.24.7/include/linux/rt_lock.h @@ -0,0 +1,334 @@ +#ifndef __LINUX_RT_LOCK_H +#define __LINUX_RT_LOCK_H + +/* + * Real-Time Preemption Support + * + * started by Ingo Molnar: + * + * Copyright (C) 2004, 2005 Red Hat, Inc., Ingo Molnar <mingo@redhat.com> + * + * This file contains the main data structure definitions. + */ +#include <linux/rtmutex.h> +#include <asm/atomic.h> +#include <linux/spinlock_types.h> + +#ifdef CONFIG_PREEMPT_RT +/* + * spinlocks - an RT mutex plus lock-break field: + */ +typedef struct { + struct rt_mutex lock; + unsigned int break_lock; +#ifdef CONFIG_DEBUG_LOCK_ALLOC + struct lockdep_map dep_map; +#endif +} spinlock_t; + +#ifdef CONFIG_DEBUG_RT_MUTEXES +# define __SPIN_LOCK_UNLOCKED(name) \ + (spinlock_t) { { .wait_lock = _RAW_SPIN_LOCK_UNLOCKED(name) \ + , .save_state = 1, .file = __FILE__, .line = __LINE__ }, SPIN_DEP_MAP_INIT(name) } +#else +# define __SPIN_LOCK_UNLOCKED(name) \ + (spinlock_t) { { .wait_lock = _RAW_SPIN_LOCK_UNLOCKED(name) }, SPIN_DEP_MAP_INIT(name) } +#endif +# define SPIN_LOCK_UNLOCKED __SPIN_LOCK_UNLOCKED(spin_old_style) +#else /* !PREEMPT_RT */ + typedef raw_spinlock_t spinlock_t; +# ifdef CONFIG_DEBUG_SPINLOCK +# define _SPIN_LOCK_UNLOCKED \ + { .raw_lock = __RAW_SPIN_LOCK_UNLOCKED, \ + .magic = SPINLOCK_MAGIC, \ + .owner = SPINLOCK_OWNER_INIT, \ + .owner_cpu = -1 } +# else +# define _SPIN_LOCK_UNLOCKED \ + { .raw_lock = __RAW_SPIN_LOCK_UNLOCKED } +# endif +# define SPIN_LOCK_UNLOCKED _SPIN_LOCK_UNLOCKED +# define __SPIN_LOCK_UNLOCKED(name) _SPIN_LOCK_UNLOCKED +#endif + +#define __DEFINE_SPINLOCK(name) \ + spinlock_t name = __SPIN_LOCK_UNLOCKED(name) + +#define DEFINE_SPINLOCK(name) \ + spinlock_t name __cacheline_aligned_in_smp = __SPIN_LOCK_UNLOCKED(name) + +#ifdef CONFIG_PREEMPT_RT + +/* + * RW-semaphores are a spinlock plus a reader-depth count. + * + * Note that the semantics are different from the usual + * Linux rw-sems, in PREEMPT_RT mode we do not allow + * multiple readers to hold the lock at once, we only allow + * a read-lock owner to read-lock recursively. This is + * better for latency, makes the implementation inherently + * fair and makes it simpler as well: + */ +struct rw_semaphore { + struct rt_mutex lock; + int read_depth; +#ifdef CONFIG_DEBUG_LOCK_ALLOC + struct lockdep_map dep_map; +#endif +}; + +/* + * rwlocks - an RW semaphore plus lock-break field: + */ +typedef struct { + struct rt_mutex lock; + int read_depth; + unsigned int break_lock; +#ifdef CONFIG_DEBUG_LOCK_ALLOC + struct lockdep_map dep_map; +#endif +} rwlock_t; + +# ifdef CONFIG_DEBUG_RT_MUTEXES +# define __RW_LOCK_UNLOCKED(name) (rwlock_t) \ + { .lock = { .wait_lock = _RAW_SPIN_LOCK_UNLOCKED(name), \ + .save_state = 1, .file = __FILE__, .line = __LINE__ } } +# else +# define __RW_LOCK_UNLOCKED(name) (rwlock_t) \ + { .lock = { .wait_lock = _RAW_SPIN_LOCK_UNLOCKED(name) } } +# endif +#else /* !PREEMPT_RT */ + + typedef raw_rwlock_t rwlock_t; +# ifdef CONFIG_DEBUG_SPINLOCK +# define _RW_LOCK_UNLOCKED \ + (rwlock_t) { .raw_lock = __RAW_RW_LOCK_UNLOCKED, \ + .magic = RWLOCK_MAGIC, \ + .owner = SPINLOCK_OWNER_INIT, \ + .owner_cpu = -1 } +# else +# define _RW_LOCK_UNLOCKED \ + (rwlock_t) { .raw_lock = __RAW_RW_LOCK_UNLOCKED } +# endif +# define __RW_LOCK_UNLOCKED(name) _RW_LOCK_UNLOCKED +#endif + +#define RW_LOCK_UNLOCKED __RW_LOCK_UNLOCKED(rw_old_style) + +#define DEFINE_RWLOCK(name) \ + rwlock_t name __cacheline_aligned_in_smp = __RW_LOCK_UNLOCKED(name) + +#ifdef CONFIG_PREEMPT_RT + +/* + * Semaphores - a spinlock plus the semaphore count: + */ +struct semaphore { + atomic_t count; + struct rt_mutex lock; +}; + +#define DECLARE_MUTEX(name) \ +struct semaphore name = \ + { .count = { 1 }, .lock = __RT_MUTEX_INITIALIZER(name.lock) } + +extern void fastcall +__sema_init(struct semaphore *sem, int val, char *name, char *file, int line); + +#define rt_sema_init(sem, val) \ + __sema_init(sem, val, #sem, __FILE__, __LINE__) + +extern void fastcall +__init_MUTEX(struct semaphore *sem, char *name, char *file, int line); +#define rt_init_MUTEX(sem) \ + __init_MUTEX(sem, #sem, __FILE__, __LINE__) + +extern void there_is_no_init_MUTEX_LOCKED_for_RT_semaphores(void); + +/* + * No locked initialization for RT semaphores + */ +#define rt_init_MUTEX_LOCKED(sem) \ + there_is_no_init_MUTEX_LOCKED_for_RT_semaphores() +extern void fastcall rt_down(struct semaphore *sem); +extern int fastcall rt_down_interruptible(struct semaphore *sem); +extern int fastcall rt_down_trylock(struct semaphore *sem); +extern void fastcall rt_up(struct semaphore *sem); + +#define rt_sem_is_locked(s) rt_mutex_is_locked(&(s)->lock) +#define rt_sema_count(s) atomic_read(&(s)->count) + +extern int __bad_func_type(void); + +#undef TYPE_EQUAL +#define TYPE_EQUAL(var, type) \ + __builtin_types_compatible_p(typeof(var), type *) + +#define PICK_FUNC_1ARG(type1, type2, func1, func2, arg) \ +do { \ + if (TYPE_EQUAL((arg), type1)) \ + func1((type1 *)(arg)); \ + else if (TYPE_EQUAL((arg), type2)) \ + func2((type2 *)(arg)); \ + else __bad_func_type(); \ +} while (0) + +#define PICK_FUNC_1ARG_RET(type1, type2, func1, func2, arg) \ +({ \ + unsigned long __ret; \ + \ + if (TYPE_EQUAL((arg), type1)) \ + __ret = func1((type1 *)(arg)); \ + else if (TYPE_EQUAL((arg), type2)) \ + __ret = func2((type2 *)(arg)); \ + else __ret = __bad_func_type(); \ + \ + __ret; \ +}) + +#define PICK_FUNC_2ARG(type1, type2, func1, func2, arg0, arg1) \ +do { \ + if (TYPE_EQUAL((arg0), type1)) \ + func1((type1 *)(arg0), arg1); \ + else if (TYPE_EQUAL((arg0), type2)) \ + func2((type2 *)(arg0), arg1); \ + else __bad_func_type(); \ +} while (0) + +#define sema_init(sem, val) \ + PICK_FUNC_2ARG(struct compat_semaphore, struct semaphore, \ + compat_sema_init, rt_sema_init, sem, val) + +#define init_MUTEX(sem) \ + PICK_FUNC_1ARG(struct compat_semaphore, struct semaphore, \ + compat_init_MUTEX, rt_init_MUTEX, sem) + +#define init_MUTEX_LOCKED(sem) \ + PICK_FUNC_1ARG(struct compat_semaphore, struct semaphore, \ + compat_init_MUTEX_LOCKED, rt_init_MUTEX_LOCKED, sem) + +#define down(sem) \ + PICK_FUNC_1ARG(struct compat_semaphore, struct semaphore, \ + compat_down, rt_down, sem) + +#define down_interruptible(sem) \ + PICK_FUNC_1ARG_RET(struct compat_semaphore, struct semaphore, \ + compat_down_interruptible, rt_down_interruptible, sem) + +#define down_trylock(sem) \ + PICK_FUNC_1ARG_RET(struct compat_semaphore, struct semaphore, \ + compat_down_trylock, rt_down_trylock, sem) + +#define up(sem) \ + PICK_FUNC_1ARG(struct compat_semaphore, struct semaphore, \ + compat_up, rt_up, sem) + +#define sem_is_locked(sem) \ + PICK_FUNC_1ARG_RET(struct compat_semaphore, struct semaphore, \ + compat_sem_is_locked, rt_sem_is_locked, sem) + +#define sema_count(sem) \ + PICK_FUNC_1ARG_RET(struct compat_semaphore, struct semaphore, \ + compat_sema_count, rt_sema_count, sem) + +/* + * rwsems: + */ + +#define __RWSEM_INITIALIZER(name) \ + { .lock = __RT_MUTEX_INITIALIZER(name.lock) } + +#define DECLARE_RWSEM(lockname) \ + struct rw_semaphore lockname = __RWSEM_INITIALIZER(lockname) + +extern void fastcall __rt_rwsem_init(struct rw_semaphore *rwsem, char *name, + struct lock_class_key *key); + +# define rt_init_rwsem(sem) \ +do { \ + static struct lock_class_key __key; \ + \ + __rt_rwsem_init((sem), #sem, &__key); \ +} while (0) + +extern void fastcall rt_down_write(struct rw_semaphore *rwsem); +extern void fastcall +rt_down_read_nested(struct rw_semaphore *rwsem, int subclass); +extern void fastcall +rt_down_write_nested(struct rw_semaphore *rwsem, int subclass); +extern void fastcall rt_down_read(struct rw_semaphore *rwsem); +#ifdef CONFIG_DEBUG_LOCK_ALLOC +extern void fastcall rt_down_read_non_owner(struct rw_semaphore *rwsem); +#else +# define rt_down_read_non_owner(rwsem) rt_down_read(rwsem) +#endif +extern int fastcall rt_down_write_trylock(struct rw_semaphore *rwsem); +extern int fastcall rt_down_read_trylock(struct rw_semaphore *rwsem); +extern void fastcall rt_up_read(struct rw_semaphore *rwsem); +#ifdef CONFIG_DEBUG_LOCK_ALLOC +extern void fastcall rt_up_read_non_owner(struct rw_semaphore *rwsem); +#else +# define rt_up_read_non_owner(rwsem) rt_up_read(rwsem) +#endif +extern void fastcall rt_up_write(struct rw_semaphore *rwsem); +extern void fastcall rt_downgrade_write(struct rw_semaphore *rwsem); + +# define rt_rwsem_is_locked(rws) (rt_mutex_is_locked(&(rws)->lock)) + +#define init_rwsem(rwsem) \ + PICK_FUNC_1ARG(struct compat_rw_semaphore, struct rw_semaphore, \ + compat_init_rwsem, rt_init_rwsem, rwsem) + +#define down_read(rwsem) \ + PICK_FUNC_1ARG(struct compat_rw_semaphore, struct rw_semaphore, \ + compat_down_read, rt_down_read, rwsem) + +#define down_read_non_owner(rwsem) \ + PICK_FUNC_1ARG(struct compat_rw_semaphore, struct rw_semaphore, \ + compat_down_read_non_owner, rt_down_read_non_owner, rwsem) + +#define down_read_trylock(rwsem) \ + PICK_FUNC_1ARG_RET(struct compat_rw_semaphore, struct rw_semaphore, \ + compat_down_read_trylock, rt_down_read_trylock, rwsem) + +#define down_write(rwsem) \ + PICK_FUNC_1ARG(struct compat_rw_semaphore, struct rw_semaphore, \ + compat_down_write, rt_down_write, rwsem) + +#define down_read_nested(rwsem, subclass) \ + PICK_FUNC_2ARG(struct compat_rw_semaphore, struct rw_semaphore, \ + compat_down_read_nested, rt_down_read_nested, rwsem, subclass) + + +#define down_write_nested(rwsem, subclass) \ + PICK_FUNC_2ARG(struct compat_rw_semaphore, struct rw_semaphore, \ + compat_down_write_nested, rt_down_write_nested, rwsem, subclass) + +#define down_write_trylock(rwsem) \ + PICK_FUNC_1ARG_RET(struct compat_rw_semaphore, struct rw_semaphore, \ + compat_down_write_trylock, rt_down_write_trylock, rwsem) + +#define up_read(rwsem) \ + PICK_FUNC_1ARG(struct compat_rw_semaphore, struct rw_semaphore, \ + compat_up_read, rt_up_read, rwsem) + +#define up_read_non_owner(rwsem) \ + PICK_FUNC_1ARG(struct compat_rw_semaphore, struct rw_semaphore, \ + compat_up_read_non_owner, rt_up_read_non_owner, rwsem) + +#define up_write(rwsem) \ + PICK_FUNC_1ARG(struct compat_rw_semaphore, struct rw_semaphore, \ + compat_up_write, rt_up_write, rwsem) + +#define downgrade_write(rwsem) \ + PICK_FUNC_1ARG(struct compat_rw_semaphore, struct rw_semaphore, \ + compat_downgrade_write, rt_downgrade_write, rwsem) + +#define rwsem_is_locked(rwsem) \ + PICK_FUNC_1ARG_RET(struct compat_rw_semaphore, struct rw_semaphore, \ + compat_rwsem_is_locked, rt_rwsem_is_locked, rwsem) + +#endif /* CONFIG_PREEMPT_RT */ + +#endif + Index: linux-2.6.24.7/include/linux/rtmutex.h =================================================================== --- linux-2.6.24.7.orig/include/linux/rtmutex.h +++ linux-2.6.24.7/include/linux/rtmutex.h @@ -24,7 +24,7 @@ * @owner: the mutex owner */ struct rt_mutex { - spinlock_t wait_lock; + raw_spinlock_t wait_lock; struct plist_head wait_list; struct task_struct *owner; #ifdef CONFIG_DEBUG_RT_MUTEXES @@ -63,7 +63,7 @@ struct hrtimer_sleeper; #endif #define __RT_MUTEX_INITIALIZER(mutexname) \ - { .wait_lock = __SPIN_LOCK_UNLOCKED(mutexname.wait_lock) \ + { .wait_lock = RAW_SPIN_LOCK_UNLOCKED(mutexname) \ , .wait_list = PLIST_HEAD_INIT(mutexname.wait_list, mutexname.wait_lock) \ , .owner = NULL \ __DEBUG_RT_MUTEX_INITIALIZER(mutexname)} Index: linux-2.6.24.7/include/linux/rwsem-spinlock.h =================================================================== --- linux-2.6.24.7.orig/include/linux/rwsem-spinlock.h +++ linux-2.6.24.7/include/linux/rwsem-spinlock.h @@ -28,7 +28,7 @@ struct rwsem_waiter; * - if activity is -1 then there is one active writer * - if wait_list is not empty, then there are processes waiting for the semaphore */ -struct rw_semaphore { +struct compat_rw_semaphore { __s32 activity; spinlock_t wait_lock; struct list_head wait_list; @@ -43,33 +43,32 @@ struct rw_semaphore { # define __RWSEM_DEP_MAP_INIT(lockname) #endif -#define __RWSEM_INITIALIZER(name) \ -{ 0, __SPIN_LOCK_UNLOCKED(name.wait_lock), LIST_HEAD_INIT((name).wait_list) \ - __RWSEM_DEP_MAP_INIT(name) } +#define __COMPAT_RWSEM_INITIALIZER(name) \ +{ 0, SPIN_LOCK_UNLOCKED, LIST_HEAD_INIT((name).wait_list) __RWSEM_DEP_MAP_INIT(name) } -#define DECLARE_RWSEM(name) \ - struct rw_semaphore name = __RWSEM_INITIALIZER(name) +#define COMPAT_DECLARE_RWSEM(name) \ + struct compat_rw_semaphore name = __COMPAT_RWSEM_INITIALIZER(name) -extern void __init_rwsem(struct rw_semaphore *sem, const char *name, +extern void __compat_init_rwsem(struct compat_rw_semaphore *sem, const char *name, struct lock_class_key *key); -#define init_rwsem(sem) \ +#define compat_init_rwsem(sem) \ do { \ static struct lock_class_key __key; \ \ - __init_rwsem((sem), #sem, &__key); \ + __compat_init_rwsem((sem), #sem, &__key); \ } while (0) -extern void FASTCALL(__down_read(struct rw_semaphore *sem)); -extern int FASTCALL(__down_read_trylock(struct rw_semaphore *sem)); -extern void FASTCALL(__down_write(struct rw_semaphore *sem)); -extern void FASTCALL(__down_write_nested(struct rw_semaphore *sem, int subclass)); -extern int FASTCALL(__down_write_trylock(struct rw_semaphore *sem)); -extern void FASTCALL(__up_read(struct rw_semaphore *sem)); -extern void FASTCALL(__up_write(struct rw_semaphore *sem)); -extern void FASTCALL(__downgrade_write(struct rw_semaphore *sem)); +extern void FASTCALL(__down_read(struct compat_rw_semaphore *sem)); +extern int FASTCALL(__down_read_trylock(struct compat_rw_semaphore *sem)); +extern void FASTCALL(__down_write(struct compat_rw_semaphore *sem)); +extern void FASTCALL(__down_write_nested(struct compat_rw_semaphore *sem, int subclass)); +extern int FASTCALL(__down_write_trylock(struct compat_rw_semaphore *sem)); +extern void FASTCALL(__up_read(struct compat_rw_semaphore *sem)); +extern void FASTCALL(__up_write(struct compat_rw_semaphore *sem)); +extern void FASTCALL(__downgrade_write(struct compat_rw_semaphore *sem)); -static inline int rwsem_is_locked(struct rw_semaphore *sem) +static inline int compat_rwsem_is_locked(struct compat_rw_semaphore *sem) { return (sem->activity != 0); } Index: linux-2.6.24.7/include/linux/rwsem.h =================================================================== --- linux-2.6.24.7.orig/include/linux/rwsem.h +++ linux-2.6.24.7/include/linux/rwsem.h @@ -9,6 +9,10 @@ #include <linux/linkage.h> +#ifdef CONFIG_PREEMPT_RT +# include <linux/rt_lock.h> +#endif + #ifdef __KERNEL__ #include <linux/types.h> @@ -16,48 +20,59 @@ #include <asm/system.h> #include <asm/atomic.h> -struct rw_semaphore; +#ifndef CONFIG_PREEMPT_RT +/* + * On !PREEMPT_RT all rw-semaphores are compat: + */ +#define compat_rw_semaphore rw_semaphore +#endif + +struct compat_rw_semaphore; #ifdef CONFIG_RWSEM_GENERIC_SPINLOCK -#include <linux/rwsem-spinlock.h> /* use a generic implementation */ +# include <linux/rwsem-spinlock.h> /* use a generic implementation */ +# ifndef CONFIG_PREEMPT_RT +# define __RWSEM_INITIALIZER __COMPAT_RWSEM_INITIALIZER +# define DECLARE_RWSEM COMPAT_DECLARE_RWSEM +# endif #else -#include <asm/rwsem.h> /* use an arch-specific implementation */ +# include <asm/rwsem.h> /* use an arch-specific implementation */ #endif /* * lock for reading */ -extern void down_read(struct rw_semaphore *sem); +extern void compat_down_read(struct compat_rw_semaphore *sem); /* * trylock for reading -- returns 1 if successful, 0 if contention */ -extern int down_read_trylock(struct rw_semaphore *sem); +extern int compat_down_read_trylock(struct compat_rw_semaphore *sem); /* * lock for writing */ -extern void down_write(struct rw_semaphore *sem); +extern void compat_down_write(struct compat_rw_semaphore *sem); /* * trylock for writing -- returns 1 if successful, 0 if contention */ -extern int down_write_trylock(struct rw_semaphore *sem); +extern int compat_down_write_trylock(struct compat_rw_semaphore *sem); /* * release a read lock */ -extern void up_read(struct rw_semaphore *sem); +extern void compat_up_read(struct compat_rw_semaphore *sem); /* * release a write lock */ -extern void up_write(struct rw_semaphore *sem); +extern void compat_up_write(struct compat_rw_semaphore *sem); /* * downgrade write lock to read lock */ -extern void downgrade_write(struct rw_semaphore *sem); +extern void compat_downgrade_write(struct compat_rw_semaphore *sem); #ifdef CONFIG_DEBUG_LOCK_ALLOC /* @@ -73,22 +88,79 @@ extern void downgrade_write(struct rw_se * lockdep_set_class() at lock initialization time. * See Documentation/lockdep-design.txt for more details.) */ -extern void down_read_nested(struct rw_semaphore *sem, int subclass); -extern void down_write_nested(struct rw_semaphore *sem, int subclass); +extern void +compat_down_read_nested(struct compat_rw_semaphore *sem, int subclass); +extern void +compat_down_write_nested(struct compat_rw_semaphore *sem, int subclass); /* * Take/release a lock when not the owner will release it. * * [ This API should be avoided as much as possible - the * proper abstraction for this case is completions. ] */ -extern void down_read_non_owner(struct rw_semaphore *sem); -extern void up_read_non_owner(struct rw_semaphore *sem); +extern void +compat_down_read_non_owner(struct compat_rw_semaphore *sem); +extern void +compat_up_read_non_owner(struct compat_rw_semaphore *sem); #else -# define down_read_nested(sem, subclass) down_read(sem) -# define down_write_nested(sem, subclass) down_write(sem) -# define down_read_non_owner(sem) down_read(sem) -# define up_read_non_owner(sem) up_read(sem) +# define compat_down_read_nested(sem, subclass) compat_down_read(sem) +# define compat_down_write_nested(sem, subclass) compat_down_write(sem) +# define compat_down_read_non_owner(sem) compat_down_read(sem) +# define compat_up_read_non_owner(sem) compat_up_read(sem) #endif +#ifndef CONFIG_PREEMPT_RT + +#define DECLARE_RWSEM COMPAT_DECLARE_RWSEM + +/* + * NOTE, lockdep: this has to be a macro, so that separate class-keys + * get generated by the compiler, if the same function does multiple + * init_rwsem() calls to different rwsems. + */ +#define init_rwsem(rwsem) compat_init_rwsem(rwsem) + +static inline void down_read(struct compat_rw_semaphore *rwsem) +{ + compat_down_read(rwsem); +} +static inline int down_read_trylock(struct compat_rw_semaphore *rwsem) +{ + return compat_down_read_trylock(rwsem); +} +static inline void down_write(struct compat_rw_semaphore *rwsem) +{ + compat_down_write(rwsem); +} +static inline int down_write_trylock(struct compat_rw_semaphore *rwsem) +{ + return compat_down_write_trylock(rwsem); +} +static inline void up_read(struct compat_rw_semaphore *rwsem) +{ + compat_up_read(rwsem); +} +static inline void up_write(struct compat_rw_semaphore *rwsem) +{ + compat_up_write(rwsem); +} +static inline void downgrade_write(struct compat_rw_semaphore *rwsem) +{ + compat_downgrade_write(rwsem); +} +static inline int rwsem_is_locked(struct compat_rw_semaphore *sem) +{ + return compat_rwsem_is_locked(sem); +} +# define down_read_nested(sem, subclass) \ + compat_down_read_nested(sem, subclass) +# define down_write_nested(sem, subclass) \ + compat_down_write_nested(sem, subclass) +# define down_read_non_owner(sem) \ + compat_down_read_non_owner(sem) +# define up_read_non_owner(sem) \ + compat_up_read_non_owner(sem) +#endif /* !CONFIG_PREEMPT_RT */ + #endif /* __KERNEL__ */ #endif /* _LINUX_RWSEM_H */ Index: linux-2.6.24.7/include/linux/sched.h =================================================================== --- linux-2.6.24.7.orig/include/linux/sched.h +++ linux-2.6.24.7/include/linux/sched.h @@ -168,6 +168,10 @@ print_cfs_rq(struct seq_file *m, int cpu } #endif +#ifdef CONFIG_PREEMPT_BKL +extern struct semaphore kernel_sem; +#endif + /* * Task state bitmask. NOTE! These bits are also * encoded in fs/proc/array.c: get_task_state(). @@ -179,15 +183,17 @@ print_cfs_rq(struct seq_file *m, int cpu * mistake. */ #define TASK_RUNNING 0 -#define TASK_INTERRUPTIBLE 1 -#define TASK_UNINTERRUPTIBLE 2 -#define TASK_STOPPED 4 -#define TASK_TRACED 8 +#define TASK_RUNNING_MUTEX 1 +#define TASK_INTERRUPTIBLE 2 +#define TASK_UNINTERRUPTIBLE 4 +#define TASK_STOPPED 8 +#define TASK_TRACED 16 /* in tsk->exit_state */ -#define EXIT_ZOMBIE 16 -#define EXIT_DEAD 32 +#define EXIT_ZOMBIE 32 +#define EXIT_DEAD 64 /* in tsk->state again */ -#define TASK_DEAD 64 +#define TASK_NONINTERACTIVE 128 +#define TASK_DEAD 256 #define __set_task_state(tsk, state_value) \ do { (tsk)->state = (state_value); } while (0) @@ -1120,7 +1126,7 @@ struct task_struct { spinlock_t alloc_lock; /* Protection of the PI data structures: */ - spinlock_t pi_lock; + raw_spinlock_t pi_lock; #ifdef CONFIG_RT_MUTEXES /* PI waiters blocked on a rt_mutex held by this task */ @@ -1163,6 +1169,25 @@ struct task_struct { unsigned long preempt_trace_parent_eip[MAX_PREEMPT_TRACE]; #endif +#define MAX_LOCK_STACK MAX_PREEMPT_TRACE +#ifdef CONFIG_DEBUG_PREEMPT + int lock_count; +# ifdef CONFIG_PREEMPT_RT + struct rt_mutex *owned_lock[MAX_LOCK_STACK]; +# endif +#endif +#ifdef CONFIG_DETECT_SOFTLOCKUP + unsigned long softlockup_count; /* Count to keep track how long the + * thread is in the kernel without + * sleeping. + */ +#endif + /* realtime bits */ + +#ifdef CONFIG_DEBUG_RT_MUTEXES + void *last_kernel_lock; +#endif + /* journalling filesystem info */ void *journal_info; @@ -1409,6 +1434,7 @@ static inline void put_task_struct(struc #define PF_EXITING 0x00000004 /* getting shut down */ #define PF_EXITPIDONE 0x00000008 /* pi exit done on shut down */ #define PF_VCPU 0x00000010 /* I'm a virtual CPU */ +#define PF_NOSCHED 0x00000020 /* Userspace does not expect scheduling */ #define PF_FORKNOEXEC 0x00000040 /* forked but didn't exec */ #define PF_SUPERPRIV 0x00000100 /* used super-user privileges */ #define PF_DUMPCORE 0x00000200 /* dumped core */ @@ -1545,6 +1571,7 @@ extern struct task_struct *curr_task(int extern void set_curr_task(int cpu, struct task_struct *p); void yield(void); +void __yield(void); /* * The default (Linux) execution domain. @@ -1616,6 +1643,9 @@ extern void do_timer(unsigned long ticks extern int FASTCALL(wake_up_state(struct task_struct * tsk, unsigned int state)); extern int FASTCALL(wake_up_process(struct task_struct * tsk)); +extern int FASTCALL(wake_up_process_mutex(struct task_struct * tsk)); +extern int FASTCALL(wake_up_process_sync(struct task_struct * tsk)); +extern int FASTCALL(wake_up_process_mutex_sync(struct task_struct * tsk)); extern void FASTCALL(wake_up_new_task(struct task_struct * tsk, unsigned long clone_flags)); #ifdef CONFIG_SMP @@ -1926,7 +1956,22 @@ static inline int need_resched_delayed(v * cond_resched_softirq() will enable bhs before scheduling. */ extern int cond_resched(void); -extern int cond_resched_lock(spinlock_t * lock); +extern int __cond_resched_raw_spinlock(raw_spinlock_t *lock); +extern int __cond_resched_spinlock(spinlock_t *spinlock); + +#define cond_resched_lock(lock) \ +({ \ + int __ret; \ + \ + if (TYPE_EQUAL((lock), raw_spinlock_t)) \ + __ret = __cond_resched_raw_spinlock((raw_spinlock_t *)lock);\ + else if (TYPE_EQUAL(lock, spinlock_t)) \ + __ret = __cond_resched_spinlock((spinlock_t *)lock); \ + else __ret = __bad_spinlock_type(); \ + \ + __ret; \ +}) + extern int cond_resched_softirq(void); extern int cond_resched_softirq_context(void); extern int cond_resched_hardirq_context(void); @@ -1935,12 +1980,18 @@ extern int cond_resched_hardirq_context( * Does a critical section need to be broken due to another * task waiting?: */ -#if defined(CONFIG_PREEMPT) && defined(CONFIG_SMP) -# define need_lockbreak(lock) ((lock)->break_lock) +#if (defined(CONFIG_PREEMPT) && defined(CONFIG_SMP)) || defined(CONFIG_PREEMPT_RT) +# define need_lockbreak(lock) ({ int __need = ((lock)->break_lock); if (__need) (lock)->break_lock = 0; __need; }) #else # define need_lockbreak(lock) 0 #endif +#if defined(CONFIG_PREEMPT) && defined(CONFIG_SMP) +# define need_lockbreak_raw(lock) ({ int __need = ((lock)->break_lock); if (__need) (lock)->break_lock = 0; __need; }) +#else +# define need_lockbreak_raw(lock) 0 +#endif + /* * Does a critical section need to be broken due to another * task waiting or preemption being signalled: @@ -2092,7 +2143,7 @@ static inline void migration_init(void) } #endif -#define TASK_STATE_TO_CHAR_STR "RSDTtZX" +#define TASK_STATE_TO_CHAR_STR "RMSDTtZX" #endif /* __KERNEL__ */ Index: linux-2.6.24.7/include/linux/semaphore.h =================================================================== --- /dev/null +++ linux-2.6.24.7/include/linux/semaphore.h @@ -0,0 +1,49 @@ +#ifndef _LINUX_SEMAPHORE_H +#define _LINUX_SEMAPHORE_H + +#ifdef CONFIG_PREEMPT_RT +# include <linux/rt_lock.h> +#else + +#define DECLARE_MUTEX COMPAT_DECLARE_MUTEX + +static inline void sema_init(struct compat_semaphore *sem, int val) +{ + compat_sema_init(sem, val); +} +static inline void init_MUTEX(struct compat_semaphore *sem) +{ + compat_init_MUTEX(sem); +} +static inline void init_MUTEX_LOCKED(struct compat_semaphore *sem) +{ + compat_init_MUTEX_LOCKED(sem); +} +static inline void down(struct compat_semaphore *sem) +{ + compat_down(sem); +} +static inline int down_interruptible(struct compat_semaphore *sem) +{ + return compat_down_interruptible(sem); +} +static inline int down_trylock(struct compat_semaphore *sem) +{ + return compat_down_trylock(sem); +} +static inline void up(struct compat_semaphore *sem) +{ + compat_up(sem); +} +static inline int sem_is_locked(struct compat_semaphore *sem) +{ + return compat_sem_is_locked(sem); +} +static inline int sema_count(struct compat_semaphore *sem) +{ + return compat_sema_count(sem); +} + +#endif /* CONFIG_PREEMPT_RT */ + +#endif /* _LINUX_SEMAPHORE_H */ Index: linux-2.6.24.7/include/linux/seqlock.h =================================================================== --- linux-2.6.24.7.orig/include/linux/seqlock.h +++ linux-2.6.24.7/include/linux/seqlock.h @@ -32,46 +32,72 @@ typedef struct { unsigned sequence; spinlock_t lock; -} seqlock_t; +} __seqlock_t; + +typedef struct { + unsigned sequence; + raw_spinlock_t lock; +} __raw_seqlock_t; + +#define seqlock_need_resched(seq) lock_need_resched(&(seq)->lock) + +#ifdef CONFIG_PREEMPT_RT +typedef __seqlock_t seqlock_t; +#else +typedef __raw_seqlock_t seqlock_t; +#endif + +typedef __raw_seqlock_t raw_seqlock_t; /* * These macros triggered gcc-3.x compile-time problems. We think these are * OK now. Be cautious. */ -#define __SEQLOCK_UNLOCKED(lockname) \ - { 0, __SPIN_LOCK_UNLOCKED(lockname) } +#define __RAW_SEQLOCK_UNLOCKED(lockname) \ + { 0, RAW_SPIN_LOCK_UNLOCKED(lockname) } + +#ifdef CONFIG_PREEMPT_RT +# define __SEQLOCK_UNLOCKED(lockname) { 0, __SPIN_LOCK_UNLOCKED(lockname) } +#else +# define __SEQLOCK_UNLOCKED(lockname) __RAW_SEQLOCK_UNLOCKED(lockname) +#endif #define SEQLOCK_UNLOCKED \ __SEQLOCK_UNLOCKED(old_style_seqlock_init) -#define seqlock_init(x) \ - do { \ - (x)->sequence = 0; \ - spin_lock_init(&(x)->lock); \ - } while (0) +#define raw_seqlock_init(x) \ + do { *(x) = (raw_seqlock_t) __RAW_SEQLOCK_UNLOCKED(x); spin_lock_init(&(x)->lock); } while (0) + +#define seqlock_init(x) \ + do { *(x) = (seqlock_t) __SEQLOCK_UNLOCKED(x); spin_lock_init(&(x)->lock); } while (0) #define DEFINE_SEQLOCK(x) \ seqlock_t x = __SEQLOCK_UNLOCKED(x) +#define DEFINE_RAW_SEQLOCK(name) \ + raw_seqlock_t name __cacheline_aligned_in_smp = \ + __RAW_SEQLOCK_UNLOCKED(name) + + /* Lock out other writers and update the count. * Acts like a normal spin_lock/unlock. * Don't need preempt_disable() because that is in the spin_lock already. */ -static inline void write_seqlock(seqlock_t *sl) +static inline void __write_seqlock(seqlock_t *sl) { spin_lock(&sl->lock); ++sl->sequence; smp_wmb(); } -static inline void write_sequnlock(seqlock_t *sl) +static inline void __write_sequnlock(seqlock_t *sl) { smp_wmb(); sl->sequence++; spin_unlock(&sl->lock); } -static inline int write_tryseqlock(seqlock_t *sl) +static inline int __write_tryseqlock(seqlock_t *sl) { int ret = spin_trylock(&sl->lock); @@ -83,7 +109,7 @@ static inline int write_tryseqlock(seqlo } /* Start of read calculation -- fetch last complete writer token */ -static __always_inline unsigned read_seqbegin(const seqlock_t *sl) +static __always_inline unsigned __read_seqbegin(const seqlock_t *sl) { unsigned ret = sl->sequence; smp_rmb(); @@ -98,12 +124,118 @@ static __always_inline unsigned read_seq * * Using xor saves one conditional branch. */ -static __always_inline int read_seqretry(const seqlock_t *sl, unsigned iv) +static inline int __read_seqretry(seqlock_t *sl, unsigned iv) +{ + int ret; + + smp_rmb(); + ret = (iv & 1) | (sl->sequence ^ iv); + /* + * If invalid then serialize with the writer, to make sure we + * are not livelocking it: + */ + if (unlikely(ret)) { + unsigned long flags; + spin_lock_irqsave(&sl->lock, flags); + spin_unlock_irqrestore(&sl->lock, flags); + } + return ret; +} + +static __always_inline void __write_seqlock_raw(raw_seqlock_t *sl) +{ + spin_lock(&sl->lock); + ++sl->sequence; + smp_wmb(); +} + +static __always_inline void __write_sequnlock_raw(raw_seqlock_t *sl) +{ + smp_wmb(); + sl->sequence++; + spin_unlock(&sl->lock); +} + +static __always_inline int __write_tryseqlock_raw(raw_seqlock_t *sl) +{ + int ret = spin_trylock(&sl->lock); + + if (ret) { + ++sl->sequence; + smp_wmb(); + } + return ret; +} + +static __always_inline unsigned __read_seqbegin_raw(const raw_seqlock_t *sl) +{ + unsigned ret = sl->sequence; + smp_rmb(); + return ret; +} + +static __always_inline int __read_seqretry_raw(const raw_seqlock_t *sl, unsigned iv) { smp_rmb(); return (iv & 1) | (sl->sequence ^ iv); } +extern int __bad_seqlock_type(void); + +#define PICK_SEQOP(op, lock) \ +do { \ + if (TYPE_EQUAL((lock), raw_seqlock_t)) \ + op##_raw((raw_seqlock_t *)(lock)); \ + else if (TYPE_EQUAL((lock), seqlock_t)) \ + op((seqlock_t *)(lock)); \ + else __bad_seqlock_type(); \ +} while (0) + +#define PICK_SEQOP_RET(op, lock) \ +({ \ + unsigned long __ret; \ + \ + if (TYPE_EQUAL((lock), raw_seqlock_t)) \ + __ret = op##_raw((raw_seqlock_t *)(lock)); \ + else if (TYPE_EQUAL((lock), seqlock_t)) \ + __ret = op((seqlock_t *)(lock)); \ + else __ret = __bad_seqlock_type(); \ + \ + __ret; \ +}) + +#define PICK_SEQOP_CONST_RET(op, lock) \ +({ \ + unsigned long __ret; \ + \ + if (TYPE_EQUAL((lock), raw_seqlock_t)) \ + __ret = op##_raw((const raw_seqlock_t *)(lock));\ + else if (TYPE_EQUAL((lock), seqlock_t)) \ + __ret = op((seqlock_t *)(lock)); \ + else __ret = __bad_seqlock_type(); \ + \ + __ret; \ +}) + +#define PICK_SEQOP2_CONST_RET(op, lock, arg) \ + ({ \ + unsigned long __ret; \ + \ + if (TYPE_EQUAL((lock), raw_seqlock_t)) \ + __ret = op##_raw((const raw_seqlock_t *)(lock), (arg)); \ + else if (TYPE_EQUAL((lock), seqlock_t)) \ + __ret = op((seqlock_t *)(lock), (arg)); \ + else __ret = __bad_seqlock_type(); \ + \ + __ret; \ +}) + + +#define write_seqlock(sl) PICK_SEQOP(__write_seqlock, sl) +#define write_sequnlock(sl) PICK_SEQOP(__write_sequnlock, sl) +#define write_tryseqlock(sl) PICK_SEQOP_RET(__write_tryseqlock, sl) +#define read_seqbegin(sl) PICK_SEQOP_CONST_RET(__read_seqbegin, sl) +#define read_seqretry(sl, iv) PICK_SEQOP2_CONST_RET(__read_seqretry, sl, iv) /* * Version using sequence counter only. @@ -155,30 +287,51 @@ static inline void write_seqcount_end(se s->sequence++; } +#define PICK_IRQOP(op, lock) \ +do { \ + if (TYPE_EQUAL((lock), raw_seqlock_t)) \ + op(); \ + else if (TYPE_EQUAL((lock), seqlock_t)) \ + { /* nothing */ } \ + else __bad_seqlock_type(); \ +} while (0) + +#define PICK_IRQOP2(op, arg, lock) \ +do { \ + if (TYPE_EQUAL((lock), raw_seqlock_t)) \ + op(arg); \ + else if (TYPE_EQUAL(lock, seqlock_t)) \ + { /* nothing */ } \ + else __bad_seqlock_type(); \ +} while (0) + + + /* * Possible sw/hw IRQ protected versions of the interfaces. */ #define write_seqlock_irqsave(lock, flags) \ - do { local_irq_save(flags); write_seqlock(lock); } while (0) + do { PICK_IRQOP2(local_irq_save, flags, lock); write_seqlock(lock); } while (0) #define write_seqlock_irq(lock) \ - do { local_irq_disable(); write_seqlock(lock); } while (0) + do { PICK_IRQOP(local_irq_disable, lock); write_seqlock(lock); } while (0) #define write_seqlock_bh(lock) \ - do { local_bh_disable(); write_seqlock(lock); } while (0) + do { PICK_IRQOP(local_bh_disable, lock); write_seqlock(lock); } while (0) #define write_sequnlock_irqrestore(lock, flags) \ - do { write_sequnlock(lock); local_irq_restore(flags); } while(0) + do { write_sequnlock(lock); PICK_IRQOP2(local_irq_restore, flags, lock); preempt_check_resched(); } while(0) #define write_sequnlock_irq(lock) \ - do { write_sequnlock(lock); local_irq_enable(); } while(0) + do { write_sequnlock(lock); PICK_IRQOP(local_irq_enable, lock); preempt_check_resched(); } while(0) #define write_sequnlock_bh(lock) \ - do { write_sequnlock(lock); local_bh_enable(); } while(0) + do { write_sequnlock(lock); PICK_IRQOP(local_bh_enable, lock); } while(0) #define read_seqbegin_irqsave(lock, flags) \ - ({ local_irq_save(flags); read_seqbegin(lock); }) + ({ PICK_IRQOP2(local_irq_save, flags, lock); read_seqbegin(lock); }) #define read_seqretry_irqrestore(lock, iv, flags) \ ({ \ int ret = read_seqretry(lock, iv); \ - local_irq_restore(flags); \ + PICK_IRQOP2(local_irq_restore, flags, lock); \ + preempt_check_resched(); \ ret; \ }) Index: linux-2.6.24.7/include/linux/spinlock.h =================================================================== --- linux-2.6.24.7.orig/include/linux/spinlock.h +++ linux-2.6.24.7/include/linux/spinlock.h @@ -44,6 +44,42 @@ * builds the _spin_*() APIs. * * linux/spinlock.h: builds the final spin_*() APIs. + * + * + * Public types and naming conventions: + * ------------------------------------ + * spinlock_t: type: sleep-lock + * raw_spinlock_t: type: spin-lock (debug) + * + * spin_lock([raw_]spinlock_t): API: acquire lock, both types + * + * + * Internal types and naming conventions: + * ------------------------------------- + * __raw_spinlock_t: type: lowlevel spin-lock + * + * _spin_lock(struct rt_mutex): API: acquire sleep-lock + * __spin_lock(raw_spinlock_t): API: acquire spin-lock (highlevel) + * _raw_spin_lock(raw_spinlock_t): API: acquire spin-lock (debug) + * __raw_spin_lock(__raw_spinlock_t): API: acquire spin-lock (lowlevel) + * + * + * spin_lock(raw_spinlock_t) translates into the following chain of + * calls/inlines/macros, if spin-lock debugging is enabled: + * + * spin_lock() [include/linux/spinlock.h] + * -> __spin_lock() [kernel/spinlock.c] + * -> _raw_spin_lock() [lib/spinlock_debug.c] + * -> __raw_spin_lock() [include/asm/spinlock.h] + * + * spin_lock(spinlock_t) translates into the following chain of + * calls/inlines/macros: + * + * spin_lock() [include/linux/spinlock.h] + * -> _spin_lock() [include/linux/spinlock.h] + * -> rt_spin_lock() [kernel/rtmutex.c] + * -> rt_spin_lock_fastlock() [kernel/rtmutex.c] + * -> rt_spin_lock_slowlock() [kernel/rtmutex.c] */ #include <linux/preempt.h> @@ -51,29 +87,14 @@ #include <linux/compiler.h> #include <linux/thread_info.h> #include <linux/kernel.h> +#include <linux/cache.h> #include <linux/stringify.h> #include <linux/bottom_half.h> +#include <linux/irqflags.h> #include <asm/system.h> /* - * Must define these before including other files, inline functions need them - */ -#define LOCK_SECTION_NAME ".text.lock."KBUILD_BASENAME - -#define LOCK_SECTION_START(extra) \ - ".subsection 1\n\t" \ - extra \ - ".ifndef " LOCK_SECTION_NAME "\n\t" \ - LOCK_SECTION_NAME ":\n\t" \ - ".endif\n" - -#define LOCK_SECTION_END \ - ".previous\n\t" - -#define __lockfunc fastcall __attribute__((section(".spinlock.text"))) - -/* * Pull the raw_spinlock_t and raw_rwlock_t definitions: */ #include <linux/spinlock_types.h> @@ -89,42 +110,10 @@ extern int __lockfunc generic__raw_read_ # include <linux/spinlock_up.h> #endif -#ifdef CONFIG_DEBUG_SPINLOCK - extern void __spin_lock_init(spinlock_t *lock, const char *name, - struct lock_class_key *key); -# define spin_lock_init(lock) \ -do { \ - static struct lock_class_key __key; \ - \ - __spin_lock_init((lock), #lock, &__key); \ -} while (0) - -#else -# define spin_lock_init(lock) \ - do { *(lock) = SPIN_LOCK_UNLOCKED; } while (0) -#endif - -#ifdef CONFIG_DEBUG_SPINLOCK - extern void __rwlock_init(rwlock_t *lock, const char *name, - struct lock_class_key *key); -# define rwlock_init(lock) \ -do { \ - static struct lock_class_key __key; \ - \ - __rwlock_init((lock), #lock, &__key); \ -} while (0) -#else -# define rwlock_init(lock) \ - do { *(lock) = RW_LOCK_UNLOCKED; } while (0) -#endif - -#define spin_is_locked(lock) __raw_spin_is_locked(&(lock)->raw_lock) - -/** - * spin_unlock_wait - wait until the spinlock gets unlocked - * @lock: the spinlock in question. +/* + * Pull the RT types: */ -#define spin_unlock_wait(lock) __raw_spin_unlock_wait(&(lock)->raw_lock) +#include <linux/rt_lock.h> /* * Pull the _spin_*()/_read_*()/_write_*() functions/declarations: @@ -136,16 +125,16 @@ do { \ #endif #ifdef CONFIG_DEBUG_SPINLOCK - extern void _raw_spin_lock(spinlock_t *lock); -#define _raw_spin_lock_flags(lock, flags) _raw_spin_lock(lock) - extern int _raw_spin_trylock(spinlock_t *lock); - extern void _raw_spin_unlock(spinlock_t *lock); - extern void _raw_read_lock(rwlock_t *lock); - extern int _raw_read_trylock(rwlock_t *lock); - extern void _raw_read_unlock(rwlock_t *lock); - extern void _raw_write_lock(rwlock_t *lock); - extern int _raw_write_trylock(rwlock_t *lock); - extern void _raw_write_unlock(rwlock_t *lock); + extern __lockfunc void _raw_spin_lock(raw_spinlock_t *lock); +# define _raw_spin_lock_flags(lock, flags) _raw_spin_lock(lock) + extern __lockfunc int _raw_spin_trylock(raw_spinlock_t *lock); + extern __lockfunc void _raw_spin_unlock(raw_spinlock_t *lock); + extern __lockfunc void _raw_read_lock(raw_rwlock_t *lock); + extern __lockfunc int _raw_read_trylock(raw_rwlock_t *lock); + extern __lockfunc void _raw_read_unlock(raw_rwlock_t *lock); + extern __lockfunc void _raw_write_lock(raw_rwlock_t *lock); + extern __lockfunc int _raw_write_trylock(raw_rwlock_t *lock); + extern __lockfunc void _raw_write_unlock(raw_rwlock_t *lock); #else # define _raw_spin_lock(lock) __raw_spin_lock(&(lock)->raw_lock) # define _raw_spin_lock_flags(lock, flags) \ @@ -160,148 +149,590 @@ do { \ # define _raw_write_unlock(rwlock) __raw_write_unlock(&(rwlock)->raw_lock) #endif -#define read_can_lock(rwlock) __raw_read_can_lock(&(rwlock)->raw_lock) -#define write_can_lock(rwlock) __raw_write_can_lock(&(rwlock)->raw_lock) +extern int __bad_spinlock_type(void); +extern int __bad_rwlock_type(void); + +extern void +__rt_spin_lock_init(spinlock_t *lock, char *name, struct lock_class_key *key); + +extern void __lockfunc rt_spin_lock(spinlock_t *lock); +extern void __lockfunc rt_spin_lock_nested(spinlock_t *lock, int subclass); +extern void __lockfunc rt_spin_unlock(spinlock_t *lock); +extern void __lockfunc rt_spin_unlock_wait(spinlock_t *lock); +extern int __lockfunc +rt_spin_trylock_irqsave(spinlock_t *lock, unsigned long *flags); +extern int __lockfunc rt_spin_trylock(spinlock_t *lock); +extern int _atomic_dec_and_spin_lock(atomic_t *atomic, spinlock_t *lock); + +/* + * lockdep-less calls, for derived types like rwlock: + * (for trylock they can use rt_mutex_trylock() directly. + */ +extern void __lockfunc __rt_spin_lock(struct rt_mutex *lock); +extern void __lockfunc __rt_spin_unlock(struct rt_mutex *lock); + +#ifdef CONFIG_PREEMPT_RT +# define _spin_lock(l) rt_spin_lock(l) +# define _spin_lock_nested(l, s) rt_spin_lock_nested(l, s) +# define _spin_lock_bh(l) rt_spin_lock(l) +# define _spin_lock_irq(l) rt_spin_lock(l) +# define _spin_unlock(l) rt_spin_unlock(l) +# define _spin_unlock_no_resched(l) rt_spin_unlock(l) +# define _spin_unlock_bh(l) rt_spin_unlock(l) +# define _spin_unlock_irq(l) rt_spin_unlock(l) +# define _spin_unlock_irqrestore(l, f) rt_spin_unlock(l) +static inline unsigned long __lockfunc _spin_lock_irqsave(spinlock_t *lock) +{ + rt_spin_lock(lock); + return 0; +} +static inline unsigned long __lockfunc +_spin_lock_irqsave_nested(spinlock_t *lock, int subclass) +{ + rt_spin_lock_nested(lock, subclass); + return 0; +} +#else +static inline unsigned long __lockfunc _spin_lock_irqsave(spinlock_t *lock) +{ + return 0; +} +static inline unsigned long __lockfunc +_spin_lock_irqsave_nested(spinlock_t *lock, int subclass) +{ + return 0; +} +# define _spin_lock(l) do { } while (0) +# define _spin_lock_nested(l, s) do { } while (0) +# define _spin_lock_bh(l) do { } while (0) +# define _spin_lock_irq(l) do { } while (0) +# define _spin_unlock(l) do { } while (0) +# define _spin_unlock_no_resched(l) do { } while (0) +# define _spin_unlock_bh(l) do { } while (0) +# define _spin_unlock_irq(l) do { } while (0) +# define _spin_unlock_irqrestore(l, f) do { } while (0) +#endif + +#define _spin_lock_init(sl, n, f, l) \ +do { \ + static struct lock_class_key __key; \ + \ + __rt_spin_lock_init(sl, n, &__key); \ +} while (0) + +# ifdef CONFIG_PREEMPT_RT +# define _spin_can_lock(l) (!rt_mutex_is_locked(&(l)->lock)) +# define _spin_is_locked(l) rt_mutex_is_locked(&(l)->lock) +# define _spin_unlock_wait(l) rt_spin_unlock_wait(l) + +# define _spin_trylock(l) rt_spin_trylock(l) +# define _spin_trylock_bh(l) rt_spin_trylock(l) +# define _spin_trylock_irq(l) rt_spin_trylock(l) +# define _spin_trylock_irqsave(l,f) rt_spin_trylock_irqsave(l, f) +# else + + extern int this_should_never_be_called_on_non_rt(spinlock_t *lock); +# define TSNBCONRT(l) this_should_never_be_called_on_non_rt(l) +# define _spin_can_lock(l) TSNBCONRT(l) +# define _spin_is_locked(l) TSNBCONRT(l) +# define _spin_unlock_wait(l) TSNBCONRT(l) + +# define _spin_trylock(l) TSNBCONRT(l) +# define _spin_trylock_bh(l) TSNBCONRT(l) +# define _spin_trylock_irq(l) TSNBCONRT(l) +# define _spin_trylock_irqsave(l,f) TSNBCONRT(l) +#endif + +#undef TYPE_EQUAL +#define TYPE_EQUAL(lock, type) \ + __builtin_types_compatible_p(typeof(lock), type *) + +#define PICK_OP(op, lock) \ +do { \ + if (TYPE_EQUAL((lock), raw_spinlock_t)) \ + __spin##op((raw_spinlock_t *)(lock)); \ + else if (TYPE_EQUAL(lock, spinlock_t)) \ + _spin##op((spinlock_t *)(lock)); \ + else __bad_spinlock_type(); \ +} while (0) + +#define PICK_OP_RET(op, lock...) \ +({ \ + unsigned long __ret; \ + \ + if (TYPE_EQUAL((lock), raw_spinlock_t)) \ + __ret = __spin##op((raw_spinlock_t *)(lock)); \ + else if (TYPE_EQUAL(lock, spinlock_t)) \ + __ret = _spin##op((spinlock_t *)(lock)); \ + else __ret = __bad_spinlock_type(); \ + \ + __ret; \ +}) + +#define PICK_OP2(op, lock, flags) \ +do { \ + if (TYPE_EQUAL((lock), raw_spinlock_t)) \ + __spin##op((raw_spinlock_t *)(lock), flags); \ + else if (TYPE_EQUAL(lock, spinlock_t)) \ + _spin##op((spinlock_t *)(lock), flags); \ + else __bad_spinlock_type(); \ +} while (0) + +#define PICK_OP2_RET(op, lock, flags) \ +({ \ + unsigned long __ret; \ + \ + if (TYPE_EQUAL((lock), raw_spinlock_t)) \ + __ret = __spin##op((raw_spinlock_t *)(lock), flags); \ + else if (TYPE_EQUAL(lock, spinlock_t)) \ + __ret = _spin##op((spinlock_t *)(lock), flags); \ + else __bad_spinlock_type(); \ + \ + __ret; \ +}) + +extern void __lockfunc rt_write_lock(rwlock_t *rwlock); +extern void __lockfunc rt_read_lock(rwlock_t *rwlock); +extern int __lockfunc rt_write_trylock(rwlock_t *rwlock); +extern int __lockfunc rt_read_trylock(rwlock_t *rwlock); +extern void __lockfunc rt_write_unlock(rwlock_t *rwlock); +extern void __lockfunc rt_read_unlock(rwlock_t *rwlock); +extern unsigned long __lockfunc rt_write_lock_irqsave(rwlock_t *rwlock); +extern unsigned long __lockfunc rt_read_lock_irqsave(rwlock_t *rwlock); +extern void +__rt_rwlock_init(rwlock_t *rwlock, char *name, struct lock_class_key *key); + +#define _rwlock_init(rwl, n, f, l) \ +do { \ + static struct lock_class_key __key; \ + \ + __rt_rwlock_init(rwl, n, &__key); \ +} while (0) + +#ifdef CONFIG_PREEMPT_RT +# define rt_read_can_lock(rwl) (!rt_mutex_is_locked(&(rwl)->lock)) +# define rt_write_can_lock(rwl) (!rt_mutex_is_locked(&(rwl)->lock)) +#else + extern int rt_rwlock_can_lock_never_call_on_non_rt(rwlock_t *rwlock); +# define rt_read_can_lock(rwl) rt_rwlock_can_lock_never_call_on_non_rt(rwl) +# define rt_write_can_lock(rwl) rt_rwlock_can_lock_never_call_on_non_rt(rwl) +#endif + +# define _read_can_lock(rwl) rt_read_can_lock(rwl) +# define _write_can_lock(rwl) rt_write_can_lock(rwl) + +# define _read_trylock(rwl) rt_read_trylock(rwl) +# define _write_trylock(rwl) rt_write_trylock(rwl) +# define _write_trylock_irqsave(rwl, flags) \ + rt_write_trylock_irqsave(rwl, flags) + +# define _read_lock(rwl) rt_read_lock(rwl) +# define _write_lock(rwl) rt_write_lock(rwl) +# define _read_unlock(rwl) rt_read_unlock(rwl) +# define _write_unlock(rwl) rt_write_unlock(rwl) + +# define _read_lock_bh(rwl) rt_read_lock(rwl) +# define _write_lock_bh(rwl) rt_write_lock(rwl) +# define _read_unlock_bh(rwl) rt_read_unlock(rwl) +# define _write_unlock_bh(rwl) rt_write_unlock(rwl) + +# define _read_lock_irq(rwl) rt_read_lock(rwl) +# define _write_lock_irq(rwl) rt_write_lock(rwl) +# define _read_unlock_irq(rwl) rt_read_unlock(rwl) +# define _write_unlock_irq(rwl) rt_write_unlock(rwl) + +# define _read_lock_irqsave(rwl) rt_read_lock_irqsave(rwl) +# define _write_lock_irqsave(rwl) rt_write_lock_irqsave(rwl) + +# define _read_unlock_irqrestore(rwl, f) rt_read_unlock(rwl) +# define _write_unlock_irqrestore(rwl, f) rt_write_unlock(rwl) + +#define __PICK_RW_OP(optype, op, lock) \ +do { \ + if (TYPE_EQUAL((lock), raw_rwlock_t)) \ + __##optype##op((raw_rwlock_t *)(lock)); \ + else if (TYPE_EQUAL(lock, rwlock_t)) \ + ##op((rwlock_t *)(lock)); \ + else __bad_rwlock_type(); \ +} while (0) + +#define PICK_RW_OP(optype, op, lock) \ +do { \ + if (TYPE_EQUAL((lock), raw_rwlock_t)) \ + __##optype##op((raw_rwlock_t *)(lock)); \ + else if (TYPE_EQUAL(lock, rwlock_t)) \ + _##optype##op((rwlock_t *)(lock)); \ + else __bad_rwlock_type(); \ +} while (0) + +#define __PICK_RW_OP_RET(optype, op, lock...) \ +({ \ + unsigned long __ret; \ + \ + if (TYPE_EQUAL((lock), raw_rwlock_t)) \ + __ret = __##optype##op((raw_rwlock_t *)(lock)); \ + else if (TYPE_EQUAL(lock, rwlock_t)) \ + __ret = _##optype##op((rwlock_t *)(lock)); \ + else __ret = __bad_rwlock_type(); \ + \ + __ret; \ +}) + +#define PICK_RW_OP_RET(optype, op, lock...) \ +({ \ + unsigned long __ret; \ + \ + if (TYPE_EQUAL((lock), raw_rwlock_t)) \ + __ret = __##optype##op((raw_rwlock_t *)(lock)); \ + else if (TYPE_EQUAL(lock, rwlock_t)) \ + __ret = _##optype##op((rwlock_t *)(lock)); \ + else __ret = __bad_rwlock_type(); \ + \ + __ret; \ +}) + +#define PICK_RW_OP2(optype, op, lock, flags) \ +do { \ + if (TYPE_EQUAL((lock), raw_rwlock_t)) \ + __##optype##op((raw_rwlock_t *)(lock), flags); \ + else if (TYPE_EQUAL(lock, rwlock_t)) \ + _##optype##op((rwlock_t *)(lock), flags); \ + else __bad_rwlock_type(); \ +} while (0) + +#define PICK_RW_OP2_RET(optype, op, lock, flags) \ +({ \ + unsigned long __ret; \ + \ + if (TYPE_EQUAL((lock), raw_rwlock_t)) \ + __ret = __##optype##op((raw_rwlock_t *)(lock), flags); \ + else if (TYPE_EQUAL(lock, rwlock_t)) \ + __ret = _##optype##op((rwlock_t *)(lock), flags); \ + else __bad_rwlock_type(); \ + \ + __ret; \ +}) + +#ifdef CONFIG_DEBUG_SPINLOCK + extern void __raw_spin_lock_init(raw_spinlock_t *lock, const char *name, + struct lock_class_key *key); +# define _raw_spin_lock_init(lock) \ +do { \ + static struct lock_class_key __key; \ + \ + __raw_spin_lock_init((lock), #lock, &__key); \ +} while (0) + +#else +#define __raw_spin_lock_init(lock) \ + do { *(lock) = RAW_SPIN_LOCK_UNLOCKED(lock); } while (0) +# define _raw_spin_lock_init(lock) __raw_spin_lock_init(lock) +#endif + +#define PICK_OP_INIT(op, lock) \ +do { \ + if (TYPE_EQUAL((lock), raw_spinlock_t)) \ + _raw_spin##op((raw_spinlock_t *)(lock)); \ + else if (TYPE_EQUAL(lock, spinlock_t)) \ + _spin##op((spinlock_t *)(lock), #lock, __FILE__, __LINE__); \ + else __bad_spinlock_type(); \ +} while (0) + + +#define spin_lock_init(lock) PICK_OP_INIT(_lock_init, lock) + +#ifdef CONFIG_DEBUG_SPINLOCK + extern void __raw_rwlock_init(raw_rwlock_t *lock, const char *name, + struct lock_class_key *key); +# define _raw_rwlock_init(lock) \ +do { \ + static struct lock_class_key __key; \ + \ + __raw_rwlock_init((lock), #lock, &__key); \ +} while (0) +#else +#define __raw_rwlock_init(lock) \ + do { *(lock) = RAW_RW_LOCK_UNLOCKED(lock); } while (0) +# define _raw_rwlock_init(lock) __raw_rwlock_init(lock) +#endif + +#define __PICK_RW_OP_INIT(optype, op, lock) \ +do { \ + if (TYPE_EQUAL((lock), raw_rwlock_t)) \ + _raw_##optype##op((raw_rwlock_t *)(lock)); \ + else if (TYPE_EQUAL(lock, rwlock_t)) \ + _##optype##op((rwlock_t *)(lock), #lock, __FILE__, __LINE__);\ + else __bad_spinlock_type(); \ +} while (0) + +#define rwlock_init(lock) __PICK_RW_OP_INIT(rwlock, _init, lock) + +#define __spin_is_locked(lock) __raw_spin_is_locked(&(lock)->raw_lock) + +#define spin_is_locked(lock) PICK_OP_RET(_is_locked, lock) + +#define __spin_unlock_wait(lock) __raw_spin_unlock_wait(&(lock)->raw_lock) +#define spin_unlock_wait(lock) PICK_OP(_unlock_wait, lock) /* * Define the various spin_lock and rw_lock methods. Note we define these * regardless of whether CONFIG_SMP or CONFIG_PREEMPT are set. The various * methods are defined as nops in the case they are not required. */ -#define spin_trylock(lock) __cond_lock(lock, _spin_trylock(lock)) -#define read_trylock(lock) __cond_lock(lock, _read_trylock(lock)) -#define write_trylock(lock) __cond_lock(lock, _write_trylock(lock)) +// #define spin_trylock(lock) _spin_trylock(lock) +#define spin_trylock(lock) __cond_lock(lock, PICK_OP_RET(_trylock, lock)) + +//#define read_trylock(lock) _read_trylock(lock) +#define read_trylock(lock) __cond_lock(lock, PICK_RW_OP_RET(read, _trylock, lock)) + +//#define write_trylock(lock) _write_trylock(lock) +#define write_trylock(lock) __cond_lock(lock, PICK_RW_OP_RET(write, _trylock, lock)) + +#define write_trylock_irqsave(lock, flags) \ + __cond_lock(lock, PICK_RW_OP2_RET(write, _trylock_irqsave, lock, &flags)) + +#define __spin_can_lock(lock) __raw_spin_can_lock(&(lock)->raw_lock) +#define __read_can_lock(lock) __raw_read_can_lock(&(lock)->raw_lock) +#define __write_can_lock(lock) __raw_write_can_lock(&(lock)->raw_lock) + +#define spin_can_lock(lock) \ + __cond_lock(lock, PICK_OP_RET(_can_lock, lock)) -#define spin_lock(lock) _spin_lock(lock) +#define read_can_lock(lock) \ + __cond_lock(lock, PICK_RW_OP_RET(read, _can_lock, lock)) + +#define write_can_lock(lock) \ + __cond_lock(lock, PICK_RW_OP_RET(write, _can_lock, lock)) + +// #define spin_lock(lock) _spin_lock(lock) +#define spin_lock(lock) PICK_OP(_lock, lock) #ifdef CONFIG_DEBUG_LOCK_ALLOC -# define spin_lock_nested(lock, subclass) _spin_lock_nested(lock, subclass) +# define spin_lock_nested(lock, subclass) PICK_OP2(_lock_nested, lock, subclass) #else -# define spin_lock_nested(lock, subclass) _spin_lock(lock) +# define spin_lock_nested(lock, subclass) spin_lock(lock) #endif -#define write_lock(lock) _write_lock(lock) -#define read_lock(lock) _read_lock(lock) +//#define write_lock(lock) _write_lock(lock) +#define write_lock(lock) PICK_RW_OP(write, _lock, lock) -#if defined(CONFIG_SMP) || defined(CONFIG_DEBUG_SPINLOCK) +// #define read_lock(lock) _read_lock(lock) +#define read_lock(lock) PICK_RW_OP(read, _lock, lock) -#define spin_lock_irqsave(lock, flags) flags = _spin_lock_irqsave(lock) -#define read_lock_irqsave(lock, flags) flags = _read_lock_irqsave(lock) -#define write_lock_irqsave(lock, flags) flags = _write_lock_irqsave(lock) +# define spin_lock_irqsave(lock, flags) \ +do { \ + BUILD_CHECK_IRQ_FLAGS(flags); \ + flags = PICK_OP_RET(_lock_irqsave, lock); \ +} while (0) #ifdef CONFIG_DEBUG_LOCK_ALLOC -#define spin_lock_irqsave_nested(lock, flags, subclass) \ - flags = _spin_lock_irqsave_nested(lock, subclass) +# define spin_lock_irqsave_nested(lock, flags, subclass) \ +do { \ + BUILD_CHECK_IRQ_FLAGS(flags); \ + flags = PICK_OP2_RET(_lock_irqsave_nested, lock, subclass); \ +} while (0) #else -#define spin_lock_irqsave_nested(lock, flags, subclass) \ - flags = _spin_lock_irqsave(lock) +# define spin_lock_irqsave_nested(lock, flags, subclass) \ + spin_lock_irqsave(lock, flags) #endif -#else +# define read_lock_irqsave(lock, flags) \ +do { \ + BUILD_CHECK_IRQ_FLAGS(flags); \ + flags = PICK_RW_OP_RET(read, _lock_irqsave, lock); \ +} while (0) -#define spin_lock_irqsave(lock, flags) _spin_lock_irqsave(lock, flags) -#define read_lock_irqsave(lock, flags) _read_lock_irqsave(lock, flags) -#define write_lock_irqsave(lock, flags) _write_lock_irqsave(lock, flags) -#define spin_lock_irqsave_nested(lock, flags, subclass) \ - spin_lock_irqsave(lock, flags) +# define write_lock_irqsave(lock, flags) \ +do { \ + BUILD_CHECK_IRQ_FLAGS(flags); \ + flags = PICK_RW_OP_RET(write, _lock_irqsave, lock); \ +} while (0) -#endif +// #define spin_lock_irq(lock) _spin_lock_irq(lock) +// #define spin_lock_bh(lock) _spin_lock_bh(lock) +#define spin_lock_irq(lock) PICK_OP(_lock_irq, lock) +#define spin_lock_bh(lock) PICK_OP(_lock_bh, lock) + +// #define read_lock_irq(lock) _read_lock_irq(lock) +// #define read_lock_bh(lock) _read_lock_bh(lock) +#define read_lock_irq(lock) PICK_RW_OP(read, _lock_irq, lock) +#define read_lock_bh(lock) PICK_RW_OP(read, _lock_bh, lock) + +// #define write_lock_irq(lock) _write_lock_irq(lock) +// #define write_lock_bh(lock) _write_lock_bh(lock) +#define write_lock_irq(lock) PICK_RW_OP(write, _lock_irq, lock) +#define write_lock_bh(lock) PICK_RW_OP(write, _lock_bh, lock) + +// #define spin_unlock(lock) _spin_unlock(lock) +// #define write_unlock(lock) _write_unlock(lock) +// #define read_unlock(lock) _read_unlock(lock) +#define spin_unlock(lock) PICK_OP(_unlock, lock) +#define read_unlock(lock) PICK_RW_OP(read, _unlock, lock) +#define write_unlock(lock) PICK_RW_OP(write, _unlock, lock) + +// #define spin_unlock(lock) _spin_unlock_no_resched(lock) +#define spin_unlock_no_resched(lock) \ + PICK_OP(_unlock_no_resched, lock) + +//#define spin_unlock_irqrestore(lock, flags) +// _spin_unlock_irqrestore(lock, flags) +//#define spin_unlock_irq(lock) _spin_unlock_irq(lock) +//#define spin_unlock_bh(lock) _spin_unlock_bh(lock) +#define spin_unlock_irqrestore(lock, flags) \ +do { \ + BUILD_CHECK_IRQ_FLAGS(flags); \ + PICK_OP2(_unlock_irqrestore, lock, flags); \ +} while (0) -#define spin_lock_irq(lock) _spin_lock_irq(lock) -#define spin_lock_bh(lock) _spin_lock_bh(lock) +#define spin_unlock_irq(lock) PICK_OP(_unlock_irq, lock) +#define spin_unlock_bh(lock) PICK_OP(_unlock_bh, lock) -#define read_lock_irq(lock) _read_lock_irq(lock) -#define read_lock_bh(lock) _read_lock_bh(lock) +// #define read_unlock_irqrestore(lock, flags) +// _read_unlock_irqrestore(lock, flags) +// #define read_unlock_irq(lock) _read_unlock_irq(lock) +// #define read_unlock_bh(lock) _read_unlock_bh(lock) +#define read_unlock_irqrestore(lock, flags) \ +do { \ + BUILD_CHECK_IRQ_FLAGS(flags); \ + PICK_RW_OP2(read, _unlock_irqrestore, lock, flags); \ +} while (0) + +#define read_unlock_irq(lock) PICK_RW_OP(read, _unlock_irq, lock) +#define read_unlock_bh(lock) PICK_RW_OP(read, _unlock_bh, lock) + +// #define write_unlock_irqrestore(lock, flags) +// _write_unlock_irqrestore(lock, flags) +// #define write_unlock_irq(lock) _write_unlock_irq(lock) +// #define write_unlock_bh(lock) _write_unlock_bh(lock) +#define write_unlock_irqrestore(lock, flags) \ +do { \ + BUILD_CHECK_IRQ_FLAGS(flags); \ + PICK_RW_OP2(write, _unlock_irqrestore, lock, flags); \ +} while (0) +#define write_unlock_irq(lock) PICK_RW_OP(write, _unlock_irq, lock) +#define write_unlock_bh(lock) PICK_RW_OP(write, _unlock_bh, lock) + +// #define spin_trylock_bh(lock) _spin_trylock_bh(lock) +#define spin_trylock_bh(lock) __cond_lock(lock, PICK_OP_RET(_trylock_bh, lock)) + +// #define spin_trylock_irq(lock) + +#define spin_trylock_irq(lock) __cond_lock(lock, PICK_OP_RET(_trylock_irq, lock)) + +// #define spin_trylock_irqsave(lock, flags) + +#define spin_trylock_irqsave(lock, flags) \ + __cond_lock(lock, PICK_OP2_RET(_trylock_irqsave, lock, &flags)) + +/* "lock on reference count zero" */ +#ifndef ATOMIC_DEC_AND_LOCK +# include <asm/atomic.h> + extern int __atomic_dec_and_spin_lock(atomic_t *atomic, raw_spinlock_t *lock); +#endif + +#define atomic_dec_and_lock(atomic, lock) \ +__cond_lock(lock, ({ \ + unsigned long __ret; \ + \ + if (TYPE_EQUAL(lock, raw_spinlock_t)) \ + __ret = __atomic_dec_and_spin_lock(atomic, \ + (raw_spinlock_t *)(lock)); \ + else if (TYPE_EQUAL(lock, spinlock_t)) \ + __ret = _atomic_dec_and_spin_lock(atomic, \ + (spinlock_t *)(lock)); \ + else __ret = __bad_spinlock_type(); \ + \ + __ret; \ +})) -#define write_lock_irq(lock) _write_lock_irq(lock) -#define write_lock_bh(lock) _write_lock_bh(lock) /* - * We inline the unlock functions in the nondebug case: + * bit-based spin_lock() + * + * Don't use this unless you really need to: spin_lock() and spin_unlock() + * are significantly faster. */ -#if defined(CONFIG_DEBUG_SPINLOCK) || defined(CONFIG_PREEMPT) || \ - !defined(CONFIG_SMP) -# define spin_unlock(lock) _spin_unlock(lock) -# define read_unlock(lock) _read_unlock(lock) -# define write_unlock(lock) _write_unlock(lock) -# define spin_unlock_irq(lock) _spin_unlock_irq(lock) -# define read_unlock_irq(lock) _read_unlock_irq(lock) -# define write_unlock_irq(lock) _write_unlock_irq(lock) -#else -# define spin_unlock(lock) \ - do {__raw_spin_unlock(&(lock)->raw_lock); __release(lock); } while (0) -# define read_unlock(lock) \ - do {__raw_read_unlock(&(lock)->raw_lock); __release(lock); } while (0) -# define write_unlock(lock) \ - do {__raw_write_unlock(&(lock)->raw_lock); __release(lock); } while (0) -# define spin_unlock_irq(lock) \ -do { \ - __raw_spin_unlock(&(lock)->raw_lock); \ - __release(lock); \ - local_irq_enable(); \ -} while (0) -# define read_unlock_irq(lock) \ -do { \ - __raw_read_unlock(&(lock)->raw_lock); \ - __release(lock); \ - local_irq_enable(); \ -} while (0) -# define write_unlock_irq(lock) \ -do { \ - __raw_write_unlock(&(lock)->raw_lock); \ - __release(lock); \ - local_irq_enable(); \ -} while (0) +static inline void bit_spin_lock(int bitnum, unsigned long *addr) +{ + /* + * Assuming the lock is uncontended, this never enters + * the body of the outer loop. If it is contended, then + * within the inner loop a non-atomic test is used to + * busywait with less bus contention for a good time to + * attempt to acquire the lock bit. + */ +#if defined(CONFIG_SMP) || defined(CONFIG_DEBUG_SPINLOCK) || defined(CONFIG_PREEMPT) + while (test_and_set_bit(bitnum, addr)) + while (test_bit(bitnum, addr)) + cpu_relax(); #endif + __acquire(bitlock); +} -#define spin_unlock_irqrestore(lock, flags) \ - _spin_unlock_irqrestore(lock, flags) -#define spin_unlock_bh(lock) _spin_unlock_bh(lock) - -#define read_unlock_irqrestore(lock, flags) \ - _read_unlock_irqrestore(lock, flags) -#define read_unlock_bh(lock) _read_unlock_bh(lock) - -#define write_unlock_irqrestore(lock, flags) \ - _write_unlock_irqrestore(lock, flags) -#define write_unlock_bh(lock) _write_unlock_bh(lock) - -#define spin_trylock_bh(lock) __cond_lock(lock, _spin_trylock_bh(lock)) - -#define spin_trylock_irq(lock) \ -({ \ - local_irq_disable(); \ - spin_trylock(lock) ? \ - 1 : ({ local_irq_enable(); 0; }); \ -}) +/* + * Return true if it was acquired + */ +static inline int bit_spin_trylock(int bitnum, unsigned long *addr) +{ +#if defined(CONFIG_SMP) || defined(CONFIG_DEBUG_SPINLOCK) || defined(CONFIG_PREEMPT) + if (test_and_set_bit(bitnum, addr)) + return 0; +#endif + __acquire(bitlock); + return 1; +} -#define spin_trylock_irqsave(lock, flags) \ -({ \ - local_irq_save(flags); \ - spin_trylock(lock) ? \ - 1 : ({ local_irq_restore(flags); 0; }); \ -}) +/* + * bit-based spin_unlock() + */ +static inline void bit_spin_unlock(int bitnum, unsigned long *addr) +{ +#if defined(CONFIG_SMP) || defined(CONFIG_DEBUG_SPINLOCK) || defined(CONFIG_PREEMPT) + BUG_ON(!test_bit(bitnum, addr)); + smp_mb__before_clear_bit(); + clear_bit(bitnum, addr); +#endif + __release(bitlock); +} -#define write_trylock_irqsave(lock, flags) \ -({ \ - local_irq_save(flags); \ - write_trylock(lock) ? \ - 1 : ({ local_irq_restore(flags); 0; }); \ -}) +/* + * Return true if the lock is held. + */ +static inline int bit_spin_is_locked(int bitnum, unsigned long *addr) +{ +#if defined(CONFIG_SMP) || defined(CONFIG_DEBUG_SPINLOCK) || defined(CONFIG_PREEMPT) + return test_bit(bitnum, addr); +#else + return 1; +#endif +} -#define write_trylock_irqsave(lock, flags) \ -({ \ - local_irq_save(flags); \ - write_trylock(lock) ? \ - 1 : ({ local_irq_restore(flags); 0; }); \ -}) +/** + * __raw_spin_can_lock - would __raw_spin_trylock() succeed? + * @lock: the spinlock in question. + */ +#define __raw_spin_can_lock(lock) (!__raw_spin_is_locked(lock)) /* * Locks two spinlocks l1 and l2. * l1_first indicates if spinlock l1 should be taken first. */ -static inline void double_spin_lock(spinlock_t *l1, spinlock_t *l2, - bool l1_first) +static inline void +raw_double_spin_lock(raw_spinlock_t *l1, raw_spinlock_t *l2, bool l1_first) + __acquires(l1) + __acquires(l2) +{ + if (l1_first) { + spin_lock(l1); + spin_lock(l2); + } else { + spin_lock(l2); + spin_lock(l1); + } +} + +static inline void +double_spin_lock(spinlock_t *l1, spinlock_t *l2, bool l1_first) __acquires(l1) __acquires(l2) { @@ -314,13 +745,15 @@ static inline void double_spin_lock(spin } } + /* * Unlocks two spinlocks l1 and l2. * l1_taken_first indicates if spinlock l1 was taken first and therefore * should be released after spinlock l2. */ -static inline void double_spin_unlock(spinlock_t *l1, spinlock_t *l2, - bool l1_taken_first) +static inline void +raw_double_spin_unlock(raw_spinlock_t *l1, raw_spinlock_t *l2, + bool l1_taken_first) __releases(l1) __releases(l2) { @@ -333,24 +766,19 @@ static inline void double_spin_unlock(sp } } -/* - * Pull the atomic_t declaration: - * (asm-mips/atomic.h needs above definitions) - */ -#include <asm/atomic.h> -/** - * atomic_dec_and_lock - lock on reaching reference count zero - * @atomic: the atomic counter - * @lock: the spinlock in question - */ -extern int _atomic_dec_and_lock(atomic_t *atomic, spinlock_t *lock); -#define atomic_dec_and_lock(atomic, lock) \ - __cond_lock(lock, _atomic_dec_and_lock(atomic, lock)) - -/** - * spin_can_lock - would spin_trylock() succeed? - * @lock: the spinlock in question. - */ -#define spin_can_lock(lock) (!spin_is_locked(lock)) +static inline void +double_spin_unlock(spinlock_t *l1, spinlock_t *l2, bool l1_taken_first) + __releases(l1) + __releases(l2) +{ + if (l1_taken_first) { + spin_unlock(l2); + spin_unlock(l1); + } else { + spin_unlock(l1); + spin_unlock(l2); + } +} #endif /* __LINUX_SPINLOCK_H */ + Index: linux-2.6.24.7/include/linux/spinlock_api_smp.h =================================================================== --- linux-2.6.24.7.orig/include/linux/spinlock_api_smp.h +++ linux-2.6.24.7/include/linux/spinlock_api_smp.h @@ -19,43 +19,58 @@ int in_lock_functions(unsigned long addr #define assert_spin_locked(x) BUG_ON(!spin_is_locked(x)) -void __lockfunc _spin_lock(spinlock_t *lock) __acquires(lock); -void __lockfunc _spin_lock_nested(spinlock_t *lock, int subclass) - __acquires(lock); -void __lockfunc _read_lock(rwlock_t *lock) __acquires(lock); -void __lockfunc _write_lock(rwlock_t *lock) __acquires(lock); -void __lockfunc _spin_lock_bh(spinlock_t *lock) __acquires(lock); -void __lockfunc _read_lock_bh(rwlock_t *lock) __acquires(lock); -void __lockfunc _write_lock_bh(rwlock_t *lock) __acquires(lock); -void __lockfunc _spin_lock_irq(spinlock_t *lock) __acquires(lock); -void __lockfunc _read_lock_irq(rwlock_t *lock) __acquires(lock); -void __lockfunc _write_lock_irq(rwlock_t *lock) __acquires(lock); -unsigned long __lockfunc _spin_lock_irqsave(spinlock_t *lock) - __acquires(lock); -unsigned long __lockfunc _spin_lock_irqsave_nested(spinlock_t *lock, int subclass) - __acquires(lock); -unsigned long __lockfunc _read_lock_irqsave(rwlock_t *lock) - __acquires(lock); -unsigned long __lockfunc _write_lock_irqsave(rwlock_t *lock) - __acquires(lock); -int __lockfunc _spin_trylock(spinlock_t *lock); -int __lockfunc _read_trylock(rwlock_t *lock); -int __lockfunc _write_trylock(rwlock_t *lock); -int __lockfunc _spin_trylock_bh(spinlock_t *lock); -void __lockfunc _spin_unlock(spinlock_t *lock) __releases(lock); -void __lockfunc _read_unlock(rwlock_t *lock) __releases(lock); -void __lockfunc _write_unlock(rwlock_t *lock) __releases(lock); -void __lockfunc _spin_unlock_bh(spinlock_t *lock) __releases(lock); -void __lockfunc _read_unlock_bh(rwlock_t *lock) __releases(lock); -void __lockfunc _write_unlock_bh(rwlock_t *lock) __releases(lock); -void __lockfunc _spin_unlock_irq(spinlock_t *lock) __releases(lock); -void __lockfunc _read_unlock_irq(rwlock_t *lock) __releases(lock); -void __lockfunc _write_unlock_irq(rwlock_t *lock) __releases(lock); -void __lockfunc _spin_unlock_irqrestore(spinlock_t *lock, unsigned long flags) - __releases(lock); -void __lockfunc _read_unlock_irqrestore(rwlock_t *lock, unsigned long flags) - __releases(lock); -void __lockfunc _write_unlock_irqrestore(rwlock_t *lock, unsigned long flags) - __releases(lock); +#define ACQUIRE_SPIN __acquires(lock) +#define ACQUIRE_RW __acquires(lock) +#define RELEASE_SPIN __releases(lock) +#define RELEASE_RW __releases(lock) + +void __lockfunc __spin_lock(raw_spinlock_t *lock) ACQUIRE_SPIN; +void __lockfunc __spin_lock_nested(raw_spinlock_t *lock, int subclass) + ACQUIRE_SPIN; +void __lockfunc __read_lock(raw_rwlock_t *lock) ACQUIRE_RW; +void __lockfunc __write_lock(raw_rwlock_t *lock) ACQUIRE_RW; +void __lockfunc __spin_lock_bh(raw_spinlock_t *lock) ACQUIRE_SPIN; +void __lockfunc __read_lock_bh(raw_rwlock_t *lock) ACQUIRE_RW; +void __lockfunc __write_lock_bh(raw_rwlock_t *lock) ACQUIRE_RW; +void __lockfunc __spin_lock_irq(raw_spinlock_t *lock) ACQUIRE_SPIN; +void __lockfunc __read_lock_irq(raw_rwlock_t *lock) ACQUIRE_RW; +void __lockfunc __write_lock_irq(raw_rwlock_t *lock) ACQUIRE_RW; +unsigned long __lockfunc __spin_lock_irqsave(raw_spinlock_t *lock) + ACQUIRE_SPIN; +unsigned long __lockfunc +__spin_lock_irqsave_nested(raw_spinlock_t *lock, int subclass) ACQUIRE_SPIN; +unsigned long __lockfunc __read_lock_irqsave(raw_rwlock_t *lock) + ACQUIRE_RW; +unsigned long __lockfunc __write_lock_irqsave(raw_rwlock_t *lock) + ACQUIRE_RW; +int __lockfunc __spin_trylock(raw_spinlock_t *lock); +int __lockfunc +__spin_trylock_irqsave(raw_spinlock_t *lock, unsigned long *flags); +int __lockfunc __read_trylock(raw_rwlock_t *lock); +int __lockfunc __write_trylock(raw_rwlock_t *lock); +int __lockfunc +__write_trylock_irqsave(raw_rwlock_t *lock, unsigned long *flags); +int __lockfunc __spin_trylock_bh(raw_spinlock_t *lock); +int __lockfunc __spin_trylock_irq(raw_spinlock_t *lock); +void __lockfunc __spin_unlock(raw_spinlock_t *lock) RELEASE_SPIN; +void __lockfunc __spin_unlock_no_resched(raw_spinlock_t *lock) + RELEASE_SPIN; +void __lockfunc __read_unlock(raw_rwlock_t *lock) RELEASE_RW; +void __lockfunc __write_unlock(raw_rwlock_t *lock) RELEASE_RW; +void __lockfunc __spin_unlock_bh(raw_spinlock_t *lock) RELEASE_SPIN; +void __lockfunc __read_unlock_bh(raw_rwlock_t *lock) RELEASE_RW; +void __lockfunc __write_unlock_bh(raw_rwlock_t *lock) RELEASE_RW; +void __lockfunc __spin_unlock_irq(raw_spinlock_t *lock) RELEASE_SPIN; +void __lockfunc __read_unlock_irq(raw_rwlock_t *lock) RELEASE_RW; +void __lockfunc __write_unlock_irq(raw_rwlock_t *lock) RELEASE_RW; +void __lockfunc +__spin_unlock_irqrestore(raw_spinlock_t *lock, unsigned long flags) + RELEASE_SPIN; +void __lockfunc +__read_unlock_irqrestore(raw_rwlock_t *lock, unsigned long flags) + RELEASE_RW; +void +__lockfunc __write_unlock_irqrestore(raw_rwlock_t *lock, unsigned long flags) + RELEASE_RW; #endif /* __LINUX_SPINLOCK_API_SMP_H */ Index: linux-2.6.24.7/include/linux/spinlock_api_up.h =================================================================== --- linux-2.6.24.7.orig/include/linux/spinlock_api_up.h +++ linux-2.6.24.7/include/linux/spinlock_api_up.h @@ -33,12 +33,22 @@ #define __LOCK_IRQ(lock) \ do { local_irq_disable(); __LOCK(lock); } while (0) -#define __LOCK_IRQSAVE(lock, flags) \ - do { local_irq_save(flags); __LOCK(lock); } while (0) +#define __LOCK_IRQSAVE(lock) \ + ({ unsigned long __flags; local_irq_save(__flags); __LOCK(lock); __flags; }) + +#define __TRYLOCK_IRQSAVE(lock, flags) \ + ({ local_irq_save(*(flags)); __LOCK(lock); 1; }) + +#define __spin_trylock_irqsave(lock, flags) __TRYLOCK_IRQSAVE(lock, flags) + +#define __write_trylock_irqsave(lock, flags) __TRYLOCK_IRQSAVE(lock, flags) #define __UNLOCK(lock) \ do { preempt_enable(); __release(lock); (void)(lock); } while (0) +#define __UNLOCK_NO_RESCHED(lock) \ + do { __preempt_enable_no_resched(); __release(lock); (void)(lock); } while (0) + #define __UNLOCK_BH(lock) \ do { preempt_enable_no_resched(); local_bh_enable(); __release(lock); (void)(lock); } while (0) @@ -48,34 +58,36 @@ #define __UNLOCK_IRQRESTORE(lock, flags) \ do { local_irq_restore(flags); __UNLOCK(lock); } while (0) -#define _spin_lock(lock) __LOCK(lock) -#define _spin_lock_nested(lock, subclass) __LOCK(lock) -#define _read_lock(lock) __LOCK(lock) -#define _write_lock(lock) __LOCK(lock) -#define _spin_lock_bh(lock) __LOCK_BH(lock) -#define _read_lock_bh(lock) __LOCK_BH(lock) -#define _write_lock_bh(lock) __LOCK_BH(lock) -#define _spin_lock_irq(lock) __LOCK_IRQ(lock) -#define _read_lock_irq(lock) __LOCK_IRQ(lock) -#define _write_lock_irq(lock) __LOCK_IRQ(lock) -#define _spin_lock_irqsave(lock, flags) __LOCK_IRQSAVE(lock, flags) -#define _read_lock_irqsave(lock, flags) __LOCK_IRQSAVE(lock, flags) -#define _write_lock_irqsave(lock, flags) __LOCK_IRQSAVE(lock, flags) -#define _spin_trylock(lock) ({ __LOCK(lock); 1; }) -#define _read_trylock(lock) ({ __LOCK(lock); 1; }) -#define _write_trylock(lock) ({ __LOCK(lock); 1; }) -#define _spin_trylock_bh(lock) ({ __LOCK_BH(lock); 1; }) -#define _spin_unlock(lock) __UNLOCK(lock) -#define _read_unlock(lock) __UNLOCK(lock) -#define _write_unlock(lock) __UNLOCK(lock) -#define _spin_unlock_bh(lock) __UNLOCK_BH(lock) -#define _write_unlock_bh(lock) __UNLOCK_BH(lock) -#define _read_unlock_bh(lock) __UNLOCK_BH(lock) -#define _spin_unlock_irq(lock) __UNLOCK_IRQ(lock) -#define _read_unlock_irq(lock) __UNLOCK_IRQ(lock) -#define _write_unlock_irq(lock) __UNLOCK_IRQ(lock) -#define _spin_unlock_irqrestore(lock, flags) __UNLOCK_IRQRESTORE(lock, flags) -#define _read_unlock_irqrestore(lock, flags) __UNLOCK_IRQRESTORE(lock, flags) -#define _write_unlock_irqrestore(lock, flags) __UNLOCK_IRQRESTORE(lock, flags) +#define __spin_lock(lock) __LOCK(lock) +#define __spin_lock_nested(lock, subclass) __LOCK(lock) +#define __read_lock(lock) __LOCK(lock) +#define __write_lock(lock) __LOCK(lock) +#define __spin_lock_bh(lock) __LOCK_BH(lock) +#define __read_lock_bh(lock) __LOCK_BH(lock) +#define __write_lock_bh(lock) __LOCK_BH(lock) +#define __spin_lock_irq(lock) __LOCK_IRQ(lock) +#define __read_lock_irq(lock) __LOCK_IRQ(lock) +#define __write_lock_irq(lock) __LOCK_IRQ(lock) +#define __spin_lock_irqsave(lock) __LOCK_IRQSAVE(lock) +#define __read_lock_irqsave(lock) __LOCK_IRQSAVE(lock) +#define __write_lock_irqsave(lock) __LOCK_IRQSAVE(lock) +#define __spin_trylock(lock) ({ __LOCK(lock); 1; }) +#define __read_trylock(lock) ({ __LOCK(lock); 1; }) +#define __write_trylock(lock) ({ __LOCK(lock); 1; }) +#define __spin_trylock_bh(lock) ({ __LOCK_BH(lock); 1; }) +#define __spin_trylock_irq(lock) ({ __LOCK_IRQ(lock); 1; }) +#define __spin_unlock(lock) __UNLOCK(lock) +#define __spin_unlock_no_resched(lock) __UNLOCK_NO_RESCHED(lock) +#define __read_unlock(lock) __UNLOCK(lock) +#define __write_unlock(lock) __UNLOCK(lock) +#define __spin_unlock_bh(lock) __UNLOCK_BH(lock) +#define __write_unlock_bh(lock) __UNLOCK_BH(lock) +#define __read_unlock_bh(lock) __UNLOCK_BH(lock) +#define __spin_unlock_irq(lock) __UNLOCK_IRQ(lock) +#define __read_unlock_irq(lock) __UNLOCK_IRQ(lock) +#define __write_unlock_irq(lock) __UNLOCK_IRQ(lock) +#define __spin_unlock_irqrestore(lock, flags) __UNLOCK_IRQRESTORE(lock, flags) +#define __read_unlock_irqrestore(lock, flags) __UNLOCK_IRQRESTORE(lock, flags) +#define __write_unlock_irqrestore(lock, flags) __UNLOCK_IRQRESTORE(lock, flags) #endif /* __LINUX_SPINLOCK_API_UP_H */ Index: linux-2.6.24.7/include/linux/spinlock_types.h =================================================================== --- linux-2.6.24.7.orig/include/linux/spinlock_types.h +++ linux-2.6.24.7/include/linux/spinlock_types.h @@ -15,10 +15,27 @@ # include <linux/spinlock_types_up.h> #endif +/* + * Must define these before including other files, inline functions need them + */ +#define LOCK_SECTION_NAME ".text.lock."KBUILD_BASENAME + +#define LOCK_SECTION_START(extra) \ + ".subsection 1\n\t" \ + extra \ + ".ifndef " LOCK_SECTION_NAME "\n\t" \ + LOCK_SECTION_NAME ":\n\t" \ + ".endif\n" + +#define LOCK_SECTION_END \ + ".previous\n\t" + +#define __lockfunc fastcall __attribute__((section(".spinlock.text"))) + #include <linux/lockdep.h> typedef struct { - raw_spinlock_t raw_lock; + __raw_spinlock_t raw_lock; #if defined(CONFIG_PREEMPT) && defined(CONFIG_SMP) unsigned int break_lock; #endif @@ -29,12 +46,12 @@ typedef struct { #ifdef CONFIG_DEBUG_LOCK_ALLOC struct lockdep_map dep_map; #endif -} spinlock_t; +} raw_spinlock_t; #define SPINLOCK_MAGIC 0xdead4ead typedef struct { - raw_rwlock_t raw_lock; + __raw_rwlock_t raw_lock; #if defined(CONFIG_PREEMPT) && defined(CONFIG_SMP) unsigned int break_lock; #endif @@ -45,7 +62,7 @@ typedef struct { #ifdef CONFIG_DEBUG_LOCK_ALLOC struct lockdep_map dep_map; #endif -} rwlock_t; +} raw_rwlock_t; #define RWLOCK_MAGIC 0xdeaf1eed @@ -64,24 +81,24 @@ typedef struct { #endif #ifdef CONFIG_DEBUG_SPINLOCK -# define __SPIN_LOCK_UNLOCKED(lockname) \ - (spinlock_t) { .raw_lock = __RAW_SPIN_LOCK_UNLOCKED, \ +# define _RAW_SPIN_LOCK_UNLOCKED(lockname) \ + { .raw_lock = __RAW_SPIN_LOCK_UNLOCKED, \ .magic = SPINLOCK_MAGIC, \ .owner = SPINLOCK_OWNER_INIT, \ .owner_cpu = -1, \ SPIN_DEP_MAP_INIT(lockname) } -#define __RW_LOCK_UNLOCKED(lockname) \ - (rwlock_t) { .raw_lock = __RAW_RW_LOCK_UNLOCKED, \ +#define _RAW_RW_LOCK_UNLOCKED(lockname) \ + { .raw_lock = __RAW_RW_LOCK_UNLOCKED, \ .magic = RWLOCK_MAGIC, \ .owner = SPINLOCK_OWNER_INIT, \ .owner_cpu = -1, \ RW_DEP_MAP_INIT(lockname) } #else -# define __SPIN_LOCK_UNLOCKED(lockname) \ - (spinlock_t) { .raw_lock = __RAW_SPIN_LOCK_UNLOCKED, \ +# define _RAW_SPIN_LOCK_UNLOCKED(lockname) \ + { .raw_lock = __RAW_SPIN_LOCK_UNLOCKED, \ SPIN_DEP_MAP_INIT(lockname) } -#define __RW_LOCK_UNLOCKED(lockname) \ - (rwlock_t) { .raw_lock = __RAW_RW_LOCK_UNLOCKED, \ +# define _RAW_RW_LOCK_UNLOCKED(lockname) \ + { .raw_lock = __RAW_RW_LOCK_UNLOCKED, \ RW_DEP_MAP_INIT(lockname) } #endif @@ -91,10 +108,22 @@ typedef struct { * Please use DEFINE_SPINLOCK()/DEFINE_RWLOCK() or * __SPIN_LOCK_UNLOCKED()/__RW_LOCK_UNLOCKED() as appropriate. */ -#define SPIN_LOCK_UNLOCKED __SPIN_LOCK_UNLOCKED(old_style_spin_init) -#define RW_LOCK_UNLOCKED __RW_LOCK_UNLOCKED(old_style_rw_init) -#define DEFINE_SPINLOCK(x) spinlock_t x = __SPIN_LOCK_UNLOCKED(x) -#define DEFINE_RWLOCK(x) rwlock_t x = __RW_LOCK_UNLOCKED(x) +# define RAW_SPIN_LOCK_UNLOCKED(lockname) \ + (raw_spinlock_t) _RAW_SPIN_LOCK_UNLOCKED(lockname) + +# define RAW_RW_LOCK_UNLOCKED(lockname) \ + (raw_rwlock_t) _RAW_RW_LOCK_UNLOCKED(lockname) + +#define DEFINE_RAW_SPINLOCK(name) \ + raw_spinlock_t name __cacheline_aligned_in_smp = \ + RAW_SPIN_LOCK_UNLOCKED(name) + +#define __DEFINE_RAW_SPINLOCK(name) \ + raw_spinlock_t name = RAW_SPIN_LOCK_UNLOCKED(name) + +#define DEFINE_RAW_RWLOCK(name) \ + raw_rwlock_t name __cacheline_aligned_in_smp = \ + RAW_RW_LOCK_UNLOCKED(name) #endif /* __LINUX_SPINLOCK_TYPES_H */ Index: linux-2.6.24.7/include/linux/spinlock_types_up.h =================================================================== --- linux-2.6.24.7.orig/include/linux/spinlock_types_up.h +++ linux-2.6.24.7/include/linux/spinlock_types_up.h @@ -16,13 +16,13 @@ typedef struct { volatile unsigned int slock; -} raw_spinlock_t; +} __raw_spinlock_t; #define __RAW_SPIN_LOCK_UNLOCKED { 1 } #else -typedef struct { } raw_spinlock_t; +typedef struct { } __raw_spinlock_t; #define __RAW_SPIN_LOCK_UNLOCKED { } @@ -30,7 +30,7 @@ typedef struct { } raw_spinlock_t; typedef struct { /* no debug version on UP */ -} raw_rwlock_t; +} __raw_rwlock_t; #define __RAW_RW_LOCK_UNLOCKED { } Index: linux-2.6.24.7/include/linux/spinlock_up.h =================================================================== --- linux-2.6.24.7.orig/include/linux/spinlock_up.h +++ linux-2.6.24.7/include/linux/spinlock_up.h @@ -20,19 +20,19 @@ #ifdef CONFIG_DEBUG_SPINLOCK #define __raw_spin_is_locked(x) ((x)->slock == 0) -static inline void __raw_spin_lock(raw_spinlock_t *lock) +static inline void __raw_spin_lock(__raw_spinlock_t *lock) { lock->slock = 0; } static inline void -__raw_spin_lock_flags(raw_spinlock_t *lock, unsigned long flags) +__raw_spin_lock_flags(__raw_spinlock_t *lock, unsigned long flags) { local_irq_save(flags); lock->slock = 0; } -static inline int __raw_spin_trylock(raw_spinlock_t *lock) +static inline int __raw_spin_trylock(__raw_spinlock_t *lock) { char oldval = lock->slock; @@ -41,7 +41,7 @@ static inline int __raw_spin_trylock(raw return oldval > 0; } -static inline void __raw_spin_unlock(raw_spinlock_t *lock) +static inline void __raw_spin_unlock(__raw_spinlock_t *lock) { lock->slock = 1; } Index: linux-2.6.24.7/kernel/Makefile =================================================================== --- linux-2.6.24.7.orig/kernel/Makefile +++ linux-2.6.24.7/kernel/Makefile @@ -7,7 +7,7 @@ obj-y = sched.o fork.o exec_domain.o sysctl.o capability.o ptrace.o timer.o user.o user_namespace.o \ signal.o sys.o kmod.o workqueue.o pid.o \ rcupdate.o extable.o params.o posix-timers.o \ - kthread.o wait.o kfifo.o sys_ni.o posix-cpu-timers.o mutex.o \ + kthread.o wait.o kfifo.o sys_ni.o posix-cpu-timers.o \ hrtimer.o rwsem.o latency.o nsproxy.o srcu.o \ utsname.o notifier.o @@ -26,7 +26,10 @@ endif obj-$(CONFIG_SYSCTL) += sysctl_check.o obj-$(CONFIG_STACKTRACE) += stacktrace.o obj-y += time/ +ifneq ($(CONFIG_PREEMPT_RT),y) +obj-y += mutex.o obj-$(CONFIG_DEBUG_MUTEXES) += mutex-debug.o +endif obj-$(CONFIG_LOCKDEP) += lockdep.o ifeq ($(CONFIG_PROC_FS),y) obj-$(CONFIG_LOCKDEP) += lockdep_proc.o @@ -38,6 +41,7 @@ endif obj-$(CONFIG_RT_MUTEXES) += rtmutex.o obj-$(CONFIG_DEBUG_RT_MUTEXES) += rtmutex-debug.o obj-$(CONFIG_RT_MUTEX_TESTER) += rtmutex-tester.o +obj-$(CONFIG_PREEMPT_RT) += rt.o obj-$(CONFIG_GENERIC_ISA_DMA) += dma.o obj-$(CONFIG_SMP) += cpu.o spinlock.o obj-$(CONFIG_DEBUG_SPINLOCK) += spinlock.o Index: linux-2.6.24.7/kernel/fork.c =================================================================== --- linux-2.6.24.7.orig/kernel/fork.c +++ linux-2.6.24.7/kernel/fork.c @@ -959,6 +959,9 @@ static void rt_mutex_init_task(struct ta #ifdef CONFIG_RT_MUTEXES plist_head_init(&p->pi_waiters, &p->pi_lock); p->pi_blocked_on = NULL; +# ifdef CONFIG_DEBUG_RT_MUTEXES + p->last_kernel_lock = NULL; +# endif #endif } @@ -1154,6 +1157,9 @@ static struct task_struct *copy_process( retval = copy_thread(0, clone_flags, stack_start, stack_size, p, regs); if (retval) goto bad_fork_cleanup_namespaces; +#ifdef CONFIG_DEBUG_PREEMPT + p->lock_count = 0; +#endif if (pid != &init_struct_pid) { retval = -ENOMEM; Index: linux-2.6.24.7/kernel/futex.c =================================================================== --- linux-2.6.24.7.orig/kernel/futex.c +++ linux-2.6.24.7/kernel/futex.c @@ -2206,7 +2206,11 @@ static int __init init(void) futex_cmpxchg_enabled = 1; for (i = 0; i < ARRAY_SIZE(futex_queues); i++) { +#ifdef CONFIG_PREEMPT_RT + plist_head_init(&futex_queues[i].chain, NULL); +#else plist_head_init(&futex_queues[i].chain, &futex_queues[i].lock); +#endif spin_lock_init(&futex_queues[i].lock); } Index: linux-2.6.24.7/kernel/hrtimer.c =================================================================== --- linux-2.6.24.7.orig/kernel/hrtimer.c +++ linux-2.6.24.7/kernel/hrtimer.c @@ -1544,7 +1544,7 @@ static void migrate_hrtimers(int cpu) tick_cancel_sched_timer(cpu); local_irq_disable(); - double_spin_lock(&new_base->lock, &old_base->lock, + raw_double_spin_lock(&new_base->lock, &old_base->lock, smp_processor_id() < cpu); for (i = 0; i < HRTIMER_MAX_CLOCK_BASES; i++) { @@ -1552,7 +1552,7 @@ static void migrate_hrtimers(int cpu) &new_base->clock_base[i]); } - double_spin_unlock(&new_base->lock, &old_base->lock, + raw_double_spin_unlock(&new_base->lock, &old_base->lock, smp_processor_id() < cpu); local_irq_enable(); put_cpu_var(hrtimer_bases); Index: linux-2.6.24.7/kernel/lockdep.c =================================================================== --- linux-2.6.24.7.orig/kernel/lockdep.c +++ linux-2.6.24.7/kernel/lockdep.c @@ -67,7 +67,7 @@ module_param(lock_stat, int, 0644); * to use a raw spinlock - we really dont want the spinlock * code to recurse back into the lockdep code... */ -static raw_spinlock_t lockdep_lock = (raw_spinlock_t)__RAW_SPIN_LOCK_UNLOCKED; +static __raw_spinlock_t lockdep_lock = (__raw_spinlock_t)__RAW_SPIN_LOCK_UNLOCKED; static int graph_lock(void) { Index: linux-2.6.24.7/kernel/rt.c =================================================================== --- /dev/null +++ linux-2.6.24.7/kernel/rt.c @@ -0,0 +1,571 @@ +/* + * kernel/rt.c + * + * Real-Time Preemption Support + * + * started by Ingo Molnar: + * + * Copyright (C) 2004-2006 Red Hat, Inc., Ingo Molnar <mingo@redhat.com> + * Copyright (C) 2006, Timesys Corp., Thomas Gleixner <tglx@timesys.com> + * + * historic credit for proving that Linux spinlocks can be implemented via + * RT-aware mutexes goes to many people: The Pmutex project (Dirk Grambow + * and others) who prototyped it on 2.4 and did lots of comparative + * research and analysis; TimeSys, for proving that you can implement a + * fully preemptible kernel via the use of IRQ threading and mutexes; + * Bill Huey for persuasively arguing on lkml that the mutex model is the + * right one; and to MontaVista, who ported pmutexes to 2.6. + * + * This code is a from-scratch implementation and is not based on pmutexes, + * but the idea of converting spinlocks to mutexes is used here too. + * + * lock debugging, locking tree, deadlock detection: + * + * Copyright (C) 2004, LynuxWorks, Inc., Igor Manyilov, Bill Huey + * Released under the General Public License (GPL). + * + * Includes portions of the generic R/W semaphore implementation from: + * + * Copyright (c) 2001 David Howells (dhowells@redhat.com). + * - Derived partially from idea by Andrea Arcangeli <andrea@suse.de> + * - Derived also from comments by Linus + * + * Pending ownership of locks and ownership stealing: + * + * Copyright (C) 2005, Kihon Technologies Inc., Steven Rostedt + * + * (also by Steven Rostedt) + * - Converted single pi_lock to individual task locks. + * + * By Esben Nielsen: + * Doing priority inheritance with help of the scheduler. + * + * Copyright (C) 2006, Timesys Corp., Thomas Gleixner <tglx@timesys.com> + * - major rework based on Esben Nielsens initial patch + * - replaced thread_info references by task_struct refs + * - removed task->pending_owner dependency + * - BKL drop/reacquire for semaphore style locks to avoid deadlocks + * in the scheduler return path as discussed with Steven Rostedt + * + * Copyright (C) 2006, Kihon Technologies Inc. + * Steven Rostedt <rostedt@goodmis.org> + * - debugged and patched Thomas Gleixner's rework. + * - added back the cmpxchg to the rework. + * - turned atomic require back on for SMP. + */ + +#include <linux/spinlock.h> +#include <linux/rt_lock.h> +#include <linux/sched.h> +#include <linux/delay.h> +#include <linux/module.h> +#include <linux/spinlock.h> +#include <linux/kallsyms.h> +#include <linux/syscalls.h> +#include <linux/interrupt.h> +#include <linux/plist.h> +#include <linux/fs.h> +#include <linux/futex.h> + +#include "rtmutex_common.h" + +#ifdef CONFIG_PREEMPT_RT +/* + * Unlock these on crash: + */ +void zap_rt_locks(void) +{ + //trace_lock_init(); +} +#endif + +/* + * struct mutex functions + */ +void _mutex_init(struct mutex *lock, char *name, struct lock_class_key *key) +{ +#ifdef CONFIG_DEBUG_LOCK_ALLOC + /* + * Make sure we are not reinitializing a held lock: + */ + debug_check_no_locks_freed((void *)lock, sizeof(*lock)); + lockdep_init_map(&lock->dep_map, name, key, 0); +#endif + __rt_mutex_init(&lock->lock, name); +} +EXPORT_SYMBOL(_mutex_init); + +void __lockfunc _mutex_lock(struct mutex *lock) +{ + mutex_acquire(&lock->dep_map, 0, 0, _RET_IP_); + rt_mutex_lock(&lock->lock); +} +EXPORT_SYMBOL(_mutex_lock); + +int __lockfunc _mutex_lock_interruptible(struct mutex *lock) +{ + int ret; + + mutex_acquire(&lock->dep_map, 0, 0, _RET_IP_); + ret = rt_mutex_lock_interruptible(&lock->lock, 0); + if (ret) + mutex_release(&lock->dep_map, 1, _RET_IP_); + return ret; +} +EXPORT_SYMBOL(_mutex_lock_interruptible); + +#ifdef CONFIG_DEBUG_LOCK_ALLOC +void __lockfunc _mutex_lock_nested(struct mutex *lock, int subclass) +{ + mutex_acquire(&lock->dep_map, subclass, 0, _RET_IP_); + rt_mutex_lock(&lock->lock); +} +EXPORT_SYMBOL(_mutex_lock_nested); + +int __lockfunc _mutex_lock_interruptible_nested(struct mutex *lock, int subclass) +{ + int ret; + + mutex_acquire(&lock->dep_map, subclass, 0, _RET_IP_); + ret = rt_mutex_lock_interruptible(&lock->lock, 0); + if (ret) + mutex_release(&lock->dep_map, 1, _RET_IP_); + return ret; +} +EXPORT_SYMBOL(_mutex_lock_interruptible_nested); +#endif + +int __lockfunc _mutex_trylock(struct mutex *lock) +{ + int ret = rt_mutex_trylock(&lock->lock); + + if (ret) + mutex_acquire(&lock->dep_map, 0, 1, _RET_IP_); + + return ret; +} +EXPORT_SYMBOL(_mutex_trylock); + +void __lockfunc _mutex_unlock(struct mutex *lock) +{ + mutex_release(&lock->dep_map, 1, _RET_IP_); + rt_mutex_unlock(&lock->lock); +} +EXPORT_SYMBOL(_mutex_unlock); + +/* + * rwlock_t functions + */ +int __lockfunc rt_write_trylock(rwlock_t *rwlock) +{ + int ret = rt_mutex_trylock(&rwlock->lock); + + if (ret) + rwlock_acquire(&rwlock->dep_map, 0, 1, _RET_IP_); + + return ret; +} +EXPORT_SYMBOL(rt_write_trylock); + +int __lockfunc rt_write_trylock_irqsave(rwlock_t *rwlock, unsigned long *flags) +{ + *flags = 0; + return rt_write_trylock(rwlock); +} + +int __lockfunc rt_read_trylock(rwlock_t *rwlock) +{ + struct rt_mutex *lock = &rwlock->lock; + unsigned long flags; + int ret; + + /* + * Read locks within the self-held write lock succeed. + */ + spin_lock_irqsave(&lock->wait_lock, flags); + if (rt_mutex_real_owner(lock) == current) { + spin_unlock_irqrestore(&lock->wait_lock, flags); + rwlock->read_depth++; + rwlock_acquire_read(&rwlock->dep_map, 0, 1, _RET_IP_); + return 1; + } + spin_unlock_irqrestore(&lock->wait_lock, flags); + + ret = rt_mutex_trylock(lock); + if (ret) + rwlock_acquire_read(&rwlock->dep_map, 0, 1, _RET_IP_); + + return ret; +} +EXPORT_SYMBOL(rt_read_trylock); + +void __lockfunc rt_write_lock(rwlock_t *rwlock) +{ + rwlock_acquire(&rwlock->dep_map, 0, 0, _RET_IP_); + __rt_spin_lock(&rwlock->lock); +} +EXPORT_SYMBOL(rt_write_lock); + +void __lockfunc rt_read_lock(rwlock_t *rwlock) +{ + unsigned long flags; + struct rt_mutex *lock = &rwlock->lock; + + rwlock_acquire_read(&rwlock->dep_map, 0, 0, _RET_IP_); + /* + * Read locks within the write lock succeed. + */ + spin_lock_irqsave(&lock->wait_lock, flags); + if (rt_mutex_real_owner(lock) == current) { + spin_unlock_irqrestore(&lock->wait_lock, flags); + rwlock->read_depth++; + return; + } + spin_unlock_irqrestore(&lock->wait_lock, flags); + __rt_spin_lock(lock); +} + +EXPORT_SYMBOL(rt_read_lock); + +void __lockfunc rt_write_unlock(rwlock_t *rwlock) +{ + /* NOTE: we always pass in '1' for nested, for simplicity */ + rwlock_release(&rwlock->dep_map, 1, _RET_IP_); + __rt_spin_unlock(&rwlock->lock); +} +EXPORT_SYMBOL(rt_write_unlock); + +void __lockfunc rt_read_unlock(rwlock_t *rwlock) +{ + struct rt_mutex *lock = &rwlock->lock; + unsigned long flags; + + rwlock_release(&rwlock->dep_map, 1, _RET_IP_); + // TRACE_WARN_ON(lock->save_state != 1); + /* + * Read locks within the self-held write lock succeed. + */ + spin_lock_irqsave(&lock->wait_lock, flags); + if (rt_mutex_real_owner(lock) == current && rwlock->read_depth) { + spin_unlock_irqrestore(&lock->wait_lock, flags); + rwlock->read_depth--; + return; + } + spin_unlock_irqrestore(&lock->wait_lock, flags); + __rt_spin_unlock(&rwlock->lock); +} +EXPORT_SYMBOL(rt_read_unlock); + +unsigned long __lockfunc rt_write_lock_irqsave(rwlock_t *rwlock) +{ + rt_write_lock(rwlock); + + return 0; +} +EXPORT_SYMBOL(rt_write_lock_irqsave); + +unsigned long __lockfunc rt_read_lock_irqsave(rwlock_t *rwlock) +{ + rt_read_lock(rwlock); + + return 0; +} +EXPORT_SYMBOL(rt_read_lock_irqsave); + +void __rt_rwlock_init(rwlock_t *rwlock, char *name, struct lock_class_key *key) +{ +#ifdef CONFIG_DEBUG_LOCK_ALLOC + /* + * Make sure we are not reinitializing a held lock: + */ + debug_check_no_locks_freed((void *)rwlock, sizeof(*rwlock)); + lockdep_init_map(&rwlock->dep_map, name, key, 0); +#endif + __rt_mutex_init(&rwlock->lock, name); + rwlock->read_depth = 0; +} +EXPORT_SYMBOL(__rt_rwlock_init); + +/* + * rw_semaphores + */ + +void fastcall rt_up_write(struct rw_semaphore *rwsem) +{ + rwsem_release(&rwsem->dep_map, 1, _RET_IP_); + rt_mutex_unlock(&rwsem->lock); +} +EXPORT_SYMBOL(rt_up_write); + +void fastcall rt_up_read(struct rw_semaphore *rwsem) +{ + unsigned long flags; + + rwsem_release(&rwsem->dep_map, 1, _RET_IP_); + /* + * Read locks within the self-held write lock succeed. + */ + spin_lock_irqsave(&rwsem->lock.wait_lock, flags); + if (rt_mutex_real_owner(&rwsem->lock) == current && rwsem->read_depth) { + spin_unlock_irqrestore(&rwsem->lock.wait_lock, flags); + rwsem->read_depth--; + return; + } + spin_unlock_irqrestore(&rwsem->lock.wait_lock, flags); + rt_mutex_unlock(&rwsem->lock); +} +EXPORT_SYMBOL(rt_up_read); + +#ifdef CONFIG_DEBUG_LOCK_ALLOC +void fastcall rt_up_read_non_owner(struct rw_semaphore *rwsem) +{ + unsigned long flags; + /* + * Read locks within the self-held write lock succeed. + */ + spin_lock_irqsave(&rwsem->lock.wait_lock, flags); + if (rt_mutex_real_owner(&rwsem->lock) == current && rwsem->read_depth) { + spin_unlock_irqrestore(&rwsem->lock.wait_lock, flags); + rwsem->read_depth--; + return; + } + spin_unlock_irqrestore(&rwsem->lock.wait_lock, flags); + rt_mutex_unlock(&rwsem->lock); +} +EXPORT_SYMBOL(rt_up_read_non_owner); +#endif + +/* + * downgrade a write lock into a read lock + * - just wake up any readers at the front of the queue + */ +void fastcall rt_downgrade_write(struct rw_semaphore *rwsem) +{ + BUG(); +} +EXPORT_SYMBOL(rt_downgrade_write); + +int fastcall rt_down_write_trylock(struct rw_semaphore *rwsem) +{ + int ret = rt_mutex_trylock(&rwsem->lock); + + if (ret) + rwsem_acquire(&rwsem->dep_map, 0, 1, _RET_IP_); + return ret; +} +EXPORT_SYMBOL(rt_down_write_trylock); + +void fastcall rt_down_write(struct rw_semaphore *rwsem) +{ + rwsem_acquire(&rwsem->dep_map, 0, 0, _RET_IP_); + rt_mutex_lock(&rwsem->lock); +} +EXPORT_SYMBOL(rt_down_write); + +void fastcall rt_down_write_nested(struct rw_semaphore *rwsem, int subclass) +{ + rwsem_acquire(&rwsem->dep_map, subclass, 0, _RET_IP_); + rt_mutex_lock(&rwsem->lock); +} +EXPORT_SYMBOL(rt_down_write_nested); + +int fastcall rt_down_read_trylock(struct rw_semaphore *rwsem) +{ + unsigned long flags; + int ret; + + /* + * Read locks within the self-held write lock succeed. + */ + spin_lock_irqsave(&rwsem->lock.wait_lock, flags); + if (rt_mutex_real_owner(&rwsem->lock) == current) { + spin_unlock_irqrestore(&rwsem->lock.wait_lock, flags); + rwsem_acquire_read(&rwsem->dep_map, 0, 1, _RET_IP_); + rwsem->read_depth++; + return 1; + } + spin_unlock_irqrestore(&rwsem->lock.wait_lock, flags); + + ret = rt_mutex_trylock(&rwsem->lock); + if (ret) + rwsem_acquire(&rwsem->dep_map, 0, 1, _RET_IP_); + return ret; +} +EXPORT_SYMBOL(rt_down_read_trylock); + +static void __rt_down_read(struct rw_semaphore *rwsem, int subclass) +{ + unsigned long flags; + + rwsem_acquire_read(&rwsem->dep_map, subclass, 0, _RET_IP_); + + /* + * Read locks within the write lock succeed. + */ + spin_lock_irqsave(&rwsem->lock.wait_lock, flags); + + if (rt_mutex_real_owner(&rwsem->lock) == current) { + spin_unlock_irqrestore(&rwsem->lock.wait_lock, flags); + rwsem->read_depth++; + return; + } + spin_unlock_irqrestore(&rwsem->lock.wait_lock, flags); + rt_mutex_lock(&rwsem->lock); +} + +void fastcall rt_down_read(struct rw_semaphore *rwsem) +{ + __rt_down_read(rwsem, 0); +} +EXPORT_SYMBOL(rt_down_read); + +void fastcall rt_down_read_nested(struct rw_semaphore *rwsem, int subclass) +{ + __rt_down_read(rwsem, subclass); +} +EXPORT_SYMBOL(rt_down_read_nested); + + +#ifdef CONFIG_DEBUG_LOCK_ALLOC + +/* + * Same as rt_down_read() but no lockdep calls: + */ +void fastcall rt_down_read_non_owner(struct rw_semaphore *rwsem) +{ + unsigned long flags; + /* + * Read locks within the write lock succeed. + */ + spin_lock_irqsave(&rwsem->lock.wait_lock, flags); + + if (rt_mutex_real_owner(&rwsem->lock) == current) { + spin_unlock_irqrestore(&rwsem->lock.wait_lock, flags); + rwsem->read_depth++; + return; + } + spin_unlock_irqrestore(&rwsem->lock.wait_lock, flags); + rt_mutex_lock(&rwsem->lock); +} +EXPORT_SYMBOL(rt_down_read_non_owner); + +#endif + +void fastcall __rt_rwsem_init(struct rw_semaphore *rwsem, char *name, + struct lock_class_key *key) +{ +#ifdef CONFIG_DEBUG_LOCK_ALLOC + /* + * Make sure we are not reinitializing a held lock: + */ + debug_check_no_locks_freed((void *)rwsem, sizeof(*rwsem)); + lockdep_init_map(&rwsem->dep_map, name, key, 0); +#endif + __rt_mutex_init(&rwsem->lock, name); + rwsem->read_depth = 0; +} +EXPORT_SYMBOL(__rt_rwsem_init); + +/* + * Semaphores + */ +/* + * Linux Semaphores implemented via RT-mutexes. + * + * In the down() variants we use the mutex as the semaphore blocking + * object: we always acquire it, decrease the counter and keep the lock + * locked if we did the 1->0 transition. The next down() will then block. + * + * In the up() path we atomically increase the counter and do the + * unlock if we were the one doing the 0->1 transition. + */ + +static inline void __down_complete(struct semaphore *sem) +{ + int count = atomic_dec_return(&sem->count); + + if (unlikely(count > 0)) + rt_mutex_unlock(&sem->lock); +} + +void fastcall rt_down(struct semaphore *sem) +{ + rt_mutex_lock(&sem->lock); + __down_complete(sem); +} +EXPORT_SYMBOL(rt_down); + +int fastcall rt_down_interruptible(struct semaphore *sem) +{ + int ret; + + ret = rt_mutex_lock_interruptible(&sem->lock, 0); + if (ret) + return ret; + __down_complete(sem); + return 0; +} +EXPORT_SYMBOL(rt_down_interruptible); + +/* + * try to down the semaphore, 0 on success and 1 on failure. (inverted) + */ +int fastcall rt_down_trylock(struct semaphore *sem) +{ + /* + * Here we are a tiny bit different from ordinary Linux semaphores, + * because we can get 'transient' locking-failures when say a + * process decreases the count from 9 to 8 and locks/releases the + * embedded mutex internally. It would be quite complex to remove + * these transient failures so lets try it the simple way first: + */ + if (rt_mutex_trylock(&sem->lock)) { + __down_complete(sem); + return 0; + } + return 1; +} +EXPORT_SYMBOL(rt_down_trylock); + +void fastcall rt_up(struct semaphore *sem) +{ + int count; + + /* + * Disable preemption to make sure a highprio trylock-er cannot + * preempt us here and get into an infinite loop: + */ + preempt_disable(); + count = atomic_inc_return(&sem->count); + /* + * If we did the 0 -> 1 transition then we are the ones to unlock it: + */ + if (likely(count == 1)) + rt_mutex_unlock(&sem->lock); + preempt_enable(); +} +EXPORT_SYMBOL(rt_up); + +void fastcall __sema_init(struct semaphore *sem, int val, + char *name, char *file, int line) +{ + atomic_set(&sem->count, val); + switch (val) { + case 0: + __rt_mutex_init(&sem->lock, name); + rt_mutex_lock(&sem->lock); + break; + default: + __rt_mutex_init(&sem->lock, name); + break; + } +} +EXPORT_SYMBOL(__sema_init); + +void fastcall __init_MUTEX(struct semaphore *sem, char *name, char *file, + int line) +{ + __sema_init(sem, 1, name, file, line); +} +EXPORT_SYMBOL(__init_MUTEX); + Index: linux-2.6.24.7/kernel/rtmutex-debug.c =================================================================== --- linux-2.6.24.7.orig/kernel/rtmutex-debug.c +++ linux-2.6.24.7/kernel/rtmutex-debug.c @@ -16,6 +16,7 @@ * * See rt.c in preempt-rt for proper credits and further information */ +#include <linux/rt_lock.h> #include <linux/sched.h> #include <linux/delay.h> #include <linux/module.h> @@ -29,61 +30,6 @@ #include "rtmutex_common.h" -# define TRACE_WARN_ON(x) WARN_ON(x) -# define TRACE_BUG_ON(x) BUG_ON(x) - -# define TRACE_OFF() \ -do { \ - if (rt_trace_on) { \ - rt_trace_on = 0; \ - console_verbose(); \ - if (spin_is_locked(¤t->pi_lock)) \ - spin_unlock(¤t->pi_lock); \ - } \ -} while (0) - -# define TRACE_OFF_NOLOCK() \ -do { \ - if (rt_trace_on) { \ - rt_trace_on = 0; \ - console_verbose(); \ - } \ -} while (0) - -# define TRACE_BUG_LOCKED() \ -do { \ - TRACE_OFF(); \ - BUG(); \ -} while (0) - -# define TRACE_WARN_ON_LOCKED(c) \ -do { \ - if (unlikely(c)) { \ - TRACE_OFF(); \ - WARN_ON(1); \ - } \ -} while (0) - -# define TRACE_BUG_ON_LOCKED(c) \ -do { \ - if (unlikely(c)) \ - TRACE_BUG_LOCKED(); \ -} while (0) - -#ifdef CONFIG_SMP -# define SMP_TRACE_BUG_ON_LOCKED(c) TRACE_BUG_ON_LOCKED(c) -#else -# define SMP_TRACE_BUG_ON_LOCKED(c) do { } while (0) -#endif - -/* - * deadlock detection flag. We turn it off when we detect - * the first problem because we dont want to recurse back - * into the tracing code when doing error printk or - * executing a BUG(): - */ -static int rt_trace_on = 1; - static void printk_task(struct task_struct *p) { if (p) @@ -111,8 +57,8 @@ static void printk_lock(struct rt_mutex void rt_mutex_debug_task_free(struct task_struct *task) { - WARN_ON(!plist_head_empty(&task->pi_waiters)); - WARN_ON(task->pi_blocked_on); + DEBUG_LOCKS_WARN_ON(!plist_head_empty(&task->pi_waiters)); + DEBUG_LOCKS_WARN_ON(task->pi_blocked_on); } /* @@ -125,7 +71,7 @@ void debug_rt_mutex_deadlock(int detect, { struct task_struct *task; - if (!rt_trace_on || detect || !act_waiter) + if (!debug_locks || detect || !act_waiter) return; task = rt_mutex_owner(act_waiter->lock); @@ -139,14 +85,15 @@ void debug_rt_mutex_print_deadlock(struc { struct task_struct *task; - if (!waiter->deadlock_lock || !rt_trace_on) + if (!waiter->deadlock_lock || !debug_locks) return; task = find_task_by_pid(waiter->deadlock_task_pid); if (!task) return; - TRACE_OFF_NOLOCK(); + if (!debug_locks_off()) + return; printk("\n============================================\n"); printk( "[ BUG: circular locking deadlock detected! ]\n"); @@ -176,7 +123,6 @@ void debug_rt_mutex_print_deadlock(struc printk("[ turning off deadlock detection." "Please report this trace. ]\n\n"); - local_irq_disable(); } void debug_rt_mutex_lock(struct rt_mutex *lock) @@ -185,7 +131,8 @@ void debug_rt_mutex_lock(struct rt_mutex void debug_rt_mutex_unlock(struct rt_mutex *lock) { - TRACE_WARN_ON_LOCKED(rt_mutex_owner(lock) != current); + if (debug_locks) + DEBUG_LOCKS_WARN_ON(rt_mutex_owner(lock) != current); } void @@ -195,7 +142,7 @@ debug_rt_mutex_proxy_lock(struct rt_mute void debug_rt_mutex_proxy_unlock(struct rt_mutex *lock) { - TRACE_WARN_ON_LOCKED(!rt_mutex_owner(lock)); + DEBUG_LOCKS_WARN_ON(!rt_mutex_owner(lock)); } void debug_rt_mutex_init_waiter(struct rt_mutex_waiter *waiter) @@ -207,9 +154,9 @@ void debug_rt_mutex_init_waiter(struct r void debug_rt_mutex_free_waiter(struct rt_mutex_waiter *waiter) { - TRACE_WARN_ON(!plist_node_empty(&waiter->list_entry)); - TRACE_WARN_ON(!plist_node_empty(&waiter->pi_list_entry)); - TRACE_WARN_ON(waiter->task); + DEBUG_LOCKS_WARN_ON(!plist_node_empty(&waiter->list_entry)); + DEBUG_LOCKS_WARN_ON(!plist_node_empty(&waiter->pi_list_entry)); + DEBUG_LOCKS_WARN_ON(waiter->task); memset(waiter, 0x22, sizeof(*waiter)); } @@ -225,9 +172,36 @@ void debug_rt_mutex_init(struct rt_mutex void rt_mutex_deadlock_account_lock(struct rt_mutex *lock, struct task_struct *task) { +#ifdef CONFIG_DEBUG_PREEMPT + if (task->lock_count >= MAX_LOCK_STACK) { + if (!debug_locks_off()) + return; + printk("BUG: %s/%d: lock count overflow!\n", + task->comm, task->pid); + dump_stack(); + return; + } +#ifdef CONFIG_PREEMPT_RT + task->owned_lock[task->lock_count] = lock; +#endif + task->lock_count++; +#endif } void rt_mutex_deadlock_account_unlock(struct task_struct *task) { +#ifdef CONFIG_DEBUG_PREEMPT + if (!task->lock_count) { + if (!debug_locks_off()) + return; + printk("BUG: %s/%d: lock count underflow!\n", + task->comm, task->pid); + dump_stack(); + return; + } + task->lock_count--; +#ifdef CONFIG_PREEMPT_RT + task->owned_lock[task->lock_count] = NULL; +#endif +#endif } - Index: linux-2.6.24.7/kernel/rtmutex.c =================================================================== --- linux-2.6.24.7.orig/kernel/rtmutex.c +++ linux-2.6.24.7/kernel/rtmutex.c @@ -97,6 +97,22 @@ static inline void mark_rt_mutex_waiters } #endif +int pi_initialized; + +/* + * we initialize the wait_list runtime. (Could be done build-time and/or + * boot-time.) + */ +static inline void init_lists(struct rt_mutex *lock) +{ + if (unlikely(!lock->wait_list.prio_list.prev)) { + plist_head_init(&lock->wait_list, &lock->wait_lock); +#ifdef CONFIG_DEBUG_RT_MUTEXES + pi_initialized++; +#endif + } +} + /* * Calculate task priority from the waiter list priority * @@ -253,13 +269,13 @@ static int rt_mutex_adjust_prio_chain(st plist_add(&waiter->list_entry, &lock->wait_list); /* Release the task */ - spin_unlock_irqrestore(&task->pi_lock, flags); + spin_unlock(&task->pi_lock); put_task_struct(task); /* Grab the next task */ task = rt_mutex_owner(lock); get_task_struct(task); - spin_lock_irqsave(&task->pi_lock, flags); + spin_lock(&task->pi_lock); if (waiter == rt_mutex_top_waiter(lock)) { /* Boost the owner */ @@ -277,10 +293,10 @@ static int rt_mutex_adjust_prio_chain(st __rt_mutex_adjust_prio(task); } - spin_unlock_irqrestore(&task->pi_lock, flags); + spin_unlock(&task->pi_lock); top_waiter = rt_mutex_top_waiter(lock); - spin_unlock(&lock->wait_lock); + spin_unlock_irqrestore(&lock->wait_lock, flags); if (!detect_deadlock && waiter != top_waiter) goto out_put_task; @@ -304,7 +320,6 @@ static inline int try_to_steal_lock(stru { struct task_struct *pendowner = rt_mutex_owner(lock); struct rt_mutex_waiter *next; - unsigned long flags; if (!rt_mutex_owner_pending(lock)) return 0; @@ -312,9 +327,9 @@ static inline int try_to_steal_lock(stru if (pendowner == current) return 1; - spin_lock_irqsave(&pendowner->pi_lock, flags); + spin_lock(&pendowner->pi_lock); if (current->prio >= pendowner->prio) { - spin_unlock_irqrestore(&pendowner->pi_lock, flags); + spin_unlock(&pendowner->pi_lock); return 0; } @@ -324,7 +339,7 @@ static inline int try_to_steal_lock(stru * priority. */ if (likely(!rt_mutex_has_waiters(lock))) { - spin_unlock_irqrestore(&pendowner->pi_lock, flags); + spin_unlock(&pendowner->pi_lock); return 1; } @@ -332,7 +347,7 @@ static inline int try_to_steal_lock(stru next = rt_mutex_top_waiter(lock); plist_del(&next->pi_list_entry, &pendowner->pi_waiters); __rt_mutex_adjust_prio(pendowner); - spin_unlock_irqrestore(&pendowner->pi_lock, flags); + spin_unlock(&pendowner->pi_lock); /* * We are going to steal the lock and a waiter was @@ -349,10 +364,10 @@ static inline int try_to_steal_lock(stru * might be current: */ if (likely(next->task != current)) { - spin_lock_irqsave(¤t->pi_lock, flags); + spin_lock(¤t->pi_lock); plist_add(&next->pi_list_entry, ¤t->pi_waiters); __rt_mutex_adjust_prio(current); - spin_unlock_irqrestore(¤t->pi_lock, flags); + spin_unlock(¤t->pi_lock); } return 1; } @@ -411,14 +426,13 @@ static int try_to_take_rt_mutex(struct r */ static int task_blocks_on_rt_mutex(struct rt_mutex *lock, struct rt_mutex_waiter *waiter, - int detect_deadlock) + int detect_deadlock, unsigned long flags) { struct task_struct *owner = rt_mutex_owner(lock); struct rt_mutex_waiter *top_waiter = waiter; - unsigned long flags; int chain_walk = 0, res; - spin_lock_irqsave(¤t->pi_lock, flags); + spin_lock(¤t->pi_lock); __rt_mutex_adjust_prio(current); waiter->task = current; waiter->lock = lock; @@ -432,17 +446,17 @@ static int task_blocks_on_rt_mutex(struc current->pi_blocked_on = waiter; - spin_unlock_irqrestore(¤t->pi_lock, flags); + spin_unlock(¤t->pi_lock); if (waiter == rt_mutex_top_waiter(lock)) { - spin_lock_irqsave(&owner->pi_lock, flags); + spin_lock(&owner->pi_lock); plist_del(&top_waiter->pi_list_entry, &owner->pi_waiters); plist_add(&waiter->pi_list_entry, &owner->pi_waiters); __rt_mutex_adjust_prio(owner); if (owner->pi_blocked_on) chain_walk = 1; - spin_unlock_irqrestore(&owner->pi_lock, flags); + spin_unlock(&owner->pi_lock); } else if (debug_rt_mutex_detect_deadlock(waiter, detect_deadlock)) chain_walk = 1; @@ -457,12 +471,12 @@ static int task_blocks_on_rt_mutex(struc */ get_task_struct(owner); - spin_unlock(&lock->wait_lock); + spin_unlock_irqrestore(&lock->wait_lock, flags); res = rt_mutex_adjust_prio_chain(owner, detect_deadlock, lock, waiter, current); - spin_lock(&lock->wait_lock); + spin_lock_irq(&lock->wait_lock); return res; } @@ -475,13 +489,12 @@ static int task_blocks_on_rt_mutex(struc * * Called with lock->wait_lock held. */ -static void wakeup_next_waiter(struct rt_mutex *lock) +static void wakeup_next_waiter(struct rt_mutex *lock, int savestate) { struct rt_mutex_waiter *waiter; struct task_struct *pendowner; - unsigned long flags; - spin_lock_irqsave(¤t->pi_lock, flags); + spin_lock(¤t->pi_lock); waiter = rt_mutex_top_waiter(lock); plist_del(&waiter->list_entry, &lock->wait_list); @@ -498,7 +511,7 @@ static void wakeup_next_waiter(struct rt rt_mutex_set_owner(lock, pendowner, RT_MUTEX_OWNER_PENDING); - spin_unlock_irqrestore(¤t->pi_lock, flags); + spin_unlock(¤t->pi_lock); /* * Clear the pi_blocked_on variable and enqueue a possible @@ -507,7 +520,7 @@ static void wakeup_next_waiter(struct rt * waiter with higher priority than pending-owner->normal_prio * is blocked on the unboosted (pending) owner. */ - spin_lock_irqsave(&pendowner->pi_lock, flags); + spin_lock(&pendowner->pi_lock); WARN_ON(!pendowner->pi_blocked_on); WARN_ON(pendowner->pi_blocked_on != waiter); @@ -521,9 +534,12 @@ static void wakeup_next_waiter(struct rt next = rt_mutex_top_waiter(lock); plist_add(&next->pi_list_entry, &pendowner->pi_waiters); } - spin_unlock_irqrestore(&pendowner->pi_lock, flags); + spin_unlock(&pendowner->pi_lock); - wake_up_process(pendowner); + if (savestate) + wake_up_process_mutex(pendowner); + else + wake_up_process(pendowner); } /* @@ -532,22 +548,22 @@ static void wakeup_next_waiter(struct rt * Must be called with lock->wait_lock held */ static void remove_waiter(struct rt_mutex *lock, - struct rt_mutex_waiter *waiter) + struct rt_mutex_waiter *waiter, + unsigned long flags) { int first = (waiter == rt_mutex_top_waiter(lock)); struct task_struct *owner = rt_mutex_owner(lock); - unsigned long flags; int chain_walk = 0; - spin_lock_irqsave(¤t->pi_lock, flags); + spin_lock(¤t->pi_lock); plist_del(&waiter->list_entry, &lock->wait_list); waiter->task = NULL; current->pi_blocked_on = NULL; - spin_unlock_irqrestore(¤t->pi_lock, flags); + spin_unlock(¤t->pi_lock); if (first && owner != current) { - spin_lock_irqsave(&owner->pi_lock, flags); + spin_lock(&owner->pi_lock); plist_del(&waiter->pi_list_entry, &owner->pi_waiters); @@ -562,7 +578,7 @@ static void remove_waiter(struct rt_mute if (owner->pi_blocked_on) chain_walk = 1; - spin_unlock_irqrestore(&owner->pi_lock, flags); + spin_unlock(&owner->pi_lock); } WARN_ON(!plist_node_empty(&waiter->pi_list_entry)); @@ -573,11 +589,11 @@ static void remove_waiter(struct rt_mute /* gets dropped in rt_mutex_adjust_prio_chain()! */ get_task_struct(owner); - spin_unlock(&lock->wait_lock); + spin_unlock_irqrestore(&lock->wait_lock, flags); rt_mutex_adjust_prio_chain(owner, 0, lock, NULL, current); - spin_lock(&lock->wait_lock); + spin_lock_irq(&lock->wait_lock); } /* @@ -598,14 +614,307 @@ void rt_mutex_adjust_pi(struct task_stru return; } - spin_unlock_irqrestore(&task->pi_lock, flags); - /* gets dropped in rt_mutex_adjust_prio_chain()! */ get_task_struct(task); + spin_unlock_irqrestore(&task->pi_lock, flags); + rt_mutex_adjust_prio_chain(task, 0, NULL, NULL, task); } /* + * preemptible spin_lock functions: + */ + +#ifdef CONFIG_PREEMPT_RT + +static inline void +rt_spin_lock_fastlock(struct rt_mutex *lock, + void fastcall (*slowfn)(struct rt_mutex *lock)) +{ + if (likely(rt_mutex_cmpxchg(lock, NULL, current))) + rt_mutex_deadlock_account_lock(lock, current); + else + slowfn(lock); +} + +static inline void +rt_spin_lock_fastunlock(struct rt_mutex *lock, + void fastcall (*slowfn)(struct rt_mutex *lock)) +{ + if (likely(rt_mutex_cmpxchg(lock, current, NULL))) + rt_mutex_deadlock_account_unlock(current); + else + slowfn(lock); +} + +/* + * Slow path lock function spin_lock style: this variant is very + * careful not to miss any non-lock wakeups. + * + * The wakeup side uses wake_up_process_mutex, which, combined with + * the xchg code of this function is a transparent sleep/wakeup + * mechanism nested within any existing sleep/wakeup mechanism. This + * enables the seemless use of arbitrary (blocking) spinlocks within + * sleep/wakeup event loops. + */ +static void fastcall noinline __sched +rt_spin_lock_slowlock(struct rt_mutex *lock) +{ + struct rt_mutex_waiter waiter; + unsigned long saved_state, state, flags; + + debug_rt_mutex_init_waiter(&waiter); + waiter.task = NULL; + + spin_lock_irqsave(&lock->wait_lock, flags); + init_lists(lock); + + /* Try to acquire the lock again: */ + if (try_to_take_rt_mutex(lock)) { + spin_unlock_irqrestore(&lock->wait_lock, flags); + return; + } + + BUG_ON(rt_mutex_owner(lock) == current); + + /* + * Here we save whatever state the task was in originally, + * we'll restore it at the end of the function and we'll take + * any intermediate wakeup into account as well, independently + * of the lock sleep/wakeup mechanism. When we get a real + * wakeup the task->state is TASK_RUNNING and we change + * saved_state accordingly. If we did not get a real wakeup + * then we return with the saved state. + */ + saved_state = xchg(¤t->state, TASK_UNINTERRUPTIBLE); + + for (;;) { + unsigned long saved_flags; + int saved_lock_depth = current->lock_depth; + + /* Try to acquire the lock */ + if (try_to_take_rt_mutex(lock)) + break; + /* + * waiter.task is NULL the first time we come here and + * when we have been woken up by the previous owner + * but the lock got stolen by an higher prio task. + */ + if (!waiter.task) { + task_blocks_on_rt_mutex(lock, &waiter, 0, flags); + /* Wakeup during boost ? */ + if (unlikely(!waiter.task)) + continue; + } + + /* + * Prevent schedule() to drop BKL, while waiting for + * the lock ! We restore lock_depth when we come back. + */ + saved_flags = current->flags & PF_NOSCHED; + current->lock_depth = -1; + current->flags &= ~PF_NOSCHED; + spin_unlock_irqrestore(&lock->wait_lock, flags); + + debug_rt_mutex_print_deadlock(&waiter); + + schedule_rt_mutex(lock); + + spin_lock_irqsave(&lock->wait_lock, flags); + current->flags |= saved_flags; + current->lock_depth = saved_lock_depth; + state = xchg(¤t->state, TASK_UNINTERRUPTIBLE); + if (unlikely(state == TASK_RUNNING)) + saved_state = TASK_RUNNING; + } + + state = xchg(¤t->state, saved_state); + if (unlikely(state == TASK_RUNNING)) + current->state = TASK_RUNNING; + + /* + * Extremely rare case, if we got woken up by a non-mutex wakeup, + * and we managed to steal the lock despite us not being the + * highest-prio waiter (due to SCHED_OTHER changing prio), then we + * can end up with a non-NULL waiter.task: + */ + if (unlikely(waiter.task)) + remove_waiter(lock, &waiter, flags); + /* + * try_to_take_rt_mutex() sets the waiter bit + * unconditionally. We might have to fix that up: + */ + fixup_rt_mutex_waiters(lock); + + spin_unlock_irqrestore(&lock->wait_lock, flags); + + debug_rt_mutex_free_waiter(&waiter); +} + +/* + * Slow path to release a rt_mutex spin_lock style + */ +static void fastcall noinline __sched +rt_spin_lock_slowunlock(struct rt_mutex *lock) +{ + unsigned long flags; + + spin_lock_irqsave(&lock->wait_lock, flags); + + debug_rt_mutex_unlock(lock); + + rt_mutex_deadlock_account_unlock(current); + + if (!rt_mutex_has_waiters(lock)) { + lock->owner = NULL; + spin_unlock_irqrestore(&lock->wait_lock, flags); + return; + } + + wakeup_next_waiter(lock, 1); + + spin_unlock_irqrestore(&lock->wait_lock, flags); + + /* Undo pi boosting.when necessary */ + rt_mutex_adjust_prio(current); +} + +void __lockfunc rt_spin_lock(spinlock_t *lock) +{ + rt_spin_lock_fastlock(&lock->lock, rt_spin_lock_slowlock); + spin_acquire(&lock->dep_map, 0, 0, _RET_IP_); +} +EXPORT_SYMBOL(rt_spin_lock); + +void __lockfunc __rt_spin_lock(struct rt_mutex *lock) +{ + rt_spin_lock_fastlock(lock, rt_spin_lock_slowlock); +} +EXPORT_SYMBOL(__rt_spin_lock); + +#ifdef CONFIG_DEBUG_LOCK_ALLOC + +void __lockfunc rt_spin_lock_nested(spinlock_t *lock, int subclass) +{ + rt_spin_lock_fastlock(&lock->lock, rt_spin_lock_slowlock); + spin_acquire(&lock->dep_map, subclass, 0, _RET_IP_); +} +EXPORT_SYMBOL(rt_spin_lock_nested); + +#endif + +void __lockfunc rt_spin_unlock(spinlock_t *lock) +{ + /* NOTE: we always pass in '1' for nested, for simplicity */ + spin_release(&lock->dep_map, 1, _RET_IP_); + rt_spin_lock_fastunlock(&lock->lock, rt_spin_lock_slowunlock); +} +EXPORT_SYMBOL(rt_spin_unlock); + +void __lockfunc __rt_spin_unlock(struct rt_mutex *lock) +{ + rt_spin_lock_fastunlock(lock, rt_spin_lock_slowunlock); +} +EXPORT_SYMBOL(__rt_spin_unlock); + +/* + * Wait for the lock to get unlocked: instead of polling for an unlock + * (like raw spinlocks do), we lock and unlock, to force the kernel to + * schedule if there's contention: + */ +void __lockfunc rt_spin_unlock_wait(spinlock_t *lock) +{ + spin_lock(lock); + spin_unlock(lock); +} +EXPORT_SYMBOL(rt_spin_unlock_wait); + +int __lockfunc rt_spin_trylock(spinlock_t *lock) +{ + int ret = rt_mutex_trylock(&lock->lock); + + if (ret) + spin_acquire(&lock->dep_map, 0, 1, _RET_IP_); + + return ret; +} +EXPORT_SYMBOL(rt_spin_trylock); + +int __lockfunc rt_spin_trylock_irqsave(spinlock_t *lock, unsigned long *flags) +{ + int ret; + + *flags = 0; + ret = rt_mutex_trylock(&lock->lock); + if (ret) + spin_acquire(&lock->dep_map, 0, 1, _RET_IP_); + + return ret; +} +EXPORT_SYMBOL(rt_spin_trylock_irqsave); + +int _atomic_dec_and_spin_lock(atomic_t *atomic, spinlock_t *lock) +{ + /* Subtract 1 from counter unless that drops it to 0 (ie. it was 1) */ + if (atomic_add_unless(atomic, -1, 1)) + return 0; + rt_spin_lock(lock); + if (atomic_dec_and_test(atomic)) + return 1; + rt_spin_unlock(lock); + return 0; +} +EXPORT_SYMBOL(_atomic_dec_and_spin_lock); + +void +__rt_spin_lock_init(spinlock_t *lock, char *name, struct lock_class_key *key) +{ +#ifdef CONFIG_DEBUG_LOCK_ALLOC + /* + * Make sure we are not reinitializing a held lock: + */ + debug_check_no_locks_freed((void *)lock, sizeof(*lock)); + lockdep_init_map(&lock->dep_map, name, key, 0); +#endif + __rt_mutex_init(&lock->lock, name); +} +EXPORT_SYMBOL(__rt_spin_lock_init); + +#endif + +#ifdef CONFIG_PREEMPT_BKL + +static inline int rt_release_bkl(struct rt_mutex *lock, unsigned long flags) +{ + int saved_lock_depth = current->lock_depth; + + current->lock_depth = -1; + /* + * try_to_take_lock set the waiters, make sure it's + * still correct. + */ + fixup_rt_mutex_waiters(lock); + spin_unlock_irqrestore(&lock->wait_lock, flags); + + up(&kernel_sem); + + spin_lock_irq(&lock->wait_lock); + + return saved_lock_depth; +} + +static inline void rt_reacquire_bkl(int saved_lock_depth) +{ + down(&kernel_sem); + current->lock_depth = saved_lock_depth; +} + +#else +# define rt_release_bkl(lock, flags) (-1) +# define rt_reacquire_bkl(depth) do { } while (0) +#endif + +/* * Slow path lock function: */ static int __sched @@ -613,20 +922,29 @@ rt_mutex_slowlock(struct rt_mutex *lock, struct hrtimer_sleeper *timeout, int detect_deadlock) { + int ret = 0, saved_lock_depth = -1; struct rt_mutex_waiter waiter; - int ret = 0; + unsigned long flags; debug_rt_mutex_init_waiter(&waiter); waiter.task = NULL; - spin_lock(&lock->wait_lock); + spin_lock_irqsave(&lock->wait_lock, flags); + init_lists(lock); /* Try to acquire the lock again: */ if (try_to_take_rt_mutex(lock)) { - spin_unlock(&lock->wait_lock); + spin_unlock_irqrestore(&lock->wait_lock, flags); return 0; } + /* + * We drop the BKL here before we go into the wait loop to avoid a + * possible deadlock in the scheduler. + */ + if (unlikely(current->lock_depth >= 0)) + saved_lock_depth = rt_release_bkl(lock, flags); + set_current_state(state); /* Setup the timer, when timeout != NULL */ @@ -635,6 +953,8 @@ rt_mutex_slowlock(struct rt_mutex *lock, HRTIMER_MODE_ABS); for (;;) { + unsigned long saved_flags; + /* Try to acquire the lock: */ if (try_to_take_rt_mutex(lock)) break; @@ -660,7 +980,7 @@ rt_mutex_slowlock(struct rt_mutex *lock, */ if (!waiter.task) { ret = task_blocks_on_rt_mutex(lock, &waiter, - detect_deadlock); + detect_deadlock, flags); /* * If we got woken up by the owner then start loop * all over without going into schedule to try @@ -679,22 +999,26 @@ rt_mutex_slowlock(struct rt_mutex *lock, if (unlikely(ret)) break; } + saved_flags = current->flags & PF_NOSCHED; + current->flags &= ~PF_NOSCHED; - spin_unlock(&lock->wait_lock); + spin_unlock_irq(&lock->wait_lock); debug_rt_mutex_print_deadlock(&waiter); if (waiter.task) schedule_rt_mutex(lock); - spin_lock(&lock->wait_lock); + spin_lock_irq(&lock->wait_lock); + + current->flags |= saved_flags; set_current_state(state); } set_current_state(TASK_RUNNING); if (unlikely(waiter.task)) - remove_waiter(lock, &waiter); + remove_waiter(lock, &waiter, flags); /* * try_to_take_rt_mutex() sets the waiter bit @@ -702,7 +1026,7 @@ rt_mutex_slowlock(struct rt_mutex *lock, */ fixup_rt_mutex_waiters(lock); - spin_unlock(&lock->wait_lock); + spin_unlock_irqrestore(&lock->wait_lock, flags); /* Remove pending timer: */ if (unlikely(timeout)) @@ -716,6 +1040,10 @@ rt_mutex_slowlock(struct rt_mutex *lock, if (unlikely(ret)) rt_mutex_adjust_prio(current); + /* Must we reaquire the BKL? */ + if (unlikely(saved_lock_depth >= 0)) + rt_reacquire_bkl(saved_lock_depth); + debug_rt_mutex_free_waiter(&waiter); return ret; @@ -727,12 +1055,15 @@ rt_mutex_slowlock(struct rt_mutex *lock, static inline int rt_mutex_slowtrylock(struct rt_mutex *lock) { + unsigned long flags; int ret = 0; - spin_lock(&lock->wait_lock); + spin_lock_irqsave(&lock->wait_lock, flags); if (likely(rt_mutex_owner(lock) != current)) { + init_lists(lock); + ret = try_to_take_rt_mutex(lock); /* * try_to_take_rt_mutex() sets the lock waiters @@ -741,7 +1072,7 @@ rt_mutex_slowtrylock(struct rt_mutex *lo fixup_rt_mutex_waiters(lock); } - spin_unlock(&lock->wait_lock); + spin_unlock_irqrestore(&lock->wait_lock, flags); return ret; } @@ -752,7 +1083,9 @@ rt_mutex_slowtrylock(struct rt_mutex *lo static void __sched rt_mutex_slowunlock(struct rt_mutex *lock) { - spin_lock(&lock->wait_lock); + unsigned long flags; + + spin_lock_irqsave(&lock->wait_lock, flags); debug_rt_mutex_unlock(lock); @@ -760,13 +1093,13 @@ rt_mutex_slowunlock(struct rt_mutex *loc if (!rt_mutex_has_waiters(lock)) { lock->owner = NULL; - spin_unlock(&lock->wait_lock); + spin_unlock_irqrestore(&lock->wait_lock, flags); return; } - wakeup_next_waiter(lock); + wakeup_next_waiter(lock, 0); - spin_unlock(&lock->wait_lock); + spin_unlock_irqrestore(&lock->wait_lock, flags); /* Undo pi boosting if necessary: */ rt_mutex_adjust_prio(current); Index: linux-2.6.24.7/kernel/rwsem.c =================================================================== --- linux-2.6.24.7.orig/kernel/rwsem.c +++ linux-2.6.24.7/kernel/rwsem.c @@ -16,7 +16,7 @@ /* * lock for reading */ -void __sched down_read(struct rw_semaphore *sem) +void __sched compat_down_read(struct compat_rw_semaphore *sem) { might_sleep(); rwsem_acquire_read(&sem->dep_map, 0, 0, _RET_IP_); @@ -24,12 +24,12 @@ void __sched down_read(struct rw_semapho LOCK_CONTENDED(sem, __down_read_trylock, __down_read); } -EXPORT_SYMBOL(down_read); +EXPORT_SYMBOL(compat_down_read); /* * trylock for reading -- returns 1 if successful, 0 if contention */ -int down_read_trylock(struct rw_semaphore *sem) +int compat_down_read_trylock(struct compat_rw_semaphore *sem) { int ret = __down_read_trylock(sem); @@ -38,12 +38,12 @@ int down_read_trylock(struct rw_semaphor return ret; } -EXPORT_SYMBOL(down_read_trylock); +EXPORT_SYMBOL(compat_down_read_trylock); /* * lock for writing */ -void __sched down_write(struct rw_semaphore *sem) +void __sched compat_down_write(struct compat_rw_semaphore *sem) { might_sleep(); rwsem_acquire(&sem->dep_map, 0, 0, _RET_IP_); @@ -51,12 +51,12 @@ void __sched down_write(struct rw_semaph LOCK_CONTENDED(sem, __down_write_trylock, __down_write); } -EXPORT_SYMBOL(down_write); +EXPORT_SYMBOL(compat_down_write); /* * trylock for writing -- returns 1 if successful, 0 if contention */ -int down_write_trylock(struct rw_semaphore *sem) +int compat_down_write_trylock(struct compat_rw_semaphore *sem) { int ret = __down_write_trylock(sem); @@ -65,36 +65,36 @@ int down_write_trylock(struct rw_semapho return ret; } -EXPORT_SYMBOL(down_write_trylock); +EXPORT_SYMBOL(compat_down_write_trylock); /* * release a read lock */ -void up_read(struct rw_semaphore *sem) +void compat_up_read(struct compat_rw_semaphore *sem) { rwsem_release(&sem->dep_map, 1, _RET_IP_); __up_read(sem); } -EXPORT_SYMBOL(up_read); +EXPORT_SYMBOL(compat_up_read); /* * release a write lock */ -void up_write(struct rw_semaphore *sem) +void compat_up_write(struct compat_rw_semaphore *sem) { rwsem_release(&sem->dep_map, 1, _RET_IP_); __up_write(sem); } -EXPORT_SYMBOL(up_write); +EXPORT_SYMBOL(compat_up_write); /* * downgrade write lock to read lock */ -void downgrade_write(struct rw_semaphore *sem) +void compat_downgrade_write(struct compat_rw_semaphore *sem) { /* * lockdep: a downgraded write will live on as a write @@ -103,11 +103,11 @@ void downgrade_write(struct rw_semaphore __downgrade_write(sem); } -EXPORT_SYMBOL(downgrade_write); +EXPORT_SYMBOL(compat_downgrade_write); #ifdef CONFIG_DEBUG_LOCK_ALLOC -void down_read_nested(struct rw_semaphore *sem, int subclass) +void compat_down_read_nested(struct compat_rw_semaphore *sem, int subclass) { might_sleep(); rwsem_acquire_read(&sem->dep_map, subclass, 0, _RET_IP_); @@ -115,18 +115,18 @@ void down_read_nested(struct rw_semaphor LOCK_CONTENDED(sem, __down_read_trylock, __down_read); } -EXPORT_SYMBOL(down_read_nested); +EXPORT_SYMBOL(compat_down_read_nested); -void down_read_non_owner(struct rw_semaphore *sem) +void compat_down_read_non_owner(struct compat_rw_semaphore *sem) { might_sleep(); __down_read(sem); } -EXPORT_SYMBOL(down_read_non_owner); +EXPORT_SYMBOL(compat_down_read_non_owner); -void down_write_nested(struct rw_semaphore *sem, int subclass) +void compat_down_write_nested(struct compat_rw_semaphore *sem, int subclass) { might_sleep(); rwsem_acquire(&sem->dep_map, subclass, 0, _RET_IP_); @@ -134,14 +134,14 @@ void down_write_nested(struct rw_semapho LOCK_CONTENDED(sem, __down_write_trylock, __down_write); } -EXPORT_SYMBOL(down_write_nested); +EXPORT_SYMBOL(compat_down_write_nested); -void up_read_non_owner(struct rw_semaphore *sem) +void compat_up_read_non_owner(struct compat_rw_semaphore *sem) { __up_read(sem); } -EXPORT_SYMBOL(up_read_non_owner); +EXPORT_SYMBOL(compat_up_read_non_owner); #endif Index: linux-2.6.24.7/kernel/sched.c =================================================================== --- linux-2.6.24.7.orig/kernel/sched.c +++ linux-2.6.24.7/kernel/sched.c @@ -1585,7 +1585,8 @@ static int sched_balance_self(int cpu, i * * returns failure only if the task is already active. */ -static int try_to_wake_up(struct task_struct *p, unsigned int state, int sync) +static int +try_to_wake_up(struct task_struct *p, unsigned int state, int sync, int mutex) { int cpu, orig_cpu, this_cpu, success = 0; unsigned long flags; @@ -1671,13 +1672,38 @@ out: int fastcall wake_up_process(struct task_struct *p) { return try_to_wake_up(p, TASK_STOPPED | TASK_TRACED | - TASK_INTERRUPTIBLE | TASK_UNINTERRUPTIBLE, 0); + TASK_RUNNING_MUTEX | TASK_INTERRUPTIBLE | + TASK_UNINTERRUPTIBLE, 0, 0); } EXPORT_SYMBOL(wake_up_process); +int fastcall wake_up_process_sync(struct task_struct * p) +{ + return try_to_wake_up(p, TASK_STOPPED | TASK_TRACED | + TASK_RUNNING_MUTEX | TASK_INTERRUPTIBLE | + TASK_UNINTERRUPTIBLE, 1, 0); +} +EXPORT_SYMBOL(wake_up_process_sync); + +int fastcall wake_up_process_mutex(struct task_struct * p) +{ + return try_to_wake_up(p, TASK_STOPPED | TASK_TRACED | + TASK_RUNNING_MUTEX | TASK_INTERRUPTIBLE | + TASK_UNINTERRUPTIBLE, 0, 1); +} +EXPORT_SYMBOL(wake_up_process_mutex); + +int fastcall wake_up_process_mutex_sync(struct task_struct * p) +{ + return try_to_wake_up(p, TASK_STOPPED | TASK_TRACED | + TASK_RUNNING_MUTEX | TASK_INTERRUPTIBLE | + TASK_UNINTERRUPTIBLE, 1, 1); +} +EXPORT_SYMBOL(wake_up_process_mutex_sync); + int fastcall wake_up_state(struct task_struct *p, unsigned int state) { - return try_to_wake_up(p, state, 0); + return try_to_wake_up(p, state | TASK_RUNNING_MUTEX, 0, 0); } /* @@ -3877,7 +3903,8 @@ asmlinkage void __sched preempt_schedule int default_wake_function(wait_queue_t *curr, unsigned mode, int sync, void *key) { - return try_to_wake_up(curr->private, mode, sync); + return try_to_wake_up(curr->private, mode | TASK_RUNNING_MUTEX, + sync, 0); } EXPORT_SYMBOL(default_wake_function); @@ -3917,8 +3944,9 @@ void fastcall __wake_up(wait_queue_head_ unsigned long flags; spin_lock_irqsave(&q->lock, flags); - __wake_up_common(q, mode, nr_exclusive, 0, key); + __wake_up_common(q, mode, nr_exclusive, 1, key); spin_unlock_irqrestore(&q->lock, flags); + preempt_check_resched_delayed(); } EXPORT_SYMBOL(__wake_up); @@ -3968,8 +3996,9 @@ void complete(struct completion *x) spin_lock_irqsave(&x->wait.lock, flags); x->done++; __wake_up_common(&x->wait, TASK_UNINTERRUPTIBLE | TASK_INTERRUPTIBLE, - 1, 0, NULL); + 1, 1, NULL); spin_unlock_irqrestore(&x->wait.lock, flags); + preempt_check_resched_delayed(); } EXPORT_SYMBOL(complete); @@ -3980,11 +4009,18 @@ void complete_all(struct completion *x) spin_lock_irqsave(&x->wait.lock, flags); x->done += UINT_MAX/2; __wake_up_common(&x->wait, TASK_UNINTERRUPTIBLE | TASK_INTERRUPTIBLE, - 0, 0, NULL); + 0, 1, NULL); spin_unlock_irqrestore(&x->wait.lock, flags); + preempt_check_resched_delayed(); } EXPORT_SYMBOL(complete_all); +unsigned int fastcall completion_done(struct completion *x) +{ + return x->done; +} +EXPORT_SYMBOL(completion_done); + static inline long __sched do_wait_for_common(struct completion *x, long timeout, int state) { @@ -4735,10 +4771,7 @@ asmlinkage long sys_sched_yield(void) * Since we are going to call schedule() anyway, there's * no need to preempt or enable interrupts: */ - __release(rq->lock); - spin_release(&rq->lock.dep_map, 1, _THIS_IP_); - _raw_spin_unlock(&rq->lock); - preempt_enable_no_resched(); + spin_unlock_no_resched(&rq->lock); schedule(); @@ -4781,7 +4814,7 @@ EXPORT_SYMBOL(cond_resched); * operations here to prevent schedule() from being called twice (once via * spin_unlock(), once by hand). */ -int cond_resched_lock(spinlock_t *lock) +int __cond_resched_raw_spinlock(raw_spinlock_t *lock) { int ret = 0; @@ -4792,24 +4825,23 @@ int cond_resched_lock(spinlock_t *lock) spin_lock(lock); } if (need_resched() && system_state == SYSTEM_RUNNING) { - spin_release(&lock->dep_map, 1, _THIS_IP_); - _raw_spin_unlock(lock); - preempt_enable_no_resched(); + spin_unlock_no_resched(lock); __cond_resched(); ret = 1; spin_lock(lock); } return ret; } -EXPORT_SYMBOL(cond_resched_lock); +EXPORT_SYMBOL(__cond_resched_raw_spinlock); /* * Voluntarily preempt a process context that has softirqs disabled: */ int __sched cond_resched_softirq(void) { +#ifndef CONFIG_PREEMPT_RT WARN_ON_ONCE(!in_softirq()); - +#endif if (need_resched() && system_state == SYSTEM_RUNNING) { local_bh_enable(); __cond_resched(); @@ -5018,19 +5050,23 @@ static void show_task(struct task_struct unsigned state; state = p->state ? __ffs(p->state) + 1 : 0; - printk(KERN_INFO "%-13.13s %c", p->comm, - state < sizeof(stat_nam) - 1 ? stat_nam[state] : '?'); + printk("%-13.13s %c [%p]", p->comm, + state < sizeof(stat_nam) - 1 ? stat_nam[state] : '?', p); #if BITS_PER_LONG == 32 - if (state == TASK_RUNNING) + if (0 && (state == TASK_RUNNING)) printk(KERN_CONT " running "); else printk(KERN_CONT " %08lx ", thread_saved_pc(p)); #else - if (state == TASK_RUNNING) + if (0 && (state == TASK_RUNNING)) printk(KERN_CONT " running task "); else printk(KERN_CONT " %016lx ", thread_saved_pc(p)); #endif + if (task_curr(p)) + printk("[curr] "); + else if (p->se.on_rq) + printk("[on rq #%d] ", task_cpu(p)); #ifdef CONFIG_DEBUG_STACK_USAGE { unsigned long *n = end_of_stack(p); Index: linux-2.6.24.7/kernel/spinlock.c =================================================================== --- linux-2.6.24.7.orig/kernel/spinlock.c +++ linux-2.6.24.7/kernel/spinlock.c @@ -21,7 +21,7 @@ #include <linux/debug_locks.h> #include <linux/module.h> -int __lockfunc _spin_trylock(spinlock_t *lock) +int __lockfunc __spin_trylock(raw_spinlock_t *lock) { preempt_disable(); if (_raw_spin_trylock(lock)) { @@ -32,9 +32,46 @@ int __lockfunc _spin_trylock(spinlock_t preempt_enable(); return 0; } -EXPORT_SYMBOL(_spin_trylock); +EXPORT_SYMBOL(__spin_trylock); -int __lockfunc _read_trylock(rwlock_t *lock) +int __lockfunc __spin_trylock_irq(raw_spinlock_t *lock) +{ + local_irq_disable(); + preempt_disable(); + + if (_raw_spin_trylock(lock)) { + spin_acquire(&lock->dep_map, 0, 1, _RET_IP_); + return 1; + } + + __preempt_enable_no_resched(); + local_irq_enable(); + preempt_check_resched(); + + return 0; +} +EXPORT_SYMBOL(__spin_trylock_irq); + +int __lockfunc __spin_trylock_irqsave(raw_spinlock_t *lock, + unsigned long *flags) +{ + local_irq_save(*flags); + preempt_disable(); + + if (_raw_spin_trylock(lock)) { + spin_acquire(&lock->dep_map, 0, 1, _RET_IP_); + return 1; + } + + __preempt_enable_no_resched(); + local_irq_restore(*flags); + preempt_check_resched(); + + return 0; +} +EXPORT_SYMBOL(__spin_trylock_irqsave); + +int __lockfunc __read_trylock(raw_rwlock_t *lock) { preempt_disable(); if (_raw_read_trylock(lock)) { @@ -45,9 +82,9 @@ int __lockfunc _read_trylock(rwlock_t *l preempt_enable(); return 0; } -EXPORT_SYMBOL(_read_trylock); +EXPORT_SYMBOL(__read_trylock); -int __lockfunc _write_trylock(rwlock_t *lock) +int __lockfunc __write_trylock(raw_rwlock_t *lock) { preempt_disable(); if (_raw_write_trylock(lock)) { @@ -58,7 +95,21 @@ int __lockfunc _write_trylock(rwlock_t * preempt_enable(); return 0; } -EXPORT_SYMBOL(_write_trylock); +EXPORT_SYMBOL(__write_trylock); + +int __lockfunc __write_trylock_irqsave(raw_rwlock_t *lock, unsigned long *flags) +{ + int ret; + + local_irq_save(*flags); + ret = __write_trylock(lock); + if (ret) + return ret; + + local_irq_restore(*flags); + return 0; +} +EXPORT_SYMBOL(__write_trylock_irqsave); /* * If lockdep is enabled then we use the non-preemption spin-ops @@ -66,17 +117,17 @@ EXPORT_SYMBOL(_write_trylock); * not re-enabled during lock-acquire (which the preempt-spin-ops do): */ #if !defined(CONFIG_PREEMPT) || !defined(CONFIG_SMP) || \ - defined(CONFIG_DEBUG_LOCK_ALLOC) + defined(CONFIG_DEBUG_LOCK_ALLOC) || defined(CONFIG_PREEMPT_RT) -void __lockfunc _read_lock(rwlock_t *lock) +void __lockfunc __read_lock(raw_rwlock_t *lock) { preempt_disable(); rwlock_acquire_read(&lock->dep_map, 0, 0, _RET_IP_); LOCK_CONTENDED(lock, _raw_read_trylock, _raw_read_lock); } -EXPORT_SYMBOL(_read_lock); +EXPORT_SYMBOL(__read_lock); -unsigned long __lockfunc _spin_lock_irqsave(spinlock_t *lock) +unsigned long __lockfunc __spin_lock_irqsave(raw_spinlock_t *lock) { unsigned long flags; @@ -95,27 +146,27 @@ unsigned long __lockfunc _spin_lock_irqs #endif return flags; } -EXPORT_SYMBOL(_spin_lock_irqsave); +EXPORT_SYMBOL(__spin_lock_irqsave); -void __lockfunc _spin_lock_irq(spinlock_t *lock) +void __lockfunc __spin_lock_irq(raw_spinlock_t *lock) { local_irq_disable(); preempt_disable(); spin_acquire(&lock->dep_map, 0, 0, _RET_IP_); LOCK_CONTENDED(lock, _raw_spin_trylock, _raw_spin_lock); } -EXPORT_SYMBOL(_spin_lock_irq); +EXPORT_SYMBOL(__spin_lock_irq); -void __lockfunc _spin_lock_bh(spinlock_t *lock) +void __lockfunc __spin_lock_bh(raw_spinlock_t *lock) { local_bh_disable(); preempt_disable(); spin_acquire(&lock->dep_map, 0, 0, _RET_IP_); LOCK_CONTENDED(lock, _raw_spin_trylock, _raw_spin_lock); } -EXPORT_SYMBOL(_spin_lock_bh); +EXPORT_SYMBOL(__spin_lock_bh); -unsigned long __lockfunc _read_lock_irqsave(rwlock_t *lock) +unsigned long __lockfunc __read_lock_irqsave(raw_rwlock_t *lock) { unsigned long flags; @@ -125,27 +176,27 @@ unsigned long __lockfunc _read_lock_irqs LOCK_CONTENDED(lock, _raw_read_trylock, _raw_read_lock); return flags; } -EXPORT_SYMBOL(_read_lock_irqsave); +EXPORT_SYMBOL(__read_lock_irqsave); -void __lockfunc _read_lock_irq(rwlock_t *lock) +void __lockfunc __read_lock_irq(raw_rwlock_t *lock) { local_irq_disable(); preempt_disable(); rwlock_acquire_read(&lock->dep_map, 0, 0, _RET_IP_); LOCK_CONTENDED(lock, _raw_read_trylock, _raw_read_lock); } -EXPORT_SYMBOL(_read_lock_irq); +EXPORT_SYMBOL(__read_lock_irq); -void __lockfunc _read_lock_bh(rwlock_t *lock) +void __lockfunc __read_lock_bh(raw_rwlock_t *lock) { local_bh_disable(); preempt_disable(); rwlock_acquire_read(&lock->dep_map, 0, 0, _RET_IP_); LOCK_CONTENDED(lock, _raw_read_trylock, _raw_read_lock); } -EXPORT_SYMBOL(_read_lock_bh); +EXPORT_SYMBOL(__read_lock_bh); -unsigned long __lockfunc _write_lock_irqsave(rwlock_t *lock) +unsigned long __lockfunc __write_lock_irqsave(raw_rwlock_t *lock) { unsigned long flags; @@ -155,43 +206,43 @@ unsigned long __lockfunc _write_lock_irq LOCK_CONTENDED(lock, _raw_write_trylock, _raw_write_lock); return flags; } -EXPORT_SYMBOL(_write_lock_irqsave); +EXPORT_SYMBOL(__write_lock_irqsave); -void __lockfunc _write_lock_irq(rwlock_t *lock) +void __lockfunc __write_lock_irq(raw_rwlock_t *lock) { local_irq_disable(); preempt_disable(); rwlock_acquire(&lock->dep_map, 0, 0, _RET_IP_); LOCK_CONTENDED(lock, _raw_write_trylock, _raw_write_lock); } -EXPORT_SYMBOL(_write_lock_irq); +EXPORT_SYMBOL(__write_lock_irq); -void __lockfunc _write_lock_bh(rwlock_t *lock) +void __lockfunc __write_lock_bh(raw_rwlock_t *lock) { local_bh_disable(); preempt_disable(); rwlock_acquire(&lock->dep_map, 0, 0, _RET_IP_); LOCK_CONTENDED(lock, _raw_write_trylock, _raw_write_lock); } -EXPORT_SYMBOL(_write_lock_bh); +EXPORT_SYMBOL(__write_lock_bh); -void __lockfunc _spin_lock(spinlock_t *lock) +void __lockfunc __spin_lock(raw_spinlock_t *lock) { preempt_disable(); spin_acquire(&lock->dep_map, 0, 0, _RET_IP_); LOCK_CONTENDED(lock, _raw_spin_trylock, _raw_spin_lock); } -EXPORT_SYMBOL(_spin_lock); +EXPORT_SYMBOL(__spin_lock); -void __lockfunc _write_lock(rwlock_t *lock) +void __lockfunc __write_lock(raw_rwlock_t *lock) { preempt_disable(); rwlock_acquire(&lock->dep_map, 0, 0, _RET_IP_); LOCK_CONTENDED(lock, _raw_write_trylock, _raw_write_lock); } -EXPORT_SYMBOL(_write_lock); +EXPORT_SYMBOL(__write_lock); #else /* CONFIG_PREEMPT: */ @@ -204,7 +255,7 @@ EXPORT_SYMBOL(_write_lock); */ #define BUILD_LOCK_OPS(op, locktype) \ -void __lockfunc _##op##_lock(locktype##_t *lock) \ +void __lockfunc __##op##_lock(locktype##_t *lock) \ { \ for (;;) { \ preempt_disable(); \ @@ -214,15 +265,16 @@ void __lockfunc _##op##_lock(locktype##_ \ if (!(lock)->break_lock) \ (lock)->break_lock = 1; \ - while (!op##_can_lock(lock) && (lock)->break_lock) \ - _raw_##op##_relax(&lock->raw_lock); \ + while (!__raw_##op##_can_lock(&(lock)->raw_lock) && \ + (lock)->break_lock) \ + __raw_##op##_relax(&lock->raw_lock); \ } \ (lock)->break_lock = 0; \ } \ \ -EXPORT_SYMBOL(_##op##_lock); \ +EXPORT_SYMBOL(__##op##_lock); \ \ -unsigned long __lockfunc _##op##_lock_irqsave(locktype##_t *lock) \ +unsigned long __lockfunc __##op##_lock_irqsave(locktype##_t *lock) \ { \ unsigned long flags; \ \ @@ -236,23 +288,24 @@ unsigned long __lockfunc _##op##_lock_ir \ if (!(lock)->break_lock) \ (lock)->break_lock = 1; \ - while (!op##_can_lock(lock) && (lock)->break_lock) \ - _raw_##op##_relax(&lock->raw_lock); \ + while (!__raw_##op##_can_lock(&(lock)->raw_lock) && \ + (lock)->break_lock) \ + __raw_##op##_relax(&lock->raw_lock); \ } \ (lock)->break_lock = 0; \ return flags; \ } \ \ -EXPORT_SYMBOL(_##op##_lock_irqsave); \ +EXPORT_SYMBOL(__##op##_lock_irqsave); \ \ -void __lockfunc _##op##_lock_irq(locktype##_t *lock) \ +void __lockfunc __##op##_lock_irq(locktype##_t *lock) \ { \ - _##op##_lock_irqsave(lock); \ + __##op##_lock_irqsave(lock); \ } \ \ -EXPORT_SYMBOL(_##op##_lock_irq); \ +EXPORT_SYMBOL(__##op##_lock_irq); \ \ -void __lockfunc _##op##_lock_bh(locktype##_t *lock) \ +void __lockfunc __##op##_lock_bh(locktype##_t *lock) \ { \ unsigned long flags; \ \ @@ -261,39 +314,40 @@ void __lockfunc _##op##_lock_bh(locktype /* irq-disabling. We use the generic preemption-aware */ \ /* function: */ \ /**/ \ - flags = _##op##_lock_irqsave(lock); \ + flags = __##op##_lock_irqsave(lock); \ local_bh_disable(); \ local_irq_restore(flags); \ } \ \ -EXPORT_SYMBOL(_##op##_lock_bh) +EXPORT_SYMBOL(__##op##_lock_bh) /* * Build preemption-friendly versions of the following * lock-spinning functions: * - * _[spin|read|write]_lock() - * _[spin|read|write]_lock_irq() - * _[spin|read|write]_lock_irqsave() - * _[spin|read|write]_lock_bh() + * __[spin|read|write]_lock() + * __[spin|read|write]_lock_irq() + * __[spin|read|write]_lock_irqsave() + * __[spin|read|write]_lock_bh() */ -BUILD_LOCK_OPS(spin, spinlock); -BUILD_LOCK_OPS(read, rwlock); -BUILD_LOCK_OPS(write, rwlock); +BUILD_LOCK_OPS(spin, raw_spinlock); +BUILD_LOCK_OPS(read, raw_rwlock); +BUILD_LOCK_OPS(write, raw_rwlock); #endif /* CONFIG_PREEMPT */ #ifdef CONFIG_DEBUG_LOCK_ALLOC -void __lockfunc _spin_lock_nested(spinlock_t *lock, int subclass) +void __lockfunc __spin_lock_nested(raw_spinlock_t *lock, int subclass) { preempt_disable(); spin_acquire(&lock->dep_map, subclass, 0, _RET_IP_); LOCK_CONTENDED(lock, _raw_spin_trylock, _raw_spin_lock); } +EXPORT_SYMBOL(__spin_lock_nested); -EXPORT_SYMBOL(_spin_lock_nested); -unsigned long __lockfunc _spin_lock_irqsave_nested(spinlock_t *lock, int subclass) +unsigned long __lockfunc +__spin_lock_irqsave_nested(raw_spinlock_t *lock, int subclass) { unsigned long flags; @@ -312,117 +366,130 @@ unsigned long __lockfunc _spin_lock_irqs #endif return flags; } - -EXPORT_SYMBOL(_spin_lock_irqsave_nested); +EXPORT_SYMBOL(__spin_lock_irqsave_nested); #endif -void __lockfunc _spin_unlock(spinlock_t *lock) +void __lockfunc __spin_unlock(raw_spinlock_t *lock) { spin_release(&lock->dep_map, 1, _RET_IP_); _raw_spin_unlock(lock); preempt_enable(); } -EXPORT_SYMBOL(_spin_unlock); +EXPORT_SYMBOL(__spin_unlock); + +void __lockfunc __spin_unlock_no_resched(raw_spinlock_t *lock) +{ + spin_release(&lock->dep_map, 1, _RET_IP_); + _raw_spin_unlock(lock); + __preempt_enable_no_resched(); +} +/* not exported */ -void __lockfunc _write_unlock(rwlock_t *lock) +void __lockfunc __write_unlock(raw_rwlock_t *lock) { rwlock_release(&lock->dep_map, 1, _RET_IP_); _raw_write_unlock(lock); preempt_enable(); } -EXPORT_SYMBOL(_write_unlock); +EXPORT_SYMBOL(__write_unlock); -void __lockfunc _read_unlock(rwlock_t *lock) +void __lockfunc __read_unlock(raw_rwlock_t *lock) { rwlock_release(&lock->dep_map, 1, _RET_IP_); _raw_read_unlock(lock); preempt_enable(); } -EXPORT_SYMBOL(_read_unlock); +EXPORT_SYMBOL(__read_unlock); -void __lockfunc _spin_unlock_irqrestore(spinlock_t *lock, unsigned long flags) +void __lockfunc __spin_unlock_irqrestore(raw_spinlock_t *lock, unsigned long flags) { spin_release(&lock->dep_map, 1, _RET_IP_); _raw_spin_unlock(lock); + __preempt_enable_no_resched(); local_irq_restore(flags); - preempt_enable(); + preempt_check_resched(); } -EXPORT_SYMBOL(_spin_unlock_irqrestore); +EXPORT_SYMBOL(__spin_unlock_irqrestore); -void __lockfunc _spin_unlock_irq(spinlock_t *lock) +void __lockfunc __spin_unlock_irq(raw_spinlock_t *lock) { spin_release(&lock->dep_map, 1, _RET_IP_); _raw_spin_unlock(lock); + __preempt_enable_no_resched(); local_irq_enable(); - preempt_enable(); + preempt_check_resched(); } -EXPORT_SYMBOL(_spin_unlock_irq); +EXPORT_SYMBOL(__spin_unlock_irq); -void __lockfunc _spin_unlock_bh(spinlock_t *lock) +void __lockfunc __spin_unlock_bh(raw_spinlock_t *lock) { spin_release(&lock->dep_map, 1, _RET_IP_); _raw_spin_unlock(lock); - preempt_enable_no_resched(); + __preempt_enable_no_resched(); local_bh_enable_ip((unsigned long)__builtin_return_address(0)); } -EXPORT_SYMBOL(_spin_unlock_bh); +EXPORT_SYMBOL(__spin_unlock_bh); -void __lockfunc _read_unlock_irqrestore(rwlock_t *lock, unsigned long flags) +void __lockfunc __read_unlock_irqrestore(raw_rwlock_t *lock, unsigned long flags) { rwlock_release(&lock->dep_map, 1, _RET_IP_); _raw_read_unlock(lock); + __preempt_enable_no_resched(); local_irq_restore(flags); - preempt_enable(); + preempt_check_resched(); } -EXPORT_SYMBOL(_read_unlock_irqrestore); +EXPORT_SYMBOL(__read_unlock_irqrestore); -void __lockfunc _read_unlock_irq(rwlock_t *lock) +void __lockfunc __read_unlock_irq(raw_rwlock_t *lock) { rwlock_release(&lock->dep_map, 1, _RET_IP_); _raw_read_unlock(lock); + __preempt_enable_no_resched(); local_irq_enable(); - preempt_enable(); + preempt_check_resched(); } -EXPORT_SYMBOL(_read_unlock_irq); +EXPORT_SYMBOL(__read_unlock_irq); -void __lockfunc _read_unlock_bh(rwlock_t *lock) +void __lockfunc __read_unlock_bh(raw_rwlock_t *lock) { rwlock_release(&lock->dep_map, 1, _RET_IP_); _raw_read_unlock(lock); - preempt_enable_no_resched(); + __preempt_enable_no_resched(); local_bh_enable_ip((unsigned long)__builtin_return_address(0)); } -EXPORT_SYMBOL(_read_unlock_bh); +EXPORT_SYMBOL(__read_unlock_bh); -void __lockfunc _write_unlock_irqrestore(rwlock_t *lock, unsigned long flags) +void __lockfunc __write_unlock_irqrestore(raw_rwlock_t *lock, unsigned long flags) { rwlock_release(&lock->dep_map, 1, _RET_IP_); _raw_write_unlock(lock); + __preempt_enable_no_resched(); local_irq_restore(flags); - preempt_enable(); + preempt_check_resched(); } -EXPORT_SYMBOL(_write_unlock_irqrestore); +EXPORT_SYMBOL(__write_unlock_irqrestore); -void __lockfunc _write_unlock_irq(rwlock_t *lock) +void __lockfunc __write_unlock_irq(raw_rwlock_t *lock) { rwlock_release(&lock->dep_map, 1, _RET_IP_); _raw_write_unlock(lock); + __preempt_enable_no_resched(); local_irq_enable(); - preempt_enable(); + preempt_check_resched(); } -EXPORT_SYMBOL(_write_unlock_irq); +EXPORT_SYMBOL(__write_unlock_irq); -void __lockfunc _write_unlock_bh(rwlock_t *lock) +void __lockfunc __write_unlock_bh(raw_rwlock_t *lock) { rwlock_release(&lock->dep_map, 1, _RET_IP_); _raw_write_unlock(lock); - preempt_enable_no_resched(); + __preempt_enable_no_resched(); local_bh_enable_ip((unsigned long)__builtin_return_address(0)); } -EXPORT_SYMBOL(_write_unlock_bh); +EXPORT_SYMBOL(__write_unlock_bh); -int __lockfunc _spin_trylock_bh(spinlock_t *lock) +int __lockfunc __spin_trylock_bh(raw_spinlock_t *lock) { local_bh_disable(); preempt_disable(); @@ -431,11 +498,11 @@ int __lockfunc _spin_trylock_bh(spinlock return 1; } - preempt_enable_no_resched(); + __preempt_enable_no_resched(); local_bh_enable_ip((unsigned long)__builtin_return_address(0)); return 0; } -EXPORT_SYMBOL(_spin_trylock_bh); +EXPORT_SYMBOL(__spin_trylock_bh); int in_lock_functions(unsigned long addr) { @@ -443,6 +510,17 @@ int in_lock_functions(unsigned long addr extern char __lock_text_start[], __lock_text_end[]; return addr >= (unsigned long)__lock_text_start - && addr < (unsigned long)__lock_text_end; + && addr < (unsigned long)__lock_text_end; } EXPORT_SYMBOL(in_lock_functions); + +void notrace __debug_atomic_dec_and_test(atomic_t *v) +{ + static int warn_once = 1; + + if (!atomic_read(v) && warn_once) { + warn_once = 0; + printk("BUG: atomic counter underflow!\n"); + WARN_ON(1); + } +} Index: linux-2.6.24.7/lib/dec_and_lock.c =================================================================== --- linux-2.6.24.7.orig/lib/dec_and_lock.c +++ linux-2.6.24.7/lib/dec_and_lock.c @@ -17,7 +17,7 @@ * because the spin-lock and the decrement must be * "atomic". */ -int _atomic_dec_and_lock(atomic_t *atomic, spinlock_t *lock) +int __atomic_dec_and_spin_lock(atomic_t *atomic, raw_spinlock_t *lock) { #ifdef CONFIG_SMP /* Subtract 1 from counter unless that drops it to 0 (ie. it was 1) */ @@ -32,4 +32,4 @@ int _atomic_dec_and_lock(atomic_t *atomi return 0; } -EXPORT_SYMBOL(_atomic_dec_and_lock); +EXPORT_SYMBOL(__atomic_dec_and_spin_lock); Index: linux-2.6.24.7/lib/kernel_lock.c =================================================================== --- linux-2.6.24.7.orig/lib/kernel_lock.c +++ linux-2.6.24.7/lib/kernel_lock.c @@ -24,7 +24,7 @@ * * Don't use in new code. */ -static DECLARE_MUTEX(kernel_sem); +DECLARE_MUTEX(kernel_sem); /* * Re-acquire the kernel semaphore. @@ -44,7 +44,7 @@ int __lockfunc __reacquire_kernel_lock(v BUG_ON(saved_lock_depth < 0); task->lock_depth = -1; - preempt_enable_no_resched(); + __preempt_enable_no_resched(); down(&kernel_sem); Index: linux-2.6.24.7/lib/locking-selftest.c =================================================================== --- linux-2.6.24.7.orig/lib/locking-selftest.c +++ linux-2.6.24.7/lib/locking-selftest.c @@ -940,6 +940,9 @@ static void dotest(void (*testcase_fn)(v { unsigned long saved_preempt_count = preempt_count(); int expected_failure = 0; +#if defined(CONFIG_DEBUG_PREEMPT) && defined(CONFIG_DEBUG_RT_MUTEXES) + int saved_lock_count = current->lock_count; +#endif WARN_ON(irqs_disabled()); @@ -989,6 +992,9 @@ static void dotest(void (*testcase_fn)(v #endif reset_locks(); +#if defined(CONFIG_DEBUG_PREEMPT) && defined(CONFIG_DEBUG_RT_MUTEXES) + current->lock_count = saved_lock_count; +#endif } static inline void print_testname(const char *testname) Index: linux-2.6.24.7/lib/plist.c =================================================================== --- linux-2.6.24.7.orig/lib/plist.c +++ linux-2.6.24.7/lib/plist.c @@ -53,7 +53,9 @@ static void plist_check_list(struct list static void plist_check_head(struct plist_head *head) { +#ifndef CONFIG_PREEMPT_RT WARN_ON(!head->lock); +#endif if (head->lock) WARN_ON_SMP(!spin_is_locked(head->lock)); plist_check_list(&head->prio_list); Index: linux-2.6.24.7/lib/rwsem-spinlock.c =================================================================== --- linux-2.6.24.7.orig/lib/rwsem-spinlock.c +++ linux-2.6.24.7/lib/rwsem-spinlock.c @@ -20,7 +20,7 @@ struct rwsem_waiter { /* * initialise the semaphore */ -void __init_rwsem(struct rw_semaphore *sem, const char *name, +void __compat_init_rwsem(struct compat_rw_semaphore *sem, const char *name, struct lock_class_key *key) { #ifdef CONFIG_DEBUG_LOCK_ALLOC @@ -44,8 +44,8 @@ void __init_rwsem(struct rw_semaphore *s * - woken process blocks are discarded from the list after having task zeroed * - writers are only woken if wakewrite is non-zero */ -static inline struct rw_semaphore * -__rwsem_do_wake(struct rw_semaphore *sem, int wakewrite) +static inline struct compat_rw_semaphore * +__rwsem_do_wake(struct compat_rw_semaphore *sem, int wakewrite) { struct rwsem_waiter *waiter; struct task_struct *tsk; @@ -103,8 +103,8 @@ __rwsem_do_wake(struct rw_semaphore *sem /* * wake a single writer */ -static inline struct rw_semaphore * -__rwsem_wake_one_writer(struct rw_semaphore *sem) +static inline struct compat_rw_semaphore * +__rwsem_wake_one_writer(struct compat_rw_semaphore *sem) { struct rwsem_waiter *waiter; struct task_struct *tsk; @@ -125,7 +125,7 @@ __rwsem_wake_one_writer(struct rw_semaph /* * get a read lock on the semaphore */ -void fastcall __sched __down_read(struct rw_semaphore *sem) +void fastcall __sched __down_read(struct compat_rw_semaphore *sem) { struct rwsem_waiter waiter; struct task_struct *tsk; @@ -168,7 +168,7 @@ void fastcall __sched __down_read(struct /* * trylock for reading -- returns 1 if successful, 0 if contention */ -int fastcall __down_read_trylock(struct rw_semaphore *sem) +int fastcall __down_read_trylock(struct compat_rw_semaphore *sem) { unsigned long flags; int ret = 0; @@ -191,7 +191,8 @@ int fastcall __down_read_trylock(struct * get a write lock on the semaphore * - we increment the waiting count anyway to indicate an exclusive lock */ -void fastcall __sched __down_write_nested(struct rw_semaphore *sem, int subclass) +void fastcall __sched +__down_write_nested(struct compat_rw_semaphore *sem, int subclass) { struct rwsem_waiter waiter; struct task_struct *tsk; @@ -231,7 +232,7 @@ void fastcall __sched __down_write_neste ; } -void fastcall __sched __down_write(struct rw_semaphore *sem) +void fastcall __sched __down_write(struct compat_rw_semaphore *sem) { __down_write_nested(sem, 0); } @@ -239,7 +240,7 @@ void fastcall __sched __down_write(struc /* * trylock for writing -- returns 1 if successful, 0 if contention */ -int fastcall __down_write_trylock(struct rw_semaphore *sem) +int fastcall __down_write_trylock(struct compat_rw_semaphore *sem) { unsigned long flags; int ret = 0; @@ -260,7 +261,7 @@ int fastcall __down_write_trylock(struct /* * release a read lock on the semaphore */ -void fastcall __up_read(struct rw_semaphore *sem) +void fastcall __up_read(struct compat_rw_semaphore *sem) { unsigned long flags; @@ -275,7 +276,7 @@ void fastcall __up_read(struct rw_semaph /* * release a write lock on the semaphore */ -void fastcall __up_write(struct rw_semaphore *sem) +void fastcall __up_write(struct compat_rw_semaphore *sem) { unsigned long flags; @@ -292,7 +293,7 @@ void fastcall __up_write(struct rw_semap * downgrade a write lock into a read lock * - just wake up any readers at the front of the queue */ -void fastcall __downgrade_write(struct rw_semaphore *sem) +void fastcall __downgrade_write(struct compat_rw_semaphore *sem) { unsigned long flags; @@ -305,7 +306,7 @@ void fastcall __downgrade_write(struct r spin_unlock_irqrestore(&sem->wait_lock, flags); } -EXPORT_SYMBOL(__init_rwsem); +EXPORT_SYMBOL(__compat_init_rwsem); EXPORT_SYMBOL(__down_read); EXPORT_SYMBOL(__down_read_trylock); EXPORT_SYMBOL(__down_write_nested); Index: linux-2.6.24.7/lib/rwsem.c =================================================================== --- linux-2.6.24.7.orig/lib/rwsem.c +++ linux-2.6.24.7/lib/rwsem.c @@ -11,8 +11,8 @@ /* * Initialize an rwsem: */ -void __init_rwsem(struct rw_semaphore *sem, const char *name, - struct lock_class_key *key) +void __compat_init_rwsem(struct rw_semaphore *sem, const char *name, + struct lock_class_key *key) { #ifdef CONFIG_DEBUG_LOCK_ALLOC /* @@ -26,7 +26,7 @@ void __init_rwsem(struct rw_semaphore *s INIT_LIST_HEAD(&sem->wait_list); } -EXPORT_SYMBOL(__init_rwsem); +EXPORT_SYMBOL(__compat_init_rwsem); struct rwsem_waiter { struct list_head list; Index: linux-2.6.24.7/lib/semaphore-sleepers.c =================================================================== --- linux-2.6.24.7.orig/lib/semaphore-sleepers.c +++ linux-2.6.24.7/lib/semaphore-sleepers.c @@ -15,6 +15,7 @@ #include <linux/sched.h> #include <linux/err.h> #include <linux/init.h> +#include <linux/module.h> #include <asm/semaphore.h> /* @@ -48,12 +49,12 @@ * we cannot lose wakeup events. */ -fastcall void __up(struct semaphore *sem) +fastcall void __compat_up(struct compat_semaphore *sem) { wake_up(&sem->wait); } -fastcall void __sched __down(struct semaphore * sem) +fastcall void __sched __compat_down(struct compat_semaphore * sem) { struct task_struct *tsk = current; DECLARE_WAITQUEUE(wait, tsk); @@ -90,7 +91,7 @@ fastcall void __sched __down(struct sema tsk->state = TASK_RUNNING; } -fastcall int __sched __down_interruptible(struct semaphore * sem) +fastcall int __sched __compat_down_interruptible(struct compat_semaphore * sem) { int retval = 0; struct task_struct *tsk = current; @@ -153,7 +154,7 @@ fastcall int __sched __down_interruptibl * single "cmpxchg" without failure cases, * but then it wouldn't work on a 386. */ -fastcall int __down_trylock(struct semaphore * sem) +fastcall int __compat_down_trylock(struct compat_semaphore * sem) { int sleepers; unsigned long flags; @@ -174,3 +175,10 @@ fastcall int __down_trylock(struct semap spin_unlock_irqrestore(&sem->wait.lock, flags); return 1; } + +int fastcall compat_sem_is_locked(struct compat_semaphore *sem) +{ + return (int) atomic_read(&sem->count) < 0; +} + +EXPORT_SYMBOL(compat_sem_is_locked); Index: linux-2.6.24.7/lib/spinlock_debug.c =================================================================== --- linux-2.6.24.7.orig/lib/spinlock_debug.c +++ linux-2.6.24.7/lib/spinlock_debug.c @@ -13,8 +13,8 @@ #include <linux/delay.h> #include <linux/module.h> -void __spin_lock_init(spinlock_t *lock, const char *name, - struct lock_class_key *key) +void __raw_spin_lock_init(raw_spinlock_t *lock, const char *name, + struct lock_class_key *key) { #ifdef CONFIG_DEBUG_LOCK_ALLOC /* @@ -23,16 +23,16 @@ void __spin_lock_init(spinlock_t *lock, debug_check_no_locks_freed((void *)lock, sizeof(*lock)); lockdep_init_map(&lock->dep_map, name, key, 0); #endif - lock->raw_lock = (raw_spinlock_t)__RAW_SPIN_LOCK_UNLOCKED; + lock->raw_lock = (__raw_spinlock_t)__RAW_SPIN_LOCK_UNLOCKED; lock->magic = SPINLOCK_MAGIC; lock->owner = SPINLOCK_OWNER_INIT; lock->owner_cpu = -1; } -EXPORT_SYMBOL(__spin_lock_init); +EXPORT_SYMBOL(__raw_spin_lock_init); -void __rwlock_init(rwlock_t *lock, const char *name, - struct lock_class_key *key) +void __raw_rwlock_init(raw_rwlock_t *lock, const char *name, + struct lock_class_key *key) { #ifdef CONFIG_DEBUG_LOCK_ALLOC /* @@ -41,15 +41,15 @@ void __rwlock_init(rwlock_t *lock, const debug_check_no_locks_freed((void *)lock, sizeof(*lock)); lockdep_init_map(&lock->dep_map, name, key, 0); #endif - lock->raw_lock = (raw_rwlock_t) __RAW_RW_LOCK_UNLOCKED; + lock->raw_lock = (__raw_rwlock_t) __RAW_RW_LOCK_UNLOCKED; lock->magic = RWLOCK_MAGIC; lock->owner = SPINLOCK_OWNER_INIT; lock->owner_cpu = -1; } -EXPORT_SYMBOL(__rwlock_init); +EXPORT_SYMBOL(__raw_rwlock_init); -static void spin_bug(spinlock_t *lock, const char *msg) +static void spin_bug(raw_spinlock_t *lock, const char *msg) { struct task_struct *owner = NULL; @@ -73,7 +73,7 @@ static void spin_bug(spinlock_t *lock, c #define SPIN_BUG_ON(cond, lock, msg) if (unlikely(cond)) spin_bug(lock, msg) static inline void -debug_spin_lock_before(spinlock_t *lock) +debug_spin_lock_before(raw_spinlock_t *lock) { SPIN_BUG_ON(lock->magic != SPINLOCK_MAGIC, lock, "bad magic"); SPIN_BUG_ON(lock->owner == current, lock, "recursion"); @@ -81,13 +81,13 @@ debug_spin_lock_before(spinlock_t *lock) lock, "cpu recursion"); } -static inline void debug_spin_lock_after(spinlock_t *lock) +static inline void debug_spin_lock_after(raw_spinlock_t *lock) { lock->owner_cpu = raw_smp_processor_id(); lock->owner = current; } -static inline void debug_spin_unlock(spinlock_t *lock) +static inline void debug_spin_unlock(raw_spinlock_t *lock) { SPIN_BUG_ON(lock->magic != SPINLOCK_MAGIC, lock, "bad magic"); SPIN_BUG_ON(!spin_is_locked(lock), lock, "already unlocked"); @@ -98,7 +98,7 @@ static inline void debug_spin_unlock(spi lock->owner_cpu = -1; } -static void __spin_lock_debug(spinlock_t *lock) +static void __spin_lock_debug(raw_spinlock_t *lock) { u64 i; u64 loops = loops_per_jiffy * HZ; @@ -125,7 +125,7 @@ static void __spin_lock_debug(spinlock_t } } -void _raw_spin_lock(spinlock_t *lock) +void __lockfunc _raw_spin_lock(raw_spinlock_t *lock) { debug_spin_lock_before(lock); if (unlikely(!__raw_spin_trylock(&lock->raw_lock))) @@ -133,7 +133,7 @@ void _raw_spin_lock(spinlock_t *lock) debug_spin_lock_after(lock); } -int _raw_spin_trylock(spinlock_t *lock) +int __lockfunc _raw_spin_trylock(raw_spinlock_t *lock) { int ret = __raw_spin_trylock(&lock->raw_lock); @@ -148,13 +148,13 @@ int _raw_spin_trylock(spinlock_t *lock) return ret; } -void _raw_spin_unlock(spinlock_t *lock) +void __lockfunc _raw_spin_unlock(raw_spinlock_t *lock) { debug_spin_unlock(lock); __raw_spin_unlock(&lock->raw_lock); } -static void rwlock_bug(rwlock_t *lock, const char *msg) +static void rwlock_bug(raw_rwlock_t *lock, const char *msg) { if (!debug_locks_off()) return; @@ -167,8 +167,8 @@ static void rwlock_bug(rwlock_t *lock, c #define RWLOCK_BUG_ON(cond, lock, msg) if (unlikely(cond)) rwlock_bug(lock, msg) -#if 0 /* __write_lock_debug() can lock up - maybe this can too? */ -static void __read_lock_debug(rwlock_t *lock) +#if 1 /* __write_lock_debug() can lock up - maybe this can too? */ +static void __raw_read_lock_debug(raw_rwlock_t *lock) { u64 i; u64 loops = loops_per_jiffy * HZ; @@ -193,13 +193,13 @@ static void __read_lock_debug(rwlock_t * } #endif -void _raw_read_lock(rwlock_t *lock) +void __lockfunc _raw_read_lock(raw_rwlock_t *lock) { RWLOCK_BUG_ON(lock->magic != RWLOCK_MAGIC, lock, "bad magic"); - __raw_read_lock(&lock->raw_lock); + __raw_read_lock_debug(lock); } -int _raw_read_trylock(rwlock_t *lock) +int __lockfunc _raw_read_trylock(raw_rwlock_t *lock) { int ret = __raw_read_trylock(&lock->raw_lock); @@ -212,13 +212,13 @@ int _raw_read_trylock(rwlock_t *lock) return ret; } -void _raw_read_unlock(rwlock_t *lock) +void __lockfunc _raw_read_unlock(raw_rwlock_t *lock) { RWLOCK_BUG_ON(lock->magic != RWLOCK_MAGIC, lock, "bad magic"); __raw_read_unlock(&lock->raw_lock); } -static inline void debug_write_lock_before(rwlock_t *lock) +static inline void debug_write_lock_before(raw_rwlock_t *lock) { RWLOCK_BUG_ON(lock->magic != RWLOCK_MAGIC, lock, "bad magic"); RWLOCK_BUG_ON(lock->owner == current, lock, "recursion"); @@ -226,13 +226,13 @@ static inline void debug_write_lock_befo lock, "cpu recursion"); } -static inline void debug_write_lock_after(rwlock_t *lock) +static inline void debug_write_lock_after(raw_rwlock_t *lock) { lock->owner_cpu = raw_smp_processor_id(); lock->owner = current; } -static inline void debug_write_unlock(rwlock_t *lock) +static inline void debug_write_unlock(raw_rwlock_t *lock) { RWLOCK_BUG_ON(lock->magic != RWLOCK_MAGIC, lock, "bad magic"); RWLOCK_BUG_ON(lock->owner != current, lock, "wrong owner"); @@ -242,8 +242,8 @@ static inline void debug_write_unlock(rw lock->owner_cpu = -1; } -#if 0 /* This can cause lockups */ -static void __write_lock_debug(rwlock_t *lock) +#if 1 /* This can cause lockups */ +static void __raw_write_lock_debug(raw_rwlock_t *lock) { u64 i; u64 loops = loops_per_jiffy * HZ; @@ -268,14 +268,14 @@ static void __write_lock_debug(rwlock_t } #endif -void _raw_write_lock(rwlock_t *lock) +void __lockfunc _raw_write_lock(raw_rwlock_t *lock) { debug_write_lock_before(lock); - __raw_write_lock(&lock->raw_lock); + __raw_write_lock_debug(lock); debug_write_lock_after(lock); } -int _raw_write_trylock(rwlock_t *lock) +int __lockfunc _raw_write_trylock(raw_rwlock_t *lock) { int ret = __raw_write_trylock(&lock->raw_lock); @@ -290,7 +290,7 @@ int _raw_write_trylock(rwlock_t *lock) return ret; } -void _raw_write_unlock(rwlock_t *lock) +void __lockfunc _raw_write_unlock(raw_rwlock_t *lock) { debug_write_unlock(lock); __raw_write_unlock(&lock->raw_lock); �����������������������������������������������������������patches/rt-mutex-trylock-export.patch���������������������������������������������������������������0000664�0000764�0000764�00000007321�11041657732�017060� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From linux-kernel-owner@vger.kernel.org Wed May 23 01:44:17 2007 Return-Path: <linux-kernel-owner+tglx=40linutronix.de-S1759353AbXEVXoG@vger.kernel.org> X-Spam-Checker-Version: SpamAssassin 3.1.7-deb (2006-10-05) on debian X-Spam-Level: X-Spam-Status: No, score=0.0 required=5.0 tests=none autolearn=unavailable version=3.1.7-deb Received: from vger.kernel.org (vger.kernel.org [209.132.176.167]) by mail.tglx.de (Postfix) with ESMTP id 32C4A65C3E9 for <tglx@linutronix.de>; Wed, 23 May 2007 01:44:17 +0200 (CEST) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1759353AbXEVXoG (ORCPT <rfc822;tglx@linutronix.de>); Tue, 22 May 2007 19:44:06 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1757791AbXEVXn4 (ORCPT <rfc822;linux-kernel-outgoing>); Tue, 22 May 2007 19:43:56 -0400 Received: from rwcrmhc11.comcast.net ([204.127.192.81]:35206 "EHLO rwcrmhc11.comcast.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1757669AbXEVXn4 (ORCPT <rfc822;Linux-kernel@vger.kernel.org>); Tue, 22 May 2007 19:43:56 -0400 Received: from sx.thebigcorporation.com ([69.181.45.228]) by comcast.net (rwcrmhc11) with ESMTP id <20070522233624m1100rg2vge>; Tue, 22 May 2007 23:36:29 +0000 Received: from sx.thebigcorporation.com (localhost.localdomain [127.0.0.1]) by sx.thebigcorporation.com (8.13.8/8.13.8) with ESMTP id l4MNaKHv029409; Tue, 22 May 2007 16:36:20 -0700 Received: (from sven@localhost) by sx.thebigcorporation.com (8.13.8/8.13.8/Submit) id l4MNaJIn029408; Tue, 22 May 2007 16:36:19 -0700 X-Authentication-Warning: sx.thebigcorporation.com: sven set sender to sven@thebigcorporation.com using -f Subject: [PATCH] 2.6.21-rt6 From: Sven-Thorsten Dietrich <sven@thebigcorporation.com> To: LKML <Linux-kernel@vger.kernel.org> Cc: Ingo Molnar <mingo@elte.hu> In-Reply-To: <1179874795.25500.40.camel@sx.thebigcorporation.com> References: <1179874795.25500.40.camel@sx.thebigcorporation.com> Content-Type: text/plain Organization: The Big Corporation Date: Tue, 22 May 2007 16:36:19 -0700 Message-Id: <1179876979.25500.54.camel@sx.thebigcorporation.com> Mime-Version: 1.0 X-Mailer: Evolution 2.8.3 (2.8.3-2.fc6) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org X-Filter-To: .Kernel.LKML X-Evolution-Source: imap://tglx%40linutronix.de@localhost:8993/ Content-Transfer-Encoding: 8bit On Tue, 2007-05-22 at 15:59 -0700, Sven-Thorsten Dietrich wrote: > Add <correct> > header and export for rt_write_trylock_irqsave. Disregard the last patch, flags parameter was missing in the header. --- include/linux/spinlock.h | 2 ++ kernel/rt.c | 1 + 2 files changed, 3 insertions(+) Index: linux-2.6.24.7/include/linux/spinlock.h =================================================================== --- linux-2.6.24.7.orig/include/linux/spinlock.h +++ linux-2.6.24.7/include/linux/spinlock.h @@ -294,6 +294,8 @@ do { \ extern void __lockfunc rt_write_lock(rwlock_t *rwlock); extern void __lockfunc rt_read_lock(rwlock_t *rwlock); extern int __lockfunc rt_write_trylock(rwlock_t *rwlock); +extern int __lockfunc rt_write_trylock_irqsave(rwlock_t *trylock, + unsigned long *flags); extern int __lockfunc rt_read_trylock(rwlock_t *rwlock); extern void __lockfunc rt_write_unlock(rwlock_t *rwlock); extern void __lockfunc rt_read_unlock(rwlock_t *rwlock); Index: linux-2.6.24.7/kernel/rt.c =================================================================== --- linux-2.6.24.7.orig/kernel/rt.c +++ linux-2.6.24.7/kernel/rt.c @@ -172,6 +172,7 @@ int __lockfunc rt_write_trylock_irqsave( *flags = 0; return rt_write_trylock(rwlock); } +EXPORT_SYMBOL(rt_write_trylock_irqsave); int __lockfunc rt_read_trylock(rwlock_t *rwlock) { ���������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/rt-mutex-spinlock-might-sleep.patch���������������������������������������������������������0000664�0000764�0000764�00000004471�11041657735�020116� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From rostedt@goodmis.org Sat Jun 2 00:35:54 2007 Return-Path: <rostedt@goodmis.org> X-Spam-Checker-Version: SpamAssassin 3.1.7-deb (2006-10-05) on debian X-Spam-Level: X-Spam-Status: No, score=0.0 required=5.0 tests=AWL autolearn=ham version=3.1.7-deb Received: from ms-smtp-01.nyroc.rr.com (ms-smtp-01.nyroc.rr.com [24.24.2.55]) by mail.tglx.de (Postfix) with ESMTP id C420E65C065 for <tglx@linutronix.de>; Sat, 2 Jun 2007 00:35:54 +0200 (CEST) Received: from [192.168.23.10] (cpe-24-94-51-176.stny.res.rr.com [24.94.51.176]) by ms-smtp-01.nyroc.rr.com (8.13.6/8.13.6) with ESMTP id l51MZLun018065; Fri, 1 Jun 2007 18:35:24 -0400 (EDT) Subject: [PATCH RT] add might_sleep in rt_spin_lock_fastlock From: Steven Rostedt <rostedt@goodmis.org> To: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de>, Arnaldo Carvalho de Melo <acme@ghostprotocols.net>, LKML <linux-kernel@vger.kernel.org> Content-Type: multipart/mixed; boundary="=-jgTmng/RcFNHiVaU9w/Z" Date: Fri, 01 Jun 2007 18:35:21 -0400 Message-Id: <1180737321.21781.46.camel@localhost.localdomain> Mime-Version: 1.0 X-Mailer: Evolution 2.6.3 X-Virus-Scanned: Symantec AntiVirus Scan Engine X-Evolution-Source: imap://tglx%40linutronix.de@localhost:8993/ --=-jgTmng/RcFNHiVaU9w/Z Content-Type: text/plain Content-Transfer-Encoding: 8bit Ingo, Every so often we get bit by a bug "scheduling in atomic", and it comes from a rtmutex spin_lock. The bug only happens when that lock has contention, so we miss it a lot. This patch adds a might_sleep() to the rt_spin_lock_fastlock to find bugs where we can schedule in atomic. The one place that exists now is from do_page_fault and sending a signal. I wrote a simple crash program that segfaults (attached) and with this patch, I get the warning. -- Steve Signed-off-by: Steven Rostedt <rostedt@goodmis.org> --- kernel/rtmutex.c | 2 ++ 1 file changed, 2 insertions(+) Index: linux-2.6.24.7/kernel/rtmutex.c =================================================================== --- linux-2.6.24.7.orig/kernel/rtmutex.c +++ linux-2.6.24.7/kernel/rtmutex.c @@ -631,6 +631,8 @@ static inline void rt_spin_lock_fastlock(struct rt_mutex *lock, void fastcall (*slowfn)(struct rt_mutex *lock)) { + might_sleep(); + if (likely(rt_mutex_cmpxchg(lock, NULL, current))) rt_mutex_deadlock_account_lock(lock, current); else �������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/rt-mutex-i386.patch�������������������������������������������������������������������������0000664�0000764�0000764�00000055354�11041657731�014553� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������--- arch/x86/Kconfig | 13 ++++++- arch/x86/kernel/apm_32.c | 2 - arch/x86/kernel/entry_32.S | 4 +- arch/x86/kernel/i386_ksyms_32.c | 12 ++++--- arch/x86/kernel/process_32.c | 10 +++--- arch/x86/lib/semaphore_32.S | 24 +++++++------- include/asm-x86/rwsem.h | 41 ++++++++++++------------ include/asm-x86/semaphore_32.h | 65 +++++++++++++++++++++++---------------- include/asm-x86/spinlock_32.h | 36 ++++++++++----------- include/asm-x86/spinlock_types.h | 4 +- include/asm-x86/thread_info_32.h | 3 + 11 files changed, 121 insertions(+), 93 deletions(-) Index: linux-2.6.24.7/arch/x86/Kconfig =================================================================== --- linux-2.6.24.7.orig/arch/x86/Kconfig +++ linux-2.6.24.7/arch/x86/Kconfig @@ -96,10 +96,19 @@ config DMI default y config RWSEM_GENERIC_SPINLOCK - def_bool !X86_XADD + bool + depends on !X86_XADD || PREEMPT_RT + default y + +config ASM_SEMAPHORES + bool + default y + config RWSEM_XCHGADD_ALGORITHM - def_bool X86_XADD + bool + depends on X86_XADD && !RWSEM_GENERIC_SPINLOCK + default y config ARCH_HAS_ILOG2_U32 def_bool n Index: linux-2.6.24.7/arch/x86/kernel/apm_32.c =================================================================== --- linux-2.6.24.7.orig/arch/x86/kernel/apm_32.c +++ linux-2.6.24.7/arch/x86/kernel/apm_32.c @@ -783,7 +783,7 @@ static int apm_do_idle(void) */ smp_mb(); } - if (!need_resched()) { + if (!need_resched() && !need_resched_delayed()) { idled = 1; ret = apm_bios_call_simple(APM_FUNC_IDLE, 0, 0, &eax); } Index: linux-2.6.24.7/arch/x86/kernel/entry_32.S =================================================================== --- linux-2.6.24.7.orig/arch/x86/kernel/entry_32.S +++ linux-2.6.24.7/arch/x86/kernel/entry_32.S @@ -481,7 +481,7 @@ ENDPROC(system_call) ALIGN RING0_PTREGS_FRAME # can't unwind into user space anyway work_pending: - testb $_TIF_NEED_RESCHED, %cl + testl $(_TIF_NEED_RESCHED|_TIF_NEED_RESCHED_DELAYED), %ecx jz work_notifysig work_resched: call schedule @@ -494,7 +494,7 @@ work_resched: andl $_TIF_WORK_MASK, %ecx # is there any work to be done other # than syscall tracing? jz restore_all - testb $_TIF_NEED_RESCHED, %cl + testl $(_TIF_NEED_RESCHED|_TIF_NEED_RESCHED_DELAYED), %ecx jnz work_resched work_notifysig: # deal with pending signals and Index: linux-2.6.24.7/arch/x86/kernel/i386_ksyms_32.c =================================================================== --- linux-2.6.24.7.orig/arch/x86/kernel/i386_ksyms_32.c +++ linux-2.6.24.7/arch/x86/kernel/i386_ksyms_32.c @@ -10,10 +10,12 @@ EXPORT_SYMBOL(mcount); #endif -EXPORT_SYMBOL(__down_failed); -EXPORT_SYMBOL(__down_failed_interruptible); -EXPORT_SYMBOL(__down_failed_trylock); -EXPORT_SYMBOL(__up_wakeup); +#ifdef CONFIG_ASM_SEMAPHORES +EXPORT_SYMBOL(__compat_down_failed); +EXPORT_SYMBOL(__compat_down_failed_interruptible); +EXPORT_SYMBOL(__compat_down_failed_trylock); +EXPORT_SYMBOL(__compat_up_wakeup); +#endif /* Networking helper routines. */ EXPORT_SYMBOL(csum_partial_copy_generic); @@ -28,7 +30,7 @@ EXPORT_SYMBOL(__put_user_8); EXPORT_SYMBOL(strstr); -#ifdef CONFIG_SMP +#if defined(CONFIG_SMP) && defined(CONFIG_ASM_SEMAPHORES) extern void FASTCALL( __write_lock_failed(rwlock_t *rw)); extern void FASTCALL( __read_lock_failed(rwlock_t *rw)); EXPORT_SYMBOL(__write_lock_failed); Index: linux-2.6.24.7/arch/x86/kernel/process_32.c =================================================================== --- linux-2.6.24.7.orig/arch/x86/kernel/process_32.c +++ linux-2.6.24.7/arch/x86/kernel/process_32.c @@ -113,7 +113,7 @@ void default_idle(void) smp_mb(); local_irq_disable(); - if (!need_resched()) + if (!need_resched() && !need_resched_delayed()) safe_halt(); /* enables interrupts racelessly */ else local_irq_enable(); @@ -178,7 +178,7 @@ void cpu_idle(void) /* endless idle loop with no priority at all */ while (1) { tick_nohz_stop_sched_tick(); - while (!need_resched()) { + while (!need_resched() && !need_resched_delayed()) { void (*idle)(void); if (__get_cpu_var(cpu_idle_state)) @@ -201,7 +201,7 @@ void cpu_idle(void) start_critical_timings(); } tick_nohz_restart_sched_tick(); - preempt_enable_no_resched(); + __preempt_enable_no_resched(); schedule(); preempt_disable(); } @@ -260,10 +260,10 @@ EXPORT_SYMBOL_GPL(cpu_idle_wait); */ void mwait_idle_with_hints(unsigned long eax, unsigned long ecx) { - if (!need_resched()) { + if (!need_resched() && !need_resched_delayed()) { __monitor((void *)¤t_thread_info()->flags, 0, 0); smp_mb(); - if (!need_resched()) + if (!need_resched() && !need_resched_delayed()) __mwait(eax, ecx); } } Index: linux-2.6.24.7/arch/x86/lib/semaphore_32.S =================================================================== --- linux-2.6.24.7.orig/arch/x86/lib/semaphore_32.S +++ linux-2.6.24.7/arch/x86/lib/semaphore_32.S @@ -30,7 +30,7 @@ * value or just clobbered.. */ .section .sched.text -ENTRY(__down_failed) +ENTRY(__compat_down_failed) CFI_STARTPROC FRAME pushl %edx @@ -39,7 +39,7 @@ ENTRY(__down_failed) pushl %ecx CFI_ADJUST_CFA_OFFSET 4 CFI_REL_OFFSET ecx,0 - call __down + call __compat_down popl %ecx CFI_ADJUST_CFA_OFFSET -4 CFI_RESTORE ecx @@ -49,9 +49,9 @@ ENTRY(__down_failed) ENDFRAME ret CFI_ENDPROC - END(__down_failed) + END(__compat_down_failed) -ENTRY(__down_failed_interruptible) +ENTRY(__compat_down_failed_interruptible) CFI_STARTPROC FRAME pushl %edx @@ -60,7 +60,7 @@ ENTRY(__down_failed_interruptible) pushl %ecx CFI_ADJUST_CFA_OFFSET 4 CFI_REL_OFFSET ecx,0 - call __down_interruptible + call __compat_down_interruptible popl %ecx CFI_ADJUST_CFA_OFFSET -4 CFI_RESTORE ecx @@ -70,9 +70,9 @@ ENTRY(__down_failed_interruptible) ENDFRAME ret CFI_ENDPROC - END(__down_failed_interruptible) + END(__compat_down_failed_interruptible) -ENTRY(__down_failed_trylock) +ENTRY(__compat_down_failed_trylock) CFI_STARTPROC FRAME pushl %edx @@ -81,7 +81,7 @@ ENTRY(__down_failed_trylock) pushl %ecx CFI_ADJUST_CFA_OFFSET 4 CFI_REL_OFFSET ecx,0 - call __down_trylock + call __compat_down_trylock popl %ecx CFI_ADJUST_CFA_OFFSET -4 CFI_RESTORE ecx @@ -91,9 +91,9 @@ ENTRY(__down_failed_trylock) ENDFRAME ret CFI_ENDPROC - END(__down_failed_trylock) + END(__compat_down_failed_trylock) -ENTRY(__up_wakeup) +ENTRY(__compat_up_wakeup) CFI_STARTPROC FRAME pushl %edx @@ -102,7 +102,7 @@ ENTRY(__up_wakeup) pushl %ecx CFI_ADJUST_CFA_OFFSET 4 CFI_REL_OFFSET ecx,0 - call __up + call __compat_up popl %ecx CFI_ADJUST_CFA_OFFSET -4 CFI_RESTORE ecx @@ -112,7 +112,7 @@ ENTRY(__up_wakeup) ENDFRAME ret CFI_ENDPROC - END(__up_wakeup) + END(__compat_up_wakeup) /* * rw spinlock fallbacks Index: linux-2.6.24.7/include/asm-x86/rwsem.h =================================================================== --- linux-2.6.24.7.orig/include/asm-x86/rwsem.h +++ linux-2.6.24.7/include/asm-x86/rwsem.h @@ -44,15 +44,15 @@ struct rwsem_waiter; -extern struct rw_semaphore *FASTCALL(rwsem_down_read_failed(struct rw_semaphore *sem)); -extern struct rw_semaphore *FASTCALL(rwsem_down_write_failed(struct rw_semaphore *sem)); -extern struct rw_semaphore *FASTCALL(rwsem_wake(struct rw_semaphore *)); -extern struct rw_semaphore *FASTCALL(rwsem_downgrade_wake(struct rw_semaphore *sem)); +extern struct compat_rw_semaphore *FASTCALL(rwsem_down_read_failed(struct compat_rw_semaphore *sem)); +extern struct compat_rw_semaphore *FASTCALL(rwsem_down_write_failed(struct compat_rw_semaphore *sem)); +extern struct compat_rw_semaphore *FASTCALL(rwsem_wake(struct compat_rw_semaphore *)); +extern struct compat_rw_semaphore *FASTCALL(rwsem_downgrade_wake(struct compat_rw_semaphore *sem)); /* * the semaphore definition */ -struct rw_semaphore { +struct compat_rw_semaphore { signed long count; #define RWSEM_UNLOCKED_VALUE 0x00000000 #define RWSEM_ACTIVE_BIAS 0x00000001 @@ -78,23 +78,23 @@ struct rw_semaphore { { RWSEM_UNLOCKED_VALUE, __SPIN_LOCK_UNLOCKED((name).wait_lock), \ LIST_HEAD_INIT((name).wait_list) __RWSEM_DEP_MAP_INIT(name) } -#define DECLARE_RWSEM(name) \ - struct rw_semaphore name = __RWSEM_INITIALIZER(name) +#define COMPAT_DECLARE_RWSEM(name) \ + struct compat_rw_semaphore name = __RWSEM_INITIALIZER(name) -extern void __init_rwsem(struct rw_semaphore *sem, const char *name, +extern void __compat_init_rwsem(struct rw_semaphore *sem, const char *name, struct lock_class_key *key); -#define init_rwsem(sem) \ +#define compat_init_rwsem(sem) \ do { \ static struct lock_class_key __key; \ \ - __init_rwsem((sem), #sem, &__key); \ + __compat_init_rwsem((sem), #sem, &__key); \ } while (0) /* * lock for reading */ -static inline void __down_read(struct rw_semaphore *sem) +static inline void __down_read(struct compat_rw_semaphore *sem) { __asm__ __volatile__( "# beginning down_read\n\t" @@ -111,7 +111,7 @@ LOCK_PREFIX " incl (%%eax)\n\t" /* /* * trylock for reading -- returns 1 if successful, 0 if contention */ -static inline int __down_read_trylock(struct rw_semaphore *sem) +static inline int __down_read_trylock(struct compat_rw_semaphore *sem) { __s32 result, tmp; __asm__ __volatile__( @@ -134,7 +134,8 @@ LOCK_PREFIX " cmpxchgl %2,%0\n\t" /* * lock for writing */ -static inline void __down_write_nested(struct rw_semaphore *sem, int subclass) +static inline void +__down_write_nested(struct compat_rw_semaphore *sem, int subclass) { int tmp; @@ -160,7 +161,7 @@ static inline void __down_write(struct r /* * trylock for writing -- returns 1 if successful, 0 if contention */ -static inline int __down_write_trylock(struct rw_semaphore *sem) +static inline int __down_write_trylock(struct compat_rw_semaphore *sem) { signed long ret = cmpxchg(&sem->count, RWSEM_UNLOCKED_VALUE, @@ -173,7 +174,7 @@ static inline int __down_write_trylock(s /* * unlock after reading */ -static inline void __up_read(struct rw_semaphore *sem) +static inline void __up_read(struct compat_rw_semaphore *sem) { __s32 tmp = -RWSEM_ACTIVE_READ_BIAS; __asm__ __volatile__( @@ -191,7 +192,7 @@ LOCK_PREFIX " xadd %%edx,(%%eax)\n /* * unlock after writing */ -static inline void __up_write(struct rw_semaphore *sem) +static inline void __up_write(struct compat_rw_semaphore *sem) { __asm__ __volatile__( "# beginning __up_write\n\t" @@ -209,7 +210,7 @@ LOCK_PREFIX " xaddl %%edx,(%%eax)\n /* * downgrade write lock to read lock */ -static inline void __downgrade_write(struct rw_semaphore *sem) +static inline void __downgrade_write(struct compat_rw_semaphore *sem) { __asm__ __volatile__( "# beginning __downgrade_write\n\t" @@ -226,7 +227,7 @@ LOCK_PREFIX " addl %2,(%%eax)\n\t" /* * implement atomic add functionality */ -static inline void rwsem_atomic_add(int delta, struct rw_semaphore *sem) +static inline void rwsem_atomic_add(int delta, struct compat_rw_semaphore *sem) { __asm__ __volatile__( LOCK_PREFIX "addl %1,%0" @@ -237,7 +238,7 @@ LOCK_PREFIX "addl %1,%0" /* * implement exchange and add functionality */ -static inline int rwsem_atomic_update(int delta, struct rw_semaphore *sem) +static inline int rwsem_atomic_update(int delta, struct compat_rw_semaphore *sem) { int tmp = delta; @@ -249,7 +250,7 @@ LOCK_PREFIX "xadd %0,%1" return tmp+delta; } -static inline int rwsem_is_locked(struct rw_semaphore *sem) +static inline int compat_rwsem_is_locked(struct rw_semaphore *sem) { return (sem->count != 0); } Index: linux-2.6.24.7/include/asm-x86/semaphore_32.h =================================================================== --- linux-2.6.24.7.orig/include/asm-x86/semaphore_32.h +++ linux-2.6.24.7/include/asm-x86/semaphore_32.h @@ -3,8 +3,6 @@ #include <linux/linkage.h> -#ifdef __KERNEL__ - /* * SMP- and interrupt-safe semaphores.. * @@ -41,29 +39,39 @@ #include <linux/wait.h> #include <linux/rwsem.h> -struct semaphore { +/* + * On !PREEMPT_RT all semaphores are compat: + */ +#ifndef CONFIG_PREEMPT_RT +# define compat_semaphore semaphore +#endif + +struct compat_semaphore { atomic_t count; int sleepers; wait_queue_head_t wait; }; -#define __SEMAPHORE_INITIALIZER(name, n) \ +#define __COMPAT_SEMAPHORE_INITIALIZER(name, n) \ { \ .count = ATOMIC_INIT(n), \ .sleepers = 0, \ .wait = __WAIT_QUEUE_HEAD_INITIALIZER((name).wait) \ } -#define __DECLARE_SEMAPHORE_GENERIC(name,count) \ - struct semaphore name = __SEMAPHORE_INITIALIZER(name,count) +#define __COMPAT_MUTEX_INITIALIZER(name) \ + __COMPAT_SEMAPHORE_INITIALIZER(name,1) -#define DECLARE_MUTEX(name) __DECLARE_SEMAPHORE_GENERIC(name,1) +#define __COMPAT_DECLARE_SEMAPHORE_GENERIC(name,count) \ + struct compat_semaphore name = __COMPAT_SEMAPHORE_INITIALIZER(name,count) -static inline void sema_init (struct semaphore *sem, int val) +#define COMPAT_DECLARE_MUTEX(name) __COMPAT_DECLARE_SEMAPHORE_GENERIC(name,1) + +static inline void compat_sema_init (struct compat_semaphore *sem, int val) { /* - * *sem = (struct semaphore)__SEMAPHORE_INITIALIZER((*sem),val); + * *sem = (struct compat_semaphore)__SEMAPHORE_INITIALIZER((*sem),val); * * i'd rather use the more flexible initialization above, but sadly * GCC 2.7.2.3 emits a bogus warning. EGCS doesn't. Oh well. @@ -73,27 +81,27 @@ static inline void sema_init (struct sem init_waitqueue_head(&sem->wait); } -static inline void init_MUTEX (struct semaphore *sem) +static inline void compat_init_MUTEX (struct compat_semaphore *sem) { - sema_init(sem, 1); + compat_sema_init(sem, 1); } -static inline void init_MUTEX_LOCKED (struct semaphore *sem) +static inline void compat_init_MUTEX_LOCKED (struct compat_semaphore *sem) { - sema_init(sem, 0); + compat_sema_init(sem, 0); } -fastcall void __down_failed(void /* special register calling convention */); -fastcall int __down_failed_interruptible(void /* params in registers */); -fastcall int __down_failed_trylock(void /* params in registers */); -fastcall void __up_wakeup(void /* special register calling convention */); +fastcall void __compat_down_failed(void /* special register calling convention */); +fastcall int __compat_down_failed_interruptible(void /* params in registers */); +fastcall int __compat_down_failed_trylock(void /* params in registers */); +fastcall void __compat_up_wakeup(void /* special register calling convention */); /* * This is ugly, but we want the default case to fall through. * "__down_failed" is a special asm handler that calls the C * routine that actually waits. See arch/i386/kernel/semaphore.c */ -static inline void down(struct semaphore * sem) +static inline void compat_down(struct compat_semaphore * sem) { might_sleep(); __asm__ __volatile__( @@ -101,7 +109,7 @@ static inline void down(struct semaphore LOCK_PREFIX "decl %0\n\t" /* --sem->count */ "jns 2f\n" "\tlea %0,%%eax\n\t" - "call __down_failed\n" + "call __compat_down_failed\n" "2:" :"+m" (sem->count) : @@ -112,7 +120,7 @@ static inline void down(struct semaphore * Interruptible try to acquire a semaphore. If we obtained * it, return zero. If we were interrupted, returns -EINTR */ -static inline int down_interruptible(struct semaphore * sem) +static inline int compat_down_interruptible(struct compat_semaphore * sem) { int result; @@ -123,7 +131,7 @@ static inline int down_interruptible(str LOCK_PREFIX "decl %1\n\t" /* --sem->count */ "jns 2f\n\t" "lea %1,%%eax\n\t" - "call __down_failed_interruptible\n" + "call __compat_down_failed_interruptible\n" "2:" :"=&a" (result), "+m" (sem->count) : @@ -135,7 +143,7 @@ static inline int down_interruptible(str * Non-blockingly attempt to down() a semaphore. * Returns zero if we acquired it */ -static inline int down_trylock(struct semaphore * sem) +static inline int compat_down_trylock(struct compat_semaphore * sem) { int result; @@ -145,7 +153,7 @@ static inline int down_trylock(struct se LOCK_PREFIX "decl %1\n\t" /* --sem->count */ "jns 2f\n\t" "lea %1,%%eax\n\t" - "call __down_failed_trylock\n\t" + "call __compat_down_failed_trylock\n\t" "2:\n" :"=&a" (result), "+m" (sem->count) : @@ -157,19 +165,24 @@ static inline int down_trylock(struct se * Note! This is subtle. We jump to wake people up only if * the semaphore was negative (== somebody was waiting on it). */ -static inline void up(struct semaphore * sem) +static inline void compat_up(struct compat_semaphore * sem) { __asm__ __volatile__( "# atomic up operation\n\t" LOCK_PREFIX "incl %0\n\t" /* ++sem->count */ "jg 1f\n\t" "lea %0,%%eax\n\t" - "call __up_wakeup\n" + "call __compat_up_wakeup\n" "1:" :"+m" (sem->count) : :"memory","ax"); } -#endif +extern int FASTCALL(compat_sem_is_locked(struct compat_semaphore *sem)); + +#define compat_sema_count(sem) atomic_read(&(sem)->count) + +#include <linux/semaphore.h> + #endif Index: linux-2.6.24.7/include/asm-x86/spinlock_32.h =================================================================== --- linux-2.6.24.7.orig/include/asm-x86/spinlock_32.h +++ linux-2.6.24.7/include/asm-x86/spinlock_32.h @@ -27,12 +27,12 @@ * (the type definitions are in asm/spinlock_types.h) */ -static inline int __raw_spin_is_locked(raw_spinlock_t *x) +static inline int __raw_spin_is_locked(__raw_spinlock_t *x) { return *(volatile signed char *)(&(x)->slock) <= 0; } -static inline void __raw_spin_lock(raw_spinlock_t *lock) +static inline void __raw_spin_lock(__raw_spinlock_t *lock) { asm volatile("\n1:\t" LOCK_PREFIX " ; decb %0\n\t" @@ -55,7 +55,7 @@ static inline void __raw_spin_lock(raw_s * irq-traced, but on CONFIG_TRACE_IRQFLAGS we never use this variant. */ #ifndef CONFIG_PROVE_LOCKING -static inline void __raw_spin_lock_flags(raw_spinlock_t *lock, unsigned long flags) +static inline void __raw_spin_lock_flags(__raw_spinlock_t *lock, unsigned long flags) { asm volatile( "\n1:\t" @@ -84,7 +84,7 @@ static inline void __raw_spin_lock_flags } #endif -static inline int __raw_spin_trylock(raw_spinlock_t *lock) +static inline int __raw_spin_trylock(__raw_spinlock_t *lock) { char oldval; asm volatile( @@ -103,14 +103,14 @@ static inline int __raw_spin_trylock(raw #if !defined(CONFIG_X86_OOSTORE) && !defined(CONFIG_X86_PPRO_FENCE) -static inline void __raw_spin_unlock(raw_spinlock_t *lock) +static inline void __raw_spin_unlock(__raw_spinlock_t *lock) { asm volatile("movb $1,%0" : "+m" (lock->slock) :: "memory"); } #else -static inline void __raw_spin_unlock(raw_spinlock_t *lock) +static inline void __raw_spin_unlock(__raw_spinlock_t *lock) { char oldval = 1; @@ -121,7 +121,7 @@ static inline void __raw_spin_unlock(raw #endif -static inline void __raw_spin_unlock_wait(raw_spinlock_t *lock) +static inline void __raw_spin_unlock_wait(__raw_spinlock_t *lock) { while (__raw_spin_is_locked(lock)) cpu_relax(); @@ -152,7 +152,7 @@ static inline void __raw_spin_unlock_wai * read_can_lock - would read_trylock() succeed? * @lock: the rwlock in question. */ -static inline int __raw_read_can_lock(raw_rwlock_t *x) +static inline int __raw_read_can_lock(__raw_rwlock_t *x) { return (int)(x)->lock > 0; } @@ -161,12 +161,12 @@ static inline int __raw_read_can_lock(ra * write_can_lock - would write_trylock() succeed? * @lock: the rwlock in question. */ -static inline int __raw_write_can_lock(raw_rwlock_t *x) +static inline int __raw_write_can_lock(__raw_rwlock_t *x) { return (x)->lock == RW_LOCK_BIAS; } -static inline void __raw_read_lock(raw_rwlock_t *rw) +static inline void __raw_read_lock(__raw_rwlock_t *rw) { asm volatile(LOCK_PREFIX " subl $1,(%0)\n\t" "jns 1f\n" @@ -175,7 +175,7 @@ static inline void __raw_read_lock(raw_r ::"a" (rw) : "memory"); } -static inline void __raw_write_lock(raw_rwlock_t *rw) +static inline void __raw_write_lock(__raw_rwlock_t *rw) { asm volatile(LOCK_PREFIX " subl $" RW_LOCK_BIAS_STR ",(%0)\n\t" "jz 1f\n" @@ -184,7 +184,7 @@ static inline void __raw_write_lock(raw_ ::"a" (rw) : "memory"); } -static inline int __raw_read_trylock(raw_rwlock_t *lock) +static inline int __raw_read_trylock(__raw_rwlock_t *lock) { atomic_t *count = (atomic_t *)lock; atomic_dec(count); @@ -194,7 +194,7 @@ static inline int __raw_read_trylock(raw return 0; } -static inline int __raw_write_trylock(raw_rwlock_t *lock) +static inline int __raw_write_trylock(__raw_rwlock_t *lock) { atomic_t *count = (atomic_t *)lock; if (atomic_sub_and_test(RW_LOCK_BIAS, count)) @@ -203,19 +203,19 @@ static inline int __raw_write_trylock(ra return 0; } -static inline void __raw_read_unlock(raw_rwlock_t *rw) +static inline void __raw_read_unlock(__raw_rwlock_t *rw) { asm volatile(LOCK_PREFIX "incl %0" :"+m" (rw->lock) : : "memory"); } -static inline void __raw_write_unlock(raw_rwlock_t *rw) +static inline void __raw_write_unlock(__raw_rwlock_t *rw) { asm volatile(LOCK_PREFIX "addl $" RW_LOCK_BIAS_STR ", %0" : "+m" (rw->lock) : : "memory"); } -#define _raw_spin_relax(lock) cpu_relax() -#define _raw_read_relax(lock) cpu_relax() -#define _raw_write_relax(lock) cpu_relax() +#define __raw_spin_relax(lock) cpu_relax() +#define __raw_read_relax(lock) cpu_relax() +#define __raw_write_relax(lock) cpu_relax() #endif /* __ASM_SPINLOCK_H */ Index: linux-2.6.24.7/include/asm-x86/spinlock_types.h =================================================================== --- linux-2.6.24.7.orig/include/asm-x86/spinlock_types.h +++ linux-2.6.24.7/include/asm-x86/spinlock_types.h @@ -7,13 +7,13 @@ typedef struct { unsigned int slock; -} raw_spinlock_t; +} __raw_spinlock_t; #define __RAW_SPIN_LOCK_UNLOCKED { 1 } typedef struct { unsigned int lock; -} raw_rwlock_t; +} __raw_rwlock_t; #define __RAW_RW_LOCK_UNLOCKED { RW_LOCK_BIAS } Index: linux-2.6.24.7/include/asm-x86/thread_info_32.h =================================================================== --- linux-2.6.24.7.orig/include/asm-x86/thread_info_32.h +++ linux-2.6.24.7/include/asm-x86/thread_info_32.h @@ -132,15 +132,18 @@ static inline struct thread_info *curren #define TIF_SYSCALL_AUDIT 6 /* syscall auditing active */ #define TIF_SECCOMP 7 /* secure computing */ #define TIF_RESTORE_SIGMASK 8 /* restore signal mask in do_signal() */ +#define TIF_NEED_RESCHED_DELAYED 10 /* reschedule on return to userspace */ #define TIF_MEMDIE 16 #define TIF_DEBUG 17 /* uses debug registers */ #define TIF_IO_BITMAP 18 /* uses I/O bitmap */ #define TIF_FREEZE 19 /* is freezing for suspend */ #define TIF_NOTSC 20 /* TSC is not accessible in userland */ + #define _TIF_SYSCALL_TRACE (1<<TIF_SYSCALL_TRACE) #define _TIF_SIGPENDING (1<<TIF_SIGPENDING) #define _TIF_NEED_RESCHED (1<<TIF_NEED_RESCHED) +#define _TIF_NEED_RESCHED_DELAYED (1<<TIF_NEED_RESCHED_DELAYED) #define _TIF_SINGLESTEP (1<<TIF_SINGLESTEP) #define _TIF_IRET (1<<TIF_IRET) #define _TIF_SYSCALL_EMU (1<<TIF_SYSCALL_EMU) ������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/rt-mutex-mips.patch�������������������������������������������������������������������������0000664�0000764�0000764�00000014704�11041657733�015026� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������--- arch/mips/Kconfig | 15 +++++++++++---- arch/mips/kernel/Makefile | 4 +++- include/asm-mips/atomic.h | 26 +++++++++++++++++++++----- include/asm-mips/semaphore.h | 30 +++++++++++++++++++++--------- 4 files changed, 56 insertions(+), 19 deletions(-) Index: linux-2.6.24.7/arch/mips/Kconfig =================================================================== --- linux-2.6.24.7.orig/arch/mips/Kconfig +++ linux-2.6.24.7/arch/mips/Kconfig @@ -52,6 +52,7 @@ config BCM47XX select CEVT_R4K select CSRC_R4K select DMA_NONCOHERENT + select NO_SPINLOCK select HW_HAS_PCI select IRQ_CPU select SYS_HAS_CPU_MIPS32_R1 @@ -703,10 +704,17 @@ endmenu config RWSEM_GENERIC_SPINLOCK bool + depends on !PREEMPT_RT default y config RWSEM_XCHGADD_ALGORITHM bool + depends on !PREEMPT_RT + +config ASM_SEMAPHORES + bool +# depends on !PREEMPT_RT + default y config ARCH_HAS_ILOG2_U32 bool @@ -808,6 +816,9 @@ config DMA_NONCOHERENT config DMA_NEED_PCI_MAP_STATE bool +config NO_SPINLOCK + bool + config EARLY_PRINTK bool "Early printk" if EMBEDDED && DEBUG_KERNEL depends on SYS_HAS_EARLY_PRINTK @@ -1889,10 +1900,6 @@ config SECCOMP endmenu -config RWSEM_GENERIC_SPINLOCK - bool - default y - config LOCKDEP_SUPPORT bool default y Index: linux-2.6.24.7/arch/mips/kernel/Makefile =================================================================== --- linux-2.6.24.7.orig/arch/mips/kernel/Makefile +++ linux-2.6.24.7/arch/mips/kernel/Makefile @@ -5,7 +5,7 @@ extra-y := head.o init_task.o vmlinux.lds obj-y += cpu-probe.o branch.o entry.o genex.o irq.o process.o \ - ptrace.o reset.o semaphore.o setup.o signal.o syscall.o \ + ptrace.o reset.o setup.o signal.o syscall.o \ time.o topology.o traps.o unaligned.o obj-$(CONFIG_CEVT_BCM1480) += cevt-bcm1480.o @@ -26,6 +26,8 @@ obj-$(CONFIG_MODULES) += mips_ksyms.o m obj-$(CONFIG_CPU_LOONGSON2) += r4k_fpu.o r4k_switch.o obj-$(CONFIG_CPU_MIPS32) += r4k_fpu.o r4k_switch.o obj-$(CONFIG_CPU_MIPS64) += r4k_fpu.o r4k_switch.o +obj-$(CONFIG_ASM_SEMAPHORES) += semaphore.o + obj-$(CONFIG_CPU_R3000) += r2300_fpu.o r2300_switch.o obj-$(CONFIG_CPU_R4000) += r4k_fpu.o r4k_switch.o obj-$(CONFIG_CPU_R4300) += r4k_fpu.o r4k_switch.o Index: linux-2.6.24.7/include/asm-mips/atomic.h =================================================================== --- linux-2.6.24.7.orig/include/asm-mips/atomic.h +++ linux-2.6.24.7/include/asm-mips/atomic.h @@ -171,7 +171,9 @@ static __inline__ int atomic_add_return( : "=&r" (result), "=&r" (temp), "=m" (v->counter) : "Ir" (i), "m" (v->counter) : "memory"); - } else { + } +#if !defined(CONFIG_NO_SPINLOCK) && !defined(CONFIG_PREEMPT_RT) + else { unsigned long flags; raw_local_irq_save(flags); @@ -180,6 +182,7 @@ static __inline__ int atomic_add_return( v->counter = result; raw_local_irq_restore(flags); } +#endif smp_llsc_mb(); @@ -223,7 +226,9 @@ static __inline__ int atomic_sub_return( : "=&r" (result), "=&r" (temp), "=m" (v->counter) : "Ir" (i), "m" (v->counter) : "memory"); - } else { + } +#if !defined(CONFIG_NO_SPINLOCK) && !defined(CONFIG_PREEMPT_RT) + else { unsigned long flags; raw_local_irq_save(flags); @@ -232,6 +237,7 @@ static __inline__ int atomic_sub_return( v->counter = result; raw_local_irq_restore(flags); } +#endif smp_llsc_mb(); @@ -291,7 +297,9 @@ static __inline__ int atomic_sub_if_posi : "=&r" (result), "=&r" (temp), "=m" (v->counter) : "Ir" (i), "m" (v->counter) : "memory"); - } else { + } +#if !defined(CONFIG_NO_SPINLOCK) && !defined(CONFIG_PREEMPT_RT) + else { unsigned long flags; raw_local_irq_save(flags); @@ -301,6 +309,7 @@ static __inline__ int atomic_sub_if_posi v->counter = result; raw_local_irq_restore(flags); } +#endif smp_llsc_mb(); @@ -552,7 +561,9 @@ static __inline__ long atomic64_add_retu : "=&r" (result), "=&r" (temp), "=m" (v->counter) : "Ir" (i), "m" (v->counter) : "memory"); - } else { + } +#if !defined(CONFIG_NO_SPINLOCK) && !defined(CONFIG_PREEMPT_RT) + else { unsigned long flags; raw_local_irq_save(flags); @@ -561,6 +572,8 @@ static __inline__ long atomic64_add_retu v->counter = result; raw_local_irq_restore(flags); } +#endif +#endif smp_llsc_mb(); @@ -604,7 +617,9 @@ static __inline__ long atomic64_sub_retu : "=&r" (result), "=&r" (temp), "=m" (v->counter) : "Ir" (i), "m" (v->counter) : "memory"); - } else { + } +#if !defined(CONFIG_NO_SPINLOCK) && !defined(CONFIG_PREEMPT_RT) + else { unsigned long flags; raw_local_irq_save(flags); @@ -682,6 +697,7 @@ static __inline__ long atomic64_sub_if_p v->counter = result; raw_local_irq_restore(flags); } +#endif smp_llsc_mb(); Index: linux-2.6.24.7/include/asm-mips/semaphore.h =================================================================== --- linux-2.6.24.7.orig/include/asm-mips/semaphore.h +++ linux-2.6.24.7/include/asm-mips/semaphore.h @@ -24,12 +24,20 @@ #ifdef __KERNEL__ -#include <asm/atomic.h> -#include <asm/system.h> #include <linux/wait.h> #include <linux/rwsem.h> -struct semaphore { +/* + * On !PREEMPT_RT all semaphores are compat: + */ +#ifndef CONFIG_PREEMPT_RT +# define compat_semaphore semaphore +#endif + +#include <asm/atomic.h> +#include <asm/system.h> + +struct compat_semaphore { /* * Note that any negative value of count is equivalent to 0, * but additionally indicates that some process(es) might be @@ -78,31 +86,35 @@ static inline void down(struct semaphore * Try to get the semaphore, take the slow path if we fail. */ if (unlikely(atomic_dec_return(&sem->count) < 0)) - __down(sem); + __compat_down(sem); } -static inline int down_interruptible(struct semaphore * sem) +static inline int compat_down_interruptible(struct compat_semaphore * sem) { int ret = 0; might_sleep(); if (unlikely(atomic_dec_return(&sem->count) < 0)) - ret = __down_interruptible(sem); + ret = __compat_down_interruptible(sem); return ret; } -static inline int down_trylock(struct semaphore * sem) +static inline int compat_down_trylock(struct compat_semaphore * sem) { return atomic_dec_if_positive(&sem->count) < 0; } -static inline void up(struct semaphore * sem) +static inline void compat_up(struct compat_semaphore * sem) { if (unlikely(atomic_inc_return(&sem->count) <= 0)) - __up(sem); + __compat_up(sem); } +#define compat_sema_count(sem) atomic_read(&(sem)->count) + +#include <linux/semaphore.h> + #endif /* __KERNEL__ */ #endif /* __ASM_SEMAPHORE_H */ ������������������������������������������������������������patches/rt-mutex-ppc.patch��������������������������������������������������������������������������0000664�0000764�0000764�00000062460�11043037055�014631� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������--- arch/powerpc/Kconfig | 19 +++++++----- arch/powerpc/kernel/Makefile | 3 + arch/powerpc/kernel/ppc_ksyms.c | 1 arch/powerpc/kernel/semaphore.c | 20 ++++++++---- arch/powerpc/lib/locks.c | 4 +- arch/ppc/Kconfig | 19 +++++++----- arch/ppc/kernel/entry.S | 4 +- arch/ppc/kernel/semaphore.c | 13 +++++--- arch/ppc/lib/locks.c | 38 ++++++++++++------------ drivers/macintosh/adb.c | 10 +++--- include/asm-powerpc/rwsem.h | 38 ++++++++++++------------ include/asm-powerpc/semaphore.h | 54 ++++++++++++++++++++++------------- include/asm-powerpc/spinlock.h | 38 ++++++++++++------------ include/asm-powerpc/spinlock_types.h | 4 +- 14 files changed, 151 insertions(+), 114 deletions(-) Index: linux-2.6.24.7/arch/powerpc/Kconfig =================================================================== --- linux-2.6.24.7.orig/arch/powerpc/Kconfig +++ linux-2.6.24.7/arch/powerpc/Kconfig @@ -47,13 +47,6 @@ config IRQ_PER_CPU bool default y -config RWSEM_GENERIC_SPINLOCK - bool - -config RWSEM_XCHGADD_ALGORITHM - bool - default y - config ARCH_HAS_ILOG2_U32 bool default y @@ -177,6 +170,18 @@ config HIGHMEM source kernel/time/Kconfig source kernel/Kconfig.hz source kernel/Kconfig.preempt + +config RWSEM_GENERIC_SPINLOCK + bool + default y + +config ASM_SEMAPHORES + bool + default y + +config RWSEM_XCHGADD_ALGORITHM + bool + source "fs/Kconfig.binfmt" # We optimistically allocate largepages from the VM, so make the limit Index: linux-2.6.24.7/arch/powerpc/kernel/Makefile =================================================================== --- linux-2.6.24.7.orig/arch/powerpc/kernel/Makefile +++ linux-2.6.24.7/arch/powerpc/kernel/Makefile @@ -22,11 +22,12 @@ endif endif -obj-y := semaphore.o cputable.o ptrace.o syscalls.o \ +obj-y := cputable.o ptrace.o syscalls.o \ irq.o align.o signal_32.o pmc.o vdso.o \ init_task.o process.o systbl.o idle.o \ signal.o obj-y += vdso32/ +obj-$(CONFIG_ASM_SEMAPHORES) += semaphore.o obj-$(CONFIG_PPC64) += setup_64.o binfmt_elf32.o sys_ppc32.o \ signal_64.o ptrace32.o \ paca.o cpu_setup_ppc970.o \ Index: linux-2.6.24.7/arch/powerpc/kernel/ppc_ksyms.c =================================================================== --- linux-2.6.24.7.orig/arch/powerpc/kernel/ppc_ksyms.c +++ linux-2.6.24.7/arch/powerpc/kernel/ppc_ksyms.c @@ -15,7 +15,6 @@ #include <linux/bitops.h> #include <asm/page.h> -#include <asm/semaphore.h> #include <asm/processor.h> #include <asm/cacheflush.h> #include <asm/uaccess.h> Index: linux-2.6.24.7/arch/powerpc/kernel/semaphore.c =================================================================== --- linux-2.6.24.7.orig/arch/powerpc/kernel/semaphore.c +++ linux-2.6.24.7/arch/powerpc/kernel/semaphore.c @@ -31,7 +31,7 @@ * sem->count = tmp; * return old_count; */ -static inline int __sem_update_count(struct semaphore *sem, int incr) +static inline int __sem_update_count(struct compat_semaphore *sem, int incr) { int old_count, tmp; @@ -50,7 +50,7 @@ static inline int __sem_update_count(str return old_count; } -void __up(struct semaphore *sem) +void __compat_up(struct compat_semaphore *sem) { /* * Note that we incremented count in up() before we came here, @@ -63,7 +63,7 @@ void __up(struct semaphore *sem) __sem_update_count(sem, 1); wake_up(&sem->wait); } -EXPORT_SYMBOL(__up); +EXPORT_SYMBOL(__compat_up); /* * Note that when we come in to __down or __down_interruptible, @@ -73,7 +73,7 @@ EXPORT_SYMBOL(__up); * Thus it is only when we decrement count from some value > 0 * that we have actually got the semaphore. */ -void __sched __down(struct semaphore *sem) +void __sched __compat_down(struct compat_semaphore *sem) { struct task_struct *tsk = current; DECLARE_WAITQUEUE(wait, tsk); @@ -101,9 +101,9 @@ void __sched __down(struct semaphore *se */ wake_up(&sem->wait); } -EXPORT_SYMBOL(__down); +EXPORT_SYMBOL(__compat_down); -int __sched __down_interruptible(struct semaphore * sem) +int __sched __compat_down_interruptible(struct compat_semaphore *sem) { int retval = 0; struct task_struct *tsk = current; @@ -132,4 +132,10 @@ int __sched __down_interruptible(struct wake_up(&sem->wait); return retval; } -EXPORT_SYMBOL(__down_interruptible); +EXPORT_SYMBOL(__compat_down_interruptible); + +int compat_sem_is_locked(struct compat_semaphore *sem) +{ + return (int) atomic_read(&sem->count) < 0; +} +EXPORT_SYMBOL(compat_sem_is_locked); Index: linux-2.6.24.7/arch/powerpc/lib/locks.c =================================================================== --- linux-2.6.24.7.orig/arch/powerpc/lib/locks.c +++ linux-2.6.24.7/arch/powerpc/lib/locks.c @@ -25,7 +25,7 @@ #include <asm/smp.h> #include <asm/firmware.h> -void __spin_yield(raw_spinlock_t *lock) +void __spin_yield(__raw_spinlock_t *lock) { unsigned int lock_value, holder_cpu, yield_count; @@ -82,7 +82,7 @@ void __rw_yield(raw_rwlock_t *rw) } #endif -void __raw_spin_unlock_wait(raw_spinlock_t *lock) +void __raw_spin_unlock_wait(__raw_spinlock_t *lock) { while (lock->slock) { HMT_low(); Index: linux-2.6.24.7/arch/ppc/Kconfig =================================================================== --- linux-2.6.24.7.orig/arch/ppc/Kconfig +++ linux-2.6.24.7/arch/ppc/Kconfig @@ -16,13 +16,6 @@ config GENERIC_HARDIRQS bool default y -config RWSEM_GENERIC_SPINLOCK - bool - -config RWSEM_XCHGADD_ALGORITHM - bool - default y - config ARCH_HAS_ILOG2_U32 bool default y @@ -979,6 +972,18 @@ config ARCH_POPULATES_NODE_MAP source kernel/Kconfig.hz source kernel/Kconfig.preempt + +config RWSEM_GENERIC_SPINLOCK + bool + default y + +config ASM_SEMAPHORES + bool + default y + +config RWSEM_XCHGADD_ALGORITHM + bool + source "mm/Kconfig" source "fs/Kconfig.binfmt" Index: linux-2.6.24.7/arch/ppc/kernel/entry.S =================================================================== --- linux-2.6.24.7.orig/arch/ppc/kernel/entry.S +++ linux-2.6.24.7/arch/ppc/kernel/entry.S @@ -892,7 +892,7 @@ global_dbcr0: #endif /* !(CONFIG_4xx || CONFIG_BOOKE) */ do_work: /* r10 contains MSR_KERNEL here */ - andi. r0,r9,_TIF_NEED_RESCHED + andi. r0,r9,(_TIF_NEED_RESCHED|_TIF_NEED_RESCHED_DELAYED) beq do_user_signal do_resched: /* r10 contains MSR_KERNEL here */ @@ -906,7 +906,7 @@ recheck: MTMSRD(r10) /* disable interrupts */ rlwinm r9,r1,0,0,18 lwz r9,TI_FLAGS(r9) - andi. r0,r9,_TIF_NEED_RESCHED + andi. r0,r9,(_TIF_NEED_RESCHED|_TIF_NEED_RESCHED_DELAYED) bne- do_resched andi. r0,r9,_TIF_SIGPENDING beq restore_user Index: linux-2.6.24.7/arch/ppc/kernel/semaphore.c =================================================================== --- linux-2.6.24.7.orig/arch/ppc/kernel/semaphore.c +++ linux-2.6.24.7/arch/ppc/kernel/semaphore.c @@ -29,7 +29,7 @@ * sem->count = tmp; * return old_count; */ -static inline int __sem_update_count(struct semaphore *sem, int incr) +static inline int __sem_update_count(struct compat_semaphore *sem, int incr) { int old_count, tmp; @@ -48,7 +48,7 @@ static inline int __sem_update_count(str return old_count; } -void __up(struct semaphore *sem) +void __compat_up(struct compat_semaphore *sem) { /* * Note that we incremented count in up() before we came here, @@ -70,7 +70,7 @@ void __up(struct semaphore *sem) * Thus it is only when we decrement count from some value > 0 * that we have actually got the semaphore. */ -void __sched __down(struct semaphore *sem) +void __sched __compat_down(struct compat_semaphore *sem) { struct task_struct *tsk = current; DECLARE_WAITQUEUE(wait, tsk); @@ -100,7 +100,7 @@ void __sched __down(struct semaphore *se wake_up(&sem->wait); } -int __sched __down_interruptible(struct semaphore * sem) +int __sched __compat_down_interruptible(struct compat_semaphore * sem) { int retval = 0; struct task_struct *tsk = current; @@ -129,3 +129,8 @@ int __sched __down_interruptible(struct wake_up(&sem->wait); return retval; } + +int compat_sem_is_locked(struct compat_semaphore *sem) +{ + return (int) atomic_read(&sem->count) < 0; +} Index: linux-2.6.24.7/arch/ppc/lib/locks.c =================================================================== --- linux-2.6.24.7.orig/arch/ppc/lib/locks.c +++ linux-2.6.24.7/arch/ppc/lib/locks.c @@ -42,7 +42,7 @@ static inline unsigned long __spin_trylo return ret; } -void _raw_spin_lock(spinlock_t *lock) +void __raw_spin_lock(raw_spinlock_t *lock) { int cpu = smp_processor_id(); unsigned int stuck = INIT_STUCK; @@ -62,9 +62,9 @@ void _raw_spin_lock(spinlock_t *lock) lock->owner_pc = (unsigned long)__builtin_return_address(0); lock->owner_cpu = cpu; } -EXPORT_SYMBOL(_raw_spin_lock); +EXPORT_SYMBOL(__raw_spin_lock); -int _raw_spin_trylock(spinlock_t *lock) +int __raw_spin_trylock(raw_spinlock_t *lock) { if (__spin_trylock(&lock->lock)) return 0; @@ -72,9 +72,9 @@ int _raw_spin_trylock(spinlock_t *lock) lock->owner_pc = (unsigned long)__builtin_return_address(0); return 1; } -EXPORT_SYMBOL(_raw_spin_trylock); +EXPORT_SYMBOL(__raw_spin_trylock); -void _raw_spin_unlock(spinlock_t *lp) +void __raw_spin_unlock(raw_spinlock_t *lp) { if ( !lp->lock ) printk("_spin_unlock(%p): no lock cpu %d curr PC %p %s/%d\n", @@ -88,13 +88,13 @@ void _raw_spin_unlock(spinlock_t *lp) wmb(); lp->lock = 0; } -EXPORT_SYMBOL(_raw_spin_unlock); +EXPORT_SYMBOL(__raw_spin_unlock); /* * For rwlocks, zero is unlocked, -1 is write-locked, * positive is read-locked. */ -static __inline__ int __read_trylock(rwlock_t *rw) +static __inline__ int __read_trylock(raw_rwlock_t *rw) { signed int tmp; @@ -114,13 +114,13 @@ static __inline__ int __read_trylock(rwl return tmp; } -int _raw_read_trylock(rwlock_t *rw) +int __raw_read_trylock(raw_rwlock_t *rw) { return __read_trylock(rw) > 0; } -EXPORT_SYMBOL(_raw_read_trylock); +EXPORT_SYMBOL(__raw_read_trylock); -void _raw_read_lock(rwlock_t *rw) +void __raw_read_lock(rwlock_t *rw) { unsigned int stuck; @@ -135,9 +135,9 @@ void _raw_read_lock(rwlock_t *rw) } } } -EXPORT_SYMBOL(_raw_read_lock); +EXPORT_SYMBOL(__raw_read_lock); -void _raw_read_unlock(rwlock_t *rw) +void __raw_read_unlock(raw_rwlock_t *rw) { if ( rw->lock == 0 ) printk("_read_unlock(): %s/%d (nip %08lX) lock %d\n", @@ -146,9 +146,9 @@ void _raw_read_unlock(rwlock_t *rw) wmb(); atomic_dec((atomic_t *) &(rw)->lock); } -EXPORT_SYMBOL(_raw_read_unlock); +EXPORT_SYMBOL(__raw_read_unlock); -void _raw_write_lock(rwlock_t *rw) +void __raw_write_lock(raw_rwlock_t *rw) { unsigned int stuck; @@ -164,18 +164,18 @@ void _raw_write_lock(rwlock_t *rw) } wmb(); } -EXPORT_SYMBOL(_raw_write_lock); +EXPORT_SYMBOL(__raw_write_lock); -int _raw_write_trylock(rwlock_t *rw) +int __raw_write_trylock(raw_rwlock_t *rw) { if (cmpxchg(&rw->lock, 0, -1) != 0) return 0; wmb(); return 1; } -EXPORT_SYMBOL(_raw_write_trylock); +EXPORT_SYMBOL(__raw_write_trylock); -void _raw_write_unlock(rwlock_t *rw) +void __raw_write_unlock(raw_rwlock_t *rw) { if (rw->lock >= 0) printk("_write_lock(): %s/%d (nip %08lX) lock %d\n", @@ -184,6 +184,6 @@ void _raw_write_unlock(rwlock_t *rw) wmb(); rw->lock = 0; } -EXPORT_SYMBOL(_raw_write_unlock); +EXPORT_SYMBOL(__raw_write_unlock); #endif Index: linux-2.6.24.7/drivers/macintosh/adb.c =================================================================== --- linux-2.6.24.7.orig/drivers/macintosh/adb.c +++ linux-2.6.24.7/drivers/macintosh/adb.c @@ -250,6 +250,8 @@ adb_probe_task(void *x) { strcpy(current->comm, "kadbprobe"); + down(&adb_probe_mutex); + printk(KERN_INFO "adb: starting probe task...\n"); do_adb_reset_bus(); printk(KERN_INFO "adb: finished probe task...\n"); @@ -276,7 +278,9 @@ adb_reset_bus(void) return 0; } - down(&adb_probe_mutex); + if (adb_got_sleep) + return 0; + schedule_work(&adb_reset_work); return 0; } @@ -339,9 +343,8 @@ adb_notify_sleep(struct pmu_sleep_notifi { switch (when) { case PBOOK_SLEEP_REQUEST: + /* Signal to discontiue probing */ adb_got_sleep = 1; - /* We need to get a lock on the probe thread */ - down(&adb_probe_mutex); /* Stop autopoll */ if (adb_controller->autopoll) adb_controller->autopoll(0); @@ -350,7 +353,6 @@ adb_notify_sleep(struct pmu_sleep_notifi break; case PBOOK_WAKE: adb_got_sleep = 0; - up(&adb_probe_mutex); adb_reset_bus(); break; } Index: linux-2.6.24.7/include/asm-powerpc/rwsem.h =================================================================== --- linux-2.6.24.7.orig/include/asm-powerpc/rwsem.h +++ linux-2.6.24.7/include/asm-powerpc/rwsem.h @@ -21,7 +21,7 @@ /* * the semaphore definition */ -struct rw_semaphore { +struct compat_rw_semaphore { /* XXX this should be able to be an atomic_t -- paulus */ signed int count; #define RWSEM_UNLOCKED_VALUE 0x00000000 @@ -30,7 +30,7 @@ struct rw_semaphore { #define RWSEM_WAITING_BIAS (-0x00010000) #define RWSEM_ACTIVE_READ_BIAS RWSEM_ACTIVE_BIAS #define RWSEM_ACTIVE_WRITE_BIAS (RWSEM_WAITING_BIAS + RWSEM_ACTIVE_BIAS) - spinlock_t wait_lock; + raw_spinlock_t wait_lock; struct list_head wait_list; }; @@ -38,15 +38,15 @@ struct rw_semaphore { { RWSEM_UNLOCKED_VALUE, SPIN_LOCK_UNLOCKED, \ LIST_HEAD_INIT((name).wait_list) } -#define DECLARE_RWSEM(name) \ - struct rw_semaphore name = __RWSEM_INITIALIZER(name) +#define COMPAT_DECLARE_RWSEM(name) \ + struct compat_rw_semaphore name = __RWSEM_INITIALIZER(name) -extern struct rw_semaphore *rwsem_down_read_failed(struct rw_semaphore *sem); -extern struct rw_semaphore *rwsem_down_write_failed(struct rw_semaphore *sem); -extern struct rw_semaphore *rwsem_wake(struct rw_semaphore *sem); -extern struct rw_semaphore *rwsem_downgrade_wake(struct rw_semaphore *sem); +extern struct compat_rw_semaphore *rwsem_down_read_failed(struct compat_rw_semaphore *sem); +extern struct compat_rw_semaphore *rwsem_down_write_failed(struct compat_rw_semaphore *sem); +extern struct compat_rw_semaphore *rwsem_wake(struct compat_rw_semaphore *sem); +extern struct compat_rw_semaphore *rwsem_downgrade_wake(struct compat_rw_semaphore *sem); -static inline void init_rwsem(struct rw_semaphore *sem) +static inline void compat_init_rwsem(struct compat_rw_semaphore *sem) { sem->count = RWSEM_UNLOCKED_VALUE; spin_lock_init(&sem->wait_lock); @@ -56,13 +56,13 @@ static inline void init_rwsem(struct rw_ /* * lock for reading */ -static inline void __down_read(struct rw_semaphore *sem) +static inline void __down_read(struct compat_rw_semaphore *sem) { if (unlikely(atomic_inc_return((atomic_t *)(&sem->count)) <= 0)) rwsem_down_read_failed(sem); } -static inline int __down_read_trylock(struct rw_semaphore *sem) +static inline int __down_read_trylock(struct compat_rw_semaphore *sem) { int tmp; @@ -78,7 +78,7 @@ static inline int __down_read_trylock(st /* * lock for writing */ -static inline void __down_write(struct rw_semaphore *sem) +static inline void __down_write(struct compat_rw_semaphore *sem) { int tmp; @@ -88,7 +88,7 @@ static inline void __down_write(struct r rwsem_down_write_failed(sem); } -static inline int __down_write_trylock(struct rw_semaphore *sem) +static inline int __down_write_trylock(struct compat_rw_semaphore *sem) { int tmp; @@ -100,7 +100,7 @@ static inline int __down_write_trylock(s /* * unlock after reading */ -static inline void __up_read(struct rw_semaphore *sem) +static inline void __up_read(struct compat_rw_semaphore *sem) { int tmp; @@ -112,7 +112,7 @@ static inline void __up_read(struct rw_s /* * unlock after writing */ -static inline void __up_write(struct rw_semaphore *sem) +static inline void __up_write(struct compat_rw_semaphore *sem) { if (unlikely(atomic_sub_return(RWSEM_ACTIVE_WRITE_BIAS, (atomic_t *)(&sem->count)) < 0)) @@ -122,7 +122,7 @@ static inline void __up_write(struct rw_ /* * implement atomic add functionality */ -static inline void rwsem_atomic_add(int delta, struct rw_semaphore *sem) +static inline void rwsem_atomic_add(int delta, struct compat_rw_semaphore *sem) { atomic_add(delta, (atomic_t *)(&sem->count)); } @@ -130,7 +130,7 @@ static inline void rwsem_atomic_add(int /* * downgrade write lock to read lock */ -static inline void __downgrade_write(struct rw_semaphore *sem) +static inline void __downgrade_write(struct compat_rw_semaphore *sem) { int tmp; @@ -142,12 +142,12 @@ static inline void __downgrade_write(str /* * implement exchange and add functionality */ -static inline int rwsem_atomic_update(int delta, struct rw_semaphore *sem) +static inline int rwsem_atomic_update(int delta, struct compat_rw_semaphore *sem) { return atomic_add_return(delta, (atomic_t *)(&sem->count)); } -static inline int rwsem_is_locked(struct rw_semaphore *sem) +static inline int compat_rwsem_is_locked(struct compat_rw_semaphore *sem) { return (sem->count != 0); } Index: linux-2.6.24.7/include/asm-powerpc/semaphore.h =================================================================== --- linux-2.6.24.7.orig/include/asm-powerpc/semaphore.h +++ linux-2.6.24.7/include/asm-powerpc/semaphore.h @@ -15,48 +15,58 @@ #include <linux/wait.h> #include <linux/rwsem.h> -struct semaphore { +/* + * On !PREEMPT_RT all sempahores are compat + */ +#ifndef CONFIG_PREEMPT_RT +# define compat_semaphore semaphore +#endif + +struct compat_semaphore { /* * Note that any negative value of count is equivalent to 0, * but additionally indicates that some process(es) might be * sleeping on `wait'. */ atomic_t count; + int sleepers; wait_queue_head_t wait; }; -#define __SEMAPHORE_INITIALIZER(name, n) \ +#define __COMPAT_SEMAPHORE_INITIALIZER(name, n) \ { \ .count = ATOMIC_INIT(n), \ .wait = __WAIT_QUEUE_HEAD_INITIALIZER((name).wait) \ } -#define __DECLARE_SEMAPHORE_GENERIC(name, count) \ - struct semaphore name = __SEMAPHORE_INITIALIZER(name,count) +#define __COMPAT_DECLARE_SEMAPHORE_GENERIC(name, count) \ + struct compat_semaphore name = __COMPAT_SEMAPHORE_INITIALIZER(name,count) -#define DECLARE_MUTEX(name) __DECLARE_SEMAPHORE_GENERIC(name, 1) +#define COMPAT_DECLARE_MUTEX(name) __COMPAT_DECLARE_SEMAPHORE_GENERIC(name, 1) -static inline void sema_init (struct semaphore *sem, int val) +static inline void compat_sema_init (struct compat_semaphore *sem, int val) { atomic_set(&sem->count, val); init_waitqueue_head(&sem->wait); } -static inline void init_MUTEX (struct semaphore *sem) +static inline void compat_init_MUTEX (struct compat_semaphore *sem) { - sema_init(sem, 1); + compat_sema_init(sem, 1); } -static inline void init_MUTEX_LOCKED (struct semaphore *sem) +static inline void compat_init_MUTEX_LOCKED (struct compat_semaphore *sem) { - sema_init(sem, 0); + compat_sema_init(sem, 0); } -extern void __down(struct semaphore * sem); -extern int __down_interruptible(struct semaphore * sem); -extern void __up(struct semaphore * sem); +extern void __compat_down(struct compat_semaphore * sem); +extern int __compat_down_interruptible(struct compat_semaphore * sem); +extern void __compat_up(struct compat_semaphore * sem); + +extern int compat_sem_is_locked(struct compat_semaphore *sem); -static inline void down(struct semaphore * sem) +static inline void compat_down(struct compat_semaphore * sem) { might_sleep(); @@ -64,31 +74,35 @@ static inline void down(struct semaphore * Try to get the semaphore, take the slow path if we fail. */ if (unlikely(atomic_dec_return(&sem->count) < 0)) - __down(sem); + __compat_down(sem); } -static inline int down_interruptible(struct semaphore * sem) +static inline int compat_down_interruptible(struct compat_semaphore * sem) { int ret = 0; might_sleep(); if (unlikely(atomic_dec_return(&sem->count) < 0)) - ret = __down_interruptible(sem); + ret = __compat_down_interruptible(sem); return ret; } -static inline int down_trylock(struct semaphore * sem) +static inline int compat_down_trylock(struct compat_semaphore * sem) { return atomic_dec_if_positive(&sem->count) < 0; } -static inline void up(struct semaphore * sem) +static inline void compat_up(struct compat_semaphore * sem) { if (unlikely(atomic_inc_return(&sem->count) <= 0)) - __up(sem); + __compat_up(sem); } +#define compat_sema_count(sem) atomic_read(&(sem)->count) + +#include <linux/semaphore.h> + #endif /* __KERNEL__ */ #endif /* _ASM_POWERPC_SEMAPHORE_H */ Index: linux-2.6.24.7/include/asm-powerpc/spinlock.h =================================================================== --- linux-2.6.24.7.orig/include/asm-powerpc/spinlock.h +++ linux-2.6.24.7/include/asm-powerpc/spinlock.h @@ -53,7 +53,7 @@ * This returns the old value in the lock, so we succeeded * in getting the lock if the return value is 0. */ -static __inline__ unsigned long __spin_trylock(raw_spinlock_t *lock) +static __inline__ unsigned long ___raw_spin_trylock(__raw_spinlock_t *lock) { unsigned long tmp, token; @@ -72,10 +72,10 @@ static __inline__ unsigned long __spin_t return tmp; } -static int __inline__ __raw_spin_trylock(raw_spinlock_t *lock) +static int __inline__ __raw_spin_trylock(__raw_spinlock_t *lock) { CLEAR_IO_SYNC; - return __spin_trylock(lock) == 0; + return ___raw_spin_trylock(lock) == 0; } /* @@ -95,19 +95,19 @@ static int __inline__ __raw_spin_trylock #if defined(CONFIG_PPC_SPLPAR) || defined(CONFIG_PPC_ISERIES) /* We only yield to the hypervisor if we are in shared processor mode */ #define SHARED_PROCESSOR (get_lppaca()->shared_proc) -extern void __spin_yield(raw_spinlock_t *lock); -extern void __rw_yield(raw_rwlock_t *lock); +extern void __spin_yield(__raw_spinlock_t *lock); +extern void __rw_yield(__raw_rwlock_t *lock); #else /* SPLPAR || ISERIES */ #define __spin_yield(x) barrier() #define __rw_yield(x) barrier() #define SHARED_PROCESSOR 0 #endif -static void __inline__ __raw_spin_lock(raw_spinlock_t *lock) +static void __inline__ __raw_spin_lock(__raw_spinlock_t *lock) { CLEAR_IO_SYNC; while (1) { - if (likely(__spin_trylock(lock) == 0)) + if (likely(___raw_spin_trylock(lock) == 0)) break; do { HMT_low(); @@ -118,13 +118,13 @@ static void __inline__ __raw_spin_lock(r } } -static void __inline__ __raw_spin_lock_flags(raw_spinlock_t *lock, unsigned long flags) +static void __inline__ __raw_spin_lock_flags(__raw_spinlock_t *lock, unsigned long flags) { unsigned long flags_dis; CLEAR_IO_SYNC; while (1) { - if (likely(__spin_trylock(lock) == 0)) + if (likely(___raw_spin_trylock(lock) == 0)) break; local_save_flags(flags_dis); local_irq_restore(flags); @@ -138,7 +138,7 @@ static void __inline__ __raw_spin_lock_f } } -static __inline__ void __raw_spin_unlock(raw_spinlock_t *lock) +static __inline__ void __raw_spin_unlock(__raw_spinlock_t *lock) { SYNC_IO; __asm__ __volatile__("# __raw_spin_unlock\n\t" @@ -147,7 +147,7 @@ static __inline__ void __raw_spin_unlock } #ifdef CONFIG_PPC64 -extern void __raw_spin_unlock_wait(raw_spinlock_t *lock); +extern void __raw_spin_unlock_wait(__raw_spinlock_t *lock); #else #define __raw_spin_unlock_wait(lock) \ do { while (__raw_spin_is_locked(lock)) cpu_relax(); } while (0) @@ -179,7 +179,7 @@ extern void __raw_spin_unlock_wait(raw_s * This returns the old value in the lock + 1, * so we got a read lock if the return value is > 0. */ -static long __inline__ __read_trylock(raw_rwlock_t *rw) +static long __inline__ __read_trylock(__raw_rwlock_t *rw) { long tmp; @@ -203,7 +203,7 @@ static long __inline__ __read_trylock(ra * This returns the old value in the lock, * so we got the write lock if the return value is 0. */ -static __inline__ long __write_trylock(raw_rwlock_t *rw) +static __inline__ long __write_trylock(__raw_rwlock_t *rw) { long tmp, token; @@ -223,7 +223,7 @@ static __inline__ long __write_trylock(r return tmp; } -static void __inline__ __raw_read_lock(raw_rwlock_t *rw) +static void __inline__ __raw_read_lock(__raw_rwlock_t *rw) { while (1) { if (likely(__read_trylock(rw) > 0)) @@ -237,7 +237,7 @@ static void __inline__ __raw_read_lock(r } } -static void __inline__ __raw_write_lock(raw_rwlock_t *rw) +static void __inline__ __raw_write_lock(__raw_rwlock_t *rw) { while (1) { if (likely(__write_trylock(rw) == 0)) @@ -251,17 +251,17 @@ static void __inline__ __raw_write_lock( } } -static int __inline__ __raw_read_trylock(raw_rwlock_t *rw) +static int __inline__ __raw_read_trylock(__raw_rwlock_t *rw) { return __read_trylock(rw) > 0; } -static int __inline__ __raw_write_trylock(raw_rwlock_t *rw) +static int __inline__ __raw_write_trylock(__raw_rwlock_t *rw) { return __write_trylock(rw) == 0; } -static void __inline__ __raw_read_unlock(raw_rwlock_t *rw) +static void __inline__ __raw_read_unlock(__raw_rwlock_t *rw) { long tmp; @@ -278,7 +278,7 @@ static void __inline__ __raw_read_unlock : "cr0", "memory"); } -static __inline__ void __raw_write_unlock(raw_rwlock_t *rw) +static __inline__ void __raw_write_unlock(__raw_rwlock_t *rw) { __asm__ __volatile__("# write_unlock\n\t" LWSYNC_ON_SMP: : :"memory"); Index: linux-2.6.24.7/include/asm-powerpc/spinlock_types.h =================================================================== --- linux-2.6.24.7.orig/include/asm-powerpc/spinlock_types.h +++ linux-2.6.24.7/include/asm-powerpc/spinlock_types.h @@ -7,13 +7,13 @@ typedef struct { volatile unsigned int slock; -} raw_spinlock_t; +} __raw_spinlock_t; #define __RAW_SPIN_LOCK_UNLOCKED { 0 } typedef struct { volatile signed int lock; -} raw_rwlock_t; +} __raw_rwlock_t; #define __RAW_RW_LOCK_UNLOCKED { 0 } ����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/rt-mutex-ppc-fix-a5.patch�������������������������������������������������������������������0000664�0000764�0000764�00000005072�11041657734�015726� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������� To fix the following compile error by changing names from __{read,write}_trylock to ___raw_{read,write}_trylock in asm-powerpc/spinlock.h - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - include/asm-powerpc/spinlock.h include/linux/spinlock_api_smp.h:49: error: conflicting types for '__read_trylock' include/asm/spinlock.h:183: error: previous definition of '__read_trylock' was here include/linux/spinlock_api_smp.h:50: error: conflicting types for '__write_trylock' include/asm/spinlock.h:207: error: previous definition of '__write_trylock' was here - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Signed-off-by: Tsutomu Owa <tsutomu.owa@toshiba.co.jp> -- owa --- include/asm-powerpc/spinlock.h | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) Index: linux-2.6.24.7/include/asm-powerpc/spinlock.h =================================================================== --- linux-2.6.24.7.orig/include/asm-powerpc/spinlock.h +++ linux-2.6.24.7/include/asm-powerpc/spinlock.h @@ -179,7 +179,7 @@ extern void __raw_spin_unlock_wait(__raw * This returns the old value in the lock + 1, * so we got a read lock if the return value is > 0. */ -static long __inline__ __read_trylock(__raw_rwlock_t *rw) +static long __inline__ ___raw_read_trylock(__raw_rwlock_t *rw) { long tmp; @@ -203,7 +203,7 @@ static long __inline__ __read_trylock(__ * This returns the old value in the lock, * so we got the write lock if the return value is 0. */ -static __inline__ long __write_trylock(__raw_rwlock_t *rw) +static __inline__ long ___raw_write_trylock(__raw_rwlock_t *rw) { long tmp, token; @@ -226,7 +226,7 @@ static __inline__ long __write_trylock(_ static void __inline__ __raw_read_lock(__raw_rwlock_t *rw) { while (1) { - if (likely(__read_trylock(rw) > 0)) + if (likely(___raw_read_trylock(rw) > 0)) break; do { HMT_low(); @@ -240,7 +240,7 @@ static void __inline__ __raw_read_lock(_ static void __inline__ __raw_write_lock(__raw_rwlock_t *rw) { while (1) { - if (likely(__write_trylock(rw) == 0)) + if (likely(___raw_write_trylock(rw) == 0)) break; do { HMT_low(); @@ -253,12 +253,12 @@ static void __inline__ __raw_write_lock( static int __inline__ __raw_read_trylock(__raw_rwlock_t *rw) { - return __read_trylock(rw) > 0; + return ___raw_read_trylock(rw) > 0; } static int __inline__ __raw_write_trylock(__raw_rwlock_t *rw) { - return __write_trylock(rw) == 0; + return ___raw_write_trylock(rw) == 0; } static void __inline__ __raw_read_unlock(__raw_rwlock_t *rw) ����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/rt-mutex-x86-64.patch�����������������������������������������������������������������������0000664�0000764�0000764�00000037523�11041673227�014732� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������--- arch/x86/Kconfig | 2 - arch/x86/kernel/entry_64.S | 18 +++++----- arch/x86/kernel/tsc_sync.c | 2 - arch/x86/kernel/vsyscall_64.c | 2 - arch/x86/kernel/x8664_ksyms_64.c | 10 +++-- arch/x86/lib/thunk_64.S | 12 ++++-- include/asm-x86/semaphore_64.h | 67 +++++++++++++++++++++++---------------- include/asm-x86/spinlock_64.h | 28 ++++++++-------- include/asm-x86/thread_info_64.h | 2 + 9 files changed, 81 insertions(+), 62 deletions(-) Index: linux-2.6.24.7/arch/x86/Kconfig =================================================================== --- linux-2.6.24.7.orig/arch/x86/Kconfig +++ linux-2.6.24.7/arch/x86/Kconfig @@ -107,7 +107,7 @@ config ASM_SEMAPHORES config RWSEM_XCHGADD_ALGORITHM bool - depends on X86_XADD && !RWSEM_GENERIC_SPINLOCK + depends on X86_XADD && !RWSEM_GENERIC_SPINLOCK && !PREEMPT_RT default y config ARCH_HAS_ILOG2_U32 Index: linux-2.6.24.7/arch/x86/kernel/entry_64.S =================================================================== --- linux-2.6.24.7.orig/arch/x86/kernel/entry_64.S +++ linux-2.6.24.7/arch/x86/kernel/entry_64.S @@ -375,8 +375,8 @@ sysret_check: /* Handle reschedules */ /* edx: work, edi: workmask */ sysret_careful: - bt $TIF_NEED_RESCHED,%edx - jnc sysret_signal + testl $(_TIF_NEED_RESCHED|_TIF_NEED_RESCHED_DELAYED),%edx + jz sysret_signal TRACE_IRQS_ON sti pushq %rdi @@ -399,7 +399,7 @@ sysret_signal: leaq -ARGOFFSET(%rsp),%rdi # &pt_regs -> arg1 xorl %esi,%esi # oldset -> arg2 call ptregscall_common -1: movl $_TIF_NEED_RESCHED,%edi +1: movl $(_TIF_NEED_RESCHED|_TIF_NEED_RESCHED_DELAYED),%edi /* Use IRET because user could have changed frame. This works because ptregscall_common has called FIXUP_TOP_OF_STACK. */ cli @@ -456,8 +456,8 @@ int_with_check: /* First do a reschedule test. */ /* edx: work, edi: workmask */ int_careful: - bt $TIF_NEED_RESCHED,%edx - jnc int_very_careful + testl $(_TIF_NEED_RESCHED|_TIF_NEED_RESCHED_DELAYED),%edx + jz int_very_careful TRACE_IRQS_ON sti pushq %rdi @@ -492,7 +492,7 @@ int_signal: movq %rsp,%rdi # &ptregs -> arg1 xorl %esi,%esi # oldset -> arg2 call do_notify_resume -1: movl $_TIF_NEED_RESCHED,%edi +1: movl $(_TIF_NEED_RESCHED|_TIF_NEED_RESCHED_DELAYED),%edi int_restore_rest: RESTORE_REST cli @@ -698,8 +698,8 @@ bad_iret: /* edi: workmask, edx: work */ retint_careful: CFI_RESTORE_STATE - bt $TIF_NEED_RESCHED,%edx - jnc retint_signal + testl $(_TIF_NEED_RESCHED|_TIF_NEED_RESCHED_DELAYED),%edx + jz retint_signal TRACE_IRQS_ON sti pushq %rdi @@ -725,7 +725,7 @@ retint_signal: RESTORE_REST cli TRACE_IRQS_OFF - movl $_TIF_NEED_RESCHED,%edi + movl $(_TIF_NEED_RESCHED|_TIF_NEED_RESCHED_DELAYED),%edi GET_THREAD_INFO(%rcx) jmp retint_check Index: linux-2.6.24.7/arch/x86/kernel/tsc_sync.c =================================================================== --- linux-2.6.24.7.orig/arch/x86/kernel/tsc_sync.c +++ linux-2.6.24.7/arch/x86/kernel/tsc_sync.c @@ -33,7 +33,7 @@ static __cpuinitdata atomic_t stop_count * we want to have the fastest, inlined, non-debug version * of a critical section, to be able to prove TSC time-warps: */ -static __cpuinitdata raw_spinlock_t sync_lock = __RAW_SPIN_LOCK_UNLOCKED; +static __cpuinitdata __raw_spinlock_t sync_lock = __RAW_SPIN_LOCK_UNLOCKED; static __cpuinitdata cycles_t last_tsc; static __cpuinitdata cycles_t max_warp; static __cpuinitdata int nr_warps; Index: linux-2.6.24.7/arch/x86/kernel/vsyscall_64.c =================================================================== --- linux-2.6.24.7.orig/arch/x86/kernel/vsyscall_64.c +++ linux-2.6.24.7/arch/x86/kernel/vsyscall_64.c @@ -55,7 +55,7 @@ int __vgetcpu_mode __section_vgetcpu_mod struct vsyscall_gtod_data __vsyscall_gtod_data __section_vsyscall_gtod_data = { - .lock = SEQLOCK_UNLOCKED, + .lock = __RAW_SEQLOCK_UNLOCKED(__vsyscall_gtod_data.lock), .sysctl_enabled = 1, }; Index: linux-2.6.24.7/arch/x86/kernel/x8664_ksyms_64.c =================================================================== --- linux-2.6.24.7.orig/arch/x86/kernel/x8664_ksyms_64.c +++ linux-2.6.24.7/arch/x86/kernel/x8664_ksyms_64.c @@ -17,10 +17,12 @@ EXPORT_SYMBOL(mcount); EXPORT_SYMBOL(kernel_thread); -EXPORT_SYMBOL(__down_failed); -EXPORT_SYMBOL(__down_failed_interruptible); -EXPORT_SYMBOL(__down_failed_trylock); -EXPORT_SYMBOL(__up_wakeup); +#ifdef CONFIG_RWSEM_GENERIC_SPINLOCK +EXPORT_SYMBOL(__compat_down_failed); +EXPORT_SYMBOL(__compat_down_failed_interruptible); +EXPORT_SYMBOL(__compat_down_failed_trylock); +EXPORT_SYMBOL(__compat_up_wakeup); +#endif EXPORT_SYMBOL(__get_user_1); EXPORT_SYMBOL(__get_user_2); Index: linux-2.6.24.7/arch/x86/lib/thunk_64.S =================================================================== --- linux-2.6.24.7.orig/arch/x86/lib/thunk_64.S +++ linux-2.6.24.7/arch/x86/lib/thunk_64.S @@ -40,11 +40,13 @@ thunk rwsem_wake_thunk,rwsem_wake thunk rwsem_downgrade_thunk,rwsem_downgrade_wake #endif - - thunk __down_failed,__down - thunk_retrax __down_failed_interruptible,__down_interruptible - thunk_retrax __down_failed_trylock,__down_trylock - thunk __up_wakeup,__up + +#ifdef CONFIG_RWSEM_GENERIC_SPINLOCK + thunk __compat_down_failed,__compat_down + thunk_retrax __compat_down_failed_interruptible,__compat_down_interruptible + thunk_retrax __compat_down_failed_trylock,__compat_down_trylock + thunk __compat_up_wakeup,__compat_up +#endif #ifdef CONFIG_TRACE_IRQFLAGS /* put return address in rdi (arg1) */ Index: linux-2.6.24.7/include/asm-x86/semaphore_64.h =================================================================== --- linux-2.6.24.7.orig/include/asm-x86/semaphore_64.h +++ linux-2.6.24.7/include/asm-x86/semaphore_64.h @@ -5,6 +5,10 @@ #ifdef __KERNEL__ +#ifndef CONFIG_PREEMPT_RT +# define compat_semaphore semaphore +#endif + /* * SMP- and interrupt-safe semaphores.. * @@ -43,28 +47,33 @@ #include <linux/rwsem.h> #include <linux/stringify.h> -struct semaphore { +struct compat_semaphore { atomic_t count; int sleepers; wait_queue_head_t wait; }; -#define __SEMAPHORE_INITIALIZER(name, n) \ +#define __COMPAT_SEMAPHORE_INITIALIZER(name, n) \ { \ .count = ATOMIC_INIT(n), \ .sleepers = 0, \ .wait = __WAIT_QUEUE_HEAD_INITIALIZER((name).wait) \ } -#define __DECLARE_SEMAPHORE_GENERIC(name,count) \ - struct semaphore name = __SEMAPHORE_INITIALIZER(name,count) +#define __COMPAT_MUTEX_INITIALIZER(name) \ + __COMPAT_SEMAPHORE_INITIALIZER(name,1) + +#define __COMPAT_DECLARE_SEMAPHORE_GENERIC(name,count) \ + struct compat_semaphore name = __COMPAT_SEMAPHORE_INITIALIZER(name,count) -#define DECLARE_MUTEX(name) __DECLARE_SEMAPHORE_GENERIC(name,1) +#define COMPAT_DECLARE_MUTEX(name) __COMPAT_DECLARE_SEMAPHORE_GENERIC(name,1) -static inline void sema_init (struct semaphore *sem, int val) +#define compat_sema_count(sem) atomic_read(&(sem)->count) + +static inline void compat_sema_init (struct compat_semaphore *sem, int val) { /* - * *sem = (struct semaphore)__SEMAPHORE_INITIALIZER((*sem),val); + * *sem = (struct compat_semaphore)__SEMAPHORE_INITIALIZER((*sem),val); * * i'd rather use the more flexible initialization above, but sadly * GCC 2.7.2.3 emits a bogus warning. EGCS doesn't. Oh well. @@ -74,32 +83,33 @@ static inline void sema_init (struct sem init_waitqueue_head(&sem->wait); } -static inline void init_MUTEX (struct semaphore *sem) +static inline void compat_init_MUTEX (struct compat_semaphore *sem) { - sema_init(sem, 1); + compat_sema_init(sem, 1); } -static inline void init_MUTEX_LOCKED (struct semaphore *sem) +static inline void compat_init_MUTEX_LOCKED (struct compat_semaphore *sem) { - sema_init(sem, 0); + compat_sema_init(sem, 0); } -asmlinkage void __down_failed(void /* special register calling convention */); -asmlinkage int __down_failed_interruptible(void /* params in registers */); -asmlinkage int __down_failed_trylock(void /* params in registers */); -asmlinkage void __up_wakeup(void /* special register calling convention */); +asmlinkage void __compat_down_failed(void /* special register calling convention */); +asmlinkage int __compat_down_failed_interruptible(void /* params in registers */); +asmlinkage int __compat_down_failed_trylock(void /* params in registers */); +asmlinkage void __compat_up_wakeup(void /* special register calling convention */); -asmlinkage void __down(struct semaphore * sem); -asmlinkage int __down_interruptible(struct semaphore * sem); -asmlinkage int __down_trylock(struct semaphore * sem); -asmlinkage void __up(struct semaphore * sem); +asmlinkage void __compat_down(struct compat_semaphore * sem); +asmlinkage int __compat_down_interruptible(struct compat_semaphore * sem); +asmlinkage int __compat_down_trylock(struct compat_semaphore * sem); +asmlinkage void __compat_up(struct compat_semaphore * sem); +asmlinkage int compat_sem_is_locked(struct compat_semaphore *sem); /* * This is ugly, but we want the default case to fall through. * "__down_failed" is a special asm handler that calls the C * routine that actually waits. See arch/x86_64/kernel/semaphore.c */ -static inline void down(struct semaphore * sem) +static inline void compat_down(struct compat_semaphore * sem) { might_sleep(); @@ -107,7 +117,7 @@ static inline void down(struct semaphore "# atomic down operation\n\t" LOCK_PREFIX "decl %0\n\t" /* --sem->count */ "jns 1f\n\t" - "call __down_failed\n" + "call __compat_down_failed\n" "1:" :"=m" (sem->count) :"D" (sem) @@ -118,7 +128,7 @@ static inline void down(struct semaphore * Interruptible try to acquire a semaphore. If we obtained * it, return zero. If we were interrupted, returns -EINTR */ -static inline int down_interruptible(struct semaphore * sem) +static inline int compat_down_interruptible(struct compat_semaphore * sem) { int result; @@ -129,7 +139,7 @@ static inline int down_interruptible(str "xorl %0,%0\n\t" LOCK_PREFIX "decl %1\n\t" /* --sem->count */ "jns 2f\n\t" - "call __down_failed_interruptible\n" + "call __compat_down_failed_interruptible\n" "2:\n" :"=&a" (result), "=m" (sem->count) :"D" (sem) @@ -141,7 +151,7 @@ static inline int down_interruptible(str * Non-blockingly attempt to down() a semaphore. * Returns zero if we acquired it */ -static inline int down_trylock(struct semaphore * sem) +static inline int compat_down_trylock(struct compat_semaphore * sem) { int result; @@ -150,7 +160,7 @@ static inline int down_trylock(struct se "xorl %0,%0\n\t" LOCK_PREFIX "decl %1\n\t" /* --sem->count */ "jns 2f\n\t" - "call __down_failed_trylock\n\t" + "call __compat_down_failed_trylock\n\t" "2:\n" :"=&a" (result), "=m" (sem->count) :"D" (sem) @@ -164,17 +174,20 @@ static inline int down_trylock(struct se * The default case (no contention) will result in NO * jumps for both down() and up(). */ -static inline void up(struct semaphore * sem) +static inline void compat_up(struct compat_semaphore * sem) { __asm__ __volatile__( "# atomic up operation\n\t" LOCK_PREFIX "incl %0\n\t" /* ++sem->count */ "jg 1f\n\t" - "call __up_wakeup\n" + "call __compat_up_wakeup\n" "1:" :"=m" (sem->count) :"D" (sem) :"memory"); } + +#include <linux/semaphore.h> + #endif /* __KERNEL__ */ #endif Index: linux-2.6.24.7/include/asm-x86/spinlock_64.h =================================================================== --- linux-2.6.24.7.orig/include/asm-x86/spinlock_64.h +++ linux-2.6.24.7/include/asm-x86/spinlock_64.h @@ -17,12 +17,12 @@ * (the type definitions are in asm/spinlock_types.h) */ -static inline int __raw_spin_is_locked(raw_spinlock_t *lock) +static inline int __raw_spin_is_locked(__raw_spinlock_t *lock) { return *(volatile signed int *)(&(lock)->slock) <= 0; } -static inline void __raw_spin_lock(raw_spinlock_t *lock) +static inline void __raw_spin_lock(__raw_spinlock_t *lock) { asm volatile( "\n1:\t" @@ -40,7 +40,7 @@ static inline void __raw_spin_lock(raw_s * Same as __raw_spin_lock, but reenable interrupts during spinning. */ #ifndef CONFIG_PROVE_LOCKING -static inline void __raw_spin_lock_flags(raw_spinlock_t *lock, unsigned long flags) +static inline void __raw_spin_lock_flags(__raw_spinlock_t *lock, unsigned long flags) { asm volatile( "\n1:\t" @@ -65,7 +65,7 @@ static inline void __raw_spin_lock_flags } #endif -static inline int __raw_spin_trylock(raw_spinlock_t *lock) +static inline int __raw_spin_trylock(__raw_spinlock_t *lock) { int oldval; @@ -77,12 +77,12 @@ static inline int __raw_spin_trylock(raw return oldval > 0; } -static inline void __raw_spin_unlock(raw_spinlock_t *lock) +static inline void __raw_spin_unlock(__raw_spinlock_t *lock) { asm volatile("movl $1,%0" :"=m" (lock->slock) :: "memory"); } -static inline void __raw_spin_unlock_wait(raw_spinlock_t *lock) +static inline void __raw_spin_unlock_wait(__raw_spinlock_t *lock) { while (__raw_spin_is_locked(lock)) cpu_relax(); @@ -102,17 +102,17 @@ static inline void __raw_spin_unlock_wai * with the high bit (sign) being the "contended" bit. */ -static inline int __raw_read_can_lock(raw_rwlock_t *lock) +static inline int __raw_read_can_lock(__raw_rwlock_t *lock) { return (int)(lock)->lock > 0; } -static inline int __raw_write_can_lock(raw_rwlock_t *lock) +static inline int __raw_write_can_lock(__raw_rwlock_t *lock) { return (lock)->lock == RW_LOCK_BIAS; } -static inline void __raw_read_lock(raw_rwlock_t *rw) +static inline void __raw_read_lock(__raw_rwlock_t *rw) { asm volatile(LOCK_PREFIX "subl $1,(%0)\n\t" "jns 1f\n" @@ -121,7 +121,7 @@ static inline void __raw_read_lock(raw_r ::"D" (rw), "i" (RW_LOCK_BIAS) : "memory"); } -static inline void __raw_write_lock(raw_rwlock_t *rw) +static inline void __raw_write_lock(__raw_rwlock_t *rw) { asm volatile(LOCK_PREFIX "subl %1,(%0)\n\t" "jz 1f\n" @@ -130,7 +130,7 @@ static inline void __raw_write_lock(raw_ ::"D" (rw), "i" (RW_LOCK_BIAS) : "memory"); } -static inline int __raw_read_trylock(raw_rwlock_t *lock) +static inline int __raw_read_trylock(__raw_rwlock_t *lock) { atomic_t *count = (atomic_t *)lock; atomic_dec(count); @@ -140,7 +140,7 @@ static inline int __raw_read_trylock(raw return 0; } -static inline int __raw_write_trylock(raw_rwlock_t *lock) +static inline int __raw_write_trylock(__raw_rwlock_t *lock) { atomic_t *count = (atomic_t *)lock; if (atomic_sub_and_test(RW_LOCK_BIAS, count)) @@ -149,12 +149,12 @@ static inline int __raw_write_trylock(ra return 0; } -static inline void __raw_read_unlock(raw_rwlock_t *rw) +static inline void __raw_read_unlock(__raw_rwlock_t *rw) { asm volatile(LOCK_PREFIX " ; incl %0" :"=m" (rw->lock) : : "memory"); } -static inline void __raw_write_unlock(raw_rwlock_t *rw) +static inline void __raw_write_unlock(__raw_rwlock_t *rw) { asm volatile(LOCK_PREFIX " ; addl $" RW_LOCK_BIAS_STR ",%0" : "=m" (rw->lock) : : "memory"); Index: linux-2.6.24.7/include/asm-x86/thread_info_64.h =================================================================== --- linux-2.6.24.7.orig/include/asm-x86/thread_info_64.h +++ linux-2.6.24.7/include/asm-x86/thread_info_64.h @@ -111,6 +111,7 @@ static inline struct thread_info *stack_ #define TIF_NEED_RESCHED 3 /* rescheduling necessary */ #define TIF_SINGLESTEP 4 /* reenable singlestep on user return*/ #define TIF_IRET 5 /* force IRET */ +#define TIF_NEED_RESCHED_DELAYED 6 /* reschedul on return to userspace */ #define TIF_SYSCALL_AUDIT 7 /* syscall auditing active */ #define TIF_SECCOMP 8 /* secure computing */ #define TIF_RESTORE_SIGMASK 9 /* restore signal mask in do_signal */ @@ -133,6 +134,7 @@ static inline struct thread_info *stack_ #define _TIF_SECCOMP (1<<TIF_SECCOMP) #define _TIF_RESTORE_SIGMASK (1<<TIF_RESTORE_SIGMASK) #define _TIF_MCE_NOTIFY (1<<TIF_MCE_NOTIFY) +#define _TIF_NEED_RESCHED_DELAYED (1<<TIF_NEED_RESCHED_DELAYED) #define _TIF_IA32 (1<<TIF_IA32) #define _TIF_FORK (1<<TIF_FORK) #define _TIF_ABI_PENDING (1<<TIF_ABI_PENDING) �����������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/rt-mutex-arm.patch��������������������������������������������������������������������������0000664�0000764�0000764�00000026200�11041657732�014626� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������--- arch/arm/kernel/entry-armv.S | 4 +- arch/arm/kernel/entry-common.S | 14 +++++---- arch/arm/kernel/process.c | 10 ++++-- arch/arm/kernel/semaphore.c | 31 +++++++++++++++------ include/asm-arm/semaphore.h | 59 ++++++++++++++++++++++++++++------------- include/asm-arm/thread_info.h | 2 + 6 files changed, 82 insertions(+), 38 deletions(-) Index: linux-2.6.24.7/arch/arm/kernel/entry-armv.S =================================================================== --- linux-2.6.24.7.orig/arch/arm/kernel/entry-armv.S +++ linux-2.6.24.7/arch/arm/kernel/entry-armv.S @@ -204,7 +204,7 @@ __irq_svc: irq_handler #ifdef CONFIG_PREEMPT ldr r0, [tsk, #TI_FLAGS] @ get flags - tst r0, #_TIF_NEED_RESCHED + tst r0, #_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_DELAYED blne svc_preempt preempt_return: ldr r0, [tsk, #TI_PREEMPT] @ read preempt value @@ -235,7 +235,7 @@ svc_preempt: str r7, [tsk, #TI_PREEMPT] @ expects preempt_count == 0 1: bl preempt_schedule_irq @ irq en/disable is done inside ldr r0, [tsk, #TI_FLAGS] @ get new tasks TI_FLAGS - tst r0, #_TIF_NEED_RESCHED + tst r0, #_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_DELAYED beq preempt_return @ go again b 1b #endif Index: linux-2.6.24.7/arch/arm/kernel/entry-common.S =================================================================== --- linux-2.6.24.7.orig/arch/arm/kernel/entry-common.S +++ linux-2.6.24.7/arch/arm/kernel/entry-common.S @@ -46,7 +46,7 @@ ret_fast_syscall: fast_work_pending: str r0, [sp, #S_R0+S_OFF]! @ returned r0 work_pending: - tst r1, #_TIF_NEED_RESCHED + tst r1, #_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_DELAYED bne work_resched tst r1, #_TIF_SIGPENDING beq no_work_pending @@ -56,7 +56,8 @@ work_pending: b ret_slow_syscall @ Check work again work_resched: - bl schedule + bl __schedule + /* * "slow" syscall return path. "why" tells us if this was a real syscall. */ @@ -396,6 +397,7 @@ ENTRY(sys_oabi_call_table) #include "calls.S" #undef ABI #undef OBSOLETE +#endif #ifdef CONFIG_FRAME_POINTER @@ -445,11 +447,13 @@ mcount: ldr ip, =mcount_enabled @ leave early, if disabled ldr ip, [ip] cmp ip, #0 - moveq pc,lr + moveq pc, lr mov ip, sp stmdb sp!, {r0 - r3, fp, ip, lr, pc} @ create stack frame + mov r2, =mcount_trace_function + ldr r1, [fp, #-4] @ get lr (the return address @ of the caller of the @ instrumented function) @@ -458,7 +462,7 @@ mcount: sub fp, ip, #4 @ point fp at this frame - bl __trace + bl r2 1: ldmdb fp, {r0 - r3, fp, sp, pc} @ pop entry frame and return @@ -504,5 +508,3 @@ arm_return_addr: #endif -#endif - Index: linux-2.6.24.7/arch/arm/kernel/process.c =================================================================== --- linux-2.6.24.7.orig/arch/arm/kernel/process.c +++ linux-2.6.24.7/arch/arm/kernel/process.c @@ -136,7 +136,7 @@ static void default_idle(void) cpu_relax(); else { local_irq_disable(); - if (!need_resched()) { + if (!need_resched() && !need_resched_delayed()) { timer_dyn_reprogram(); arch_idle(); } @@ -168,13 +168,15 @@ void cpu_idle(void) idle = default_idle; leds_event(led_idle_start); tick_nohz_stop_sched_tick(); - while (!need_resched()) + while (!need_resched() && !need_resched_delayed()) idle(); leds_event(led_idle_end); tick_nohz_restart_sched_tick(); - preempt_enable_no_resched(); - schedule(); + local_irq_disable(); + __preempt_enable_no_resched(); + __schedule(); preempt_disable(); + local_irq_enable(); } } Index: linux-2.6.24.7/arch/arm/kernel/semaphore.c =================================================================== --- linux-2.6.24.7.orig/arch/arm/kernel/semaphore.c +++ linux-2.6.24.7/arch/arm/kernel/semaphore.c @@ -49,14 +49,16 @@ * we cannot lose wakeup events. */ -void __up(struct semaphore *sem) +fastcall void __attribute_used__ __compat_up(struct compat_semaphore *sem) { wake_up(&sem->wait); } +EXPORT_SYMBOL(__compat_up); + static DEFINE_SPINLOCK(semaphore_lock); -void __sched __down(struct semaphore * sem) +fastcall void __attribute_used__ __sched __compat_down(struct compat_semaphore * sem) { struct task_struct *tsk = current; DECLARE_WAITQUEUE(wait, tsk); @@ -89,7 +91,9 @@ void __sched __down(struct semaphore * s wake_up(&sem->wait); } -int __sched __down_interruptible(struct semaphore * sem) +EXPORT_SYMBOL(__compat_down); + +fastcall int __attribute_used__ __sched __compat_down_interruptible(struct compat_semaphore * sem) { int retval = 0; struct task_struct *tsk = current; @@ -140,6 +144,8 @@ int __sched __down_interruptible(struct return retval; } +EXPORT_SYMBOL(__compat_down_interruptible); + /* * Trylock failed - make sure we correct for * having decremented the count. @@ -148,7 +154,7 @@ int __sched __down_interruptible(struct * single "cmpxchg" without failure cases, * but then it wouldn't work on a 386. */ -int __down_trylock(struct semaphore * sem) +fastcall int __attribute_used__ __compat_down_trylock(struct compat_semaphore * sem) { int sleepers; unsigned long flags; @@ -168,6 +174,15 @@ int __down_trylock(struct semaphore * se return 1; } +EXPORT_SYMBOL(__compat_down_trylock); + +fastcall int compat_sem_is_locked(struct compat_semaphore *sem) +{ + return (int) atomic_read(&sem->count) < 0; +} + +EXPORT_SYMBOL(compat_sem_is_locked); + /* * The semaphore operations have a special calling sequence that * allow us to do a simpler in-line version of them. These routines @@ -185,7 +200,7 @@ asm(" .section .sched.text,\"ax\",%progb __down_failed: \n\ stmfd sp!, {r0 - r4, lr} \n\ mov r0, ip \n\ - bl __down \n\ + bl __compat_down \n\ ldmfd sp!, {r0 - r4, pc} \n\ \n\ .align 5 \n\ @@ -193,7 +208,7 @@ __down_failed: \n\ __down_interruptible_failed: \n\ stmfd sp!, {r0 - r4, lr} \n\ mov r0, ip \n\ - bl __down_interruptible \n\ + bl __compat_down_interruptible \n\ mov ip, r0 \n\ ldmfd sp!, {r0 - r4, pc} \n\ \n\ @@ -202,7 +217,7 @@ __down_interruptible_failed: \n\ __down_trylock_failed: \n\ stmfd sp!, {r0 - r4, lr} \n\ mov r0, ip \n\ - bl __down_trylock \n\ + bl __compat_down_trylock \n\ mov ip, r0 \n\ ldmfd sp!, {r0 - r4, pc} \n\ \n\ @@ -211,7 +226,7 @@ __down_trylock_failed: \n\ __up_wakeup: \n\ stmfd sp!, {r0 - r4, lr} \n\ mov r0, ip \n\ - bl __up \n\ + bl __compat_up \n\ ldmfd sp!, {r0 - r4, pc} \n\ "); Index: linux-2.6.24.7/include/asm-arm/semaphore.h =================================================================== --- linux-2.6.24.7.orig/include/asm-arm/semaphore.h +++ linux-2.6.24.7/include/asm-arm/semaphore.h @@ -5,45 +5,65 @@ #define __ASM_ARM_SEMAPHORE_H #include <linux/linkage.h> + +#ifdef CONFIG_PREEMPT_RT +# include <linux/rt_lock.h> +#endif + #include <linux/spinlock.h> #include <linux/wait.h> #include <linux/rwsem.h> +/* + * On !PREEMPT_RT all semaphores are compat: + */ +#ifndef CONFIG_PREEMPT_RT +# define semaphore compat_semaphore +#endif + #include <asm/atomic.h> #include <asm/locks.h> -struct semaphore { +struct compat_semaphore { atomic_t count; int sleepers; wait_queue_head_t wait; }; -#define __SEMAPHORE_INIT(name, cnt) \ +#define __COMPAT_SEMAPHORE_INITIALIZER(name, cnt) \ { \ .count = ATOMIC_INIT(cnt), \ .wait = __WAIT_QUEUE_HEAD_INITIALIZER((name).wait), \ } -#define __DECLARE_SEMAPHORE_GENERIC(name,count) \ - struct semaphore name = __SEMAPHORE_INIT(name,count) +#define __COMPAT_MUTEX_INITIALIZER(name) \ + __COMPAT_SEMAPHORE_INITIALIZER(name,1) + +#define __COMPAT_DECLARE_SEMAPHORE_GENERIC(name,count) \ + struct compat_semaphore name = __COMPAT_SEMAPHORE_INITIALIZER(name,count) -#define DECLARE_MUTEX(name) __DECLARE_SEMAPHORE_GENERIC(name,1) +#define COMPAT_DECLARE_MUTEX(name) __COMPAT_DECLARE_SEMAPHORE_GENERIC(name,1) -static inline void sema_init(struct semaphore *sem, int val) +static inline void compat_sema_init(struct compat_semaphore *sem, int val) { atomic_set(&sem->count, val); sem->sleepers = 0; init_waitqueue_head(&sem->wait); } -static inline void init_MUTEX(struct semaphore *sem) +static inline void compat_init_MUTEX(struct compat_semaphore *sem) +{ + compat_sema_init(sem, 1); +} + +static inline void compat_init_MUTEX_LOCKED(struct compat_semaphore *sem) { - sema_init(sem, 1); + compat_sema_init(sem, 0); } -static inline void init_MUTEX_LOCKED(struct semaphore *sem) +static inline int compat_sema_count(struct compat_semaphore *sem) { - sema_init(sem, 0); + return atomic_read(&sem->count); } /* @@ -54,16 +74,18 @@ asmlinkage int __down_interruptible_fai asmlinkage int __down_trylock_failed(void); asmlinkage void __up_wakeup(void); -extern void __down(struct semaphore * sem); -extern int __down_interruptible(struct semaphore * sem); -extern int __down_trylock(struct semaphore * sem); -extern void __up(struct semaphore * sem); +extern void __compat_up(struct compat_semaphore *sem); +extern int __compat_down_interruptible(struct compat_semaphore * sem); +extern int __compat_down_trylock(struct compat_semaphore * sem); +extern void __compat_down(struct compat_semaphore * sem); + +extern int compat_sem_is_locked(struct compat_semaphore *sem); /* * This is ugly, but we want the default case to fall through. * "__down" is the actual routine that waits... */ -static inline void down(struct semaphore * sem) +static inline void compat_down(struct compat_semaphore * sem) { might_sleep(); __down_op(sem, __down_failed); @@ -73,13 +95,13 @@ static inline void down(struct semaphore * This is ugly, but we want the default case to fall through. * "__down_interruptible" is the actual routine that waits... */ -static inline int down_interruptible (struct semaphore * sem) +static inline int compat_down_interruptible (struct compat_semaphore * sem) { might_sleep(); return __down_op_ret(sem, __down_interruptible_failed); } -static inline int down_trylock(struct semaphore *sem) +static inline int compat_down_trylock(struct compat_semaphore *sem) { return __down_op_ret(sem, __down_trylock_failed); } @@ -90,9 +112,10 @@ static inline int down_trylock(struct se * The default case (no contention) will result in NO * jumps for both down() and up(). */ -static inline void up(struct semaphore * sem) +static inline void compat_up(struct compat_semaphore * sem) { __up_op(sem, __up_wakeup); } +#include <linux/semaphore.h> #endif Index: linux-2.6.24.7/include/asm-arm/thread_info.h =================================================================== --- linux-2.6.24.7.orig/include/asm-arm/thread_info.h +++ linux-2.6.24.7/include/asm-arm/thread_info.h @@ -141,6 +141,7 @@ extern void iwmmxt_task_switch(struct th */ #define TIF_SIGPENDING 0 #define TIF_NEED_RESCHED 1 +#define TIF_NEED_RESCHED_DELAYED 3 #define TIF_SYSCALL_TRACE 8 #define TIF_POLLING_NRFLAG 16 #define TIF_USING_IWMMXT 17 @@ -149,6 +150,7 @@ extern void iwmmxt_task_switch(struct th #define _TIF_SIGPENDING (1 << TIF_SIGPENDING) #define _TIF_NEED_RESCHED (1 << TIF_NEED_RESCHED) +#define _TIF_NEED_RESCHED_DELAYED (1<<TIF_NEED_RESCHED_DELAYED) #define _TIF_SYSCALL_TRACE (1 << TIF_SYSCALL_TRACE) #define _TIF_POLLING_NRFLAG (1 << TIF_POLLING_NRFLAG) #define _TIF_USING_IWMMXT (1 << TIF_USING_IWMMXT) ������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/rt-mutex-arm-fix.patch����������������������������������������������������������������������0000664�0000764�0000764�00000001721�11041657732�015413� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������--- arch/arm/kernel/semaphore.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) Index: linux-2.6.24.7/arch/arm/kernel/semaphore.c =================================================================== --- linux-2.6.24.7.orig/arch/arm/kernel/semaphore.c +++ linux-2.6.24.7/arch/arm/kernel/semaphore.c @@ -154,7 +154,7 @@ EXPORT_SYMBOL(__compat_down_interruptibl * single "cmpxchg" without failure cases, * but then it wouldn't work on a 386. */ -fastcall int __attribute_used__ __compat_down_trylock(struct compat_semaphore * sem) +fastcall int __attribute_used__ __sched __compat_down_trylock(struct compat_semaphore * sem) { int sleepers; unsigned long flags; @@ -176,7 +176,7 @@ fastcall int __attribute_used__ __compat EXPORT_SYMBOL(__compat_down_trylock); -fastcall int compat_sem_is_locked(struct compat_semaphore *sem) +fastcall int __sched compat_sem_is_locked(struct compat_semaphore *sem) { return (int) atomic_read(&sem->count) < 0; } �����������������������������������������������patches/rt-mutex-m68knommu-add-compat_semaphore.patch�����������������������������������������������0000664�0000764�0000764�00000021176�11041657735�021774� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From 6d12160b217c3a274f638f844a100ecb9d365b06 Mon Sep 17 00:00:00 2001 From: Sebastian Siewior <bigeasy@linutronix.de> Date: Fri, 18 Apr 2008 17:02:24 +0200 Subject: [PATCH] m68knommu: add compat_semaphore basically a rename Signed-off-by: Sebastian Siewior <bigeasy@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> --- arch/m68knommu/Kconfig | 4 ++ arch/m68knommu/kernel/Makefile | 3 + arch/m68knommu/kernel/semaphore.c | 8 ++-- include/asm-m68knommu/semaphore-helper.h | 8 ++-- include/asm-m68knommu/semaphore.h | 56 +++++++++++++++++-------------- 5 files changed, 46 insertions(+), 33 deletions(-) Index: linux-2.6.24.7/arch/m68knommu/Kconfig =================================================================== --- linux-2.6.24.7.orig/arch/m68knommu/Kconfig +++ linux-2.6.24.7/arch/m68knommu/Kconfig @@ -29,6 +29,10 @@ config RWSEM_XCHGADD_ALGORITHM bool default n +config ASM_SEMAPHORES + bool + default y + config ARCH_HAS_ILOG2_U32 bool default n Index: linux-2.6.24.7/arch/m68knommu/kernel/Makefile =================================================================== --- linux-2.6.24.7.orig/arch/m68knommu/kernel/Makefile +++ linux-2.6.24.7/arch/m68knommu/kernel/Makefile @@ -5,8 +5,9 @@ extra-y := vmlinux.lds obj-y += dma.o entry.o init_task.o irq.o m68k_ksyms.o process.o ptrace.o \ - semaphore.o setup.o signal.o syscalltable.o sys_m68k.o time.o traps.o + setup.o signal.o syscalltable.o sys_m68k.o time.o traps.o obj-$(CONFIG_MODULES) += module.o obj-$(CONFIG_COMEMPCI) += comempci.o obj-$(CONFIG_STACKTRACE) += stacktrace.o +obj-$(CONFIG_ASM_SEMAPHORES) += semaphore.o Index: linux-2.6.24.7/arch/m68knommu/kernel/semaphore.c =================================================================== --- linux-2.6.24.7.orig/arch/m68knommu/kernel/semaphore.c +++ linux-2.6.24.7/arch/m68knommu/kernel/semaphore.c @@ -42,7 +42,7 @@ spinlock_t semaphore_wake_lock; * critical part is the inline stuff in <asm/semaphore.h> * where we want to avoid any extra jumps and calls. */ -void __up(struct semaphore *sem) +void __compat_up(struct compat_semaphore *sem) { wake_one_more(sem); wake_up(&sem->wait); @@ -96,7 +96,7 @@ void __up(struct semaphore *sem) current->state = TASK_RUNNING; \ remove_wait_queue(&sem->wait, &wait); -void __sched __down(struct semaphore * sem) +void __sched __compat_down(struct compat_semaphore * sem) { DECLARE_WAITQUEUE(wait, current); @@ -107,7 +107,7 @@ void __sched __down(struct semaphore * s DOWN_TAIL(TASK_UNINTERRUPTIBLE) } -int __sched __down_interruptible(struct semaphore * sem) +int __sched __compat_down_interruptible(struct compat_semaphore * sem) { DECLARE_WAITQUEUE(wait, current); int ret = 0; @@ -127,7 +127,7 @@ int __sched __down_interruptible(struct return ret; } -int __down_trylock(struct semaphore * sem) +int __compat_down_trylock(struct compat_semaphore * sem) { return waking_non_zero_trylock(sem); } Index: linux-2.6.24.7/include/asm-m68knommu/semaphore-helper.h =================================================================== --- linux-2.6.24.7.orig/include/asm-m68knommu/semaphore-helper.h +++ linux-2.6.24.7/include/asm-m68knommu/semaphore-helper.h @@ -13,12 +13,12 @@ /* * These two _must_ execute atomically wrt each other. */ -static inline void wake_one_more(struct semaphore * sem) +static inline void wake_one_more(struct compat_semaphore * sem) { atomic_inc(&sem->waking); } -static inline int waking_non_zero(struct semaphore *sem) +static inline int waking_non_zero(struct compat_semaphore *sem) { int ret; unsigned long flags; @@ -39,7 +39,7 @@ static inline int waking_non_zero(struct * 0 go to sleep * -EINTR interrupted */ -static inline int waking_non_zero_interruptible(struct semaphore *sem, +static inline int waking_non_zero_interruptible(struct compat_semaphore *sem, struct task_struct *tsk) { int ret; @@ -63,7 +63,7 @@ static inline int waking_non_zero_interr * 1 failed to lock * 0 got the lock */ -static inline int waking_non_zero_trylock(struct semaphore *sem) +static inline int waking_non_zero_trylock(struct compat_semaphore *sem) { int ret; unsigned long flags; Index: linux-2.6.24.7/include/asm-m68knommu/semaphore.h =================================================================== --- linux-2.6.24.7.orig/include/asm-m68knommu/semaphore.h +++ linux-2.6.24.7/include/asm-m68knommu/semaphore.h @@ -21,49 +21,55 @@ * m68k version by Andreas Schwab */ +#ifndef CONFIG_PREEMPT_RT +# define compat_semaphore semaphore +#endif -struct semaphore { +struct compat_semaphore { atomic_t count; atomic_t waking; wait_queue_head_t wait; }; -#define __SEMAPHORE_INITIALIZER(name, n) \ +#define __COMPAT_SEMAPHORE_INITIALIZER(name, n) \ { \ .count = ATOMIC_INIT(n), \ .waking = ATOMIC_INIT(0), \ .wait = __WAIT_QUEUE_HEAD_INITIALIZER((name).wait) \ } -#define __DECLARE_SEMAPHORE_GENERIC(name,count) \ - struct semaphore name = __SEMAPHORE_INITIALIZER(name,count) +#define __COMPAT_DECLARE_SEMAPHORE_GENERIC(name,count) \ + struct compat_semaphore name = __COMPAT_SEMAPHORE_INITIALIZER(name,count) -#define DECLARE_MUTEX(name) __DECLARE_SEMAPHORE_GENERIC(name,1) +#define COMPAT_DECLARE_MUTEX(name) __COMPAT_DECLARE_SEMAPHORE_GENERIC(name,1) -static inline void sema_init (struct semaphore *sem, int val) +static inline void compat_sema_init (struct compat_semaphore *sem, int val) { - *sem = (struct semaphore)__SEMAPHORE_INITIALIZER(*sem, val); + *sem = (struct compat_semaphore)__COMPAT_SEMAPHORE_INITIALIZER(*sem, val); } -static inline void init_MUTEX (struct semaphore *sem) +static inline void compat_init_MUTEX (struct compat_semaphore *sem) { - sema_init(sem, 1); + compat_sema_init(sem, 1); } -static inline void init_MUTEX_LOCKED (struct semaphore *sem) +static inline void compat_init_MUTEX_LOCKED (struct compat_semaphore *sem) { - sema_init(sem, 0); + compat_sema_init(sem, 0); } -asmlinkage void __down_failed(void /* special register calling convention */); -asmlinkage int __down_failed_interruptible(void /* params in registers */); -asmlinkage int __down_failed_trylock(void /* params in registers */); -asmlinkage void __up_wakeup(void /* special register calling convention */); +asmlinkage void __compat_down_failed(void /* special register calling convention */); +asmlinkage int __compat_down_failed_interruptible(void /* params in registers */); +asmlinkage int __compat_down_failed_trylock(void /* params in registers */); +asmlinkage void __compat_up_wakeup(void /* special register calling convention */); + +asmlinkage void __compat_down(struct compat_semaphore * sem); +asmlinkage int __compat_down_interruptible(struct compat_semaphore * sem); +asmlinkage int __compat_down_trylock(struct compat_semaphore * sem); +asmlinkage void __compat_up(struct compat_semaphore * sem); -asmlinkage void __down(struct semaphore * sem); -asmlinkage int __down_interruptible(struct semaphore * sem); -asmlinkage int __down_trylock(struct semaphore * sem); -asmlinkage void __up(struct semaphore * sem); +extern int compat_sem_is_locked(struct compat_semaphore *sem); +#define compat_sema_count(sem) atomic_read(&(sem)->count) extern spinlock_t semaphore_wake_lock; @@ -72,7 +78,7 @@ extern spinlock_t semaphore_wake_lock; * "down_failed" is a special asm handler that calls the C * routine that actually waits. See arch/m68k/lib/semaphore.S */ -static inline void down(struct semaphore * sem) +static inline void compat_down(struct compat_semaphore * sem) { might_sleep(); __asm__ __volatile__( @@ -87,7 +93,7 @@ static inline void down(struct semaphore : "cc", "%a0", "%a1", "memory"); } -static inline int down_interruptible(struct semaphore * sem) +static inline int compat_down_interruptible(struct compat_semaphore * sem) { int ret; @@ -106,9 +112,9 @@ static inline int down_interruptible(str return(ret); } -static inline int down_trylock(struct semaphore * sem) +static inline int compat_down_trylock(struct compat_semaphore * sem) { - register struct semaphore *sem1 __asm__ ("%a1") = sem; + register struct compat_semaphore *sem1 __asm__ ("%a1") = sem; register int result __asm__ ("%d0"); __asm__ __volatile__( @@ -134,7 +140,7 @@ static inline int down_trylock(struct se * The default case (no contention) will result in NO * jumps for both down() and up(). */ -static inline void up(struct semaphore * sem) +static inline void compat_up(struct compat_semaphore * sem) { __asm__ __volatile__( "| atomic up operation\n\t" @@ -148,6 +154,8 @@ static inline void up(struct semaphore * : "cc", "%a0", "%a1", "memory"); } +#include <linux/semaphore.h> + #endif /* __ASSEMBLY__ */ #endif ��������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/rt-mutex-m68knommu-consider-TIF_NEED_RESCHED_DELAYED-on-resc.patch��������������������������0000664�0000764�0000764�00000005720�11041657731�024562� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From d837c2879fb47dcca714acca57e0b677a3756372 Mon Sep 17 00:00:00 2001 From: Sebastian Siewior <bigeasy@linutronix.de> Date: Fri, 18 Apr 2008 17:02:26 +0200 Subject: [PATCH] m68knommu: consider TIF_NEED_RESCHED_DELAYED on resched Signed-off-by: Sebastian Siewior <bigeasy@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> --- arch/m68knommu/kernel/process.c | 4 ++-- arch/m68knommu/platform/coldfire/entry.S | 4 ++-- include/asm-m68knommu/thread_info.h | 8 ++++++++ 3 files changed, 12 insertions(+), 4 deletions(-) Index: linux-2.6.24.7/arch/m68knommu/kernel/process.c =================================================================== --- linux-2.6.24.7.orig/arch/m68knommu/kernel/process.c +++ linux-2.6.24.7/arch/m68knommu/kernel/process.c @@ -54,9 +54,9 @@ EXPORT_SYMBOL(pm_power_off); static void default_idle(void) { local_irq_disable(); - while (!need_resched()) { + while (!need_resched() && !need_resched_delayed()) { /* This stop will re-enable interrupts */ - __asm__("stop #0x2000" : : : "cc"); + __asm__("stop #0x2000" : : : "cc"); local_irq_disable(); } local_irq_enable(); Index: linux-2.6.24.7/arch/m68knommu/platform/coldfire/entry.S =================================================================== --- linux-2.6.24.7.orig/arch/m68knommu/platform/coldfire/entry.S +++ linux-2.6.24.7/arch/m68knommu/platform/coldfire/entry.S @@ -145,7 +145,7 @@ ret_from_exception: andl #-THREAD_SIZE,%d1 /* at base of kernel stack */ movel %d1,%a0 movel %a0@(TI_FLAGS),%d1 /* get thread_info->flags */ - andl #_TIF_NEED_RESCHED,%d1 + andl #_TIF_RESCHED_MASK,%d1 jeq Lkernel_return movel %a0@(TI_PREEMPTCOUNT),%d1 @@ -191,7 +191,7 @@ Lreturn: Lwork_to_do: movel %a0@(TI_FLAGS),%d1 /* get thread_info->flags */ move #0x2000,%sr /* enable intrs again */ - btst #TIF_NEED_RESCHED,%d1 + andl #_TIF_RESCHED_MASK, %d1 jne reschedule /* GERG: do we need something here for TRACEing?? */ Index: linux-2.6.24.7/include/asm-m68knommu/thread_info.h =================================================================== --- linux-2.6.24.7.orig/include/asm-m68knommu/thread_info.h +++ linux-2.6.24.7/include/asm-m68knommu/thread_info.h @@ -88,12 +88,20 @@ static inline struct thread_info *curren #define TIF_POLLING_NRFLAG 3 /* true if poll_idle() is polling TIF_NEED_RESCHED */ #define TIF_MEMDIE 4 +#define TIF_NEED_RESCHED_DELAYED 6 /* reschedule on return to userspace */ /* as above, but as bit values */ #define _TIF_SYSCALL_TRACE (1<<TIF_SYSCALL_TRACE) #define _TIF_SIGPENDING (1<<TIF_SIGPENDING) #define _TIF_NEED_RESCHED (1<<TIF_NEED_RESCHED) #define _TIF_POLLING_NRFLAG (1<<TIF_POLLING_NRFLAG) +#define _TIF_NEED_RESCHED_DELAYED (1<<TIF_NEED_RESCHED_DELAYED) + +/* + * the compiler does not accept + * #define _TIF_RESCHED_MASK (_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_DELAYED) + */ +#define _TIF_RESCHED_MASK (0x44) #define _TIF_WORK_MASK 0x0000FFFE /* work to do on interrupt/exception return */ ������������������������������������������������patches/rt-mutex-drop-generic-TIF_NEED_RESCHED_DELAYED.patch����������������������������������������0000664�0000764�0000764�00000001437�11041657733�022132� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������No need for a generic TIF_NEED_RESCHED_DELAYED , since all the architectures patches should be applied by now. --- include/linux/preempt.h | 9 --------- 1 file changed, 9 deletions(-) Index: linux-2.6.24.7/include/linux/preempt.h =================================================================== --- linux-2.6.24.7.orig/include/linux/preempt.h +++ linux-2.6.24.7/include/linux/preempt.h @@ -55,15 +55,6 @@ do { \ preempt_schedule(); \ } while (0) - -/* - * If the architecture doens't have TIF_NEED_RESCHED_DELAYED - * help it out and define it back to TIF_NEED_RESCHED - */ -#ifndef TIF_NEED_RESCHED_DELAYED -# define TIF_NEED_RESCHED_DELAYED TIF_NEED_RESCHED -#endif - #define preempt_check_resched_delayed() \ do { \ if (unlikely(test_thread_flag(TIF_NEED_RESCHED_DELAYED))) \ ���������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/rt-mutex-compat-semaphores.patch������������������������������������������������������������0000664�0000764�0000764�00000027076�11041657732�017512� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������� drivers/acpi/osl.c | 12 ++++++------ drivers/media/dvb/dvb-core/dvb_frontend.c | 2 +- drivers/net/3c527.c | 2 +- drivers/net/hamradio/6pack.c | 2 +- drivers/net/hamradio/mkiss.c | 2 +- drivers/net/plip.c | 5 ++++- drivers/net/ppp_async.c | 2 +- drivers/net/ppp_synctty.c | 2 +- drivers/pci/hotplug/ibmphp_hpc.c | 2 +- drivers/scsi/aacraid/aacraid.h | 4 ++-- drivers/scsi/qla2xxx/qla_def.h | 2 +- drivers/usb/storage/usb.h | 2 +- fs/jffs2/jffs2_fs_i.h | 2 +- fs/xfs/linux-2.6/sema.h | 9 +++++++-- fs/xfs/linux-2.6/xfs_buf.h | 4 ++-- include/linux/parport.h | 2 +- 16 files changed, 32 insertions(+), 24 deletions(-) Index: linux-2.6.24.7/drivers/acpi/osl.c =================================================================== --- linux-2.6.24.7.orig/drivers/acpi/osl.c +++ linux-2.6.24.7/drivers/acpi/osl.c @@ -775,13 +775,13 @@ void acpi_os_delete_lock(acpi_spinlock h acpi_status acpi_os_create_semaphore(u32 max_units, u32 initial_units, acpi_handle * handle) { - struct semaphore *sem = NULL; + struct compat_semaphore *sem = NULL; - sem = acpi_os_allocate(sizeof(struct semaphore)); + sem = acpi_os_allocate(sizeof(struct compat_semaphore)); if (!sem) return AE_NO_MEMORY; - memset(sem, 0, sizeof(struct semaphore)); + memset(sem, 0, sizeof(struct compat_semaphore)); sema_init(sem, initial_units); @@ -804,7 +804,7 @@ EXPORT_SYMBOL(acpi_os_create_semaphore); acpi_status acpi_os_delete_semaphore(acpi_handle handle) { - struct semaphore *sem = (struct semaphore *)handle; + struct compat_semaphore *sem = (struct compat_semaphore *)handle; if (!sem) @@ -832,7 +832,7 @@ EXPORT_SYMBOL(acpi_os_delete_semaphore); acpi_status acpi_os_wait_semaphore(acpi_handle handle, u32 units, u16 timeout) { acpi_status status = AE_OK; - struct semaphore *sem = (struct semaphore *)handle; + struct compat_semaphore *sem = (struct compat_semaphore *)handle; int ret = 0; @@ -919,7 +919,7 @@ EXPORT_SYMBOL(acpi_os_wait_semaphore); */ acpi_status acpi_os_signal_semaphore(acpi_handle handle, u32 units) { - struct semaphore *sem = (struct semaphore *)handle; + struct compat_semaphore *sem = (struct compat_semaphore *)handle; if (!sem || (units < 1)) Index: linux-2.6.24.7/drivers/media/dvb/dvb-core/dvb_frontend.c =================================================================== --- linux-2.6.24.7.orig/drivers/media/dvb/dvb-core/dvb_frontend.c +++ linux-2.6.24.7/drivers/media/dvb/dvb-core/dvb_frontend.c @@ -97,7 +97,7 @@ struct dvb_frontend_private { struct dvb_device *dvbdev; struct dvb_frontend_parameters parameters; struct dvb_fe_events events; - struct semaphore sem; + struct compat_semaphore sem; struct list_head list_head; wait_queue_head_t wait_queue; struct task_struct *thread; Index: linux-2.6.24.7/drivers/net/3c527.c =================================================================== --- linux-2.6.24.7.orig/drivers/net/3c527.c +++ linux-2.6.24.7/drivers/net/3c527.c @@ -182,7 +182,7 @@ struct mc32_local u16 rx_ring_tail; /* index to rx de-queue end */ - struct semaphore cmd_mutex; /* Serialises issuing of execute commands */ + struct compat_semaphore cmd_mutex; /* Serialises issuing of execute commands */ struct completion execution_cmd; /* Card has completed an execute command */ struct completion xceiver_cmd; /* Card has completed a tx or rx command */ }; Index: linux-2.6.24.7/drivers/net/hamradio/6pack.c =================================================================== --- linux-2.6.24.7.orig/drivers/net/hamradio/6pack.c +++ linux-2.6.24.7/drivers/net/hamradio/6pack.c @@ -123,7 +123,7 @@ struct sixpack { struct timer_list tx_t; struct timer_list resync_t; atomic_t refcnt; - struct semaphore dead_sem; + struct compat_semaphore dead_sem; spinlock_t lock; }; Index: linux-2.6.24.7/drivers/net/hamradio/mkiss.c =================================================================== --- linux-2.6.24.7.orig/drivers/net/hamradio/mkiss.c +++ linux-2.6.24.7/drivers/net/hamradio/mkiss.c @@ -84,7 +84,7 @@ struct mkiss { #define CRC_MODE_SMACK_TEST 4 atomic_t refcnt; - struct semaphore dead_sem; + struct compat_semaphore dead_sem; }; /*---------------------------------------------------------------------------*/ Index: linux-2.6.24.7/drivers/net/plip.c =================================================================== --- linux-2.6.24.7.orig/drivers/net/plip.c +++ linux-2.6.24.7/drivers/net/plip.c @@ -221,7 +221,10 @@ struct net_local { int should_relinquish; spinlock_t lock; atomic_t kill_timer; - struct semaphore killed_timer_sem; + /* + * PREEMPT_RT: this isnt a mutex, it should be struct completion. + */ + struct compat_semaphore killed_timer_sem; }; static inline void enable_parport_interrupts (struct net_device *dev) Index: linux-2.6.24.7/drivers/net/ppp_async.c =================================================================== --- linux-2.6.24.7.orig/drivers/net/ppp_async.c +++ linux-2.6.24.7/drivers/net/ppp_async.c @@ -67,7 +67,7 @@ struct asyncppp { struct tasklet_struct tsk; atomic_t refcnt; - struct semaphore dead_sem; + struct compat_semaphore dead_sem; struct ppp_channel chan; /* interface to generic ppp layer */ unsigned char obuf[OBUFSIZE]; }; Index: linux-2.6.24.7/drivers/net/ppp_synctty.c =================================================================== --- linux-2.6.24.7.orig/drivers/net/ppp_synctty.c +++ linux-2.6.24.7/drivers/net/ppp_synctty.c @@ -70,7 +70,7 @@ struct syncppp { struct tasklet_struct tsk; atomic_t refcnt; - struct semaphore dead_sem; + struct compat_semaphore dead_sem; struct ppp_channel chan; /* interface to generic ppp layer */ }; Index: linux-2.6.24.7/drivers/pci/hotplug/ibmphp_hpc.c =================================================================== --- linux-2.6.24.7.orig/drivers/pci/hotplug/ibmphp_hpc.c +++ linux-2.6.24.7/drivers/pci/hotplug/ibmphp_hpc.c @@ -104,7 +104,7 @@ static int to_debug = 0; static struct mutex sem_hpcaccess; // lock access to HPC static struct semaphore semOperations; // lock all operations and // access to data structures -static struct semaphore sem_exit; // make sure polling thread goes away +static struct compat_semaphore sem_exit; // make sure polling thread goes away static struct task_struct *ibmphp_poll_thread; //---------------------------------------------------------------------------- // local function prototypes Index: linux-2.6.24.7/drivers/scsi/aacraid/aacraid.h =================================================================== --- linux-2.6.24.7.orig/drivers/scsi/aacraid/aacraid.h +++ linux-2.6.24.7/drivers/scsi/aacraid/aacraid.h @@ -715,7 +715,7 @@ struct aac_fib_context { u32 unique; // unique value representing this context ulong jiffies; // used for cleanup - dmb changed to ulong struct list_head next; // used to link context's into a linked list - struct semaphore wait_sem; // this is used to wait for the next fib to arrive. + struct compat_semaphore wait_sem; // this is used to wait for the next fib to arrive. int wait; // Set to true when thread is in WaitForSingleObject unsigned long count; // total number of FIBs on FibList struct list_head fib_list; // this holds fibs and their attachd hw_fibs @@ -785,7 +785,7 @@ struct fib { * This is the event the sendfib routine will wait on if the * caller did not pass one and this is synch io. */ - struct semaphore event_wait; + struct compat_semaphore event_wait; spinlock_t event_lock; u32 done; /* gets set to 1 when fib is complete */ Index: linux-2.6.24.7/drivers/scsi/qla2xxx/qla_def.h =================================================================== --- linux-2.6.24.7.orig/drivers/scsi/qla2xxx/qla_def.h +++ linux-2.6.24.7/drivers/scsi/qla2xxx/qla_def.h @@ -2418,7 +2418,7 @@ typedef struct scsi_qla_host { struct semaphore mbx_cmd_sem; /* Serialialize mbx access */ struct semaphore vport_sem; /* Virtual port synchronization */ - struct semaphore mbx_intr_sem; /* Used for completion notification */ + struct compat_semaphore mbx_intr_sem; /* Used for completion notification */ uint32_t mbx_flags; #define MBX_IN_PROGRESS BIT_0 Index: linux-2.6.24.7/drivers/usb/storage/usb.h =================================================================== --- linux-2.6.24.7.orig/drivers/usb/storage/usb.h +++ linux-2.6.24.7/drivers/usb/storage/usb.h @@ -147,7 +147,7 @@ struct us_data { struct task_struct *ctl_thread; /* the control thread */ /* mutual exclusion and synchronization structures */ - struct semaphore sema; /* to sleep thread on */ + struct compat_semaphore sema; /* to sleep thread on */ struct completion notify; /* thread begin/end */ wait_queue_head_t delay_wait; /* wait during scan, reset */ struct completion scanning_done; /* wait for scan thread */ Index: linux-2.6.24.7/fs/jffs2/jffs2_fs_i.h =================================================================== --- linux-2.6.24.7.orig/fs/jffs2/jffs2_fs_i.h +++ linux-2.6.24.7/fs/jffs2/jffs2_fs_i.h @@ -24,7 +24,7 @@ struct jffs2_inode_info { before letting GC proceed. Or we'd have to put ugliness into the GC code so it didn't attempt to obtain the i_mutex for the inode(s) which are already locked */ - struct semaphore sem; + struct compat_semaphore sem; /* The highest (datanode) version number used for this ino */ uint32_t highest_version; Index: linux-2.6.24.7/fs/xfs/linux-2.6/sema.h =================================================================== --- linux-2.6.24.7.orig/fs/xfs/linux-2.6/sema.h +++ linux-2.6.24.7/fs/xfs/linux-2.6/sema.h @@ -27,7 +27,7 @@ * sema_t structure just maps to struct semaphore in Linux kernel. */ -typedef struct semaphore sema_t; +typedef struct compat_semaphore sema_t; #define initnsema(sp, val, name) sema_init(sp, val) #define psema(sp, b) down(sp) @@ -36,7 +36,12 @@ typedef struct semaphore sema_t; static inline int issemalocked(sema_t *sp) { - return down_trylock(sp) || (up(sp), 0); + int rv; + + if ((rv = down_trylock(sp))) + return (rv); + up(sp); + return (0); } /* Index: linux-2.6.24.7/fs/xfs/linux-2.6/xfs_buf.h =================================================================== --- linux-2.6.24.7.orig/fs/xfs/linux-2.6/xfs_buf.h +++ linux-2.6.24.7/fs/xfs/linux-2.6/xfs_buf.h @@ -118,7 +118,7 @@ typedef int (*xfs_buf_bdstrat_t)(struct #define XB_PAGES 2 typedef struct xfs_buf { - struct semaphore b_sema; /* semaphore for lockables */ + struct compat_semaphore b_sema; /* semaphore for lockables */ unsigned long b_queuetime; /* time buffer was queued */ atomic_t b_pin_count; /* pin count */ wait_queue_head_t b_waiters; /* unpin waiters */ @@ -138,7 +138,7 @@ typedef struct xfs_buf { xfs_buf_iodone_t b_iodone; /* I/O completion function */ xfs_buf_relse_t b_relse; /* releasing function */ xfs_buf_bdstrat_t b_strat; /* pre-write function */ - struct semaphore b_iodonesema; /* Semaphore for I/O waiters */ + struct compat_semaphore b_iodonesema; /* Semaphore for I/O waiters */ void *b_fspriv; void *b_fspriv2; void *b_fspriv3; Index: linux-2.6.24.7/include/linux/parport.h =================================================================== --- linux-2.6.24.7.orig/include/linux/parport.h +++ linux-2.6.24.7/include/linux/parport.h @@ -266,7 +266,7 @@ enum ieee1284_phase { struct ieee1284_info { int mode; volatile enum ieee1284_phase phase; - struct semaphore irq; + struct compat_semaphore irq; }; /* A parallel port */ ������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/percpu-locked-mm.patch����������������������������������������������������������������������0000664�0000764�0000764�00000040643�11041657731�015436� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������� arch/ppc/mm/init.c | 2 - arch/x86/mm/init_32.c | 2 - arch/x86/mm/init_64.c | 2 - include/asm-generic/percpu.h | 24 ++++++++++++ include/asm-generic/tlb.h | 9 +++- include/asm-x86/percpu_32.h | 19 +++++++++ include/asm-x86/percpu_64.h | 25 ++++++++++++ include/linux/percpu.h | 23 +++++++++++ mm/swap.c | 85 ++++++++++++++++++++++++++++++++++++------- 9 files changed, 173 insertions(+), 18 deletions(-) Index: linux-2.6.24.7/arch/ppc/mm/init.c =================================================================== --- linux-2.6.24.7.orig/arch/ppc/mm/init.c +++ linux-2.6.24.7/arch/ppc/mm/init.c @@ -55,7 +55,7 @@ #endif #define MAX_LOW_MEM CONFIG_LOWMEM_SIZE -DEFINE_PER_CPU(struct mmu_gather, mmu_gathers); +DEFINE_PER_CPU_LOCKED(struct mmu_gather, mmu_gathers); unsigned long total_memory; unsigned long total_lowmem; Index: linux-2.6.24.7/arch/x86/mm/init_32.c =================================================================== --- linux-2.6.24.7.orig/arch/x86/mm/init_32.c +++ linux-2.6.24.7/arch/x86/mm/init_32.c @@ -47,7 +47,7 @@ unsigned int __VMALLOC_RESERVE = 128 << 20; -DEFINE_PER_CPU(struct mmu_gather, mmu_gathers); +DEFINE_PER_CPU_LOCKED(struct mmu_gather, mmu_gathers); unsigned long highstart_pfn, highend_pfn; static int noinline do_test_wp_bit(void); Index: linux-2.6.24.7/arch/x86/mm/init_64.c =================================================================== --- linux-2.6.24.7.orig/arch/x86/mm/init_64.c +++ linux-2.6.24.7/arch/x86/mm/init_64.c @@ -53,7 +53,7 @@ EXPORT_SYMBOL(dma_ops); static unsigned long dma_reserve __initdata; -DEFINE_PER_CPU(struct mmu_gather, mmu_gathers); +DEFINE_PER_CPU_LOCKED(struct mmu_gather, mmu_gathers); /* * NOTE: pagetable_init alloc all the fixmap pagetables contiguous on the Index: linux-2.6.24.7/include/asm-generic/percpu.h =================================================================== --- linux-2.6.24.7.orig/include/asm-generic/percpu.h +++ linux-2.6.24.7/include/asm-generic/percpu.h @@ -19,6 +19,10 @@ extern unsigned long __per_cpu_offset[NR __typeof__(type) per_cpu__##name \ ____cacheline_aligned_in_smp +#define DEFINE_PER_CPU_LOCKED(type, name) \ + __attribute__((__section__(".data.percpu"))) __DEFINE_SPINLOCK(per_cpu_lock__##name##_locked); \ + __attribute__((__section__(".data.percpu"))) __typeof__(type) per_cpu__##name##_locked + /* var is in discarded region: offset to particular copy we want */ #define per_cpu(var, cpu) (*({ \ extern int simple_identifier_##var(void); \ @@ -26,6 +30,15 @@ extern unsigned long __per_cpu_offset[NR #define __get_cpu_var(var) per_cpu(var, smp_processor_id()) #define __raw_get_cpu_var(var) per_cpu(var, raw_smp_processor_id()) +#define per_cpu_lock(var, cpu) \ + (*RELOC_HIDE(&per_cpu_lock__##var##_locked, __per_cpu_offset[cpu])) +#define per_cpu_var_locked(var, cpu) \ + (*RELOC_HIDE(&per_cpu__##var##_locked, __per_cpu_offset[cpu])) +#define __get_cpu_lock(var, cpu) \ + per_cpu_lock(var, cpu) +#define __get_cpu_var_locked(var, cpu) \ + per_cpu_var_locked(var, cpu) + /* A macro to avoid #include hell... */ #define percpu_modcopy(pcpudst, src, size) \ do { \ @@ -38,19 +51,30 @@ do { \ #define DEFINE_PER_CPU(type, name) \ __typeof__(type) per_cpu__##name +#define DEFINE_PER_CPU_LOCKED(type, name) \ + __DEFINE_SPINLOCK(per_cpu_lock__##name##_locked); \ + __typeof__(type) per_cpu__##name##_locked #define DEFINE_PER_CPU_SHARED_ALIGNED(type, name) \ DEFINE_PER_CPU(type, name) #define per_cpu(var, cpu) (*((void)(cpu), &per_cpu__##var)) +#define per_cpu_var_locked(var, cpu) (*((void)(cpu), &per_cpu__##var##_locked)) #define __get_cpu_var(var) per_cpu__##var #define __raw_get_cpu_var(var) per_cpu__##var +#define __get_cpu_lock(var, cpu) per_cpu_lock__##var##_locked +#define __get_cpu_var_locked(var, cpu) per_cpu__##var##_locked #endif /* SMP */ #define DECLARE_PER_CPU(type, name) extern __typeof__(type) per_cpu__##name +#define DECLARE_PER_CPU_LOCKED(type, name) \ + extern spinlock_t per_cpu_lock__##name##_locked; \ + extern __typeof__(type) per_cpu__##name##_locked #define EXPORT_PER_CPU_SYMBOL(var) EXPORT_SYMBOL(per_cpu__##var) #define EXPORT_PER_CPU_SYMBOL_GPL(var) EXPORT_SYMBOL_GPL(per_cpu__##var) +#define EXPORT_PER_CPU_LOCKED_SYMBOL(var) EXPORT_SYMBOL(per_cpu_lock__##var##_locked); EXPORT_SYMBOL(per_cpu__##var##_locked) +#define EXPORT_PER_CPU_LOCKED_SYMBOL_GPL(var) EXPORT_SYMBOL_GPL(per_cpu_lock__##var##_locked); EXPORT_SYMBOL_GPL(per_cpu__##var##_locked) #endif /* _ASM_GENERIC_PERCPU_H_ */ Index: linux-2.6.24.7/include/asm-generic/tlb.h =================================================================== --- linux-2.6.24.7.orig/include/asm-generic/tlb.h +++ linux-2.6.24.7/include/asm-generic/tlb.h @@ -42,11 +42,12 @@ struct mmu_gather { unsigned int nr; /* set to ~0U means fast mode */ unsigned int need_flush;/* Really unmapped some ptes? */ unsigned int fullmm; /* non-zero means full mm flush */ + int cpu; struct page * pages[FREE_PTE_NR]; }; /* Users of the generic TLB shootdown code must declare this storage space. */ -DECLARE_PER_CPU(struct mmu_gather, mmu_gathers); +DECLARE_PER_CPU_LOCKED(struct mmu_gather, mmu_gathers); /* tlb_gather_mmu * Return a pointer to an initialized struct mmu_gather. @@ -54,8 +55,10 @@ DECLARE_PER_CPU(struct mmu_gather, mmu_g static inline struct mmu_gather * tlb_gather_mmu(struct mm_struct *mm, unsigned int full_mm_flush) { - struct mmu_gather *tlb = &get_cpu_var(mmu_gathers); + int cpu; + struct mmu_gather *tlb = &get_cpu_var_locked(mmu_gathers, &cpu); + tlb->cpu = cpu; tlb->mm = mm; /* Use fast mode if only one CPU is online */ @@ -91,7 +94,7 @@ tlb_finish_mmu(struct mmu_gather *tlb, u /* keep the page table cache within bounds */ check_pgt_cache(); - put_cpu_var(mmu_gathers); + put_cpu_var_locked(mmu_gathers, tlb->cpu); } /* tlb_remove_page Index: linux-2.6.24.7/include/asm-x86/percpu_32.h =================================================================== --- linux-2.6.24.7.orig/include/asm-x86/percpu_32.h +++ linux-2.6.24.7/include/asm-x86/percpu_32.h @@ -51,6 +51,10 @@ extern unsigned long __per_cpu_offset[]; /* Separate out the type, so (int[3], foo) works. */ #define DECLARE_PER_CPU(type, name) extern __typeof__(type) per_cpu__##name +#define DECLARE_PER_CPU_LOCKED(type, name) \ + extern spinlock_t per_cpu_lock__##name##_locked; \ + extern __typeof__(type) per_cpu__##name##_locked + #define DEFINE_PER_CPU(type, name) \ __attribute__((__section__(".data.percpu"))) __typeof__(type) per_cpu__##name @@ -59,6 +63,10 @@ extern unsigned long __per_cpu_offset[]; __typeof__(type) per_cpu__##name \ ____cacheline_aligned_in_smp +#define DEFINE_PER_CPU_LOCKED(type, name) \ + __attribute__((__section__(".data.percpu"))) __DEFINE_SPINLOCK(per_cpu_lock__##name##_locked); \ + __attribute__((__section__(".data.percpu"))) __typeof__(type) per_cpu__##name##_locked + /* We can use this directly for local CPU (faster). */ DECLARE_PER_CPU(unsigned long, this_cpu_off); @@ -74,6 +82,15 @@ DECLARE_PER_CPU(unsigned long, this_cpu_ #define __get_cpu_var(var) __raw_get_cpu_var(var) +#define per_cpu_lock(var, cpu) \ + (*RELOC_HIDE(&per_cpu_lock__##var##_locked, __per_cpu_offset[cpu])) +#define per_cpu_var_locked(var, cpu) \ + (*RELOC_HIDE(&per_cpu__##var##_locked, __per_cpu_offset[cpu])) +#define __get_cpu_lock(var, cpu) \ + per_cpu_lock(var, cpu) +#define __get_cpu_var_locked(var, cpu) \ + per_cpu_var_locked(var, cpu) + /* A macro to avoid #include hell... */ #define percpu_modcopy(pcpudst, src, size) \ do { \ @@ -85,6 +102,8 @@ do { \ #define EXPORT_PER_CPU_SYMBOL(var) EXPORT_SYMBOL(per_cpu__##var) #define EXPORT_PER_CPU_SYMBOL_GPL(var) EXPORT_SYMBOL_GPL(per_cpu__##var) +#define EXPORT_PER_CPU_LOCKED_SYMBOL(var) EXPORT_SYMBOL(per_cpu_lock__##var##_locked); EXPORT_SYMBOL(per_cpu__##var##_locked) +#define EXPORT_PER_CPU_LOCKED_SYMBOL_GPL(var) EXPORT_SYMBOL_GPL(per_cpu_lock__##var##_locked); EXPORT_SYMBOL_GPL(per_cpu__##var##_locked) /* fs segment starts at (positive) offset == __per_cpu_offset[cpu] */ #define __percpu_seg "%%fs:" Index: linux-2.6.24.7/include/asm-x86/percpu_64.h =================================================================== --- linux-2.6.24.7.orig/include/asm-x86/percpu_64.h +++ linux-2.6.24.7/include/asm-x86/percpu_64.h @@ -19,6 +19,9 @@ /* Separate out the type, so (int[3], foo) works. */ #define DEFINE_PER_CPU(type, name) \ __attribute__((__section__(".data.percpu"))) __typeof__(type) per_cpu__##name +#define DEFINE_PER_CPU_LOCKED(type, name) \ + __attribute__((__section__(".data.percpu"))) __DEFINE_SPINLOCK(per_cpu_lock__##name##_locked); \ + __attribute__((__section__(".data.percpu"))) __typeof__(type) per_cpu__##name##_locked #define DEFINE_PER_CPU_SHARED_ALIGNED(type, name) \ __attribute__((__section__(".data.percpu.shared_aligned"))) \ @@ -36,6 +39,15 @@ extern int simple_identifier_##var(void); \ RELOC_HIDE(&per_cpu__##var, __my_cpu_offset()); })) +#define per_cpu_lock(var, cpu) \ + (*RELOC_HIDE(&per_cpu_lock__##var##_locked, __per_cpu_offset(cpu))) +#define per_cpu_var_locked(var, cpu) \ + (*RELOC_HIDE(&per_cpu__##var##_locked, __per_cpu_offset(cpu))) +#define __get_cpu_lock(var, cpu) \ + per_cpu_lock(var, cpu) +#define __get_cpu_var_locked(var, cpu) \ + per_cpu_var_locked(var, cpu) + /* A macro to avoid #include hell... */ #define percpu_modcopy(pcpudst, src, size) \ do { \ @@ -54,15 +66,28 @@ extern void setup_per_cpu_areas(void); #define DEFINE_PER_CPU_SHARED_ALIGNED(type, name) \ DEFINE_PER_CPU(type, name) +#define DEFINE_PER_CPU_LOCKED(type, name) \ + spinlock_t per_cpu_lock__##name##_locked = SPIN_LOCK_UNLOCKED; \ + __typeof__(type) per_cpu__##name##_locked + #define per_cpu(var, cpu) (*((void)(cpu), &per_cpu__##var)) +#define per_cpu_var_locked(var, cpu) (*((void)(cpu), &per_cpu__##var##_locked)) #define __get_cpu_var(var) per_cpu__##var #define __raw_get_cpu_var(var) per_cpu__##var +#define __get_cpu_lock(var, cpu) per_cpu_lock__##var##_locked +#define __get_cpu_var_locked(var, cpu) per_cpu__##var##_locked #endif /* SMP */ #define DECLARE_PER_CPU(type, name) extern __typeof__(type) per_cpu__##name +#define DECLARE_PER_CPU_LOCKED(type, name) \ + extern spinlock_t per_cpu_lock__##name##_locked; \ + extern __typeof__(type) per_cpu__##name##_locked + #define EXPORT_PER_CPU_SYMBOL(var) EXPORT_SYMBOL(per_cpu__##var) #define EXPORT_PER_CPU_SYMBOL_GPL(var) EXPORT_SYMBOL_GPL(per_cpu__##var) +#define EXPORT_PER_CPU_LOCKED_SYMBOL(var) EXPORT_SYMBOL(per_cpu_lock__##var##_locked); EXPORT_SYMBOL(per_cpu__##var##_locked) +#define EXPORT_PER_CPU_LOCKED_SYMBOL_GPL(var) EXPORT_SYMBOL_GPL(per_cpu_lock__##var##_locked); EXPORT_SYMBOL_GPL(per_cpu__##var##_locked) #endif /* _ASM_X8664_PERCPU_H_ */ Index: linux-2.6.24.7/include/linux/percpu.h =================================================================== --- linux-2.6.24.7.orig/include/linux/percpu.h +++ linux-2.6.24.7/include/linux/percpu.h @@ -31,6 +31,29 @@ &__get_cpu_var(var); })) #define put_cpu_var(var) preempt_enable() +/* + * Per-CPU data structures with an additional lock - useful for + * PREEMPT_RT code that wants to reschedule but also wants + * per-CPU data structures. + * + * 'cpu' gets updated with the CPU the task is currently executing on. + * + * NOTE: on normal !PREEMPT_RT kernels these per-CPU variables + * are the same as the normal per-CPU variables, so there no + * runtime overhead. + */ +#define get_cpu_var_locked(var, cpuptr) \ +(*({ \ + int __cpu = raw_smp_processor_id(); \ + \ + *(cpuptr) = __cpu; \ + spin_lock(&__get_cpu_lock(var, __cpu)); \ + &__get_cpu_var_locked(var, __cpu); \ +})) + +#define put_cpu_var_locked(var, cpu) \ + do { (void)cpu; spin_unlock(&__get_cpu_lock(var, cpu)); } while (0) + #ifdef CONFIG_SMP struct percpu_data { Index: linux-2.6.24.7/mm/swap.c =================================================================== --- linux-2.6.24.7.orig/mm/swap.c +++ linux-2.6.24.7/mm/swap.c @@ -33,10 +33,64 @@ /* How many pages do we try to swap or page in/out together? */ int page_cluster; +/* + * On PREEMPT_RT we don't want to disable preemption for cpu variables. + * We grab a cpu and then use that cpu to lock the variables accordingly. + */ +#ifdef CONFIG_PREEMPT_RT +static DEFINE_PER_CPU_LOCKED(struct pagevec, lru_add_pvecs) = { 0, }; +static DEFINE_PER_CPU_LOCKED(struct pagevec, lru_add_active_pvecs) = { 0, }; +static DEFINE_PER_CPU_LOCKED(struct pagevec, lru_rotate_pvecs) = { 0, }; + +#define swap_get_cpu_var_irq_save(var, flags, cpu) \ + ({ \ + (void)flags; \ + &get_cpu_var_locked(var, &cpu); \ + }) +#define swap_put_cpu_var_irq_restore(var, flags, cpu) \ + put_cpu_var_locked(var, cpu) +#define swap_get_cpu_var(var, cpu) \ + &get_cpu_var_locked(var, &cpu) +#define swap_put_cpu_var(var, cpu) \ + put_cpu_var_locked(var, cpu) +#define swap_per_cpu_lock(var, cpu) \ + ({ \ + spin_lock(&__get_cpu_lock(var, cpu)); \ + &__get_cpu_var_locked(var, cpu); \ + }) +#define swap_per_cpu_unlock(var, cpu) \ + spin_unlock(&__get_cpu_lock(var, cpu)); +#define swap_get_cpu() raw_smp_processor_id(); +#define swap_put_cpu() +#else static DEFINE_PER_CPU(struct pagevec, lru_add_pvecs) = { 0, }; static DEFINE_PER_CPU(struct pagevec, lru_add_active_pvecs) = { 0, }; static DEFINE_PER_CPU(struct pagevec, lru_rotate_pvecs) = { 0, }; +#define swap_get_cpu_var_irq_save(var, flags, cpu) \ + ({ \ + (void)cpu; \ + local_irq_save(flags); \ + &__get_cpu_var(var); \ + }) +#define swap_put_cpu_var_irq_restore(var, flags, cpu) \ + local_irq_restore(flags) +#define swap_get_cpu_var(var, cpu) \ + ({ (void)cpu; &get_cpu_var(var); }) +#define swap_put_cpu_var(var, cpu) \ + do { \ + (void)cpu; \ + put_cpu_var(var); \ + } while(0) +#define swap_per_cpu_lock(var, cpu) \ + &per_cpu(lru_add_pvecs, cpu) +#define swap_per_cpu_unlock(var, cpu) \ + do { } while(0) +#define swap_get_cpu() get_cpu() +#define swap_put_cpu() put_cpu(); + +#endif /* CONFIG_PREEMPT_RT */ + /* * This path almost never happens for VM activity - pages are normally * freed via pagevecs. But it gets used by networking. @@ -139,6 +193,7 @@ int rotate_reclaimable_page(struct page { struct pagevec *pvec; unsigned long flags; + int cpu; if (PageLocked(page)) return 1; @@ -150,11 +205,10 @@ int rotate_reclaimable_page(struct page return 1; page_cache_get(page); - local_irq_save(flags); - pvec = &__get_cpu_var(lru_rotate_pvecs); + pvec = swap_get_cpu_var_irq_save(lru_rotate_pvecs, flags, cpu); if (!pagevec_add(pvec, page)) pagevec_move_tail(pvec); - local_irq_restore(flags); + swap_put_cpu_var_irq_restore(lru_rotate_pvecs, flags, cpu); if (!test_clear_page_writeback(page)) BUG(); @@ -204,22 +258,24 @@ EXPORT_SYMBOL(mark_page_accessed); */ void fastcall lru_cache_add(struct page *page) { - struct pagevec *pvec = &get_cpu_var(lru_add_pvecs); + int cpu; + struct pagevec *pvec = swap_get_cpu_var(lru_add_pvecs, cpu); page_cache_get(page); if (!pagevec_add(pvec, page)) __pagevec_lru_add(pvec); - put_cpu_var(lru_add_pvecs); + swap_put_cpu_var(lru_add_pvecs, cpu); } void fastcall lru_cache_add_active(struct page *page) { - struct pagevec *pvec = &get_cpu_var(lru_add_active_pvecs); + int cpu; + struct pagevec *pvec = swap_get_cpu_var(lru_add_active_pvecs, cpu); page_cache_get(page); if (!pagevec_add(pvec, page)) __pagevec_lru_add_active(pvec); - put_cpu_var(lru_add_active_pvecs); + swap_put_cpu_var(lru_add_active_pvecs, cpu); } /* @@ -231,15 +287,17 @@ static void drain_cpu_pagevecs(int cpu) { struct pagevec *pvec; - pvec = &per_cpu(lru_add_pvecs, cpu); + pvec = swap_per_cpu_lock(lru_add_pvecs, cpu); if (pagevec_count(pvec)) __pagevec_lru_add(pvec); + swap_per_cpu_unlock(lru_add_pvecs, cpu); - pvec = &per_cpu(lru_add_active_pvecs, cpu); + pvec = swap_per_cpu_lock(lru_add_active_pvecs, cpu); if (pagevec_count(pvec)) __pagevec_lru_add_active(pvec); + swap_per_cpu_unlock(lru_add_active_pvecs, cpu); - pvec = &per_cpu(lru_rotate_pvecs, cpu); + pvec = swap_per_cpu_lock(lru_rotate_pvecs, cpu); if (pagevec_count(pvec)) { unsigned long flags; @@ -248,12 +306,15 @@ static void drain_cpu_pagevecs(int cpu) pagevec_move_tail(pvec); local_irq_restore(flags); } + swap_per_cpu_unlock(lru_rotate_pvecs, cpu); } void lru_add_drain(void) { - drain_cpu_pagevecs(get_cpu()); - put_cpu(); + int cpu; + cpu = swap_get_cpu(); + drain_cpu_pagevecs(cpu); + swap_put_cpu(); } #ifdef CONFIG_NUMA ���������������������������������������������������������������������������������������������patches/percpu-locked-netfilter.patch���������������������������������������������������������������0000664�0000764�0000764�00000010310�11041657733�017007� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������� net/core/flow.c | 22 ++++++++++++++-------- net/ipv4/netfilter/arp_tables.c | 4 ++-- net/ipv4/netfilter/ip_tables.c | 2 +- 3 files changed, 17 insertions(+), 11 deletions(-) --- Index: linux-2.6.24.7/net/core/flow.c =================================================================== --- linux-2.6.24.7.orig/net/core/flow.c +++ linux-2.6.24.7/net/core/flow.c @@ -40,9 +40,10 @@ atomic_t flow_cache_genid = ATOMIC_INIT( static u32 flow_hash_shift; #define flow_hash_size (1 << flow_hash_shift) -static DEFINE_PER_CPU(struct flow_cache_entry **, flow_tables) = { NULL }; -#define flow_table(cpu) (per_cpu(flow_tables, cpu)) +static DEFINE_PER_CPU_LOCKED(struct flow_cache_entry **, flow_tables); + +#define flow_table(cpu) (per_cpu_var_locked(flow_tables, cpu)) static struct kmem_cache *flow_cachep __read_mostly; @@ -169,24 +170,24 @@ static int flow_key_compare(struct flowi void *flow_cache_lookup(struct flowi *key, u16 family, u8 dir, flow_resolve_t resolver) { - struct flow_cache_entry *fle, **head = NULL /* shut up GCC */; + struct flow_cache_entry **table, *fle, **head = NULL /* shut up GCC */; unsigned int hash; int cpu; local_bh_disable(); - cpu = smp_processor_id(); + table = get_cpu_var_locked(flow_tables, &cpu); fle = NULL; /* Packet really early in init? Making flow_cache_init a * pre-smp initcall would solve this. --RR */ - if (!flow_table(cpu)) + if (!table) goto nocache; if (flow_hash_rnd_recalc(cpu)) flow_new_hash_rnd(cpu); hash = flow_hash_code(key, cpu); - head = &flow_table(cpu)[hash]; + head = &table[hash]; for (fle = *head; fle; fle = fle->next) { if (fle->family == family && fle->dir == dir && @@ -196,6 +197,7 @@ void *flow_cache_lookup(struct flowi *ke if (ret) atomic_inc(fle->object_ref); + put_cpu_var_locked(flow_tables, cpu); local_bh_enable(); return ret; @@ -221,6 +223,8 @@ void *flow_cache_lookup(struct flowi *ke } nocache: + put_cpu_var_locked(flow_tables, cpu); + { int err; void *obj; @@ -250,14 +254,15 @@ nocache: static void flow_cache_flush_tasklet(unsigned long data) { struct flow_flush_info *info = (void *)data; + struct flow_cache_entry **table; int i; int cpu; - cpu = smp_processor_id(); + table = get_cpu_var_locked(flow_tables, &cpu); for (i = 0; i < flow_hash_size; i++) { struct flow_cache_entry *fle; - fle = flow_table(cpu)[i]; + fle = table[i]; for (; fle; fle = fle->next) { unsigned genid = atomic_read(&flow_cache_genid); @@ -268,6 +273,7 @@ static void flow_cache_flush_tasklet(uns atomic_dec(fle->object_ref); } } + put_cpu_var_locked(flow_tables, cpu); if (atomic_dec_and_test(&info->cpuleft)) complete(&info->completion); Index: linux-2.6.24.7/net/ipv4/netfilter/arp_tables.c =================================================================== --- linux-2.6.24.7.orig/net/ipv4/netfilter/arp_tables.c +++ linux-2.6.24.7/net/ipv4/netfilter/arp_tables.c @@ -241,7 +241,7 @@ unsigned int arpt_do_table(struct sk_buf read_lock_bh(&table->lock); private = table->private; - table_base = (void *)private->entries[smp_processor_id()]; + table_base = (void *)private->entries[raw_smp_processor_id()]; e = get_entry(table_base, private->hook_entry[hook]); back = get_entry(table_base, private->underflow[hook]); @@ -951,7 +951,7 @@ static int do_add_counters(void __user * i = 0; /* Choose the copy that is on our node */ - loc_cpu_entry = private->entries[smp_processor_id()]; + loc_cpu_entry = private->entries[raw_smp_processor_id()]; ARPT_ENTRY_ITERATE(loc_cpu_entry, private->size, add_counter_to_entry, Index: linux-2.6.24.7/net/ipv4/netfilter/ip_tables.c =================================================================== --- linux-2.6.24.7.orig/net/ipv4/netfilter/ip_tables.c +++ linux-2.6.24.7/net/ipv4/netfilter/ip_tables.c @@ -346,7 +346,7 @@ ipt_do_table(struct sk_buff *skb, read_lock_bh(&table->lock); IP_NF_ASSERT(table->valid_hooks & (1 << hook)); private = table->private; - table_base = (void *)private->entries[smp_processor_id()]; + table_base = (void *)private->entries[raw_smp_processor_id()]; e = get_entry(table_base, private->hook_entry[hook]); /* For return from builtin chain */ ������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/percpu-locked-netfilter2.patch��������������������������������������������������������������0000664�0000764�0000764�00000010725�11041657731�017101� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������--- include/net/netfilter/nf_conntrack.h | 2 +- include/net/netfilter/nf_conntrack_ecache.h | 13 +++++++------ net/netfilter/nf_conntrack_ecache.c | 16 ++++++++-------- 3 files changed, 16 insertions(+), 15 deletions(-) Index: linux-2.6.24.7/include/net/netfilter/nf_conntrack.h =================================================================== --- linux-2.6.24.7.orig/include/net/netfilter/nf_conntrack.h +++ linux-2.6.24.7/include/net/netfilter/nf_conntrack.h @@ -259,13 +259,13 @@ extern atomic_t nf_conntrack_count; extern int nf_conntrack_max; DECLARE_PER_CPU(struct ip_conntrack_stat, nf_conntrack_stat); -#define NF_CT_STAT_INC(count) (__get_cpu_var(nf_conntrack_stat).count++) #define NF_CT_STAT_INC_ATOMIC(count) \ do { \ local_bh_disable(); \ __get_cpu_var(nf_conntrack_stat).count++; \ local_bh_enable(); \ } while (0) +#define NF_CT_STAT_INC(count) (__raw_get_cpu_var(nf_conntrack_stat).count++) extern int nf_conntrack_register_cache(u_int32_t features, const char *name, size_t size); Index: linux-2.6.24.7/include/net/netfilter/nf_conntrack_ecache.h =================================================================== --- linux-2.6.24.7.orig/include/net/netfilter/nf_conntrack_ecache.h +++ linux-2.6.24.7/include/net/netfilter/nf_conntrack_ecache.h @@ -15,16 +15,15 @@ struct nf_conntrack_ecache { struct nf_conn *ct; unsigned int events; }; -DECLARE_PER_CPU(struct nf_conntrack_ecache, nf_conntrack_ecache); - -#define CONNTRACK_ECACHE(x) (__get_cpu_var(nf_conntrack_ecache).x) +DECLARE_PER_CPU_LOCKED(struct nf_conntrack_ecache, nf_conntrack_ecache); extern struct atomic_notifier_head nf_conntrack_chain; extern int nf_conntrack_register_notifier(struct notifier_block *nb); extern int nf_conntrack_unregister_notifier(struct notifier_block *nb); extern void nf_ct_deliver_cached_events(const struct nf_conn *ct); -extern void __nf_ct_event_cache_init(struct nf_conn *ct); +extern void __nf_ct_event_cache_init(struct nf_conntrack_ecache *ecache, + struct nf_conn *ct); extern void nf_ct_event_cache_flush(void); static inline void @@ -33,12 +32,14 @@ nf_conntrack_event_cache(enum ip_conntra { struct nf_conn *ct = (struct nf_conn *)skb->nfct; struct nf_conntrack_ecache *ecache; + int cpu; local_bh_disable(); - ecache = &__get_cpu_var(nf_conntrack_ecache); + ecache = &get_cpu_var_locked(nf_conntrack_ecache, &cpu); if (ct != ecache->ct) - __nf_ct_event_cache_init(ct); + __nf_ct_event_cache_init(ecache, ct); ecache->events |= event; + put_cpu_var_locked(nf_conntrack_ecache, cpu); local_bh_enable(); } Index: linux-2.6.24.7/net/netfilter/nf_conntrack_ecache.c =================================================================== --- linux-2.6.24.7.orig/net/netfilter/nf_conntrack_ecache.c +++ linux-2.6.24.7/net/netfilter/nf_conntrack_ecache.c @@ -29,8 +29,8 @@ EXPORT_SYMBOL_GPL(nf_conntrack_chain); ATOMIC_NOTIFIER_HEAD(nf_ct_expect_chain); EXPORT_SYMBOL_GPL(nf_ct_expect_chain); -DEFINE_PER_CPU(struct nf_conntrack_ecache, nf_conntrack_ecache); -EXPORT_PER_CPU_SYMBOL_GPL(nf_conntrack_ecache); +DEFINE_PER_CPU_LOCKED(struct nf_conntrack_ecache, nf_conntrack_ecache); +EXPORT_PER_CPU_LOCKED_SYMBOL_GPL(nf_conntrack_ecache); /* deliver cached events and clear cache entry - must be called with locally * disabled softirqs */ @@ -52,22 +52,22 @@ __nf_ct_deliver_cached_events(struct nf_ void nf_ct_deliver_cached_events(const struct nf_conn *ct) { struct nf_conntrack_ecache *ecache; + int cpu; local_bh_disable(); - ecache = &__get_cpu_var(nf_conntrack_ecache); + ecache = &get_cpu_var_locked(nf_conntrack_ecache, &cpu); if (ecache->ct == ct) __nf_ct_deliver_cached_events(ecache); + put_cpu_var_locked(nf_conntrack_ecache, cpu); local_bh_enable(); } EXPORT_SYMBOL_GPL(nf_ct_deliver_cached_events); /* Deliver cached events for old pending events, if current conntrack != old */ -void __nf_ct_event_cache_init(struct nf_conn *ct) +void +__nf_ct_event_cache_init(struct nf_conntrack_ecache *ecache, struct nf_conn *ct) { - struct nf_conntrack_ecache *ecache; - /* take care of delivering potentially old events */ - ecache = &__get_cpu_var(nf_conntrack_ecache); BUG_ON(ecache->ct == ct); if (ecache->ct) __nf_ct_deliver_cached_events(ecache); @@ -85,7 +85,7 @@ void nf_ct_event_cache_flush(void) int cpu; for_each_possible_cpu(cpu) { - ecache = &per_cpu(nf_conntrack_ecache, cpu); + ecache = &__get_cpu_var_locked(nf_conntrack_ecache, cpu); if (ecache->ct) nf_ct_put(ecache->ct); } �������������������������������������������patches/percpu-locked-powerpc-fixups.patch����������������������������������������������������������0000664�0000764�0000764�00000002213�11041657732�020010� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������--- arch/powerpc/mm/init_32.c | 2 +- arch/powerpc/mm/tlb_64.c | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) Index: linux-2.6.24.7/arch/powerpc/mm/init_32.c =================================================================== --- linux-2.6.24.7.orig/arch/powerpc/mm/init_32.c +++ linux-2.6.24.7/arch/powerpc/mm/init_32.c @@ -54,7 +54,7 @@ #endif #define MAX_LOW_MEM CONFIG_LOWMEM_SIZE -DEFINE_PER_CPU(struct mmu_gather, mmu_gathers); +DEFINE_PER_CPU_LOCKED(struct mmu_gather, mmu_gathers); unsigned long total_memory; unsigned long total_lowmem; Index: linux-2.6.24.7/arch/powerpc/mm/tlb_64.c =================================================================== --- linux-2.6.24.7.orig/arch/powerpc/mm/tlb_64.c +++ linux-2.6.24.7/arch/powerpc/mm/tlb_64.c @@ -36,7 +36,7 @@ DEFINE_PER_CPU(struct ppc64_tlb_batch, p /* This is declared as we are using the more or less generic * include/asm-powerpc/tlb.h file -- tgall */ -DEFINE_PER_CPU(struct mmu_gather, mmu_gathers); +DEFINE_PER_CPU_LOCKED(struct mmu_gather, mmu_gathers); DEFINE_PER_CPU(struct pte_freelist_batch *, pte_freelist_cur); unsigned long pte_freelist_forced_free; �������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/percpu-locked-powerpc-fixups-a6.patch�������������������������������������������������������0000664�0000764�0000764�00000007543�11041657734�020331� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������� To fix the following compile error by adding necessary macro definitions (mostly taken from asm-generic/percpu.h). - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - include/asm-powerpc/percpu.h In file included from include/asm/tlb.h:52, from arch/powerpc/mm/mem.c:44: include/asm-generic/tlb.h:49: error: expected declaration specifiers or '...' before 'mmu_gathers' include/asm-generic/tlb.h:49: warning: data definition has no type or storage class include/asm-generic/tlb.h:49: warning: type defaults to 'int' in declaration of 'DECLARE_PER_CPU_LOCKED' include/asm-generic/tlb.h: In function 'tlb_gather_mmu': include/asm-generic/tlb.h:58: warning: implicit declaration of function '__get_cpu_lock' include/asm-generic/tlb.h:58: error: 'mmu_gathers' undeclared (first use in this function) include/asm-generic/tlb.h:58: error: (Each undeclared identifier is reported only once include/asm-generic/tlb.h:58: error: for each function it appears in.) : - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Signed-off-by: Tsutomu Owa <tsutomu.owa@toshiba.co.jp> -- owa --- include/asm-powerpc/percpu.h | 22 ++++++++++++++++++++++ 1 file changed, 22 insertions(+) Index: linux-2.6.24.7/include/asm-powerpc/percpu.h =================================================================== --- linux-2.6.24.7.orig/include/asm-powerpc/percpu.h +++ linux-2.6.24.7/include/asm-powerpc/percpu.h @@ -19,6 +19,9 @@ /* Separate out the type, so (int[3], foo) works. */ #define DEFINE_PER_CPU(type, name) \ __attribute__((__section__(".data.percpu"))) __typeof__(type) per_cpu__##name +#define DEFINE_PER_CPU_LOCKED(type, name) \ + __attribute__((__section__(".data.percpu"))) __DEFINE_SPINLOCK(per_cpu_lock__##name##_locked); \ + __attribute__((__section__(".data.percpu"))) __typeof__(type) per_cpu__##name##_locked #define DEFINE_PER_CPU_SHARED_ALIGNED(type, name) \ __attribute__((__section__(".data.percpu.shared_aligned"))) \ @@ -30,6 +33,15 @@ #define __get_cpu_var(var) (*RELOC_HIDE(&per_cpu__##var, __my_cpu_offset())) #define __raw_get_cpu_var(var) (*RELOC_HIDE(&per_cpu__##var, local_paca->data_offset)) +#define per_cpu_lock(var, cpu) \ + (*RELOC_HIDE(&per_cpu_lock__##var##_locked, __per_cpu_offset(cpu))) +#define per_cpu_var_locked(var, cpu) \ + (*RELOC_HIDE(&per_cpu__##var##_locked, __per_cpu_offset(cpu))) +#define __get_cpu_lock(var, cpu) \ + per_cpu_lock(var, cpu) +#define __get_cpu_var_locked(var, cpu) \ + per_cpu_var_locked(var, cpu) + /* A macro to avoid #include hell... */ #define percpu_modcopy(pcpudst, src, size) \ do { \ @@ -47,17 +59,27 @@ extern void setup_per_cpu_areas(void); __typeof__(type) per_cpu__##name #define DEFINE_PER_CPU_SHARED_ALIGNED(type, name) \ DEFINE_PER_CPU(type, name) +#define DEFINE_PER_CPU_LOCKED(type, name) \ + __DEFINE_SPINLOCK(per_cpu_lock__##name##_locked); \ + __typeof__(type) per_cpu__##name##_locked #define per_cpu(var, cpu) (*((void)(cpu), &per_cpu__##var)) +#define per_cpu_var_locked(var, cpu) (*((void)(cpu), &per_cpu__##var##_locked)) + #define __get_cpu_var(var) per_cpu__##var #define __raw_get_cpu_var(var) per_cpu__##var #endif /* SMP */ #define DECLARE_PER_CPU(type, name) extern __typeof__(type) per_cpu__##name +#define DECLARE_PER_CPU_LOCKED(type, name) \ + extern spinlock_t per_cpu_lock__##name##_locked; \ + extern __typeof__(type) per_cpu__##name##_locked #define EXPORT_PER_CPU_SYMBOL(var) EXPORT_SYMBOL(per_cpu__##var) #define EXPORT_PER_CPU_SYMBOL_GPL(var) EXPORT_SYMBOL_GPL(per_cpu__##var) +#define EXPORT_PER_CPU_LOCKED_SYMBOL(var) EXPORT_SYMBOL(per_cpu_lock__##var##_locked); EXPORT_SYMBOL(per_cpu__##var##_locked) +#define EXPORT_PER_CPU_LOCKED_SYMBOL_GPL(var) EXPORT_SYMBOL_GPL(per_cpu_lock__##var##_locked); EXPORT_SYMBOL_GPL(per_cpu__##var##_locked) #else #include <asm-generic/percpu.h> �������������������������������������������������������������������������������������������������������������������������������������������������������������patches/net-core-preempt-fix.patch������������������������������������������������������������������0000664�0000764�0000764�00000000720�11041657730�016233� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������--- net/core/dev.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) Index: linux-2.6.24.7/net/core/dev.c =================================================================== --- linux-2.6.24.7.orig/net/core/dev.c +++ linux-2.6.24.7/net/core/dev.c @@ -1801,8 +1801,8 @@ int netif_rx_ni(struct sk_buff *skb) { int err; - preempt_disable(); err = netif_rx(skb); + preempt_disable(); if (local_softirq_pending()) do_softirq(); preempt_enable(); ������������������������������������������������patches/bh-uptodate-lock.patch����������������������������������������������������������������������0000664�0000764�0000764�00000011036�11041657734�015431� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������� fs/buffer.c | 20 ++++++++------------ fs/ntfs/aops.c | 9 +++------ include/linux/buffer_head.h | 5 +---- 3 files changed, 12 insertions(+), 22 deletions(-) Index: linux-2.6.24.7/fs/buffer.c =================================================================== --- linux-2.6.24.7.orig/fs/buffer.c +++ linux-2.6.24.7/fs/buffer.c @@ -403,8 +403,7 @@ static void end_buffer_async_read(struct * decide that the page is now completely done. */ first = page_buffers(page); - local_irq_save(flags); - bit_spin_lock(BH_Uptodate_Lock, &first->b_state); + spin_lock_irqsave(&first->b_uptodate_lock, flags); clear_buffer_async_read(bh); unlock_buffer(bh); tmp = bh; @@ -417,8 +416,7 @@ static void end_buffer_async_read(struct } tmp = tmp->b_this_page; } while (tmp != bh); - bit_spin_unlock(BH_Uptodate_Lock, &first->b_state); - local_irq_restore(flags); + spin_unlock_irqrestore(&first->b_uptodate_lock, flags); /* * If none of the buffers had errors and they are all @@ -430,8 +428,7 @@ static void end_buffer_async_read(struct return; still_busy: - bit_spin_unlock(BH_Uptodate_Lock, &first->b_state); - local_irq_restore(flags); + spin_unlock_irqrestore(&first->b_uptodate_lock, flags); return; } @@ -466,8 +463,7 @@ static void end_buffer_async_write(struc } first = page_buffers(page); - local_irq_save(flags); - bit_spin_lock(BH_Uptodate_Lock, &first->b_state); + spin_lock_irqsave(&first->b_uptodate_lock, flags); clear_buffer_async_write(bh); unlock_buffer(bh); @@ -479,14 +475,12 @@ static void end_buffer_async_write(struc } tmp = tmp->b_this_page; } - bit_spin_unlock(BH_Uptodate_Lock, &first->b_state); - local_irq_restore(flags); + spin_unlock_irqrestore(&first->b_uptodate_lock, flags); end_page_writeback(page); return; still_busy: - bit_spin_unlock(BH_Uptodate_Lock, &first->b_state); - local_irq_restore(flags); + spin_unlock_irqrestore(&first->b_uptodate_lock, flags); return; } @@ -3172,6 +3166,7 @@ struct buffer_head *alloc_buffer_head(gf set_migrateflags(gfp_flags, __GFP_RECLAIMABLE)); if (ret) { INIT_LIST_HEAD(&ret->b_assoc_buffers); + spin_lock_init(&ret->b_uptodate_lock); get_cpu_var(bh_accounting).nr++; recalc_bh_state(); put_cpu_var(bh_accounting); @@ -3183,6 +3178,7 @@ EXPORT_SYMBOL(alloc_buffer_head); void free_buffer_head(struct buffer_head *bh) { BUG_ON(!list_empty(&bh->b_assoc_buffers)); + BUG_ON(spin_is_locked(&bh->b_uptodate_lock)); kmem_cache_free(bh_cachep, bh); get_cpu_var(bh_accounting).nr--; recalc_bh_state(); Index: linux-2.6.24.7/fs/ntfs/aops.c =================================================================== --- linux-2.6.24.7.orig/fs/ntfs/aops.c +++ linux-2.6.24.7/fs/ntfs/aops.c @@ -103,8 +103,7 @@ static void ntfs_end_buffer_async_read(s "0x%llx.", (unsigned long long)bh->b_blocknr); } first = page_buffers(page); - local_irq_save(flags); - bit_spin_lock(BH_Uptodate_Lock, &first->b_state); + spin_lock_irqsave(&first->b_uptodate_lock, flags); clear_buffer_async_read(bh); unlock_buffer(bh); tmp = bh; @@ -119,8 +118,7 @@ static void ntfs_end_buffer_async_read(s } tmp = tmp->b_this_page; } while (tmp != bh); - bit_spin_unlock(BH_Uptodate_Lock, &first->b_state); - local_irq_restore(flags); + spin_unlock_irqrestore(&first->b_uptodate_lock, flags); /* * If none of the buffers had errors then we can set the page uptodate, * but we first have to perform the post read mst fixups, if the @@ -155,8 +153,7 @@ static void ntfs_end_buffer_async_read(s unlock_page(page); return; still_busy: - bit_spin_unlock(BH_Uptodate_Lock, &first->b_state); - local_irq_restore(flags); + spin_unlock_irqrestore(&first->b_uptodate_lock, flags); return; } Index: linux-2.6.24.7/include/linux/buffer_head.h =================================================================== --- linux-2.6.24.7.orig/include/linux/buffer_head.h +++ linux-2.6.24.7/include/linux/buffer_head.h @@ -21,10 +21,6 @@ enum bh_state_bits { BH_Dirty, /* Is dirty */ BH_Lock, /* Is locked */ BH_Req, /* Has been submitted for I/O */ - BH_Uptodate_Lock,/* Used by the first bh in a page, to serialise - * IO completion of other buffers in the page - */ - BH_Mapped, /* Has a disk mapping */ BH_New, /* Disk mapping was newly created by get_block */ BH_Async_Read, /* Is under end_buffer_async_read I/O */ @@ -73,6 +69,7 @@ struct buffer_head { struct address_space *b_assoc_map; /* mapping this buffer is associated with */ atomic_t b_count; /* users using this buffer_head */ + spinlock_t b_uptodate_lock; }; /* ��������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/bh-state-lock.patch�������������������������������������������������������������������������0000664�0000764�0000764�00000006026�11041657733�014726� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������� I was compiling a kernel in a shell that I set to a priority of 20, and it locked up on the bit_spin_lock crap of jbd. This patch adds another spinlock to the buffer head and uses that instead of the bit_spins. From: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Ingo Molnar <mingo@elte.hu> -- fs/buffer.c | 3 ++- include/linux/buffer_head.h | 1 + include/linux/jbd.h | 12 ++++++------ 3 files changed, 9 insertions(+), 7 deletions(-) Index: linux-2.6.24.7/fs/buffer.c =================================================================== --- linux-2.6.24.7.orig/fs/buffer.c +++ linux-2.6.24.7/fs/buffer.c @@ -40,7 +40,6 @@ #include <linux/cpu.h> #include <linux/bitops.h> #include <linux/mpage.h> -#include <linux/bit_spinlock.h> static int fsync_buffers_list(spinlock_t *lock, struct list_head *list); @@ -3167,6 +3166,7 @@ struct buffer_head *alloc_buffer_head(gf if (ret) { INIT_LIST_HEAD(&ret->b_assoc_buffers); spin_lock_init(&ret->b_uptodate_lock); + spin_lock_init(&ret->b_state_lock); get_cpu_var(bh_accounting).nr++; recalc_bh_state(); put_cpu_var(bh_accounting); @@ -3179,6 +3179,7 @@ void free_buffer_head(struct buffer_head { BUG_ON(!list_empty(&bh->b_assoc_buffers)); BUG_ON(spin_is_locked(&bh->b_uptodate_lock)); + BUG_ON(spin_is_locked(&bh->b_state_lock)); kmem_cache_free(bh_cachep, bh); get_cpu_var(bh_accounting).nr--; recalc_bh_state(); Index: linux-2.6.24.7/include/linux/buffer_head.h =================================================================== --- linux-2.6.24.7.orig/include/linux/buffer_head.h +++ linux-2.6.24.7/include/linux/buffer_head.h @@ -70,6 +70,7 @@ struct buffer_head { associated with */ atomic_t b_count; /* users using this buffer_head */ spinlock_t b_uptodate_lock; + spinlock_t b_state_lock; }; /* Index: linux-2.6.24.7/include/linux/jbd.h =================================================================== --- linux-2.6.24.7.orig/include/linux/jbd.h +++ linux-2.6.24.7/include/linux/jbd.h @@ -319,32 +319,32 @@ static inline struct journal_head *bh2jh static inline void jbd_lock_bh_state(struct buffer_head *bh) { - bit_spin_lock(BH_State, &bh->b_state); + spin_lock(&bh->b_state_lock); } static inline int jbd_trylock_bh_state(struct buffer_head *bh) { - return bit_spin_trylock(BH_State, &bh->b_state); + return spin_trylock(&bh->b_state_lock); } static inline int jbd_is_locked_bh_state(struct buffer_head *bh) { - return bit_spin_is_locked(BH_State, &bh->b_state); + return spin_is_locked(&bh->b_state_lock); } static inline void jbd_unlock_bh_state(struct buffer_head *bh) { - bit_spin_unlock(BH_State, &bh->b_state); + spin_unlock(&bh->b_state_lock); } static inline void jbd_lock_bh_journal_head(struct buffer_head *bh) { - bit_spin_lock(BH_JournalHead, &bh->b_state); + spin_lock_irq(&bh->b_uptodate_lock); } static inline void jbd_unlock_bh_journal_head(struct buffer_head *bh) { - bit_spin_unlock(BH_JournalHead, &bh->b_state); + spin_unlock_irq(&bh->b_uptodate_lock); } struct jbd_revoke_table_s; ����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/jbd_assertions_smp_only.patch���������������������������������������������������������������0000664�0000764�0000764�00000003707�11041657734�017226� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������� fs/jbd/transaction.c | 6 +++--- include/linux/jbd.h | 9 +++++++++ 2 files changed, 12 insertions(+), 3 deletions(-) Index: linux-2.6.24.7/fs/jbd/transaction.c =================================================================== --- linux-2.6.24.7.orig/fs/jbd/transaction.c +++ linux-2.6.24.7/fs/jbd/transaction.c @@ -1516,7 +1516,7 @@ static void __journal_temp_unlink_buffer transaction_t *transaction; struct buffer_head *bh = jh2bh(jh); - J_ASSERT_JH(jh, jbd_is_locked_bh_state(bh)); + J_ASSERT_JH_SMP(jh, jbd_is_locked_bh_state(bh)); transaction = jh->b_transaction; if (transaction) assert_spin_locked(&transaction->t_journal->j_list_lock); @@ -1959,7 +1959,7 @@ void __journal_file_buffer(struct journa int was_dirty = 0; struct buffer_head *bh = jh2bh(jh); - J_ASSERT_JH(jh, jbd_is_locked_bh_state(bh)); + J_ASSERT_JH_SMP(jh, jbd_is_locked_bh_state(bh)); assert_spin_locked(&transaction->t_journal->j_list_lock); J_ASSERT_JH(jh, jh->b_jlist < BJ_Types); @@ -2048,7 +2048,7 @@ void __journal_refile_buffer(struct jour int was_dirty; struct buffer_head *bh = jh2bh(jh); - J_ASSERT_JH(jh, jbd_is_locked_bh_state(bh)); + J_ASSERT_JH_SMP(jh, jbd_is_locked_bh_state(bh)); if (jh->b_transaction) assert_spin_locked(&jh->b_transaction->t_journal->j_list_lock); Index: linux-2.6.24.7/include/linux/jbd.h =================================================================== --- linux-2.6.24.7.orig/include/linux/jbd.h +++ linux-2.6.24.7/include/linux/jbd.h @@ -264,6 +264,15 @@ void buffer_assertion_failure(struct buf #define J_ASSERT_JH(jh, expr) J_ASSERT(expr) #endif +/* + * For assertions that are only valid on SMP (e.g. spin_is_locked()): + */ +#ifdef CONFIG_SMP +# define J_ASSERT_JH_SMP(jh, expr) J_ASSERT_JH(jh, expr) +#else +# define J_ASSERT_JH_SMP(jh, assert) do { } while (0) +#endif + #if defined(JBD_PARANOID_IOFAIL) #define J_EXPECT(expr, why...) J_ASSERT(expr) #define J_EXPECT_BH(bh, expr, why...) J_ASSERT_BH(bh, expr) ���������������������������������������������������������patches/tasklet-redesign.patch����������������������������������������������������������������������0000664�0000764�0000764�00000021106�11041657732�015531� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From: Ingo Molnar <mingo@elte.hu> tasklet redesign: make it saner and make it easier to thread. Signed-off-by: Ingo Molnar <mingo@elte.hu> ---- include/linux/interrupt.h | 39 ++++++----- kernel/softirq.c | 155 +++++++++++++++++++++++++++++++--------------- 2 files changed, 128 insertions(+), 66 deletions(-) Index: linux-2.6.24.7/include/linux/interrupt.h =================================================================== --- linux-2.6.24.7.orig/include/linux/interrupt.h +++ linux-2.6.24.7/include/linux/interrupt.h @@ -310,8 +310,9 @@ extern void wait_for_softirq(int softirq to be executed on some cpu at least once after this. * If the tasklet is already scheduled, but its excecution is still not started, it will be executed only once. - * If this tasklet is already running on another CPU (or schedule is called - from tasklet itself), it is rescheduled for later. + * If this tasklet is already running on another CPU, it is rescheduled + for later. + * Schedule must not be called from the tasklet itself (a lockup occurs) * Tasklet is strictly serialized wrt itself, but not wrt another tasklets. If client needs some intertask synchronization, he makes it with spinlocks. @@ -336,15 +337,25 @@ struct tasklet_struct name = { NULL, 0, enum { TASKLET_STATE_SCHED, /* Tasklet is scheduled for execution */ - TASKLET_STATE_RUN /* Tasklet is running (SMP only) */ + TASKLET_STATE_RUN, /* Tasklet is running (SMP only) */ + TASKLET_STATE_PENDING /* Tasklet is pending */ }; -#ifdef CONFIG_SMP +#define TASKLET_STATEF_SCHED (1 << TASKLET_STATE_SCHED) +#define TASKLET_STATEF_RUN (1 << TASKLET_STATE_RUN) +#define TASKLET_STATEF_PENDING (1 << TASKLET_STATE_PENDING) + +#if defined(CONFIG_SMP) || defined(CONFIG_PREEMPT_RT) static inline int tasklet_trylock(struct tasklet_struct *t) { return !test_and_set_bit(TASKLET_STATE_RUN, &(t)->state); } +static inline int tasklet_tryunlock(struct tasklet_struct *t) +{ + return cmpxchg(&t->state, TASKLET_STATEF_RUN, 0) == TASKLET_STATEF_RUN; +} + static inline void tasklet_unlock(struct tasklet_struct *t) { smp_mb__before_clear_bit(); @@ -356,9 +367,10 @@ static inline void tasklet_unlock_wait(s while (test_bit(TASKLET_STATE_RUN, &(t)->state)) { barrier(); } } #else -#define tasklet_trylock(t) 1 -#define tasklet_unlock_wait(t) do { } while (0) -#define tasklet_unlock(t) do { } while (0) +# define tasklet_trylock(t) 1 +# define tasklet_tryunlock(t) 1 +# define tasklet_unlock_wait(t) do { } while (0) +# define tasklet_unlock(t) do { } while (0) #endif extern void FASTCALL(__tasklet_schedule(struct tasklet_struct *t)); @@ -391,17 +403,8 @@ static inline void tasklet_disable(struc smp_mb(); } -static inline void tasklet_enable(struct tasklet_struct *t) -{ - smp_mb__before_atomic_dec(); - atomic_dec(&t->count); -} - -static inline void tasklet_hi_enable(struct tasklet_struct *t) -{ - smp_mb__before_atomic_dec(); - atomic_dec(&t->count); -} +extern fastcall void tasklet_enable(struct tasklet_struct *t); +extern fastcall void tasklet_hi_enable(struct tasklet_struct *t); extern void tasklet_kill(struct tasklet_struct *t); extern void tasklet_kill_immediate(struct tasklet_struct *t, unsigned int cpu); Index: linux-2.6.24.7/kernel/softirq.c =================================================================== --- linux-2.6.24.7.orig/kernel/softirq.c +++ linux-2.6.24.7/kernel/softirq.c @@ -454,14 +454,24 @@ struct tasklet_head static DEFINE_PER_CPU(struct tasklet_head, tasklet_vec) = { NULL }; static DEFINE_PER_CPU(struct tasklet_head, tasklet_hi_vec) = { NULL }; +static void inline +__tasklet_common_schedule(struct tasklet_struct *t, struct tasklet_head *head, unsigned int nr) +{ + if (tasklet_trylock(t)) { + WARN_ON(t->next != NULL); + t->next = head->list; + head->list = t; + raise_softirq_irqoff(nr); + tasklet_unlock(t); + } +} + void fastcall __tasklet_schedule(struct tasklet_struct *t) { unsigned long flags; local_irq_save(flags); - t->next = __get_cpu_var(tasklet_vec).list; - __get_cpu_var(tasklet_vec).list = t; - raise_softirq_irqoff(TASKLET_SOFTIRQ); + __tasklet_common_schedule(t, &__get_cpu_var(tasklet_vec), TASKLET_SOFTIRQ); local_irq_restore(flags); } @@ -472,81 +482,130 @@ void fastcall __tasklet_hi_schedule(stru unsigned long flags; local_irq_save(flags); - t->next = __get_cpu_var(tasklet_hi_vec).list; - __get_cpu_var(tasklet_hi_vec).list = t; - raise_softirq_irqoff(HI_SOFTIRQ); + __tasklet_common_schedule(t, &__get_cpu_var(tasklet_hi_vec), HI_SOFTIRQ); local_irq_restore(flags); } EXPORT_SYMBOL(__tasklet_hi_schedule); -static void tasklet_action(struct softirq_action *a) +void fastcall tasklet_enable(struct tasklet_struct *t) { - struct tasklet_struct *list; + if (!atomic_dec_and_test(&t->count)) + return; + if (test_and_clear_bit(TASKLET_STATE_PENDING, &t->state)) + tasklet_schedule(t); +} - local_irq_disable(); - list = __get_cpu_var(tasklet_vec).list; - __get_cpu_var(tasklet_vec).list = NULL; - local_irq_enable(); +EXPORT_SYMBOL(tasklet_enable); + +void fastcall tasklet_hi_enable(struct tasklet_struct *t) +{ + if (!atomic_dec_and_test(&t->count)) + return; + if (test_and_clear_bit(TASKLET_STATE_PENDING, &t->state)) + tasklet_hi_schedule(t); +} + +EXPORT_SYMBOL(tasklet_hi_enable); + +static void +__tasklet_action(struct softirq_action *a, struct tasklet_struct *list) +{ + int loops = 1000000; while (list) { struct tasklet_struct *t = list; list = list->next; + /* + * Should always succeed - after a tasklist got on the + * list (after getting the SCHED bit set from 0 to 1), + * nothing but the tasklet softirq it got queued to can + * lock it: + */ + if (!tasklet_trylock(t)) { + WARN_ON(1); + continue; + } + + t->next = NULL; + + /* + * If we cannot handle the tasklet because it's disabled, + * mark it as pending. tasklet_enable() will later + * re-schedule the tasklet. + */ + if (unlikely(atomic_read(&t->count))) { +out_disabled: + /* implicit unlock: */ + wmb(); + t->state = TASKLET_STATEF_PENDING; + continue; + } - if (tasklet_trylock(t)) { - if (!atomic_read(&t->count)) { - if (!test_and_clear_bit(TASKLET_STATE_SCHED, &t->state)) - BUG(); - t->func(t->data); + /* + * After this point on the tasklet might be rescheduled + * on another CPU, but it can only be added to another + * CPU's tasklet list if we unlock the tasklet (which we + * dont do yet). + */ + if (!test_and_clear_bit(TASKLET_STATE_SCHED, &t->state)) + WARN_ON(1); + +again: + t->func(t->data); + + /* + * Try to unlock the tasklet. We must use cmpxchg, because + * another CPU might have scheduled or disabled the tasklet. + * We only allow the STATE_RUN -> 0 transition here. + */ + while (!tasklet_tryunlock(t)) { + /* + * If it got disabled meanwhile, bail out: + */ + if (atomic_read(&t->count)) + goto out_disabled; + /* + * If it got scheduled meanwhile, re-execute + * the tasklet function: + */ + if (test_and_clear_bit(TASKLET_STATE_SCHED, &t->state)) + goto again; + if (!--loops) { + printk("hm, tasklet state: %08lx\n", t->state); + WARN_ON(1); tasklet_unlock(t); - continue; + break; } - tasklet_unlock(t); } - - local_irq_disable(); - t->next = __get_cpu_var(tasklet_vec).list; - __get_cpu_var(tasklet_vec).list = t; - __do_raise_softirq_irqoff(TASKLET_SOFTIRQ); - local_irq_enable(); } } -static void tasklet_hi_action(struct softirq_action *a) +static void tasklet_action(struct softirq_action *a) { struct tasklet_struct *list; local_irq_disable(); - list = __get_cpu_var(tasklet_hi_vec).list; - __get_cpu_var(tasklet_hi_vec).list = NULL; + list = __get_cpu_var(tasklet_vec).list; + __get_cpu_var(tasklet_vec).list = NULL; local_irq_enable(); - while (list) { - struct tasklet_struct *t = list; + __tasklet_action(a, list); +} - list = list->next; +static void tasklet_hi_action(struct softirq_action *a) +{ + struct tasklet_struct *list; - if (tasklet_trylock(t)) { - if (!atomic_read(&t->count)) { - if (!test_and_clear_bit(TASKLET_STATE_SCHED, &t->state)) - BUG(); - t->func(t->data); - tasklet_unlock(t); - continue; - } - tasklet_unlock(t); - } + local_irq_disable(); + list = __get_cpu_var(tasklet_hi_vec).list; + __get_cpu_var(tasklet_hi_vec).list = NULL; + local_irq_enable(); - local_irq_disable(); - t->next = __get_cpu_var(tasklet_hi_vec).list; - __get_cpu_var(tasklet_hi_vec).list = t; - __do_raise_softirq_irqoff(HI_SOFTIRQ); - local_irq_enable(); - } + __tasklet_action(a, list); } - void tasklet_init(struct tasklet_struct *t, void (*func)(unsigned long), unsigned long data) { ����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/tasklet-busy-loop-hack.patch����������������������������������������������������������������0000664�0000764�0000764�00000003247�11041657730�016572� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������--- include/linux/interrupt.h | 6 ++---- kernel/softirq.c | 20 ++++++++++++++++++++ 2 files changed, 22 insertions(+), 4 deletions(-) Index: linux-2.6.24.7/include/linux/interrupt.h =================================================================== --- linux-2.6.24.7.orig/include/linux/interrupt.h +++ linux-2.6.24.7/include/linux/interrupt.h @@ -362,10 +362,8 @@ static inline void tasklet_unlock(struct clear_bit(TASKLET_STATE_RUN, &(t)->state); } -static inline void tasklet_unlock_wait(struct tasklet_struct *t) -{ - while (test_bit(TASKLET_STATE_RUN, &(t)->state)) { barrier(); } -} +extern void tasklet_unlock_wait(struct tasklet_struct *t); + #else # define tasklet_trylock(t) 1 # define tasklet_tryunlock(t) 1 Index: linux-2.6.24.7/kernel/softirq.c =================================================================== --- linux-2.6.24.7.orig/kernel/softirq.c +++ linux-2.6.24.7/kernel/softirq.c @@ -19,6 +19,7 @@ #include <linux/mm.h> #include <linux/notifier.h> #include <linux/percpu.h> +#include <linux/delay.h> #include <linux/cpu.h> #include <linux/freezer.h> #include <linux/kthread.h> @@ -640,6 +641,25 @@ void __init softirq_init(void) open_softirq(HI_SOFTIRQ, tasklet_hi_action, NULL); } +#if defined(CONFIG_SMP) || defined(CONFIG_PREEMPT_RT) + +void tasklet_unlock_wait(struct tasklet_struct *t) +{ + while (test_bit(TASKLET_STATE_RUN, &(t)->state)) { + /* + * Hack for now to avoid this busy-loop: + */ +#ifdef CONFIG_PREEMPT_RT + msleep(1); +#else + barrier(); +#endif + } +} +EXPORT_SYMBOL(tasklet_unlock_wait); + +#endif + static int ksoftirqd(void * __data) { struct sched_param param = { .sched_priority = MAX_USER_RT_PRIO/2 }; ���������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/tasklet-fix-preemption-race.patch�����������������������������������������������������������0000664�0000764�0000764�00000007777�11041657734�017633� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From johnstul@us.ibm.com Wed Jun 6 04:17:34 2007 Return-Path: <johnstul@us.ibm.com> Received: from e3.ny.us.ibm.com (e3.ny.us.ibm.com [32.97.182.143]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mail.tglx.de (Postfix) with ESMTP id 1CCC065C065 for <tglx@linutronix.de>; Wed, 6 Jun 2007 04:17:34 +0200 (CEST) Received: from d01relay04.pok.ibm.com (d01relay04.pok.ibm.com [9.56.227.236]) by e3.ny.us.ibm.com (8.13.8/8.13.8) with ESMTP id l561EvIT011411 for <tglx@linutronix.de>; Tue, 5 Jun 2007 21:14:57 -0400 Received: from d01av04.pok.ibm.com (d01av04.pok.ibm.com [9.56.224.64]) by d01relay04.pok.ibm.com (8.13.8/8.13.8/NCO v8.3) with ESMTP id l562HUG6545736 for <tglx@linutronix.de>; Tue, 5 Jun 2007 22:17:30 -0400 Received: from d01av04.pok.ibm.com (loopback [127.0.0.1]) by d01av04.pok.ibm.com (8.12.11.20060308/8.13.3) with ESMTP id l562HUu0027167 for <tglx@linutronix.de>; Tue, 5 Jun 2007 22:17:30 -0400 Received: from [9.47.21.16] (cog.beaverton.ibm.com [9.47.21.16]) by d01av04.pok.ibm.com (8.12.11.20060308/8.12.11) with ESMTP id l562HTkh027139; Tue, 5 Jun 2007 22:17:29 -0400 Subject: [PATCH -rt] Fix TASKLET_STATE_SCHED WARN_ON() From: john stultz <johnstul@us.ibm.com> To: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de>, Steven Rostedt <rostedt@goodmis.org>, "Paul E. McKenney" <paulmck@us.ibm.com>, lkml <linux-kernel@vger.kernel.org> Content-Type: text/plain Date: Tue, 05 Jun 2007 19:17:23 -0700 Message-Id: <1181096244.6018.20.camel@localhost> Mime-Version: 1.0 X-Mailer: Evolution 2.10.1 X-Evolution-Source: imap://tglx%40linutronix.de@localhost:8993/ Content-Transfer-Encoding: 8bit Hey Ingo, So we've been seeing the following trace fairly frequently on our SMP boxes when running kernbench: BUG: at kernel/softirq.c:639 __tasklet_action() Call Trace: [<ffffffff8106d5da>] dump_trace+0xaa/0x32a [<ffffffff8106d89b>] show_trace+0x41/0x5c [<ffffffff8106d8cb>] dump_stack+0x15/0x17 [<ffffffff81094a97>] __tasklet_action+0xdf/0x12e [<ffffffff81094f76>] tasklet_action+0x27/0x29 [<ffffffff8109530a>] ksoftirqd+0x16c/0x271 [<ffffffff81033d4d>] kthread+0xf5/0x128 [<ffffffff8105ff68>] child_rip+0xa/0x12 Paul also pointed this out awhile back: http://lkml.org/lkml/2007/2/25/1 Anyway, I think I finally found the issue. Its a bit hard to explain, but the idea is while __tasklet_action is running the tasklet function on CPU1, if a call to tasklet_schedule() on CPU2 is made, and if right after we mark the TASKLET_STATE_SCHED bit we are preempted, __tasklet_action on CPU1 might be able to re-run the function, clear the bit and unlock the tasklet before CPU2 enters __tasklet_common_schedule. Once __tasklet_common_schedule locks the tasklet, we will add the tasklet to the list with the TASKLET_STATE_SCHED *unset*. I've verified this race occurs w/ a WARN_ON in __tasklet_common_schedule(). This fix avoids this race by making sure *after* we've locked the tasklet that the STATE_SCHED bit is set before adding it to the list. Does it look ok to you? thanks -john Signed-off-by: John Stultz <johnstul@us.ibm.com> --- kernel/softirq.c | 15 +++++++++++---- 1 file changed, 11 insertions(+), 4 deletions(-) Index: linux-2.6.24.7/kernel/softirq.c =================================================================== --- linux-2.6.24.7.orig/kernel/softirq.c +++ linux-2.6.24.7/kernel/softirq.c @@ -459,10 +459,17 @@ static void inline __tasklet_common_schedule(struct tasklet_struct *t, struct tasklet_head *head, unsigned int nr) { if (tasklet_trylock(t)) { - WARN_ON(t->next != NULL); - t->next = head->list; - head->list = t; - raise_softirq_irqoff(nr); + /* We may have been preempted before tasklet_trylock + * and __tasklet_action may have already run. + * So double check the sched bit while the takslet + * is locked before adding it to the list. + */ + if (test_bit(TASKLET_STATE_SCHED, &t->state)) { + WARN_ON(t->next != NULL); + t->next = head->list; + head->list = t; + raise_softirq_irqoff(nr); + } tasklet_unlock(t); } } �patches/tasklet-more-fixes.patch��������������������������������������������������������������������0000664�0000764�0000764�00000015741�11041657731�016016� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From linux-kernel-owner@vger.kernel.org Thu Jun 14 23:21:31 2007 Return-Path: <linux-kernel-owner+tglx=40linutronix.de-S1756447AbXFNVVF@vger.kernel.org> X-Spam-Checker-Version: SpamAssassin 3.1.7-deb (2006-10-05) on debian X-Spam-Level: X-Spam-Status: No, score=0.0 required=5.0 tests=none autolearn=unavailable version=3.1.7-deb Received: from vger.kernel.org (vger.kernel.org [209.132.176.167]) by mail.tglx.de (Postfix) with ESMTP id F2D8065C3D9 for <tglx@linutronix.de>; Thu, 14 Jun 2007 23:21:31 +0200 (CEST) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756447AbXFNVVF (ORCPT <rfc822;tglx@linutronix.de>); Thu, 14 Jun 2007 17:21:05 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1753441AbXFNVUw (ORCPT <rfc822;linux-kernel-outgoing>); Thu, 14 Jun 2007 17:20:52 -0400 Received: from e33.co.us.ibm.com ([32.97.110.151]:53331 "EHLO e33.co.us.ibm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752693AbXFNVUv (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Thu, 14 Jun 2007 17:20:51 -0400 Received: from d03relay02.boulder.ibm.com (d03relay02.boulder.ibm.com [9.17.195.227]) by e33.co.us.ibm.com (8.13.8/8.13.8) with ESMTP id l5ELKnM3030113 for <linux-kernel@vger.kernel.org>; Thu, 14 Jun 2007 17:20:49 -0400 Received: from d03av01.boulder.ibm.com (d03av01.boulder.ibm.com [9.17.195.167]) by d03relay02.boulder.ibm.com (8.13.8/8.13.8/NCO v8.3) with ESMTP id l5ELKniv268710 for <linux-kernel@vger.kernel.org>; Thu, 14 Jun 2007 15:20:49 -0600 Received: from d03av01.boulder.ibm.com (loopback [127.0.0.1]) by d03av01.boulder.ibm.com (8.12.11.20060308/8.13.3) with ESMTP id l5ELKm9A010919 for <linux-kernel@vger.kernel.org>; Thu, 14 Jun 2007 15:20:49 -0600 Received: from [9.67.41.186] (wecm-9-67-41-186.wecm.ibm.com [9.67.41.186]) by d03av01.boulder.ibm.com (8.12.11.20060308/8.12.11) with ESMTP id l5ELKl3X010835; Thu, 14 Jun 2007 15:20:47 -0600 Subject: Re: [PATCH -rt] Fix TASKLET_STATE_SCHED WARN_ON() From: john stultz <johnstul@us.ibm.com> To: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de>, Steven Rostedt <rostedt@goodmis.org>, "Paul E. McKenney" <paulmck@us.ibm.com>, lkml <linux-kernel@vger.kernel.org> In-Reply-To: <1181096244.6018.20.camel@localhost> References: <1181096244.6018.20.camel@localhost> Content-Type: text/plain Date: Thu, 14 Jun 2007 14:20:20 -0700 Message-Id: <1181856020.6276.14.camel@localhost.localdomain> Mime-Version: 1.0 X-Mailer: Evolution 2.10.1 Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org X-Filter-To: .Kernel.LKML X-Evolution-Source: imap://tglx%40linutronix.de@localhost:8993/ Content-Transfer-Encoding: 8bit On Tue, 2007-06-05 at 19:17 -0700, john stultz wrote: > Hey Ingo, > So we've been seeing the following trace fairly frequently on our SMP > boxes when running kernbench: > > BUG: at kernel/softirq.c:639 __tasklet_action() > > Call Trace: > [<ffffffff8106d5da>] dump_trace+0xaa/0x32a > [<ffffffff8106d89b>] show_trace+0x41/0x5c > [<ffffffff8106d8cb>] dump_stack+0x15/0x17 > [<ffffffff81094a97>] __tasklet_action+0xdf/0x12e > [<ffffffff81094f76>] tasklet_action+0x27/0x29 > [<ffffffff8109530a>] ksoftirqd+0x16c/0x271 > [<ffffffff81033d4d>] kthread+0xf5/0x128 > [<ffffffff8105ff68>] child_rip+0xa/0x12 > > > Paul also pointed this out awhile back: http://lkml.org/lkml/2007/2/25/1 > > > Anyway, I think I finally found the issue. Its a bit hard to explain, > but the idea is while __tasklet_action is running the tasklet function > on CPU1, if a call to tasklet_schedule() on CPU2 is made, and if right > after we mark the TASKLET_STATE_SCHED bit we are preempted, > __tasklet_action on CPU1 might be able to re-run the function, clear the > bit and unlock the tasklet before CPU2 enters __tasklet_common_schedule. > Once __tasklet_common_schedule locks the tasklet, we will add the > tasklet to the list with the TASKLET_STATE_SCHED *unset*. > > I've verified this race occurs w/ a WARN_ON in > __tasklet_common_schedule(). > > > This fix avoids this race by making sure *after* we've locked the > tasklet that the STATE_SCHED bit is set before adding it to the list. > > Does it look ok to you? > > thanks > -john > > Signed-off-by: John Stultz <johnstul@us.ibm.com> > > Index: 2.6-rt/kernel/softirq.c > =================================================================== > --- 2.6-rt.orig/kernel/softirq.c 2007-06-05 18:30:54.000000000 -0700 > +++ 2.6-rt/kernel/softirq.c 2007-06-05 18:36:44.000000000 -0700 > @@ -544,10 +544,17 @@ static void inline > __tasklet_common_schedule(struct tasklet_struct *t, struct tasklet_head *head, unsigned int nr) > { > if (tasklet_trylock(t)) { > - WARN_ON(t->next != NULL); > - t->next = head->list; > - head->list = t; > - raise_softirq_irqoff(nr); > + /* We may have been preempted before tasklet_trylock > + * and __tasklet_action may have already run. > + * So double check the sched bit while the takslet > + * is locked before adding it to the list. > + */ > + if (test_bit(TASKLET_STATE_SCHED, &t->state)) { > + WARN_ON(t->next != NULL); > + t->next = head->list; > + head->list = t; > + raise_softirq_irqoff(nr); > + } > tasklet_unlock(t); > } > } So while digging on a strange OOM issue we were seeing (which actually ended up being fixed by Steven's softirq patch), I noticed that the fix above is incomplete. With only the patch above, we may no longer have unscheduled tasklets added to the list, but we may end up with scheduled tasklets that are not on the list (and will stay that way!). The following additional patch should correct this issue. Although since we weren't actually hitting it, the issue is a bit theoretical, so I've not been able to prove it really fixes anything. thanks -john --- kernel/softirq.c | 16 +++++++++++++++- 1 file changed, 15 insertions(+), 1 deletion(-) Index: linux-2.6.24.7/kernel/softirq.c =================================================================== --- linux-2.6.24.7.orig/kernel/softirq.c +++ linux-2.6.24.7/kernel/softirq.c @@ -459,6 +459,7 @@ static void inline __tasklet_common_schedule(struct tasklet_struct *t, struct tasklet_head *head, unsigned int nr) { if (tasklet_trylock(t)) { +again: /* We may have been preempted before tasklet_trylock * and __tasklet_action may have already run. * So double check the sched bit while the takslet @@ -469,8 +470,21 @@ __tasklet_common_schedule(struct tasklet t->next = head->list; head->list = t; raise_softirq_irqoff(nr); + tasklet_unlock(t); + } else { + /* This is subtle. If we hit the corner case above + * It is possible that we get preempted right here, + * and another task has successfully called + * tasklet_schedule(), then this function, and + * failed on the trylock. Thus we must be sure + * before releasing the tasklet lock, that the + * SCHED_BIT is clear. Otherwise the tasklet + * may get its SCHED_BIT set, but not added to the + * list + */ + if (!tasklet_tryunlock(t)) + goto again; } - tasklet_unlock(t); } } �������������������������������patches/disable-irqpoll.patch�����������������������������������������������������������������������0000664�0000764�0000764�00000001762�11041657731�015354� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������� kernel/irq/spurious.c | 10 ++++++++++ 1 file changed, 10 insertions(+) Index: linux-2.6.24.7/kernel/irq/spurious.c =================================================================== --- linux-2.6.24.7.orig/kernel/irq/spurious.c +++ linux-2.6.24.7/kernel/irq/spurious.c @@ -239,6 +239,11 @@ __setup("noirqdebug", noirqdebug_setup); static int __init irqfixup_setup(char *str) { +#ifdef CONFIG_PREEMPT_RT + printk(KERN_WARNING "irqfixup boot option not supported " + "w/ CONFIG_PREEMPT_RT\n"); + return 1; +#endif irqfixup = 1; printk(KERN_WARNING "Misrouted IRQ fixup support enabled.\n"); printk(KERN_WARNING "This may impact system performance.\n"); @@ -250,6 +255,11 @@ __setup("irqfixup", irqfixup_setup); static int __init irqpoll_setup(char *str) { +#ifdef CONFIG_PREEMPT_RT + printk(KERN_WARNING "irqpoll boot option not supported " + "w/ CONFIG_PREEMPT_RT\n"); + return 1; +#endif irqfixup = 2; printk(KERN_WARNING "Misrouted IRQ fixup and polling support " "enabled\n"); ��������������patches/kstat-add-rt-stats.patch��������������������������������������������������������������������0000664�0000764�0000764�00000012312�11041657731�015715� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From: Thomas Gleixner <tglx@linutronix.de> Subject: add rt stats to /proc/stat add RT stats to /proc/stat Signed-off-by: Ingo Molnar <mingo@elte.hu> fs/proc/proc_misc.c | 24 ++++++++++++++++++------ include/linux/kernel_stat.h | 2 ++ kernel/sched.c | 6 +++++- 3 files changed, 25 insertions(+), 7 deletions(-) Index: linux-2.6.24.7/fs/proc/proc_misc.c =================================================================== --- linux-2.6.24.7.orig/fs/proc/proc_misc.c +++ linux-2.6.24.7/fs/proc/proc_misc.c @@ -455,7 +455,8 @@ static int show_stat(struct seq_file *p, { int i; unsigned long jif; - cputime64_t user, nice, system, idle, iowait, irq, softirq, steal; + cputime64_t user_rt, user, nice, system_rt, system, idle, + iowait, irq, softirq, steal; cputime64_t guest; u64 sum = 0; struct timespec boottime; @@ -465,7 +466,7 @@ static int show_stat(struct seq_file *p, if (!per_irq_sum) return -ENOMEM; - user = nice = system = idle = iowait = + user_rt = user = nice = system_rt = system = idle = iowait = irq = softirq = steal = cputime64_zero; guest = cputime64_zero; getboottime(&boottime); @@ -482,6 +483,8 @@ static int show_stat(struct seq_file *p, irq = cputime64_add(irq, kstat_cpu(i).cpustat.irq); softirq = cputime64_add(softirq, kstat_cpu(i).cpustat.softirq); steal = cputime64_add(steal, kstat_cpu(i).cpustat.steal); + user_rt = cputime64_add(user_rt, kstat_cpu(i).cpustat.user_rt); + system_rt = cputime64_add(system_rt, kstat_cpu(i).cpustat.system_rt); guest = cputime64_add(guest, kstat_cpu(i).cpustat.guest); for (j = 0; j < NR_IRQS; j++) { unsigned int temp = kstat_cpu(i).irqs[j]; @@ -490,7 +493,10 @@ static int show_stat(struct seq_file *p, } } - seq_printf(p, "cpu %llu %llu %llu %llu %llu %llu %llu %llu %llu\n", + user = cputime64_add(user_rt, user); + system = cputime64_add(system_rt, system); + + seq_printf(p, "cpu %llu %llu %llu %llu %llu %llu %llu %llu %llu %llu %llu\n", (unsigned long long)cputime64_to_clock_t(user), (unsigned long long)cputime64_to_clock_t(nice), (unsigned long long)cputime64_to_clock_t(system), @@ -499,13 +505,17 @@ static int show_stat(struct seq_file *p, (unsigned long long)cputime64_to_clock_t(irq), (unsigned long long)cputime64_to_clock_t(softirq), (unsigned long long)cputime64_to_clock_t(steal), + (unsigned long long)cputime64_to_clock_t(user_rt), + (unsigned long long)cputime64_to_clock_t(system_rt), (unsigned long long)cputime64_to_clock_t(guest)); for_each_online_cpu(i) { /* Copy values here to work around gcc-2.95.3, gcc-2.96 */ - user = kstat_cpu(i).cpustat.user; + user_rt = kstat_cpu(i).cpustat.user_rt; + system_rt = kstat_cpu(i).cpustat.system_rt; + user = cputime64_add(user_rt, kstat_cpu(i).cpustat.user); nice = kstat_cpu(i).cpustat.nice; - system = kstat_cpu(i).cpustat.system; + system = cputime64_add(system_rt, kstat_cpu(i).cpustat.system); idle = kstat_cpu(i).cpustat.idle; iowait = kstat_cpu(i).cpustat.iowait; irq = kstat_cpu(i).cpustat.irq; @@ -513,7 +523,7 @@ static int show_stat(struct seq_file *p, steal = kstat_cpu(i).cpustat.steal; guest = kstat_cpu(i).cpustat.guest; seq_printf(p, - "cpu%d %llu %llu %llu %llu %llu %llu %llu %llu %llu\n", + "cpu%d %llu %llu %llu %llu %llu %llu %llu %llu %llu %llu %llu\n", i, (unsigned long long)cputime64_to_clock_t(user), (unsigned long long)cputime64_to_clock_t(nice), @@ -523,6 +533,8 @@ static int show_stat(struct seq_file *p, (unsigned long long)cputime64_to_clock_t(irq), (unsigned long long)cputime64_to_clock_t(softirq), (unsigned long long)cputime64_to_clock_t(steal), + (unsigned long long)cputime64_to_clock_t(user_rt), + (unsigned long long)cputime64_to_clock_t(system_rt), (unsigned long long)cputime64_to_clock_t(guest)); } seq_printf(p, "intr %llu", (unsigned long long)sum); Index: linux-2.6.24.7/include/linux/kernel_stat.h =================================================================== --- linux-2.6.24.7.orig/include/linux/kernel_stat.h +++ linux-2.6.24.7/include/linux/kernel_stat.h @@ -23,6 +23,8 @@ struct cpu_usage_stat { cputime64_t idle; cputime64_t iowait; cputime64_t steal; + cputime64_t user_rt; + cputime64_t system_rt; cputime64_t guest; }; Index: linux-2.6.24.7/kernel/sched.c =================================================================== --- linux-2.6.24.7.orig/kernel/sched.c +++ linux-2.6.24.7/kernel/sched.c @@ -3450,7 +3450,9 @@ void account_user_time(struct task_struc /* Add user time to cpustat. */ tmp = cputime_to_cputime64(cputime); - if (TASK_NICE(p) > 0) + if (rt_task(p)) + cpustat->user_rt = cputime64_add(cpustat->user_rt, tmp); + else if (TASK_NICE(p) > 0) cpustat->nice = cputime64_add(cpustat->nice, tmp); else cpustat->user = cputime64_add(cpustat->user, tmp); @@ -3509,6 +3511,8 @@ void account_system_time(struct task_str cpustat->irq = cputime64_add(cpustat->irq, tmp); else if (softirq_count() || (p->flags & PF_SOFTIRQ)) cpustat->softirq = cputime64_add(cpustat->softirq, tmp); + else if (rt_task(p)) + cpustat->system_rt = cputime64_add(cpustat->system_rt, tmp); else if (p != rq->idle) cpustat->system = cputime64_add(cpustat->system, tmp); else if (atomic_read(&rq->nr_iowait) > 0) ����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/preempt-realtime-warn-and-bug-on.patch������������������������������������������������������0000664�0000764�0000764�00000001727�11041657733�020442� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������--- include/asm-generic/bug.h | 14 ++++++++++++++ 1 file changed, 14 insertions(+) Index: linux-2.6.24.7/include/asm-generic/bug.h =================================================================== --- linux-2.6.24.7.orig/include/asm-generic/bug.h +++ linux-2.6.24.7/include/asm-generic/bug.h @@ -3,6 +3,8 @@ #include <linux/compiler.h> +extern void __WARN_ON(const char *func, const char *file, const int line); + #ifdef CONFIG_BUG #ifdef CONFIG_GENERIC_BUG @@ -76,4 +78,16 @@ struct bug_entry { # define WARN_ON_SMP(x) do { } while (0) #endif +#ifdef CONFIG_PREEMPT_RT +# define BUG_ON_RT(c) BUG_ON(c) +# define BUG_ON_NONRT(c) do { } while (0) +# define WARN_ON_RT(condition) WARN_ON(condition) +# define WARN_ON_NONRT(condition) do { } while (0) +#else +# define BUG_ON_RT(c) do { } while (0) +# define BUG_ON_NONRT(c) BUG_ON(c) +# define WARN_ON_RT(condition) do { } while (0) +# define WARN_ON_NONRT(condition) WARN_ON(condition) +#endif + #endif �����������������������������������������patches/cputimer-thread-rt_A0.patch�����������������������������������������������������������������0000664�0000764�0000764�00000021473�11041657733�016334� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������Ingo, This patch re-adds the posix-cpu-timer functionality by running it from a per-cpu RT thread. This allows cpu rlimits to be enforced against RT processes that would otherwise starve the system. thanks -john Signed-off-by: John Stultz <johnstul@us.ibm.com> include/linux/init_task.h | 1 include/linux/posix-timers.h | 2 include/linux/sched.h | 2 init/main.c | 2 kernel/fork.c | 2 kernel/posix-cpu-timers.c | 176 ++++++++++++++++++++++++++++++++++++++++++- 6 files changed, 180 insertions(+), 5 deletions(-) Index: linux-2.6.24.7/include/linux/init_task.h =================================================================== --- linux-2.6.24.7.orig/include/linux/init_task.h +++ linux-2.6.24.7/include/linux/init_task.h @@ -166,6 +166,7 @@ extern struct group_info init_groups; .journal_info = NULL, \ .cpu_timers = INIT_CPU_TIMERS(tsk.cpu_timers), \ .fs_excl = ATOMIC_INIT(0), \ + .posix_timer_list = NULL, \ .pi_lock = RAW_SPIN_LOCK_UNLOCKED(tsk.pi_lock), \ .pids = { \ [PIDTYPE_PID] = INIT_PID_LINK(PIDTYPE_PID), \ Index: linux-2.6.24.7/include/linux/posix-timers.h =================================================================== --- linux-2.6.24.7.orig/include/linux/posix-timers.h +++ linux-2.6.24.7/include/linux/posix-timers.h @@ -115,4 +115,6 @@ void set_process_cpu_timer(struct task_s long clock_nanosleep_restart(struct restart_block *restart_block); +int posix_cpu_thread_init(void); + #endif Index: linux-2.6.24.7/include/linux/sched.h =================================================================== --- linux-2.6.24.7.orig/include/linux/sched.h +++ linux-2.6.24.7/include/linux/sched.h @@ -1070,6 +1070,8 @@ struct task_struct { unsigned long long it_sched_expires; struct list_head cpu_timers[3]; + struct task_struct* posix_timer_list; + /* process credentials */ uid_t uid,euid,suid,fsuid; gid_t gid,egid,sgid,fsgid; Index: linux-2.6.24.7/init/main.c =================================================================== --- linux-2.6.24.7.orig/init/main.c +++ linux-2.6.24.7/init/main.c @@ -34,6 +34,7 @@ #include <linux/workqueue.h> #include <linux/profile.h> #include <linux/rcupdate.h> +#include <linux/posix-timers.h> #include <linux/moduleparam.h> #include <linux/kallsyms.h> #include <linux/writeback.h> @@ -753,6 +754,7 @@ static void __init do_pre_smp_initcalls( extern int spawn_ksoftirqd(void); migration_init(); + posix_cpu_thread_init(); spawn_ksoftirqd(); if (!nosoftlockup) spawn_softlockup_task(); Index: linux-2.6.24.7/kernel/fork.c =================================================================== --- linux-2.6.24.7.orig/kernel/fork.c +++ linux-2.6.24.7/kernel/fork.c @@ -1081,7 +1081,7 @@ static struct task_struct *copy_process( INIT_LIST_HEAD(&p->cpu_timers[0]); INIT_LIST_HEAD(&p->cpu_timers[1]); INIT_LIST_HEAD(&p->cpu_timers[2]); - + p->posix_timer_list = NULL; p->lock_depth = -1; /* -1 = no lock */ do_posix_clock_monotonic_gettime(&p->start_time); p->real_start_time = p->start_time; Index: linux-2.6.24.7/kernel/posix-cpu-timers.c =================================================================== --- linux-2.6.24.7.orig/kernel/posix-cpu-timers.c +++ linux-2.6.24.7/kernel/posix-cpu-timers.c @@ -578,7 +578,7 @@ static void arm_timer(struct k_itimer *t p->cpu_timers : p->signal->cpu_timers); head += CPUCLOCK_WHICH(timer->it_clock); - BUG_ON(!irqs_disabled()); + BUG_ON_NONRT(!irqs_disabled()); spin_lock(&p->sighand->siglock); listpos = head; @@ -735,7 +735,7 @@ int posix_cpu_timer_set(struct k_itimer /* * Disarm any old timer after extracting its expiry time. */ - BUG_ON(!irqs_disabled()); + BUG_ON_NONRT(!irqs_disabled()); ret = 0; spin_lock(&p->sighand->siglock); @@ -1287,12 +1287,11 @@ out: * already updated our counts. We need to check if any timers fire now. * Interrupts are disabled. */ -void run_posix_cpu_timers(struct task_struct *tsk) +void __run_posix_cpu_timers(struct task_struct *tsk) { LIST_HEAD(firing); struct k_itimer *timer, *next; - BUG_ON(!irqs_disabled()); #define UNEXPIRED(clock) \ (cputime_eq(tsk->it_##clock##_expires, cputime_zero) || \ @@ -1355,6 +1354,169 @@ void run_posix_cpu_timers(struct task_st } } +#include <linux/kthread.h> +#include <linux/cpu.h> +DEFINE_PER_CPU(struct task_struct *, posix_timer_task); +DEFINE_PER_CPU(struct task_struct *, posix_timer_tasklist); + +static int posix_cpu_timers_thread(void *data) +{ + int cpu = (long)data; + + BUG_ON(per_cpu(posix_timer_task,cpu) != current); + + + while (!kthread_should_stop()) { + struct task_struct *tsk = NULL; + struct task_struct *next = NULL; + + if (cpu_is_offline(cpu)) { + goto wait_to_die; + } + + /* grab task list */ + raw_local_irq_disable(); + tsk = per_cpu(posix_timer_tasklist, cpu); + per_cpu(posix_timer_tasklist, cpu) = NULL; + raw_local_irq_enable(); + + + /* its possible the list is empty, just return */ + if (!tsk) { + set_current_state(TASK_INTERRUPTIBLE); + schedule(); + __set_current_state(TASK_RUNNING); + continue; + } + + /* Process task list */ + while (1) { + /* save next */ + next = tsk->posix_timer_list; + + /* run the task timers, clear its ptr and + * unreference it + */ + __run_posix_cpu_timers(tsk); + tsk->posix_timer_list = NULL; + put_task_struct(tsk); + + /* check if this is the last on the list */ + if (next == tsk) + break; + tsk = next; + } + } + return 0; + +wait_to_die: + /* Wait for kthread_stop */ + set_current_state(TASK_INTERRUPTIBLE); + while (!kthread_should_stop()) { + schedule(); + set_current_state(TASK_INTERRUPTIBLE); + } + __set_current_state(TASK_RUNNING); + return 0; +} + +void run_posix_cpu_timers(struct task_struct *tsk) +{ + unsigned long cpu = smp_processor_id(); + struct task_struct *tasklist; + + BUG_ON(!irqs_disabled()); + if(!per_cpu(posix_timer_task, cpu)) + return; + /* get per-cpu references */ + tasklist = per_cpu(posix_timer_tasklist, cpu); + + /* check to see if we're already queued */ + if (!tsk->posix_timer_list) { + get_task_struct(tsk); + if (tasklist) { + tsk->posix_timer_list = tasklist; + } else { + /* + * The list is terminated by a self-pointing + * task_struct + */ + tsk->posix_timer_list = tsk; + } + per_cpu(posix_timer_tasklist, cpu) = tsk; + } + /* XXX signal the thread somehow */ + wake_up_process(per_cpu(posix_timer_task,cpu)); +} + + + + +/* + * posix_cpu_thread_call - callback that gets triggered when a CPU is added. + * Here we can start up the necessary migration thread for the new CPU. + */ +static int posix_cpu_thread_call(struct notifier_block *nfb, unsigned long action, + void *hcpu) +{ + int cpu = (long)hcpu; + struct task_struct *p; + struct sched_param param; + + switch (action) { + case CPU_UP_PREPARE: + p = kthread_create(posix_cpu_timers_thread, hcpu, + "posix_cpu_timers/%d",cpu); + if (IS_ERR(p)) + return NOTIFY_BAD; + p->flags |= PF_NOFREEZE; + kthread_bind(p, cpu); + /* Must be high prio to avoid getting starved */ + param.sched_priority = MAX_RT_PRIO-1; + sched_setscheduler(p, SCHED_FIFO, ¶m); + per_cpu(posix_timer_task,cpu) = p; + break; + case CPU_ONLINE: + /* Strictly unneccessary, as first user will wake it. */ + wake_up_process(per_cpu(posix_timer_task,cpu)); + break; +#ifdef CONFIG_HOTPLUG_CPU + case CPU_UP_CANCELED: + /* Unbind it from offline cpu so it can run. Fall thru. */ + kthread_bind(per_cpu(posix_timer_task,cpu), + any_online_cpu(cpu_online_map)); + kthread_stop(per_cpu(posix_timer_task,cpu)); + per_cpu(posix_timer_task,cpu) = NULL; + break; + case CPU_DEAD: + kthread_stop(per_cpu(posix_timer_task,cpu)); + per_cpu(posix_timer_task,cpu) = NULL; + break; +#endif + } + return NOTIFY_OK; +} + +/* Register at highest priority so that task migration (migrate_all_tasks) + * happens before everything else. + */ +static struct notifier_block __devinitdata posix_cpu_thread_notifier = { + .notifier_call = posix_cpu_thread_call, + .priority = 10 +}; + +int __init posix_cpu_thread_init(void) +{ + void *cpu = (void *)(long)smp_processor_id(); + /* Start one for boot CPU. */ + posix_cpu_thread_call(&posix_cpu_thread_notifier, CPU_UP_PREPARE, cpu); + posix_cpu_thread_call(&posix_cpu_thread_notifier, CPU_ONLINE, cpu); + register_cpu_notifier(&posix_cpu_thread_notifier); + return 0; +} + + + /* * Set one of the process-wide special case CPU timers. * The tasklist_lock and tsk->sighand->siglock must be held by the caller. @@ -1620,6 +1782,12 @@ static __init int init_posix_cpu_timers( .nsleep = thread_cpu_nsleep, .nsleep_restart = thread_cpu_nsleep_restart, }; + unsigned long cpu; + + /* init the per-cpu posix_timer_tasklets */ + for_each_cpu_mask(cpu, cpu_possible_map) { + per_cpu(posix_timer_tasklist, cpu) = NULL; + } register_posix_clock(CLOCK_PROCESS_CPUTIME_ID, &process); register_posix_clock(CLOCK_THREAD_CPUTIME_ID, &thread); �����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/cputimer-thread-rt-fix.patch����������������������������������������������������������������0000664�0000764�0000764�00000003137�11041657734�016576� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������--- kernel/posix-cpu-timers.c | 27 ++++++++++++++------------- 1 file changed, 14 insertions(+), 13 deletions(-) Index: linux-2.6.24.7/kernel/posix-cpu-timers.c =================================================================== --- linux-2.6.24.7.orig/kernel/posix-cpu-timers.c +++ linux-2.6.24.7/kernel/posix-cpu-timers.c @@ -1292,18 +1292,6 @@ void __run_posix_cpu_timers(struct task_ LIST_HEAD(firing); struct k_itimer *timer, *next; - -#define UNEXPIRED(clock) \ - (cputime_eq(tsk->it_##clock##_expires, cputime_zero) || \ - cputime_lt(clock##_ticks(tsk), tsk->it_##clock##_expires)) - - if (UNEXPIRED(prof) && UNEXPIRED(virt) && - (tsk->it_sched_expires == 0 || - tsk->se.sum_exec_runtime < tsk->it_sched_expires)) - return; - -#undef UNEXPIRED - /* * Double-check with locks held. */ @@ -1428,6 +1416,19 @@ void run_posix_cpu_timers(struct task_st BUG_ON(!irqs_disabled()); if(!per_cpu(posix_timer_task, cpu)) return; + + +#define UNEXPIRED(clock) \ + (cputime_eq(tsk->it_##clock##_expires, cputime_zero) || \ + cputime_lt(clock##_ticks(tsk), tsk->it_##clock##_expires)) + + if (UNEXPIRED(prof) && UNEXPIRED(virt) && + (tsk->it_sched_expires == 0 || + tsk->sum_exec_runtime < tsk->it_sched_expires)) + return; + +#undef UNEXPIRED + /* get per-cpu references */ tasklist = per_cpu(posix_timer_tasklist, cpu); @@ -1446,7 +1447,7 @@ void run_posix_cpu_timers(struct task_st per_cpu(posix_timer_tasklist, cpu) = tsk; } /* XXX signal the thread somehow */ - wake_up_process(per_cpu(posix_timer_task,cpu)); + wake_up_process(per_cpu(posix_timer_task, cpu)); } ���������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/posix-cpu-timers-fix.patch������������������������������������������������������������������0000664�0000764�0000764�00000001652�11041657734�016306� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������� kernel/posix-cpu-timers.c | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-) Index: linux-2.6.24.7/kernel/posix-cpu-timers.c =================================================================== --- linux-2.6.24.7.orig/kernel/posix-cpu-timers.c +++ linux-2.6.24.7/kernel/posix-cpu-timers.c @@ -1296,6 +1296,12 @@ void __run_posix_cpu_timers(struct task_ * Double-check with locks held. */ read_lock(&tasklist_lock); + /* Make sure the task doesn't exit under us. */ + if (unlikely(tsk->exit_state)) { + read_unlock(&tasklist_lock); + return; + } + if (likely(tsk->signal != NULL)) { spin_lock(&tsk->sighand->siglock); @@ -1424,7 +1430,7 @@ void run_posix_cpu_timers(struct task_st if (UNEXPIRED(prof) && UNEXPIRED(virt) && (tsk->it_sched_expires == 0 || - tsk->sum_exec_runtime < tsk->it_sched_expires)) + tsk->se.sum_exec_runtime < tsk->it_sched_expires)) return; #undef UNEXPIRED ��������������������������������������������������������������������������������������patches/vortex-fix.patch����������������������������������������������������������������������������0000664�0000764�0000764�00000005135�11041657730�014401� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������� Argh, cut and paste wasn't enough... Use this patch instead. It needs an irq disable. But, believe it or not, on SMP this is actually better. If the irq is shared (as it is in Mark's case), we don't stop the irq of other devices from being handled on another CPU (unfortunately for Mark, he pinned all interrupts to one CPU). Andrew, should this be changed in mainline too? -- Steve Signed-off-by: Steven Rostedt <rostedt@goodmis.org> drivers/net/3c59x.c | 15 +++++++++------ 1 file changed, 9 insertions(+), 6 deletions(-) Index: linux-2.6.24.7/drivers/net/3c59x.c =================================================================== --- linux-2.6.24.7.orig/drivers/net/3c59x.c +++ linux-2.6.24.7/drivers/net/3c59x.c @@ -792,9 +792,9 @@ static void poll_vortex(struct net_devic { struct vortex_private *vp = netdev_priv(dev); unsigned long flags; - local_irq_save(flags); + local_irq_save_nort(flags); (vp->full_bus_master_rx ? boomerang_interrupt:vortex_interrupt)(dev->irq,dev); - local_irq_restore(flags); + local_irq_restore_nort(flags); } #endif @@ -1739,6 +1739,7 @@ vortex_timer(unsigned long data) int next_tick = 60*HZ; int ok = 0; int media_status, old_window; + unsigned long flags; if (vortex_debug > 2) { printk(KERN_DEBUG "%s: Media selection timer tick happened, %s.\n", @@ -1746,7 +1747,7 @@ vortex_timer(unsigned long data) printk(KERN_DEBUG "dev->watchdog_timeo=%d\n", dev->watchdog_timeo); } - disable_irq_lockdep(dev->irq); + spin_lock_irqsave(&vp->lock, flags); old_window = ioread16(ioaddr + EL3_CMD) >> 13; EL3WINDOW(4); media_status = ioread16(ioaddr + Wn4_Media); @@ -1769,9 +1770,7 @@ vortex_timer(unsigned long data) case XCVR_MII: case XCVR_NWAY: { ok = 1; - spin_lock_bh(&vp->lock); vortex_check_media(dev, 0); - spin_unlock_bh(&vp->lock); } break; default: /* Other media types handled by Tx timeouts. */ @@ -1827,7 +1826,7 @@ leave_media_alone: dev->name, media_tbl[dev->if_port].name); EL3WINDOW(old_window); - enable_irq_lockdep(dev->irq); + spin_unlock_irqrestore(&vp->lock, flags); mod_timer(&vp->timer, RUN_AT(next_tick)); if (vp->deferred) iowrite16(FakeIntr, ioaddr + EL3_CMD); @@ -1860,13 +1859,17 @@ static void vortex_tx_timeout(struct net /* * Block interrupts because vortex_interrupt does a bare spin_lock() */ +#ifndef CONFIG_PREEMPT_RT unsigned long flags; local_irq_save(flags); +#endif if (vp->full_bus_master_tx) boomerang_interrupt(dev->irq, dev); else vortex_interrupt(dev->irq, dev); +#ifndef CONFIG_PREEMPT_RT local_irq_restore(flags); +#endif } } �����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/serial-locking-rt-cleanup.patch�������������������������������������������������������������0000664�0000764�0000764�00000002155�11041657733�017243� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������� drivers/serial/8250.c | 15 +++++---------- 1 file changed, 5 insertions(+), 10 deletions(-) Index: linux-2.6.24.7/drivers/serial/8250.c =================================================================== --- linux-2.6.24.7.orig/drivers/serial/8250.c +++ linux-2.6.24.7/drivers/serial/8250.c @@ -2473,14 +2473,10 @@ serial8250_console_write(struct console touch_nmi_watchdog(); - local_irq_save(flags); - if (up->port.sysrq) { - /* serial8250_handle_port() already took the lock */ - locked = 0; - } else if (oops_in_progress) { - locked = spin_trylock(&up->port.lock); - } else - spin_lock(&up->port.lock); + if (up->port.sysrq || oops_in_progress) + locked = spin_trylock_irqsave(&up->port.lock, flags); + else + spin_lock_irqsave(&up->port.lock, flags); /* * First save the IER then disable the interrupts @@ -2512,8 +2508,7 @@ serial8250_console_write(struct console check_modem_status(up); if (locked) - spin_unlock(&up->port.lock); - local_irq_restore(flags); + spin_unlock_irqrestore(&up->port.lock, flags); } static int __init serial8250_console_setup(struct console *co, char *options) �������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/fix-emac-locking-2.6.16.patch���������������������������������������������������������������0000664�0000764�0000764�00000005533�11041657731�016136� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������� drivers/net/ibm_emac/ibm_emac_core.c | 11 +++++++++++ drivers/net/ibm_emac/ibm_emac_core.h | 2 ++ 2 files changed, 13 insertions(+) Index: linux-2.6.24.7/drivers/net/ibm_emac/ibm_emac_core.c =================================================================== --- linux-2.6.24.7.orig/drivers/net/ibm_emac/ibm_emac_core.c +++ linux-2.6.24.7/drivers/net/ibm_emac/ibm_emac_core.c @@ -1058,6 +1058,8 @@ static inline int emac_xmit_finish(struc ++dev->stats.tx_packets; dev->stats.tx_bytes += len; + spin_unlock(&dev->tx_lock); + return 0; } @@ -1071,6 +1073,7 @@ static int emac_start_xmit(struct sk_buf u16 ctrl = EMAC_TX_CTRL_GFCS | EMAC_TX_CTRL_GP | MAL_TX_CTRL_READY | MAL_TX_CTRL_LAST | emac_tx_csum(dev, skb); + spin_lock(&dev->tx_lock); slot = dev->tx_slot++; if (dev->tx_slot == NUM_TX_BUFF) { dev->tx_slot = 0; @@ -1133,6 +1136,8 @@ static int emac_start_xmit_sg(struct sk_ if (likely(!nr_frags && len <= MAL_MAX_TX_SIZE)) return emac_start_xmit(skb, ndev); + spin_lock(&dev->tx_lock); + len -= skb->data_len; /* Note, this is only an *estimation*, we can still run out of empty @@ -1201,6 +1206,7 @@ static int emac_start_xmit_sg(struct sk_ stop_queue: netif_stop_queue(ndev); DBG2("%d: stopped TX queue" NL, dev->def->index); + spin_unlock(&dev->tx_lock); return 1; } #else @@ -1240,6 +1246,7 @@ static void emac_poll_tx(void *param) DBG2("%d: poll_tx, %d %d" NL, dev->def->index, dev->tx_cnt, dev->ack_slot); + spin_lock(&dev->tx_lock); if (dev->tx_cnt) { u16 ctrl; int slot = dev->ack_slot, n = 0; @@ -1249,6 +1256,7 @@ static void emac_poll_tx(void *param) struct sk_buff *skb = dev->tx_skb[slot]; ++n; + spin_unlock(&dev->tx_lock); if (skb) { dev_kfree_skb(skb); dev->tx_skb[slot] = NULL; @@ -1258,6 +1266,7 @@ static void emac_poll_tx(void *param) if (unlikely(EMAC_IS_BAD_TX(ctrl))) emac_parse_tx_error(dev, ctrl); + spin_lock(&dev->tx_lock); if (--dev->tx_cnt) goto again; } @@ -1270,6 +1279,7 @@ static void emac_poll_tx(void *param) DBG2("%d: tx %d pkts" NL, dev->def->index, n); } } + spin_unlock(&dev->tx_lock); } static inline void emac_recycle_rx_skb(struct ocp_enet_private *dev, int slot, @@ -1964,6 +1974,7 @@ static int __init emac_probe(struct ocp_ dev->ndev = ndev; dev->ldev = &ocpdev->dev; dev->def = ocpdev->def; + spin_lock_init(&dev->tx_lock); /* Find MAL device we are connected to */ maldev = Index: linux-2.6.24.7/drivers/net/ibm_emac/ibm_emac_core.h =================================================================== --- linux-2.6.24.7.orig/drivers/net/ibm_emac/ibm_emac_core.h +++ linux-2.6.24.7/drivers/net/ibm_emac/ibm_emac_core.h @@ -193,6 +193,8 @@ struct ocp_enet_private { struct ibm_emac_error_stats estats; struct net_device_stats nstats; + spinlock_t tx_lock; + struct device* ldev; }; ���������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/serial-slow-machines.patch������������������������������������������������������������������0000664�0000764�0000764�00000003213�11041657731�016310� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������--- drivers/char/tty_io.c | 4 ++++ drivers/serial/8250.c | 11 ++++++++++- 2 files changed, 14 insertions(+), 1 deletion(-) Index: linux-2.6.24.7/drivers/char/tty_io.c =================================================================== --- linux-2.6.24.7.orig/drivers/char/tty_io.c +++ linux-2.6.24.7/drivers/char/tty_io.c @@ -3691,10 +3691,14 @@ void tty_flip_buffer_push(struct tty_str tty->buf.tail->commit = tty->buf.tail->used; spin_unlock_irqrestore(&tty->buf.lock, flags); +#ifndef CONFIG_PREEMPT_RT if (tty->low_latency) flush_to_ldisc(&tty->buf.work.work); else schedule_delayed_work(&tty->buf.work, 1); +#else + flush_to_ldisc(&tty->buf.work.work); +#endif } EXPORT_SYMBOL(tty_flip_buffer_push); Index: linux-2.6.24.7/drivers/serial/8250.c =================================================================== --- linux-2.6.24.7.orig/drivers/serial/8250.c +++ linux-2.6.24.7/drivers/serial/8250.c @@ -1455,7 +1455,10 @@ static irqreturn_t serial8250_interrupt( { struct irq_info *i = dev_id; struct list_head *l, *end = NULL; - int pass_counter = 0, handled = 0; +#ifndef CONFIG_PREEMPT_RT + int pass_counter = 0; +#endif + int handled = 0; DEBUG_INTR("serial8250_interrupt(%d)...", irq); @@ -1493,12 +1496,18 @@ static irqreturn_t serial8250_interrupt( l = l->next; + /* + * On preempt-rt we can be preempted and run in our + * own thread. + */ +#ifndef CONFIG_PREEMPT_RT if (l == i->head && pass_counter++ > PASS_LIMIT) { /* If we hit this, we're dead. */ printk(KERN_ERR "serial8250: too much work for " "irq%d\n", irq); break; } +#endif } while (l != end); spin_unlock(&i->lock); �������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/preempt-realtime-arm.patch������������������������������������������������������������������0000664�0000764�0000764�00000017132�11041657733�016322� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������--- arch/arm/kernel/dma.c | 2 +- arch/arm/kernel/irq.c | 2 +- arch/arm/kernel/process.c | 2 +- arch/arm/kernel/signal.c | 8 ++++++++ arch/arm/kernel/smp.c | 2 +- arch/arm/kernel/traps.c | 4 ++-- arch/arm/mm/consistent.c | 2 +- arch/arm/mm/copypage-v4mc.c | 2 +- arch/arm/mm/copypage-v6.c | 2 +- arch/arm/mm/copypage-xscale.c | 2 +- arch/arm/mm/mmu.c | 2 +- include/asm-arm/dma.h | 2 +- include/asm-arm/futex.h | 2 +- include/asm-arm/tlb.h | 9 ++++++--- 14 files changed, 27 insertions(+), 16 deletions(-) Index: linux-2.6.24.7/arch/arm/kernel/dma.c =================================================================== --- linux-2.6.24.7.orig/arch/arm/kernel/dma.c +++ linux-2.6.24.7/arch/arm/kernel/dma.c @@ -20,7 +20,7 @@ #include <asm/mach/dma.h> -DEFINE_SPINLOCK(dma_spin_lock); +DEFINE_RAW_SPINLOCK(dma_spin_lock); EXPORT_SYMBOL(dma_spin_lock); static dma_t dma_chan[MAX_DMA_CHANNELS]; Index: linux-2.6.24.7/arch/arm/kernel/irq.c =================================================================== --- linux-2.6.24.7.orig/arch/arm/kernel/irq.c +++ linux-2.6.24.7/arch/arm/kernel/irq.c @@ -102,7 +102,7 @@ unlock: /* Handle bad interrupts */ static struct irq_desc bad_irq_desc = { .handle_irq = handle_bad_irq, - .lock = SPIN_LOCK_UNLOCKED + .lock = RAW_SPIN_LOCK_UNLOCKED(bad_irq_desc.lock) }; /* Index: linux-2.6.24.7/arch/arm/kernel/process.c =================================================================== --- linux-2.6.24.7.orig/arch/arm/kernel/process.c +++ linux-2.6.24.7/arch/arm/kernel/process.c @@ -37,7 +37,7 @@ #include <asm/uaccess.h> #include <asm/mach/time.h> -DEFINE_SPINLOCK(futex_atomic_lock); +DEFINE_RAW_SPINLOCK(futex_atomic_lock); static const char *processor_modes[] = { "USER_26", "FIQ_26" , "IRQ_26" , "SVC_26" , "UK4_26" , "UK5_26" , "UK6_26" , "UK7_26" , Index: linux-2.6.24.7/arch/arm/kernel/signal.c =================================================================== --- linux-2.6.24.7.orig/arch/arm/kernel/signal.c +++ linux-2.6.24.7/arch/arm/kernel/signal.c @@ -623,6 +623,14 @@ static int do_signal(sigset_t *oldset, s siginfo_t info; int signr; +#ifdef CONFIG_PREEMPT_RT + /* + * Fully-preemptible kernel does not need interrupts disabled: + */ + local_irq_enable(); + preempt_check_resched(); +#endif + /* * We want the common case to go fast, which * is why we may in certain cases get here from Index: linux-2.6.24.7/arch/arm/kernel/smp.c =================================================================== --- linux-2.6.24.7.orig/arch/arm/kernel/smp.c +++ linux-2.6.24.7/arch/arm/kernel/smp.c @@ -522,7 +522,7 @@ static void ipi_call_function(unsigned i cpu_clear(cpu, data->unfinished); } -static DEFINE_SPINLOCK(stop_lock); +static DEFINE_RAW_SPINLOCK(stop_lock); /* * ipi_cpu_stop - handle IPI from smp_send_stop() Index: linux-2.6.24.7/arch/arm/kernel/traps.c =================================================================== --- linux-2.6.24.7.orig/arch/arm/kernel/traps.c +++ linux-2.6.24.7/arch/arm/kernel/traps.c @@ -233,7 +233,7 @@ static void __die(const char *str, int e } } -DEFINE_SPINLOCK(die_lock); +DEFINE_RAW_SPINLOCK(die_lock); /* * This function is protected against re-entrancy. @@ -276,7 +276,7 @@ void arm_notify_die(const char *str, str } static LIST_HEAD(undef_hook); -static DEFINE_SPINLOCK(undef_lock); +static DEFINE_RAW_SPINLOCK(undef_lock); void register_undef_hook(struct undef_hook *hook) { Index: linux-2.6.24.7/arch/arm/mm/consistent.c =================================================================== --- linux-2.6.24.7.orig/arch/arm/mm/consistent.c +++ linux-2.6.24.7/arch/arm/mm/consistent.c @@ -40,7 +40,7 @@ * These are the page tables (2MB each) covering uncached, DMA consistent allocations */ static pte_t *consistent_pte[NUM_CONSISTENT_PTES]; -static DEFINE_SPINLOCK(consistent_lock); +static DEFINE_RAW_SPINLOCK(consistent_lock); /* * VM region handling support. Index: linux-2.6.24.7/arch/arm/mm/copypage-v4mc.c =================================================================== --- linux-2.6.24.7.orig/arch/arm/mm/copypage-v4mc.c +++ linux-2.6.24.7/arch/arm/mm/copypage-v4mc.c @@ -30,7 +30,7 @@ #define minicache_pgprot __pgprot(L_PTE_PRESENT | L_PTE_YOUNG | \ L_PTE_CACHEABLE) -static DEFINE_SPINLOCK(minicache_lock); +static DEFINE_RAW_SPINLOCK(minicache_lock); /* * ARMv4 mini-dcache optimised copy_user_page Index: linux-2.6.24.7/arch/arm/mm/copypage-v6.c =================================================================== --- linux-2.6.24.7.orig/arch/arm/mm/copypage-v6.c +++ linux-2.6.24.7/arch/arm/mm/copypage-v6.c @@ -26,7 +26,7 @@ #define from_address (0xffff8000) #define to_address (0xffffc000) -static DEFINE_SPINLOCK(v6_lock); +static DEFINE_RAW_SPINLOCK(v6_lock); /* * Copy the user page. No aliasing to deal with so we can just Index: linux-2.6.24.7/arch/arm/mm/copypage-xscale.c =================================================================== --- linux-2.6.24.7.orig/arch/arm/mm/copypage-xscale.c +++ linux-2.6.24.7/arch/arm/mm/copypage-xscale.c @@ -32,7 +32,7 @@ #define minicache_pgprot __pgprot(L_PTE_PRESENT | L_PTE_YOUNG | \ L_PTE_CACHEABLE) -static DEFINE_SPINLOCK(minicache_lock); +static DEFINE_RAW_SPINLOCK(minicache_lock); /* * XScale mini-dcache optimised copy_user_page Index: linux-2.6.24.7/arch/arm/mm/mmu.c =================================================================== --- linux-2.6.24.7.orig/arch/arm/mm/mmu.c +++ linux-2.6.24.7/arch/arm/mm/mmu.c @@ -25,7 +25,7 @@ #include "mm.h" -DEFINE_PER_CPU(struct mmu_gather, mmu_gathers); +DEFINE_PER_CPU_LOCKED(struct mmu_gather, mmu_gathers); extern void _stext, _etext, __data_start, _end; extern pgd_t swapper_pg_dir[PTRS_PER_PGD]; Index: linux-2.6.24.7/include/asm-arm/dma.h =================================================================== --- linux-2.6.24.7.orig/include/asm-arm/dma.h +++ linux-2.6.24.7/include/asm-arm/dma.h @@ -27,7 +27,7 @@ typedef unsigned int dmamode_t; #define DMA_MODE_CASCADE 2 #define DMA_AUTOINIT 4 -extern spinlock_t dma_spin_lock; +extern raw_spinlock_t dma_spin_lock; static inline unsigned long claim_dma_lock(void) { Index: linux-2.6.24.7/include/asm-arm/futex.h =================================================================== --- linux-2.6.24.7.orig/include/asm-arm/futex.h +++ linux-2.6.24.7/include/asm-arm/futex.h @@ -7,7 +7,7 @@ #include <linux/errno.h> #include <linux/uaccess.h> -extern spinlock_t futex_atomic_lock; +extern raw_spinlock_t futex_atomic_lock; #define __futex_atomic_op(insn, ret, oldval, uaddr, oparg) \ __asm__ __volatile__ ( \ Index: linux-2.6.24.7/include/asm-arm/tlb.h =================================================================== --- linux-2.6.24.7.orig/include/asm-arm/tlb.h +++ linux-2.6.24.7/include/asm-arm/tlb.h @@ -36,15 +36,18 @@ struct mmu_gather { struct mm_struct *mm; unsigned int fullmm; + int cpu; }; -DECLARE_PER_CPU(struct mmu_gather, mmu_gathers); +DECLARE_PER_CPU_LOCKED(struct mmu_gather, mmu_gathers); static inline struct mmu_gather * tlb_gather_mmu(struct mm_struct *mm, unsigned int full_mm_flush) { - struct mmu_gather *tlb = &get_cpu_var(mmu_gathers); + int cpu; + struct mmu_gather *tlb = &get_cpu_var_locked(mmu_gathers, &cpu); + tlb->cpu = cpu; tlb->mm = mm; tlb->fullmm = full_mm_flush; @@ -60,7 +63,7 @@ tlb_finish_mmu(struct mmu_gather *tlb, u /* keep the page table cache within bounds */ check_pgt_cache(); - put_cpu_var(mmu_gathers); + put_cpu_var_locked(mmu_gathers, tlb->cpu); } #define tlb_remove_tlb_entry(tlb,ptep,address) do { } while (0) ��������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/preempt-realtime-arm-rawlock-in-mmu_context-h.patch�����������������������������������������0000664�0000764�0000764�00000003233�11041657732�023147� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From khilman@mvista.com Fri Aug 31 05:09:03 2007 Return-Path: <khilman@mvista.com> X-Spam-Checker-Version: SpamAssassin 3.1.7-deb (2006-10-05) on debian X-Spam-Level: X-Spam-Status: No, score=0.0 required=5.0 tests=AWL autolearn=unavailable version=3.1.7-deb Received: from paris.hilman.org (deeprooted.net [216.254.16.51]) by mail.tglx.de (Postfix) with ESMTP id 1F21965C003 for <tglx@linutronix.de>; Fri, 31 Aug 2007 05:09:03 +0200 (CEST) Received: by paris.hilman.org (Postfix, from userid 1000) id C5837E4C5FE; Thu, 30 Aug 2007 20:09:02 -0700 (PDT) Message-Id: <20070831030841.799694742@mvista.com> User-Agent: quilt/0.45-1 Date: Thu, 30 Aug 2007 20:08:41 -0700 From: Kevin Hilman <khilman@mvista.com> To: Ingo Molnar <mingo@elte.hu>, Thomas Gleixner <tglx@linutronix.de> Cc: LKML <linux-kernel@vger.kernel.org>, RT-Users <linux-rt-users@vger.kernel.org> Subject: [PATCH 2.6.23-rc2-rt2] ARM: use raw lock in __new_context X-Evolution-Source: imap://tglx%40linutronix.de@localhost:8993/ Content-Transfer-Encoding: 8bit Mime-Version: 1.0 The ARM CPU ASID lock should be raw as it's used by schedule() when creating a new context. Signed-off-by: Kevin Hilman <khilman@mvista.com> --- arch/arm/mm/context.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) Index: linux-2.6.24.7/arch/arm/mm/context.c =================================================================== --- linux-2.6.24.7.orig/arch/arm/mm/context.c +++ linux-2.6.24.7/arch/arm/mm/context.c @@ -14,7 +14,7 @@ #include <asm/mmu_context.h> #include <asm/tlbflush.h> -static DEFINE_SPINLOCK(cpu_asid_lock); +static DEFINE_RAW_SPINLOCK(cpu_asid_lock); unsigned int cpu_last_asid = ASID_FIRST_VERSION; /* ���������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/arm-trace-preempt-idle.patch����������������������������������������������������������������0000664�0000764�0000764�00000005241�11041657734�016530� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From linux-rt-users-owner@vger.kernel.org Fri Jul 13 20:13:14 2007 Return-Path: <linux-rt-users-owner@vger.kernel.org> X-Spam-Checker-Version: SpamAssassin 3.1.7-deb (2006-10-05) on debian X-Spam-Level: X-Spam-Status: No, score=0.0 required=5.0 tests=AWL autolearn=unavailable version=3.1.7-deb Received: from vger.kernel.org (vger.kernel.org [209.132.176.167]) by mail.tglx.de (Postfix) with ESMTP id 5902865C3EB; Fri, 13 Jul 2007 20:13:14 +0200 (CEST) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S933095AbXGMSNN (ORCPT <rfc822;jan.altenberg@linutronix.de> + 1 other); Fri, 13 Jul 2007 14:13:13 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S933031AbXGMSNM (ORCPT <rfc822;linux-rt-users-outgoing>); Fri, 13 Jul 2007 14:13:12 -0400 Received: from deeprooted.net ([216.254.16.51]:38941 "EHLO paris.hilman.org" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1760089AbXGMSNH (ORCPT <rfc822;linux-rt-users@vger.kernel.org>); Fri, 13 Jul 2007 14:13:07 -0400 Received: by paris.hilman.org (Postfix, from userid 1000) id E61B1D2857A; Fri, 13 Jul 2007 10:52:28 -0700 (PDT) Message-Id: <20070713175228.623525155@mvista.com> References: <20070713175214.336577416@mvista.com> User-Agent: quilt/0.45-1 Date: Fri, 13 Jul 2007 10:52:18 -0700 From: Kevin Hilman <khilman@mvista.com> To: tglx@linutronix.de, mingo@elte.hu Cc: linux-rt-users@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [PATCH -rt 4/6] Add trace_preempt_*_idle() support for ARM. Content-Disposition: inline; filename=arm-trace-preempt-idle.patch Sender: linux-rt-users-owner@vger.kernel.org Precedence: bulk X-Mailing-List: linux-rt-users@vger.kernel.org X-Filter-To: .Kernel.rt-users X-Evolution-Source: imap://tglx%40linutronix.de@localhost:8993/ Content-Transfer-Encoding: 8bit Mime-Version: 1.0 Add trace functions to ARM idle loop and also move the tick_nohz_restart_sched_tick() after the local_irq_disable() as is done on x86. Signed-off-by: Kevin Hilman <khilman@mvista.com> --- arch/arm/kernel/process.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) Index: linux-2.6.24.7/arch/arm/kernel/process.c =================================================================== --- linux-2.6.24.7.orig/arch/arm/kernel/process.c +++ linux-2.6.24.7/arch/arm/kernel/process.c @@ -171,11 +171,13 @@ void cpu_idle(void) while (!need_resched() && !need_resched_delayed()) idle(); leds_event(led_idle_end); - tick_nohz_restart_sched_tick(); local_irq_disable(); + trace_preempt_exit_idle(); + tick_nohz_restart_sched_tick(); __preempt_enable_no_resched(); __schedule(); preempt_disable(); + trace_preempt_enter_idle(); local_irq_enable(); } } ���������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/preempt-realtime-arm-bagde4.patch�����������������������������������������������������������0000664�0000764�0000764�00000002262�11041657734�017445� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������--- arch/arm/mach-sa1100/badge4.c | 11 +++++++++-- 1 file changed, 9 insertions(+), 2 deletions(-) Index: linux-2.6.24.7/arch/arm/mach-sa1100/badge4.c =================================================================== --- linux-2.6.24.7.orig/arch/arm/mach-sa1100/badge4.c +++ linux-2.6.24.7/arch/arm/mach-sa1100/badge4.c @@ -240,15 +240,22 @@ void badge4_set_5V(unsigned subsystem, i /* detect on->off and off->on transitions */ if ((!old_5V_bitmap) && (badge4_5V_bitmap)) { /* was off, now on */ - printk(KERN_INFO "%s: enabling 5V supply rail\n", __FUNCTION__); GPSR = BADGE4_GPIO_PCMEN5V; } else if ((old_5V_bitmap) && (!badge4_5V_bitmap)) { /* was on, now off */ - printk(KERN_INFO "%s: disabling 5V supply rail\n", __FUNCTION__); GPCR = BADGE4_GPIO_PCMEN5V; } local_irq_restore(flags); + + /* detect on->off and off->on transitions */ + if ((!old_5V_bitmap) && (badge4_5V_bitmap)) { + /* was off, now on */ + printk(KERN_INFO "%s: enabling 5V supply rail\n", __FUNCTION__); + } else if ((old_5V_bitmap) && (!badge4_5V_bitmap)) { + /* was on, now off */ + printk(KERN_INFO "%s: disabling 5V supply rail\n", __FUNCTION__); + } } EXPORT_SYMBOL(badge4_set_5V); ����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/preempt-realtime-arm-footbridge.patch�������������������������������������������������������0000664�0000764�0000764�00000002226�11041657734�020443� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������--- arch/arm/mach-footbridge/netwinder-hw.c | 2 +- arch/arm/mach-footbridge/netwinder-leds.c | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) Index: linux-2.6.24.7/arch/arm/mach-footbridge/netwinder-hw.c =================================================================== --- linux-2.6.24.7.orig/arch/arm/mach-footbridge/netwinder-hw.c +++ linux-2.6.24.7/arch/arm/mach-footbridge/netwinder-hw.c @@ -67,7 +67,7 @@ static inline void wb977_ww(int reg, int /* * This is a lock for accessing ports GP1_IO_BASE and GP2_IO_BASE */ -DEFINE_SPINLOCK(gpio_lock); +DEFINE_RAW_SPINLOCK(gpio_lock); static unsigned int current_gpio_op; static unsigned int current_gpio_io; Index: linux-2.6.24.7/arch/arm/mach-footbridge/netwinder-leds.c =================================================================== --- linux-2.6.24.7.orig/arch/arm/mach-footbridge/netwinder-leds.c +++ linux-2.6.24.7/arch/arm/mach-footbridge/netwinder-leds.c @@ -32,7 +32,7 @@ static char led_state; static char hw_led_state; static DEFINE_SPINLOCK(leds_lock); -extern spinlock_t gpio_lock; +extern raw_spinlock_t gpio_lock; static void netwinder_leds_event(led_event_t evt) { ��������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/preempt-realtime-arm-integrator.patch�������������������������������������������������������0000664�0000764�0000764�00000002075�11041657734�020477� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������--- arch/arm/mach-integrator/core.c | 2 +- arch/arm/mach-integrator/pci_v3.c | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) Index: linux-2.6.24.7/arch/arm/mach-integrator/core.c =================================================================== --- linux-2.6.24.7.orig/arch/arm/mach-integrator/core.c +++ linux-2.6.24.7/arch/arm/mach-integrator/core.c @@ -164,7 +164,7 @@ static struct amba_pl010_data integrator #define CM_CTRL IO_ADDRESS(INTEGRATOR_HDR_BASE) + INTEGRATOR_HDR_CTRL_OFFSET -static DEFINE_SPINLOCK(cm_lock); +static DEFINE_RAW_SPINLOCK(cm_lock); /** * cm_control - update the CM_CTRL register. Index: linux-2.6.24.7/arch/arm/mach-integrator/pci_v3.c =================================================================== --- linux-2.6.24.7.orig/arch/arm/mach-integrator/pci_v3.c +++ linux-2.6.24.7/arch/arm/mach-integrator/pci_v3.c @@ -162,7 +162,7 @@ * 7:2 register number * */ -static DEFINE_SPINLOCK(v3_lock); +static DEFINE_RAW_SPINLOCK(v3_lock); #define PCI_BUS_NONMEM_START 0x00000000 #define PCI_BUS_NONMEM_SIZE SZ_256M �������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/preempt-realtime-arm-ixp4xx.patch�����������������������������������������������������������0000664�0000764�0000764�00000001154�11041657735�017563� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������--- arch/arm/mach-ixp4xx/common-pci.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) Index: linux-2.6.24.7/arch/arm/mach-ixp4xx/common-pci.c =================================================================== --- linux-2.6.24.7.orig/arch/arm/mach-ixp4xx/common-pci.c +++ linux-2.6.24.7/arch/arm/mach-ixp4xx/common-pci.c @@ -53,7 +53,7 @@ unsigned long ixp4xx_pci_reg_base = 0; * these transactions are atomic or we will end up * with corrupt data on the bus or in a driver. */ -static DEFINE_SPINLOCK(ixp4xx_pci_lock); +static DEFINE_RAW_SPINLOCK(ixp4xx_pci_lock); /* * Read from PCI config space ��������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/preempt-realtime-arm-pxa.patch��������������������������������������������������������������0000664�0000764�0000764�00000001532�11041657731�017103� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������--- include/asm-arm/arch-pxa/timex.h | 6 ++++++ 1 file changed, 6 insertions(+) Index: linux-2.6.24.7/include/asm-arm/arch-pxa/timex.h =================================================================== --- linux-2.6.24.7.orig/include/asm-arm/arch-pxa/timex.h +++ linux-2.6.24.7/include/asm-arm/arch-pxa/timex.h @@ -16,6 +16,8 @@ #define CLOCK_TICK_RATE 3686400 #elif defined(CONFIG_PXA27x) /* PXA27x timer base */ +#include <asm-arm/arch-pxa/hardware.h> +#include <asm-arm/arch-pxa/pxa-regs.h> #ifdef CONFIG_MACH_MAINSTONE #define CLOCK_TICK_RATE 3249600 #else @@ -24,3 +26,7 @@ #else #define CLOCK_TICK_RATE 3250000 #endif + +#define mach_read_cycles() OSCR +#define mach_cycles_to_usecs(d) (((d) * ((1000000LL << 32) / CLOCK_TICK_RATE)) >> 32) +#define mach_usecs_to_cycles(d) (((d) * (((long long)CLOCK_TICK_RATE << 32) / 1000000)) >> 32) ����������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/preempt-realtime-arm-shark.patch������������������������������������������������������������0000664�0000764�0000764�00000001024�11041657734�017422� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������--- arch/arm/mach-shark/leds.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) Index: linux-2.6.24.7/arch/arm/mach-shark/leds.c =================================================================== --- linux-2.6.24.7.orig/arch/arm/mach-shark/leds.c +++ linux-2.6.24.7/arch/arm/mach-shark/leds.c @@ -32,7 +32,7 @@ static char led_state; static short hw_led_state; static short saved_state; -static DEFINE_SPINLOCK(leds_lock); +static DEFINE_RAW_SPINLOCK(leds_lock); short sequoia_read(int addr) { outw(addr,0x24); ������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/preempt-realtime-mips.patch�����������������������������������������������������������������0000664�0000764�0000764�00000107053�11041657731�016513� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������� arch/mips/Kconfig | 13 ++ arch/mips/kernel/asm-offsets.c | 2 arch/mips/kernel/entry.S | 22 +++- arch/mips/kernel/i8259.c | 2 arch/mips/kernel/module.c | 2 arch/mips/kernel/process.c | 8 + arch/mips/kernel/scall32-o32.S | 2 arch/mips/kernel/scall64-64.S | 2 arch/mips/kernel/scall64-n32.S | 2 arch/mips/kernel/scall64-o32.S | 2 arch/mips/kernel/semaphore.c | 22 +++- arch/mips/kernel/signal.c | 4 arch/mips/kernel/signal32.c | 4 arch/mips/kernel/smp.c | 27 +++++ arch/mips/kernel/traps.c | 2 arch/mips/mm/init.c | 2 arch/mips/sibyte/cfe/smp.c | 4 arch/mips/sibyte/sb1250/irq.c | 6 + arch/mips/sibyte/sb1250/smp.c | 2 arch/mips/sibyte/swarm/setup.c | 6 + include/asm-mips/asmmacro.h | 8 - include/asm-mips/atomic.h | 1 include/asm-mips/bitops.h | 5 - include/asm-mips/hw_irq.h | 1 include/asm-mips/i8259.h | 2 include/asm-mips/io.h | 1 include/asm-mips/linkage.h | 5 + include/asm-mips/m48t35.h | 2 include/asm-mips/rwsem.h | 176 ++++++++++++++++++++++++++++++++++++++ include/asm-mips/semaphore.h | 31 +++--- include/asm-mips/spinlock.h | 18 +-- include/asm-mips/spinlock_types.h | 4 include/asm-mips/thread_info.h | 2 include/asm-mips/time.h | 2 include/asm-mips/timeofday.h | 5 + include/asm-mips/uaccess.h | 12 -- 36 files changed, 331 insertions(+), 80 deletions(-) Index: linux-2.6.24.7/arch/mips/Kconfig =================================================================== --- linux-2.6.24.7.orig/arch/mips/Kconfig +++ linux-2.6.24.7/arch/mips/Kconfig @@ -702,18 +702,16 @@ source "arch/mips/vr41xx/Kconfig" endmenu + config RWSEM_GENERIC_SPINLOCK bool - depends on !PREEMPT_RT default y config RWSEM_XCHGADD_ALGORITHM bool - depends on !PREEMPT_RT config ASM_SEMAPHORES bool -# depends on !PREEMPT_RT default y config ARCH_HAS_ILOG2_U32 @@ -1898,6 +1896,15 @@ config SECCOMP If unsure, say Y. Only embedded should say N here. +config GENERIC_TIME + bool + default y + +source "kernel/time/Kconfig" + +config CPU_SPEED + int "CPU speed used for clocksource/clockevent calculations" + default 600 endmenu config LOCKDEP_SUPPORT Index: linux-2.6.24.7/arch/mips/kernel/asm-offsets.c =================================================================== --- linux-2.6.24.7.orig/arch/mips/kernel/asm-offsets.c +++ linux-2.6.24.7/arch/mips/kernel/asm-offsets.c @@ -10,9 +10,11 @@ */ #include <linux/compat.h> #include <linux/types.h> +#include <linux/linkage.h> #include <linux/sched.h> #include <linux/mm.h> #include <linux/interrupt.h> +#include <linux/irqflags.h> #include <asm/ptrace.h> #include <asm/processor.h> Index: linux-2.6.24.7/arch/mips/kernel/entry.S =================================================================== --- linux-2.6.24.7.orig/arch/mips/kernel/entry.S +++ linux-2.6.24.7/arch/mips/kernel/entry.S @@ -30,7 +30,7 @@ .align 5 #ifndef CONFIG_PREEMPT FEXPORT(ret_from_exception) - local_irq_disable # preempt stop + raw_local_irq_disable # preempt stop b __ret_from_irq #endif FEXPORT(ret_from_irq) @@ -41,7 +41,7 @@ FEXPORT(__ret_from_irq) beqz t0, resume_kernel resume_userspace: - local_irq_disable # make sure we dont miss an + raw_local_irq_disable # make sure we dont miss an # interrupt setting need_resched # between sampling and return LONG_L a2, TI_FLAGS($28) # current->work @@ -51,7 +51,9 @@ resume_userspace: #ifdef CONFIG_PREEMPT resume_kernel: - local_irq_disable + raw_local_irq_disable + lw t0, kernel_preemption + beqz t0, restore_all lw t0, TI_PRE_COUNT($28) bnez t0, restore_all need_resched: @@ -61,7 +63,9 @@ need_resched: LONG_L t0, PT_STATUS(sp) # Interrupts off? andi t0, 1 beqz t0, restore_all + raw_local_irq_disable jal preempt_schedule_irq + sw zero, TI_PRE_COUNT($28) b need_resched #endif @@ -69,7 +73,7 @@ FEXPORT(ret_from_fork) jal schedule_tail # a0 = struct task_struct *prev FEXPORT(syscall_exit) - local_irq_disable # make sure need_resched and + raw_local_irq_disable # make sure need_resched and # signals dont change between # sampling and return LONG_L a2, TI_FLAGS($28) # current->work @@ -142,19 +146,21 @@ FEXPORT(restore_partial) # restore part .set at work_pending: - andi t0, a2, _TIF_NEED_RESCHED # a2 is preloaded with TI_FLAGS + # a2 is preloaded with TI_FLAGS + andi t0, a2, (_TIF_NEED_RESCHED|_TIF_NEED_RESCHED_DELAYED) beqz t0, work_notifysig work_resched: + raw_local_irq_enable t0 jal schedule - local_irq_disable # make sure need_resched and + raw_local_irq_disable # make sure need_resched and # signals dont change between # sampling and return LONG_L a2, TI_FLAGS($28) andi t0, a2, _TIF_WORK_MASK # is there any work to be done # other than syscall tracing? beqz t0, restore_all - andi t0, a2, _TIF_NEED_RESCHED + andi t0, a2, (_TIF_NEED_RESCHED|_TIF_NEED_RESCHED_DELAYED) bnez t0, work_resched work_notifysig: # deal with pending signals and @@ -170,7 +176,7 @@ syscall_exit_work: li t0, _TIF_SYSCALL_TRACE | _TIF_SYSCALL_AUDIT and t0, a2 # a2 is preloaded with TI_FLAGS beqz t0, work_pending # trace bit set? - local_irq_enable # could let do_syscall_trace() + raw_local_irq_enable # could let do_syscall_trace() # call schedule() instead move a0, sp li a1, 1 Index: linux-2.6.24.7/arch/mips/kernel/i8259.c =================================================================== --- linux-2.6.24.7.orig/arch/mips/kernel/i8259.c +++ linux-2.6.24.7/arch/mips/kernel/i8259.c @@ -29,7 +29,7 @@ */ static int i8259A_auto_eoi = -1; -DEFINE_SPINLOCK(i8259A_lock); +DEFINE_RAW_SPINLOCK(i8259A_lock); static void disable_8259A_irq(unsigned int irq); static void enable_8259A_irq(unsigned int irq); static void mask_and_ack_8259A(unsigned int irq); Index: linux-2.6.24.7/arch/mips/kernel/module.c =================================================================== --- linux-2.6.24.7.orig/arch/mips/kernel/module.c +++ linux-2.6.24.7/arch/mips/kernel/module.c @@ -40,7 +40,7 @@ struct mips_hi16 { static struct mips_hi16 *mips_hi16_list; static LIST_HEAD(dbe_list); -static DEFINE_SPINLOCK(dbe_lock); +static DEFINE_RAW_SPINLOCK(dbe_lock); void *module_alloc(unsigned long size) { Index: linux-2.6.24.7/arch/mips/kernel/process.c =================================================================== --- linux-2.6.24.7.orig/arch/mips/kernel/process.c +++ linux-2.6.24.7/arch/mips/kernel/process.c @@ -54,7 +54,7 @@ void __noreturn cpu_idle(void) /* endless idle loop with no priority at all */ while (1) { tick_nohz_stop_sched_tick(); - while (!need_resched()) { + while (!need_resched() && !need_resched_delayed()) { #ifdef CONFIG_SMTC_IDLE_HOOK_DEBUG extern void smtc_idle_loop_hook(void); @@ -64,9 +64,11 @@ void __noreturn cpu_idle(void) (*cpu_wait)(); } tick_nohz_restart_sched_tick(); - preempt_enable_no_resched(); - schedule(); + local_irq_disable(); + __preempt_enable_no_resched(); + __schedule(); preempt_disable(); + local_irq_enable(); } } Index: linux-2.6.24.7/arch/mips/kernel/scall32-o32.S =================================================================== --- linux-2.6.24.7.orig/arch/mips/kernel/scall32-o32.S +++ linux-2.6.24.7/arch/mips/kernel/scall32-o32.S @@ -73,7 +73,7 @@ stack_done: 1: sw v0, PT_R2(sp) # result o32_syscall_exit: - local_irq_disable # make sure need_resched and + raw_local_irq_disable # make sure need_resched and # signals dont change between # sampling and return lw a2, TI_FLAGS($28) # current->work Index: linux-2.6.24.7/arch/mips/kernel/scall64-64.S =================================================================== --- linux-2.6.24.7.orig/arch/mips/kernel/scall64-64.S +++ linux-2.6.24.7/arch/mips/kernel/scall64-64.S @@ -72,7 +72,7 @@ NESTED(handle_sys64, PT_SIZE, sp) 1: sd v0, PT_R2(sp) # result n64_syscall_exit: - local_irq_disable # make sure need_resched and + raw_local_irq_disable # make sure need_resched and # signals dont change between # sampling and return LONG_L a2, TI_FLAGS($28) # current->work Index: linux-2.6.24.7/arch/mips/kernel/scall64-n32.S =================================================================== --- linux-2.6.24.7.orig/arch/mips/kernel/scall64-n32.S +++ linux-2.6.24.7/arch/mips/kernel/scall64-n32.S @@ -69,7 +69,7 @@ NESTED(handle_sysn32, PT_SIZE, sp) sd v0, PT_R0(sp) # set flag for syscall restarting 1: sd v0, PT_R2(sp) # result - local_irq_disable # make sure need_resched and + raw_local_irq_disable # make sure need_resched and # signals dont change between # sampling and return LONG_L a2, TI_FLAGS($28) # current->work Index: linux-2.6.24.7/arch/mips/kernel/scall64-o32.S =================================================================== --- linux-2.6.24.7.orig/arch/mips/kernel/scall64-o32.S +++ linux-2.6.24.7/arch/mips/kernel/scall64-o32.S @@ -98,7 +98,7 @@ NESTED(handle_sys, PT_SIZE, sp) 1: sd v0, PT_R2(sp) # result o32_syscall_exit: - local_irq_disable # make need_resched and + raw_local_irq_disable # make need_resched and # signals dont change between # sampling and return LONG_L a2, TI_FLAGS($28) Index: linux-2.6.24.7/arch/mips/kernel/semaphore.c =================================================================== --- linux-2.6.24.7.orig/arch/mips/kernel/semaphore.c +++ linux-2.6.24.7/arch/mips/kernel/semaphore.c @@ -36,7 +36,7 @@ * sem->count and sem->waking atomic. Scalability isn't an issue because * this lock is used on UP only so it's just an empty variable. */ -static inline int __sem_update_count(struct semaphore *sem, int incr) +static inline int __sem_update_count(struct compat_semaphore *sem, int incr) { int old_count, tmp; @@ -67,7 +67,7 @@ static inline int __sem_update_count(str : "=&r" (old_count), "=&r" (tmp), "=m" (sem->count) : "r" (incr), "m" (sem->count)); } else { - static DEFINE_SPINLOCK(semaphore_lock); + static DEFINE_RAW_SPINLOCK(semaphore_lock); unsigned long flags; spin_lock_irqsave(&semaphore_lock, flags); @@ -80,7 +80,7 @@ static inline int __sem_update_count(str return old_count; } -void __up(struct semaphore *sem) +void __compat_up(struct compat_semaphore *sem) { /* * Note that we incremented count in up() before we came here, @@ -94,7 +94,7 @@ void __up(struct semaphore *sem) wake_up(&sem->wait); } -EXPORT_SYMBOL(__up); +EXPORT_SYMBOL(__compat_up); /* * Note that when we come in to __down or __down_interruptible, @@ -104,7 +104,7 @@ EXPORT_SYMBOL(__up); * Thus it is only when we decrement count from some value > 0 * that we have actually got the semaphore. */ -void __sched __down(struct semaphore *sem) +void __sched __compat_down(struct compat_semaphore *sem) { struct task_struct *tsk = current; DECLARE_WAITQUEUE(wait, tsk); @@ -133,9 +133,9 @@ void __sched __down(struct semaphore *se wake_up(&sem->wait); } -EXPORT_SYMBOL(__down); +EXPORT_SYMBOL(__compat_down); -int __sched __down_interruptible(struct semaphore * sem) +int __sched __compat_down_interruptible(struct compat_semaphore * sem) { int retval = 0; struct task_struct *tsk = current; @@ -165,4 +165,10 @@ int __sched __down_interruptible(struct return retval; } -EXPORT_SYMBOL(__down_interruptible); +EXPORT_SYMBOL(__compat_down_interruptible); + +int fastcall compat_sem_is_locked(struct compat_semaphore *sem) +{ + return (int) atomic_read(&sem->count) < 0; +} +EXPORT_SYMBOL(compat_sem_is_locked); Index: linux-2.6.24.7/arch/mips/kernel/signal.c =================================================================== --- linux-2.6.24.7.orig/arch/mips/kernel/signal.c +++ linux-2.6.24.7/arch/mips/kernel/signal.c @@ -629,6 +629,10 @@ static void do_signal(struct pt_regs *re siginfo_t info; int signr; +#ifdef CONFIG_PREEMPT_RT + local_irq_enable(); + preempt_check_resched(); +#endif /* * We want the common case to go fast, which is why we may in certain * cases get here from kernel mode. Just return without doing anything Index: linux-2.6.24.7/arch/mips/kernel/signal32.c =================================================================== --- linux-2.6.24.7.orig/arch/mips/kernel/signal32.c +++ linux-2.6.24.7/arch/mips/kernel/signal32.c @@ -655,6 +655,10 @@ static int setup_rt_frame_32(struct k_si if (err) goto give_sigsegv; +#ifdef CONFIG_PREEMPT_RT + local_irq_enable(); + preempt_check_resched(); +#endif /* * Arguments to signal handler: * Index: linux-2.6.24.7/arch/mips/kernel/smp.c =================================================================== --- linux-2.6.24.7.orig/arch/mips/kernel/smp.c +++ linux-2.6.24.7/arch/mips/kernel/smp.c @@ -91,7 +91,22 @@ asmlinkage __cpuinit void start_secondar cpu_idle(); } -DEFINE_SPINLOCK(smp_call_lock); +DEFINE_RAW_SPINLOCK(smp_call_lock); + +/* + * this function sends a 'reschedule' IPI to all other CPUs. + * This is used when RT tasks are starving and other CPUs + * might be able to run them. + */ +void smp_send_reschedule_allbutself(void) +{ + int cpu = smp_processor_id(); + int i; + + for (i = 0; i < NR_CPUS; i++) + if (cpu_online(i) && i != cpu) + core_send_ipi(i, SMP_RESCHEDULE_YOURSELF); +} struct call_data_struct *call_data; @@ -314,6 +329,8 @@ int setup_profiling_timer(unsigned int m return 0; } +static DEFINE_RAW_SPINLOCK(tlbstate_lock); + static void flush_tlb_all_ipi(void *info) { local_flush_tlb_all(); @@ -371,6 +388,7 @@ static inline void smp_on_each_tlb(void void flush_tlb_mm(struct mm_struct *mm) { preempt_disable(); + spin_lock(&tlbstate_lock); if ((atomic_read(&mm->mm_users) != 1) || (current->mm != mm)) { smp_on_other_tlbs(flush_tlb_mm_ipi, mm); @@ -383,6 +401,7 @@ void flush_tlb_mm(struct mm_struct *mm) if (cpu_context(cpu, mm)) cpu_context(cpu, mm) = 0; } + spin_unlock(&tlbstate_lock); local_flush_tlb_mm(mm); preempt_enable(); @@ -406,6 +425,8 @@ void flush_tlb_range(struct vm_area_stru struct mm_struct *mm = vma->vm_mm; preempt_disable(); + spin_lock(&tlbstate_lock); + if ((atomic_read(&mm->mm_users) != 1) || (current->mm != mm)) { struct flush_tlb_data fd = { .vma = vma, @@ -423,6 +444,7 @@ void flush_tlb_range(struct vm_area_stru if (cpu_context(cpu, mm)) cpu_context(cpu, mm) = 0; } + spin_unlock(&tlbstate_lock); local_flush_tlb_range(vma, start, end); preempt_enable(); } @@ -454,6 +476,8 @@ static void flush_tlb_page_ipi(void *inf void flush_tlb_page(struct vm_area_struct *vma, unsigned long page) { preempt_disable(); + spin_lock(&tlbstate_lock); + if ((atomic_read(&vma->vm_mm->mm_users) != 1) || (current->mm != vma->vm_mm)) { struct flush_tlb_data fd = { .vma = vma, @@ -470,6 +494,7 @@ void flush_tlb_page(struct vm_area_struc if (cpu_context(cpu, vma->vm_mm)) cpu_context(cpu, vma->vm_mm) = 0; } + spin_unlock(&tlbstate_lock); local_flush_tlb_page(vma, page); preempt_enable(); } Index: linux-2.6.24.7/arch/mips/kernel/traps.c =================================================================== --- linux-2.6.24.7.orig/arch/mips/kernel/traps.c +++ linux-2.6.24.7/arch/mips/kernel/traps.c @@ -320,7 +320,7 @@ void show_registers(const struct pt_regs printk("\n"); } -static DEFINE_SPINLOCK(die_lock); +static DEFINE_RAW_SPINLOCK(die_lock); void __noreturn die(const char * str, const struct pt_regs * regs) { Index: linux-2.6.24.7/arch/mips/mm/init.c =================================================================== --- linux-2.6.24.7.orig/arch/mips/mm/init.c +++ linux-2.6.24.7/arch/mips/mm/init.c @@ -61,7 +61,7 @@ #endif /* CONFIG_MIPS_MT_SMTC */ -DEFINE_PER_CPU(struct mmu_gather, mmu_gathers); +DEFINE_PER_CPU_LOCKED(struct mmu_gather, mmu_gathers); /* * We have up to 8 empty zeroed pages so we can map one of the right colour Index: linux-2.6.24.7/arch/mips/sibyte/cfe/smp.c =================================================================== --- linux-2.6.24.7.orig/arch/mips/sibyte/cfe/smp.c +++ linux-2.6.24.7/arch/mips/sibyte/cfe/smp.c @@ -107,4 +107,8 @@ void __cpuinit prom_smp_finish(void) */ void prom_cpus_done(void) { +#ifdef CONFIG_HIGH_RES_TIMERS + extern void sync_c0_count_master(void); + sync_c0_count_master(); +#endif } Index: linux-2.6.24.7/arch/mips/sibyte/sb1250/irq.c =================================================================== --- linux-2.6.24.7.orig/arch/mips/sibyte/sb1250/irq.c +++ linux-2.6.24.7/arch/mips/sibyte/sb1250/irq.c @@ -82,7 +82,7 @@ static struct irq_chip sb1250_irq_type = /* Store the CPU id (not the logical number) */ int sb1250_irq_owner[SB1250_NR_IRQS]; -DEFINE_SPINLOCK(sb1250_imr_lock); +DEFINE_RAW_SPINLOCK(sb1250_imr_lock); void sb1250_mask_irq(int cpu, int irq) { @@ -316,6 +316,10 @@ void __init arch_init_irq(void) #ifdef CONFIG_KGDB imask |= STATUSF_IP6; #endif + +#ifdef CONFIG_HIGH_RES_TIMERS + imask |= STATUSF_IP7; +#endif /* Enable necessary IPs, disable the rest */ change_c0_status(ST0_IM, imask); Index: linux-2.6.24.7/arch/mips/sibyte/sb1250/smp.c =================================================================== --- linux-2.6.24.7.orig/arch/mips/sibyte/sb1250/smp.c +++ linux-2.6.24.7/arch/mips/sibyte/sb1250/smp.c @@ -60,7 +60,7 @@ void __cpuinit sb1250_smp_finish(void) extern void sb1250_clockevent_init(void); sb1250_clockevent_init(); - local_irq_enable(); + raw_local_irq_enable(); } /* Index: linux-2.6.24.7/arch/mips/sibyte/swarm/setup.c =================================================================== --- linux-2.6.24.7.orig/arch/mips/sibyte/swarm/setup.c +++ linux-2.6.24.7/arch/mips/sibyte/swarm/setup.c @@ -136,6 +136,12 @@ void __init plat_mem_setup(void) if (m41t81_probe()) swarm_rtc_type = RTC_M4LT81; +#ifdef CONFIG_HIGH_RES_TIMERS + /* + * set the mips_hpt_frequency here + */ + mips_hpt_frequency = CONFIG_CPU_SPEED * 1000000; +#endif printk("This kernel optimized for " #ifdef CONFIG_SIMULATION "simulation" Index: linux-2.6.24.7/include/asm-mips/asmmacro.h =================================================================== --- linux-2.6.24.7.orig/include/asm-mips/asmmacro.h +++ linux-2.6.24.7/include/asm-mips/asmmacro.h @@ -21,7 +21,7 @@ #endif #ifdef CONFIG_MIPS_MT_SMTC - .macro local_irq_enable reg=t0 + .macro raw_local_irq_enable reg=t0 mfc0 \reg, CP0_TCSTATUS ori \reg, \reg, TCSTATUS_IXMT xori \reg, \reg, TCSTATUS_IXMT @@ -29,21 +29,21 @@ _ehb .endm - .macro local_irq_disable reg=t0 + .macro raw_local_irq_disable reg=t0 mfc0 \reg, CP0_TCSTATUS ori \reg, \reg, TCSTATUS_IXMT mtc0 \reg, CP0_TCSTATUS _ehb .endm #else - .macro local_irq_enable reg=t0 + .macro raw_local_irq_enable reg=t0 mfc0 \reg, CP0_STATUS ori \reg, \reg, 1 mtc0 \reg, CP0_STATUS irq_enable_hazard .endm - .macro local_irq_disable reg=t0 + .macro raw_local_irq_disable reg=t0 mfc0 \reg, CP0_STATUS ori \reg, \reg, 1 xori \reg, \reg, 1 Index: linux-2.6.24.7/include/asm-mips/atomic.h =================================================================== --- linux-2.6.24.7.orig/include/asm-mips/atomic.h +++ linux-2.6.24.7/include/asm-mips/atomic.h @@ -573,7 +573,6 @@ static __inline__ long atomic64_add_retu raw_local_irq_restore(flags); } #endif -#endif smp_llsc_mb(); Index: linux-2.6.24.7/include/asm-mips/bitops.h =================================================================== --- linux-2.6.24.7.orig/include/asm-mips/bitops.h +++ linux-2.6.24.7/include/asm-mips/bitops.h @@ -606,9 +606,6 @@ static inline unsigned long __ffs(unsign } /* - * fls - find last bit set. - * @word: The word to search - * * This is defined the same way as ffs. * Note fls(0) = 0, fls(1) = 1, fls(0x80000000) = 32. */ @@ -626,6 +623,8 @@ static inline int fls64(__u64 word) return 64 - word; } +#define __bi_local_irq_save(x) raw_local_irq_save(x) +#define __bi_local_irq_restore(x) raw_local_irq_restore(x) #else #include <asm-generic/bitops/fls64.h> #endif Index: linux-2.6.24.7/include/asm-mips/hw_irq.h =================================================================== --- linux-2.6.24.7.orig/include/asm-mips/hw_irq.h +++ linux-2.6.24.7/include/asm-mips/hw_irq.h @@ -9,6 +9,7 @@ #define __ASM_HW_IRQ_H #include <asm/atomic.h> +#include <linux/irqflags.h> extern atomic_t irq_err_count; Index: linux-2.6.24.7/include/asm-mips/i8259.h =================================================================== --- linux-2.6.24.7.orig/include/asm-mips/i8259.h +++ linux-2.6.24.7/include/asm-mips/i8259.h @@ -35,7 +35,7 @@ #define SLAVE_ICW4_DEFAULT 0x01 #define PIC_ICW4_AEOI 2 -extern spinlock_t i8259A_lock; +extern raw_spinlock_t i8259A_lock; extern int i8259A_irq_pending(unsigned int irq); extern void make_8259A_irq(unsigned int irq); Index: linux-2.6.24.7/include/asm-mips/io.h =================================================================== --- linux-2.6.24.7.orig/include/asm-mips/io.h +++ linux-2.6.24.7/include/asm-mips/io.h @@ -15,6 +15,7 @@ #include <linux/compiler.h> #include <linux/kernel.h> #include <linux/types.h> +#include <linux/irqflags.h> #include <asm/addrspace.h> #include <asm/byteorder.h> Index: linux-2.6.24.7/include/asm-mips/linkage.h =================================================================== --- linux-2.6.24.7.orig/include/asm-mips/linkage.h +++ linux-2.6.24.7/include/asm-mips/linkage.h @@ -3,6 +3,11 @@ #ifdef __ASSEMBLY__ #include <asm/asm.h> + +/* FASTCALL stuff */ +#define FASTCALL(x) x +#define fastcall + #endif #define __weak __attribute__((weak)) Index: linux-2.6.24.7/include/asm-mips/m48t35.h =================================================================== --- linux-2.6.24.7.orig/include/asm-mips/m48t35.h +++ linux-2.6.24.7/include/asm-mips/m48t35.h @@ -6,7 +6,7 @@ #include <linux/spinlock.h> -extern spinlock_t rtc_lock; +extern raw_spinlock_t rtc_lock; struct m48t35_rtc { volatile u8 pad[0x7ff8]; /* starts at 0x7ff8 */ Index: linux-2.6.24.7/include/asm-mips/rwsem.h =================================================================== --- /dev/null +++ linux-2.6.24.7/include/asm-mips/rwsem.h @@ -0,0 +1,176 @@ +/* + * include/asm-mips/rwsem.h: R/W semaphores for MIPS using the stuff + * in lib/rwsem.c. Adapted largely from include/asm-ppc/rwsem.h + * by john.cooper@timesys.com + */ + +#ifndef _MIPS_RWSEM_H +#define _MIPS_RWSEM_H + +#ifndef _LINUX_RWSEM_H +#error "please don't include asm/rwsem.h directly, use linux/rwsem.h instead" +#endif + +#ifdef __KERNEL__ +#include <linux/list.h> +#include <linux/spinlock.h> +#include <asm/atomic.h> +#include <asm/system.h> + +/* + * the semaphore definition + */ +struct compat_rw_semaphore { + /* XXX this should be able to be an atomic_t -- paulus */ + signed long count; +#define RWSEM_UNLOCKED_VALUE 0x00000000 +#define RWSEM_ACTIVE_BIAS 0x00000001 +#define RWSEM_ACTIVE_MASK 0x0000ffff +#define RWSEM_WAITING_BIAS (-0x00010000) +#define RWSEM_ACTIVE_READ_BIAS RWSEM_ACTIVE_BIAS +#define RWSEM_ACTIVE_WRITE_BIAS (RWSEM_WAITING_BIAS + RWSEM_ACTIVE_BIAS) + raw_spinlock_t wait_lock; + struct list_head wait_list; +#if RWSEM_DEBUG + int debug; +#endif +}; + +/* + * initialisation + */ +#if RWSEM_DEBUG +#define __RWSEM_DEBUG_INIT , 0 +#else +#define __RWSEM_DEBUG_INIT /* */ +#endif + +#define __COMPAT_RWSEM_INITIALIZER(name) \ + { RWSEM_UNLOCKED_VALUE, SPIN_LOCK_UNLOCKED, \ + LIST_HEAD_INIT((name).wait_list) \ + __RWSEM_DEBUG_INIT } + +#define COMPAT_DECLARE_RWSEM(name) \ + struct compat_rw_semaphore name = __COMPAT_RWSEM_INITIALIZER(name) + +extern struct compat_rw_semaphore *rwsem_down_read_failed(struct compat_rw_semaphore *sem); +extern struct compat_rw_semaphore *rwsem_down_write_failed(struct compat_rw_semaphore *sem); +extern struct compat_rw_semaphore *rwsem_wake(struct compat_rw_semaphore *sem); +extern struct compat_rw_semaphore *rwsem_downgrade_wake(struct compat_rw_semaphore *sem); + +static inline void compat_init_rwsem(struct compat_rw_semaphore *sem) +{ + sem->count = RWSEM_UNLOCKED_VALUE; + spin_lock_init(&sem->wait_lock); + INIT_LIST_HEAD(&sem->wait_list); +#if RWSEM_DEBUG + sem->debug = 0; +#endif +} + +/* + * lock for reading + */ +static inline void __down_read(struct compat_rw_semaphore *sem) +{ + if (atomic_inc_return((atomic_t *)(&sem->count)) > 0) + smp_wmb(); + else + rwsem_down_read_failed(sem); +} + +static inline int __down_read_trylock(struct compat_rw_semaphore *sem) +{ + int tmp; + + while ((tmp = sem->count) >= 0) { + if (tmp == cmpxchg(&sem->count, tmp, + tmp + RWSEM_ACTIVE_READ_BIAS)) { + smp_wmb(); + return 1; + } + } + return 0; +} + +/* + * lock for writing + */ +static inline void __down_write(struct compat_rw_semaphore *sem) +{ + int tmp; + + tmp = atomic_add_return(RWSEM_ACTIVE_WRITE_BIAS, + (atomic_t *)(&sem->count)); + if (tmp == RWSEM_ACTIVE_WRITE_BIAS) + smp_wmb(); + else + rwsem_down_write_failed(sem); +} + +static inline int __down_write_trylock(struct compat_rw_semaphore *sem) +{ + int tmp; + + tmp = cmpxchg(&sem->count, RWSEM_UNLOCKED_VALUE, + RWSEM_ACTIVE_WRITE_BIAS); + smp_wmb(); + return tmp == RWSEM_UNLOCKED_VALUE; +} + +/* + * unlock after reading + */ +static inline void __up_read(struct compat_rw_semaphore *sem) +{ + int tmp; + + smp_wmb(); + tmp = atomic_dec_return((atomic_t *)(&sem->count)); + if (tmp < -1 && (tmp & RWSEM_ACTIVE_MASK) == 0) + rwsem_wake(sem); +} + +/* + * unlock after writing + */ +static inline void __up_write(struct compat_rw_semaphore *sem) +{ + smp_wmb(); + if (atomic_sub_return(RWSEM_ACTIVE_WRITE_BIAS, + (atomic_t *)(&sem->count)) < 0) + rwsem_wake(sem); +} + +/* + * implement atomic add functionality + */ +static inline void rwsem_atomic_add(int delta, struct compat_rw_semaphore *sem) +{ + atomic_add(delta, (atomic_t *)(&sem->count)); +} + +/* + * downgrade write lock to read lock + */ +static inline void __downgrade_write(struct compat_rw_semaphore *sem) +{ + int tmp; + + smp_wmb(); + tmp = atomic_add_return(-RWSEM_WAITING_BIAS, (atomic_t *)(&sem->count)); + if (tmp < 0) + rwsem_downgrade_wake(sem); +} + +/* + * implement exchange and add functionality + */ +static inline int rwsem_atomic_update(int delta, struct compat_rw_semaphore *sem) +{ + smp_mb(); + return atomic_add_return(delta, (atomic_t *)(&sem->count)); +} + +#endif /* __KERNEL__ */ +#endif /* _MIPS_RWSEM_H */ Index: linux-2.6.24.7/include/asm-mips/semaphore.h =================================================================== --- linux-2.6.24.7.orig/include/asm-mips/semaphore.h +++ linux-2.6.24.7/include/asm-mips/semaphore.h @@ -47,38 +47,41 @@ struct compat_semaphore { wait_queue_head_t wait; }; -#define __SEMAPHORE_INITIALIZER(name, n) \ +#define __COMPAT_SEMAPHORE_INITIALIZER(name, n) \ { \ .count = ATOMIC_INIT(n), \ .wait = __WAIT_QUEUE_HEAD_INITIALIZER((name).wait) \ } -#define __DECLARE_SEMAPHORE_GENERIC(name, count) \ - struct semaphore name = __SEMAPHORE_INITIALIZER(name, count) +#define __COMPAT_MUTEX_INITIALIZER(name) \ + __COMPAT_SEMAPHORE_INITIALIZER(name, 1) -#define DECLARE_MUTEX(name) __DECLARE_SEMAPHORE_GENERIC(name, 1) +#define __COMPAT_DECLARE_SEMAPHORE_GENERIC(name, count) \ + struct compat_semaphore name = __COMPAT_SEMAPHORE_INITIALIZER(name,count) -static inline void sema_init(struct semaphore *sem, int val) +#define COMPAT_DECLARE_MUTEX(name) __COMPAT_DECLARE_SEMAPHORE_GENERIC(name, 1) + +static inline void compat_sema_init (struct compat_semaphore *sem, int val) { atomic_set(&sem->count, val); init_waitqueue_head(&sem->wait); } -static inline void init_MUTEX(struct semaphore *sem) +static inline void compat_init_MUTEX (struct compat_semaphore *sem) { - sema_init(sem, 1); + compat_sema_init(sem, 1); } -static inline void init_MUTEX_LOCKED(struct semaphore *sem) +static inline void compat_init_MUTEX_LOCKED (struct compat_semaphore *sem) { - sema_init(sem, 0); + compat_sema_init(sem, 0); } -extern void __down(struct semaphore * sem); -extern int __down_interruptible(struct semaphore * sem); -extern void __up(struct semaphore * sem); +extern void __compat_down(struct compat_semaphore * sem); +extern int __compat_down_interruptible(struct compat_semaphore * sem); +extern void __compat_up(struct compat_semaphore * sem); -static inline void down(struct semaphore * sem) +static inline void compat_down(struct compat_semaphore * sem) { might_sleep(); @@ -111,6 +114,8 @@ static inline void compat_up(struct comp __compat_up(sem); } +extern int compat_sem_is_locked(struct compat_semaphore *sem); + #define compat_sema_count(sem) atomic_read(&(sem)->count) #include <linux/semaphore.h> Index: linux-2.6.24.7/include/asm-mips/spinlock.h =================================================================== --- linux-2.6.24.7.orig/include/asm-mips/spinlock.h +++ linux-2.6.24.7/include/asm-mips/spinlock.h @@ -28,7 +28,7 @@ * We make no fairness assumptions. They have a cost. */ -static inline void __raw_spin_lock(raw_spinlock_t *lock) +static inline void __raw_spin_lock(__raw_spinlock_t *lock) { unsigned int tmp; @@ -70,7 +70,7 @@ static inline void __raw_spin_lock(raw_s smp_llsc_mb(); } -static inline void __raw_spin_unlock(raw_spinlock_t *lock) +static inline void __raw_spin_unlock(__raw_spinlock_t *lock) { smp_mb(); @@ -83,7 +83,7 @@ static inline void __raw_spin_unlock(raw : "memory"); } -static inline unsigned int __raw_spin_trylock(raw_spinlock_t *lock) +static inline unsigned int __raw_spin_trylock(__raw_spinlock_t *lock) { unsigned int temp, res; @@ -144,7 +144,7 @@ static inline unsigned int __raw_spin_tr */ #define __raw_write_can_lock(rw) (!(rw)->lock) -static inline void __raw_read_lock(raw_rwlock_t *rw) +static inline void __raw_read_lock(__raw_rwlock_t *rw) { unsigned int tmp; @@ -189,7 +189,7 @@ static inline void __raw_read_lock(raw_r /* Note the use of sub, not subu which will make the kernel die with an overflow exception if we ever try to unlock an rwlock that is already unlocked or is being held by a writer. */ -static inline void __raw_read_unlock(raw_rwlock_t *rw) +static inline void __raw_read_unlock(__raw_rwlock_t *rw) { unsigned int tmp; @@ -223,7 +223,7 @@ static inline void __raw_read_unlock(raw } } -static inline void __raw_write_lock(raw_rwlock_t *rw) +static inline void __raw_write_lock(__raw_rwlock_t *rw) { unsigned int tmp; @@ -265,7 +265,7 @@ static inline void __raw_write_lock(raw_ smp_llsc_mb(); } -static inline void __raw_write_unlock(raw_rwlock_t *rw) +static inline void __raw_write_unlock(__raw_rwlock_t *rw) { smp_mb(); @@ -277,7 +277,7 @@ static inline void __raw_write_unlock(ra : "memory"); } -static inline int __raw_read_trylock(raw_rwlock_t *rw) +static inline int __raw_read_trylock(__raw_rwlock_t *rw) { unsigned int tmp; int ret; @@ -321,7 +321,7 @@ static inline int __raw_read_trylock(raw return ret; } -static inline int __raw_write_trylock(raw_rwlock_t *rw) +static inline int __raw_write_trylock(__raw_rwlock_t *rw) { unsigned int tmp; int ret; Index: linux-2.6.24.7/include/asm-mips/spinlock_types.h =================================================================== --- linux-2.6.24.7.orig/include/asm-mips/spinlock_types.h +++ linux-2.6.24.7/include/asm-mips/spinlock_types.h @@ -7,13 +7,13 @@ typedef struct { volatile unsigned int lock; -} raw_spinlock_t; +} __raw_spinlock_t; #define __RAW_SPIN_LOCK_UNLOCKED { 0 } typedef struct { volatile unsigned int lock; -} raw_rwlock_t; +} __raw_rwlock_t; #define __RAW_RW_LOCK_UNLOCKED { 0 } Index: linux-2.6.24.7/include/asm-mips/thread_info.h =================================================================== --- linux-2.6.24.7.orig/include/asm-mips/thread_info.h +++ linux-2.6.24.7/include/asm-mips/thread_info.h @@ -112,6 +112,7 @@ register struct thread_info *__current_t #define TIF_NEED_RESCHED 2 /* rescheduling necessary */ #define TIF_SYSCALL_AUDIT 3 /* syscall auditing active */ #define TIF_SECCOMP 4 /* secure computing */ +#define TIF_NEED_RESCHED_DELAYED 6 /* reschedule on return to userspace */ #define TIF_RESTORE_SIGMASK 9 /* restore signal mask in do_signal() */ #define TIF_USEDFPU 16 /* FPU was used by this task this quantum (SMP) */ #define TIF_POLLING_NRFLAG 17 /* true if poll_idle() is polling TIF_NEED_RESCHED */ @@ -129,6 +130,7 @@ register struct thread_info *__current_t #define _TIF_NEED_RESCHED (1<<TIF_NEED_RESCHED) #define _TIF_SYSCALL_AUDIT (1<<TIF_SYSCALL_AUDIT) #define _TIF_SECCOMP (1<<TIF_SECCOMP) +#define _TIF_NEED_RESCHED_DELAYED (1<<TIF_NEED_RESCHED_DELAYED) #define _TIF_RESTORE_SIGMASK (1<<TIF_RESTORE_SIGMASK) #define _TIF_USEDFPU (1<<TIF_USEDFPU) #define _TIF_POLLING_NRFLAG (1<<TIF_POLLING_NRFLAG) Index: linux-2.6.24.7/include/asm-mips/time.h =================================================================== --- linux-2.6.24.7.orig/include/asm-mips/time.h +++ linux-2.6.24.7/include/asm-mips/time.h @@ -19,7 +19,7 @@ #include <linux/clockchips.h> #include <linux/clocksource.h> -extern spinlock_t rtc_lock; +extern raw_spinlock_t rtc_lock; /* * RTC ops. By default, they point to weak no-op RTC functions. Index: linux-2.6.24.7/include/asm-mips/timeofday.h =================================================================== --- /dev/null +++ linux-2.6.24.7/include/asm-mips/timeofday.h @@ -0,0 +1,5 @@ +#ifndef _ASM_MIPS_TIMEOFDAY_H +#define _ASM_MIPS_TIMEOFDAY_H +#include <asm-generic/timeofday.h> +#endif + Index: linux-2.6.24.7/include/asm-mips/uaccess.h =================================================================== --- linux-2.6.24.7.orig/include/asm-mips/uaccess.h +++ linux-2.6.24.7/include/asm-mips/uaccess.h @@ -427,7 +427,6 @@ extern size_t __copy_user(void *__to, co const void *__cu_from; \ long __cu_len; \ \ - might_sleep(); \ __cu_to = (to); \ __cu_from = (from); \ __cu_len = (n); \ @@ -483,7 +482,6 @@ extern size_t __copy_user_inatomic(void const void *__cu_from; \ long __cu_len; \ \ - might_sleep(); \ __cu_to = (to); \ __cu_from = (from); \ __cu_len = (n); \ @@ -562,7 +560,6 @@ extern size_t __copy_user_inatomic(void const void __user *__cu_from; \ long __cu_len; \ \ - might_sleep(); \ __cu_to = (to); \ __cu_from = (from); \ __cu_len = (n); \ @@ -593,7 +590,6 @@ extern size_t __copy_user_inatomic(void const void __user *__cu_from; \ long __cu_len; \ \ - might_sleep(); \ __cu_to = (to); \ __cu_from = (from); \ __cu_len = (n); \ @@ -611,7 +607,6 @@ extern size_t __copy_user_inatomic(void const void __user *__cu_from; \ long __cu_len; \ \ - might_sleep(); \ __cu_to = (to); \ __cu_from = (from); \ __cu_len = (n); \ @@ -638,7 +633,6 @@ __clear_user(void __user *addr, __kernel { __kernel_size_t res; - might_sleep(); __asm__ __volatile__( "move\t$4, %1\n\t" "move\t$5, $0\n\t" @@ -687,7 +681,6 @@ __strncpy_from_user(char *__to, const ch { long res; - might_sleep(); __asm__ __volatile__( "move\t$4, %1\n\t" "move\t$5, %2\n\t" @@ -724,7 +717,6 @@ strncpy_from_user(char *__to, const char { long res; - might_sleep(); __asm__ __volatile__( "move\t$4, %1\n\t" "move\t$5, %2\n\t" @@ -743,7 +735,6 @@ static inline long __strlen_user(const c { long res; - might_sleep(); __asm__ __volatile__( "move\t$4, %1\n\t" __MODULE_JAL(__strlen_user_nocheck_asm) @@ -773,7 +764,6 @@ static inline long strlen_user(const cha { long res; - might_sleep(); __asm__ __volatile__( "move\t$4, %1\n\t" __MODULE_JAL(__strlen_user_asm) @@ -790,7 +780,6 @@ static inline long __strnlen_user(const { long res; - might_sleep(); __asm__ __volatile__( "move\t$4, %1\n\t" "move\t$5, %2\n\t" @@ -821,7 +810,6 @@ static inline long strnlen_user(const ch { long res; - might_sleep(); __asm__ __volatile__( "move\t$4, %1\n\t" "move\t$5, %2\n\t" �������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/preempt-realtime-x86_64.patch���������������������������������������������������������������0000664�0000764�0000764�00000032411�11041673214�016466� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������� arch/x86/kernel/early_printk.c | 2 +- arch/x86/kernel/head64.c | 6 +++++- arch/x86/kernel/i8259_64.c | 2 +- arch/x86/kernel/io_apic_64.c | 13 +++++++------ arch/x86/kernel/nmi_64.c | 2 ++ arch/x86/kernel/process_64.c | 21 ++++++++++++--------- arch/x86/kernel/signal_64.c | 7 +++++++ arch/x86/kernel/smp_64.c | 14 ++++++++++++-- arch/x86/kernel/traps_64.c | 13 ++++++------- include/asm-x86/acpi_64.h | 4 ++-- include/asm-x86/hw_irq_64.h | 2 +- include/asm-x86/io_apic_64.h | 2 +- include/asm-x86/spinlock_64.h | 6 +++--- include/asm-x86/tlbflush_64.h | 8 +++++++- include/asm-x86/vgtod.h | 2 +- 15 files changed, 68 insertions(+), 36 deletions(-) Index: linux-2.6.24.7/arch/x86/kernel/early_printk.c =================================================================== --- linux-2.6.24.7.orig/arch/x86/kernel/early_printk.c +++ linux-2.6.24.7/arch/x86/kernel/early_printk.c @@ -198,7 +198,7 @@ static int early_console_initialized = 0 void early_printk(const char *fmt, ...) { - char buf[512]; + static char buf[512]; int n; va_list ap; Index: linux-2.6.24.7/arch/x86/kernel/head64.c =================================================================== --- linux-2.6.24.7.orig/arch/x86/kernel/head64.c +++ linux-2.6.24.7/arch/x86/kernel/head64.c @@ -24,7 +24,11 @@ static void __init zap_identity_mappings { pgd_t *pgd = pgd_offset_k(0UL); pgd_clear(pgd); - __flush_tlb(); + /* + * preempt_disable/enable does not work this early in the + * bootup yet: + */ + write_cr3(read_cr3()); } /* Don't add a printk in there. printk relies on the PDA which is not initialized Index: linux-2.6.24.7/arch/x86/kernel/i8259_64.c =================================================================== --- linux-2.6.24.7.orig/arch/x86/kernel/i8259_64.c +++ linux-2.6.24.7/arch/x86/kernel/i8259_64.c @@ -96,8 +96,8 @@ static void (*interrupt[NR_VECTORS - FIR */ static int i8259A_auto_eoi; -DEFINE_SPINLOCK(i8259A_lock); static void mask_and_ack_8259A(unsigned int); +DEFINE_RAW_SPINLOCK(i8259A_lock); static struct irq_chip i8259A_chip = { .name = "XT-PIC", Index: linux-2.6.24.7/arch/x86/kernel/io_apic_64.c =================================================================== --- linux-2.6.24.7.orig/arch/x86/kernel/io_apic_64.c +++ linux-2.6.24.7/arch/x86/kernel/io_apic_64.c @@ -91,8 +91,8 @@ int timer_over_8254 __initdata = 1; /* Where if anywhere is the i8259 connect in external int mode */ static struct { int pin, apic; } ioapic_i8259 = { -1, -1 }; -static DEFINE_SPINLOCK(ioapic_lock); -DEFINE_SPINLOCK(vector_lock); +static DEFINE_RAW_SPINLOCK(ioapic_lock); +DEFINE_RAW_SPINLOCK(vector_lock); /* * # of IRQ routing registers @@ -205,6 +205,9 @@ static inline void io_apic_sync(unsigned reg ACTION; \ io_apic_modify(entry->apic, reg); \ FINAL; \ + /* Force POST flush by reading: */ \ + reg = io_apic_read(entry->apic, 0x10 + R + pin*2); \ + \ if (!entry->next) \ break; \ entry = irq_2_pin + entry->next; \ @@ -349,10 +352,8 @@ static void add_pin_to_irq(unsigned int static void name##_IO_APIC_irq (unsigned int irq) \ __DO_ACTION(R, ACTION, FINAL) -DO_ACTION( __mask, 0, |= 0x00010000, io_apic_sync(entry->apic) ) - /* mask = 1 */ -DO_ACTION( __unmask, 0, &= 0xfffeffff, ) - /* mask = 0 */ +DO_ACTION( __mask, 0, |= 0x00010000, ) /* mask = 1 */ +DO_ACTION( __unmask, 0, &= 0xfffeffff, ) /* mask = 0 */ DO_ACTION( __pcix_mask, 0, &= 0xffff7fff, ) /* edge */ DO_ACTION( __pcix_unmask, 0, = (reg & 0xfffeffff) | 0x00008000, ) /* level */ Index: linux-2.6.24.7/arch/x86/kernel/nmi_64.c =================================================================== --- linux-2.6.24.7.orig/arch/x86/kernel/nmi_64.c +++ linux-2.6.24.7/arch/x86/kernel/nmi_64.c @@ -68,7 +68,9 @@ static int endflag __initdata = 0; */ static __init void nmi_cpu_busy(void *data) { +#ifndef CONFIG_PREEMPT_RT local_irq_enable_in_hardirq(); +#endif /* Intentionally don't use cpu_relax here. This is to make sure that the performance counter really ticks, even if there is a simulator or similar that catches the Index: linux-2.6.24.7/arch/x86/kernel/process_64.c =================================================================== --- linux-2.6.24.7.orig/arch/x86/kernel/process_64.c +++ linux-2.6.24.7/arch/x86/kernel/process_64.c @@ -115,7 +115,7 @@ static void default_idle(void) */ smp_mb(); local_irq_disable(); - if (!need_resched()) { + if (!need_resched() && !need_resched_delayed()) { /* Enables interrupts one instruction before HLT. x86 special cases this so there is no race. */ safe_halt(); @@ -213,7 +213,7 @@ void cpu_idle (void) /* endless idle loop with no priority at all */ while (1) { tick_nohz_stop_sched_tick(); - while (!need_resched()) { + while (!need_resched() && !need_resched_delayed()) { void (*idle)(void); if (__get_cpu_var(cpu_idle_state)) @@ -243,9 +243,11 @@ void cpu_idle (void) } tick_nohz_restart_sched_tick(); - preempt_enable_no_resched(); - schedule(); + local_irq_disable(); + __preempt_enable_no_resched(); + __schedule(); preempt_disable(); + local_irq_enable(); } } @@ -261,10 +263,10 @@ void cpu_idle (void) */ void mwait_idle_with_hints(unsigned long eax, unsigned long ecx) { - if (!need_resched()) { + if (!need_resched() && !need_resched_delayed()) { __monitor((void *)¤t_thread_info()->flags, 0, 0); smp_mb(); - if (!need_resched()) + if (!need_resched() && !need_resched_delayed()) __mwait(eax, ecx); } } @@ -272,10 +274,10 @@ void mwait_idle_with_hints(unsigned long /* Default MONITOR/MWAIT with no hints, used for default C1 state */ static void mwait_idle(void) { - if (!need_resched()) { + if (!need_resched() && !need_resched_delayed()) { __monitor((void *)¤t_thread_info()->flags, 0, 0); smp_mb(); - if (!need_resched()) + if (!need_resched() && !need_resched_delayed()) __sti_mwait(0, 0); else local_irq_enable(); @@ -393,7 +395,7 @@ void exit_thread(void) struct thread_struct *t = &me->thread; if (me->thread.io_bitmap_ptr) { - struct tss_struct *tss = &per_cpu(init_tss, get_cpu()); + struct tss_struct *tss; kfree(t->io_bitmap_ptr); t->io_bitmap_ptr = NULL; @@ -401,6 +403,7 @@ void exit_thread(void) /* * Careful, clear this in the TSS too: */ + tss = &per_cpu(init_tss, get_cpu()); memset(tss->io_bitmap, 0xff, t->io_bitmap_max); t->io_bitmap_max = 0; put_cpu(); Index: linux-2.6.24.7/arch/x86/kernel/signal_64.c =================================================================== --- linux-2.6.24.7.orig/arch/x86/kernel/signal_64.c +++ linux-2.6.24.7/arch/x86/kernel/signal_64.c @@ -423,6 +423,13 @@ static void do_signal(struct pt_regs *re int signr; sigset_t *oldset; +#ifdef CONFIG_PREEMPT_RT + /* + * Fully-preemptible kernel does not need interrupts disabled: + */ + local_irq_enable(); + preempt_check_resched(); +#endif /* * We want the common case to go fast, which * is why we may in certain cases get here from Index: linux-2.6.24.7/arch/x86/kernel/smp_64.c =================================================================== --- linux-2.6.24.7.orig/arch/x86/kernel/smp_64.c +++ linux-2.6.24.7/arch/x86/kernel/smp_64.c @@ -56,7 +56,7 @@ union smp_flush_state { struct mm_struct *flush_mm; unsigned long flush_va; #define FLUSH_ALL -1ULL - spinlock_t tlbstate_lock; + raw_spinlock_t tlbstate_lock; }; char pad[SMP_CACHE_BYTES]; } ____cacheline_aligned; @@ -296,10 +296,20 @@ void smp_send_reschedule(int cpu) } /* + * this function sends a 'reschedule' IPI to all other CPUs. + * This is used when RT tasks are starving and other CPUs + * might be able to run them: + */ +void smp_send_reschedule_allbutself(void) +{ + send_IPI_allbutself(RESCHEDULE_VECTOR); +} + +/* * Structure and data for smp_call_function(). This is designed to minimise * static memory requirements. It also looks cleaner. */ -static DEFINE_SPINLOCK(call_lock); +static DEFINE_RAW_SPINLOCK(call_lock); struct call_data_struct { void (*func) (void *info); Index: linux-2.6.24.7/arch/x86/kernel/traps_64.c =================================================================== --- linux-2.6.24.7.orig/arch/x86/kernel/traps_64.c +++ linux-2.6.24.7/arch/x86/kernel/traps_64.c @@ -220,7 +220,7 @@ void dump_trace(struct task_struct *tsk, unsigned long *stack, const struct stacktrace_ops *ops, void *data) { - const unsigned cpu = get_cpu(); + const unsigned cpu = raw_smp_processor_id(); unsigned long *irqstack_end = (unsigned long*)cpu_pda(cpu)->irqstackptr; unsigned used = 0; struct thread_info *tinfo; @@ -311,7 +311,6 @@ void dump_trace(struct task_struct *tsk, tinfo = task_thread_info(tsk); HANDLE_STACK (valid_stack_ptr(tinfo, stack)); #undef HANDLE_STACK - put_cpu(); } EXPORT_SYMBOL(dump_trace); @@ -361,7 +360,7 @@ _show_stack(struct task_struct *tsk, str { unsigned long *stack; int i; - const int cpu = smp_processor_id(); + const int cpu = raw_smp_processor_id(); unsigned long *irqstack_end = (unsigned long *) (cpu_pda(cpu)->irqstackptr); unsigned long *irqstack = (unsigned long *) (cpu_pda(cpu)->irqstackptr - IRQSTACKSIZE); @@ -473,7 +472,7 @@ void out_of_line_bug(void) EXPORT_SYMBOL(out_of_line_bug); #endif -static raw_spinlock_t die_lock = __RAW_SPIN_LOCK_UNLOCKED; +static raw_spinlock_t die_lock = RAW_SPIN_LOCK_UNLOCKED(die_lock); static int die_owner = -1; static unsigned int die_nest_count; @@ -487,11 +486,11 @@ unsigned __kprobes long oops_begin(void) /* racy, but better than risking deadlock. */ raw_local_irq_save(flags); cpu = smp_processor_id(); - if (!__raw_spin_trylock(&die_lock)) { + if (!spin_trylock(&die_lock)) { if (cpu == die_owner) /* nested oops. should stop eventually */; else - __raw_spin_lock(&die_lock); + spin_lock(&die_lock); } die_nest_count++; die_owner = cpu; @@ -507,7 +506,7 @@ void __kprobes oops_end(unsigned long fl die_nest_count--; if (!die_nest_count) /* Nest count reaches zero, release the lock. */ - __raw_spin_unlock(&die_lock); + spin_unlock(&die_lock); raw_local_irq_restore(flags); if (panic_on_oops) panic("Fatal exception"); Index: linux-2.6.24.7/include/asm-x86/acpi_64.h =================================================================== --- linux-2.6.24.7.orig/include/asm-x86/acpi_64.h +++ linux-2.6.24.7/include/asm-x86/acpi_64.h @@ -51,8 +51,8 @@ #define ACPI_ASM_MACROS #define BREAKPOINT3 -#define ACPI_DISABLE_IRQS() local_irq_disable() -#define ACPI_ENABLE_IRQS() local_irq_enable() +#define ACPI_DISABLE_IRQS() local_irq_disable_nort() +#define ACPI_ENABLE_IRQS() local_irq_enable_nort() #define ACPI_FLUSH_CPU_CACHE() wbinvd() int __acpi_acquire_global_lock(unsigned int *lock); Index: linux-2.6.24.7/include/asm-x86/hw_irq_64.h =================================================================== --- linux-2.6.24.7.orig/include/asm-x86/hw_irq_64.h +++ linux-2.6.24.7/include/asm-x86/hw_irq_64.h @@ -118,7 +118,7 @@ void i8254_timer_resume(void); typedef int vector_irq_t[NR_VECTORS]; DECLARE_PER_CPU(vector_irq_t, vector_irq); extern void __setup_vector_irq(int cpu); -extern spinlock_t vector_lock; +extern raw_spinlock_t vector_lock; /* * Various low-level irq details needed by irq.c, process.c, Index: linux-2.6.24.7/include/asm-x86/io_apic_64.h =================================================================== --- linux-2.6.24.7.orig/include/asm-x86/io_apic_64.h +++ linux-2.6.24.7/include/asm-x86/io_apic_64.h @@ -131,7 +131,7 @@ extern int sis_apic_bug; /* dummy */ void enable_NMI_through_LVT0(void); -extern spinlock_t i8259A_lock; +extern raw_spinlock_t i8259A_lock; extern int timer_over_8254; Index: linux-2.6.24.7/include/asm-x86/spinlock_64.h =================================================================== --- linux-2.6.24.7.orig/include/asm-x86/spinlock_64.h +++ linux-2.6.24.7/include/asm-x86/spinlock_64.h @@ -160,8 +160,8 @@ static inline void __raw_write_unlock(__ : "=m" (rw->lock) : : "memory"); } -#define _raw_spin_relax(lock) cpu_relax() -#define _raw_read_relax(lock) cpu_relax() -#define _raw_write_relax(lock) cpu_relax() +#define __raw_spin_relax(lock) cpu_relax() +#define __raw_read_relax(lock) cpu_relax() +#define __raw_write_relax(lock) cpu_relax() #endif /* __ASM_SPINLOCK_H */ Index: linux-2.6.24.7/include/asm-x86/tlbflush_64.h =================================================================== --- linux-2.6.24.7.orig/include/asm-x86/tlbflush_64.h +++ linux-2.6.24.7/include/asm-x86/tlbflush_64.h @@ -8,14 +8,20 @@ static inline void __flush_tlb(void) { + preempt_disable(); write_cr3(read_cr3()); + preempt_enable(); } static inline void __flush_tlb_all(void) { - unsigned long cr4 = read_cr4(); + unsigned long cr4; + + preempt_disable(); + cr4 = read_cr4(); write_cr4(cr4 & ~X86_CR4_PGE); /* clear PGE */ write_cr4(cr4); /* write old PGE again and flush TLBs */ + preempt_enable(); } #define __flush_tlb_one(addr) \ Index: linux-2.6.24.7/include/asm-x86/vgtod.h =================================================================== --- linux-2.6.24.7.orig/include/asm-x86/vgtod.h +++ linux-2.6.24.7/include/asm-x86/vgtod.h @@ -5,7 +5,7 @@ #include <linux/clocksource.h> struct vsyscall_gtod_data { - seqlock_t lock; + raw_seqlock_t lock; /* open coded 'struct timespec' */ time_t wall_time_sec; �������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/preempt-realtime-ia64.patch�����������������������������������������������������������������0000664�0000764�0000764�00000135733�11041657731�016314� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������� Hi, This is a first version of my port of Ingo's -rt kernel to the IA64 arch. So far the kernel boots with PREEMPT_RT enabled (on a 4-cpu tiger), and that's about it. I've not done extensive tests (only scripts/rt-tester), nor any measurements of any kind. There's very probably many bugs I'm not aware of. But there is already one thing I know should be fixed : I've changed the declaration of (struct zone).lock (in include/linux/mmzone.h) from spinlock_t to raw_spinlock_t. I did this because on IA64, cpu_idle(), which is not allowed to call schedule(), calls check_pgt_cache(). I guess this could be fixed by moving this call to another kernel thread... ideas are welcome. Simon. Signed-off-by: Simon.Derr@bull.net arch/ia64/Kconfig | 64 +++++++++++++++++++++++++ arch/ia64/kernel/asm-offsets.c | 2 arch/ia64/kernel/entry.S | 25 +++++----- arch/ia64/kernel/fsys.S | 21 ++++++++ arch/ia64/kernel/iosapic.c | 33 ++++++++++++- arch/ia64/kernel/mca.c | 2 arch/ia64/kernel/perfmon.c | 6 +- arch/ia64/kernel/process.c | 14 +++-- arch/ia64/kernel/sal.c | 2 arch/ia64/kernel/salinfo.c | 6 +- arch/ia64/kernel/semaphore.c | 8 +-- arch/ia64/kernel/signal.c | 8 +++ arch/ia64/kernel/smp.c | 16 ++++++ arch/ia64/kernel/smpboot.c | 3 + arch/ia64/kernel/time.c | 74 +++++++++++++++++++---------- arch/ia64/kernel/traps.c | 10 ++-- arch/ia64/kernel/unwind.c | 4 - arch/ia64/kernel/unwind_i.h | 2 arch/ia64/mm/init.c | 2 arch/ia64/mm/tlb.c | 2 include/asm-ia64/irqflags.h | 95 ++++++++++++++++++++++++++++++++++++++ include/asm-ia64/mmu_context.h | 2 include/asm-ia64/percpu.h | 21 +++++++- include/asm-ia64/processor.h | 6 +- include/asm-ia64/rtc.h | 7 ++ include/asm-ia64/rwsem.h | 32 ++++++------ include/asm-ia64/sal.h | 2 include/asm-ia64/semaphore.h | 51 ++++++++++++-------- include/asm-ia64/spinlock.h | 26 ++++------ include/asm-ia64/spinlock_types.h | 4 - include/asm-ia64/system.h | 67 -------------------------- include/asm-ia64/thread_info.h | 1 include/asm-ia64/tlb.h | 10 ++-- 33 files changed, 436 insertions(+), 192 deletions(-) Index: linux-2.6.24.7/arch/ia64/Kconfig =================================================================== --- linux-2.6.24.7.orig/arch/ia64/Kconfig +++ linux-2.6.24.7/arch/ia64/Kconfig @@ -44,6 +44,7 @@ config SWIOTLB config RWSEM_XCHGADD_ALGORITHM bool + depends on !PREEMPT_RT default y config ARCH_HAS_ILOG2_U32 @@ -280,6 +281,69 @@ config SMP If you don't know what to do here, say N. + +config GENERIC_TIME + bool + default y + +config HIGH_RES_TIMERS + bool "High-Resolution Timers" + help + + POSIX timers are available by default. This option enables + high-resolution POSIX timers. With this option the resolution + is at least 1 microsecond. High resolution is not free. If + enabled this option will add a small overhead each time a + timer expires that is not on a 1/HZ tick boundary. If no such + timers are used the overhead is nil. + + This option enables two additional POSIX CLOCKS, + CLOCK_REALTIME_HR and CLOCK_MONOTONIC_HR. Note that this + option does not change the resolution of CLOCK_REALTIME or + CLOCK_MONOTONIC which remain at 1/HZ resolution. + +config HIGH_RES_RESOLUTION + int "High-Resolution-Timer resolution (nanoseconds)" + depends on HIGH_RES_TIMERS + default 1000 + help + + This sets the resolution of timers accessed with + CLOCK_REALTIME_HR and CLOCK_MONOTONIC_HR. Too + fine a resolution (small a number) will usually not + be observable due to normal system latencies. For an + 800 MHZ processor about 10,000 is the recommended maximum + (smallest number). If you don't need that sort of resolution, + higher numbers may generate less overhead. + +choice + prompt "Clock source" + depends on HIGH_RES_TIMERS + default HIGH_RES_TIMER_ITC + help + This option allows you to choose the hardware source in charge + of generating high precision interruptions on your system. + On IA-64 these are: + + <timer> <resolution> + ITC Interval Time Counter 1/CPU clock + HPET High Precision Event Timer ~ (XXX:have to check the spec) + + The ITC timer is available on all the ia64 computers because + it is integrated directly into the processor. However it may not + give correct results on MP machines with processors running + at different clock rates. In this case you may want to use + the HPET if available on your machine. + + +config HIGH_RES_TIMER_ITC + bool "Interval Time Counter/ITC" + +config HIGH_RES_TIMER_HPET + bool "High Precision Event Timer/HPET" + +endchoice + config NR_CPUS int "Maximum number of CPUs (2-1024)" range 2 1024 Index: linux-2.6.24.7/arch/ia64/kernel/asm-offsets.c =================================================================== --- linux-2.6.24.7.orig/arch/ia64/kernel/asm-offsets.c +++ linux-2.6.24.7/arch/ia64/kernel/asm-offsets.c @@ -257,6 +257,7 @@ void foo(void) offsetof (struct pal_min_state_area_s, pmsa_xip)); BLANK(); +#ifdef CONFIG_TIME_INTERPOLATION /* used by fsys_gettimeofday in arch/ia64/kernel/fsys.S */ DEFINE(IA64_GTOD_LOCK_OFFSET, offsetof (struct fsyscall_gtod_data_t, lock)); @@ -278,4 +279,5 @@ void foo(void) offsetof (struct itc_jitter_data_t, itc_jitter)); DEFINE(IA64_ITC_LASTCYCLE_OFFSET, offsetof (struct itc_jitter_data_t, itc_lastcycle)); +#endif } Index: linux-2.6.24.7/arch/ia64/kernel/entry.S =================================================================== --- linux-2.6.24.7.orig/arch/ia64/kernel/entry.S +++ linux-2.6.24.7/arch/ia64/kernel/entry.S @@ -1098,23 +1098,24 @@ skip_rbs_switch: st8 [r2]=r8 st8 [r3]=r10 .work_pending: - tbit.z p6,p0=r31,TIF_NEED_RESCHED // current_thread_info()->need_resched==0? + tbit.nz p6,p0=r31,TIF_NEED_RESCHED // current_thread_info()->need_resched==0? +(p6) br.cond.sptk.few .needresched + tbit.z p6,p0=r31,TIF_NEED_RESCHED_DELAYED // current_thread_info()->need_resched_delayed==0? (p6) br.cond.sptk.few .notify -#ifdef CONFIG_PREEMPT -(pKStk) dep r21=-1,r0,PREEMPT_ACTIVE_BIT,1 + +.needresched: + +(pKStk) br.cond.sptk.many .fromkernel ;; -(pKStk) st4 [r20]=r21 ssm psr.i // enable interrupts -#endif br.call.spnt.many rp=schedule -.ret9: cmp.eq p6,p0=r0,r0 // p6 <- 1 - rsm psr.i // disable interrupts - ;; -#ifdef CONFIG_PREEMPT -(pKStk) adds r20=TI_PRE_COUNT+IA64_TASK_SIZE,r13 +.ret9a: rsm psr.i // disable interrupts ;; -(pKStk) st4 [r20]=r0 // preempt_count() <- 0 -#endif + br.cond.sptk.many .endpreemptdep +.fromkernel: + br.call.spnt.many rp=preempt_schedule_irq +.ret9b: rsm psr.i // disable interrupts +.endpreemptdep: (pLvSys)br.cond.sptk.few .work_pending_syscall_end br.cond.sptk.many .work_processed_kernel // re-check Index: linux-2.6.24.7/arch/ia64/kernel/fsys.S =================================================================== --- linux-2.6.24.7.orig/arch/ia64/kernel/fsys.S +++ linux-2.6.24.7/arch/ia64/kernel/fsys.S @@ -26,6 +26,7 @@ #include "entry.h" +#ifdef CONFIG_TIME_INTERPOLATION /* * See Documentation/ia64/fsys.txt for details on fsyscalls. * @@ -349,6 +350,26 @@ ENTRY(fsys_clock_gettime) br.many .gettime END(fsys_clock_gettime) + +#else // !CONFIG_TIME_INTERPOLATION + +# define fsys_gettimeofday 0 +# define fsys_clock_gettime 0 + +.fail_einval: + mov r8 = EINVAL + mov r10 = -1 + FSYS_RETURN + +.fail_efault: + mov r8 = EFAULT + mov r10 = -1 + FSYS_RETURN + +#endif + + + /* * long fsys_rt_sigprocmask (int how, sigset_t *set, sigset_t *oset, size_t sigsetsize). */ Index: linux-2.6.24.7/arch/ia64/kernel/iosapic.c =================================================================== --- linux-2.6.24.7.orig/arch/ia64/kernel/iosapic.c +++ linux-2.6.24.7/arch/ia64/kernel/iosapic.c @@ -111,7 +111,7 @@ (PAGE_SIZE / sizeof(struct iosapic_rte_info)) #define RTE_PREALLOCATED (1) -static DEFINE_SPINLOCK(iosapic_lock); +static DEFINE_RAW_SPINLOCK(iosapic_lock); /* * These tables map IA-64 vectors to the IOSAPIC pin that generates this @@ -390,6 +390,34 @@ iosapic_startup_level_irq (unsigned int return 0; } +/* + * In the preemptible case mask the IRQ first then handle it and ack it. + */ +#ifdef CONFIG_PREEMPT_HARDIRQS + +static void +iosapic_ack_level_irq (unsigned int irq) +{ + ia64_vector vec = irq_to_vector(irq); + struct iosapic_rte_info *rte; + + move_irq(irq); + mask_irq(irq); + list_for_each_entry(rte, &iosapic_intr_info[vec].rtes, rte_list) + iosapic_eoi(rte->addr, vec); +} + +static void +iosapic_end_level_irq (unsigned int irq) +{ + if (!(irq_desc[irq].status & IRQ_INPROGRESS)) + unmask_irq(irq); +} + +#else /* !CONFIG_PREEMPT_HARDIRQS */ + +#define iosapic_ack_level_irq nop + static void iosapic_end_level_irq (unsigned int irq) { @@ -411,10 +439,11 @@ iosapic_end_level_irq (unsigned int irq) } } +#endif + #define iosapic_shutdown_level_irq mask_irq #define iosapic_enable_level_irq unmask_irq #define iosapic_disable_level_irq mask_irq -#define iosapic_ack_level_irq nop static struct irq_chip irq_type_iosapic_level = { .name = "IO-SAPIC-level", Index: linux-2.6.24.7/arch/ia64/kernel/mca.c =================================================================== --- linux-2.6.24.7.orig/arch/ia64/kernel/mca.c +++ linux-2.6.24.7/arch/ia64/kernel/mca.c @@ -323,7 +323,7 @@ ia64_mca_spin(const char *func) typedef struct ia64_state_log_s { - spinlock_t isl_lock; + raw_spinlock_t isl_lock; int isl_index; unsigned long isl_count; ia64_err_rec_t *isl_log[IA64_MAX_LOGS]; /* need space to store header + error log */ Index: linux-2.6.24.7/arch/ia64/kernel/perfmon.c =================================================================== --- linux-2.6.24.7.orig/arch/ia64/kernel/perfmon.c +++ linux-2.6.24.7/arch/ia64/kernel/perfmon.c @@ -280,7 +280,7 @@ typedef struct { */ typedef struct pfm_context { - spinlock_t ctx_lock; /* context protection */ + raw_spinlock_t ctx_lock; /* context protection */ pfm_context_flags_t ctx_flags; /* bitmask of flags (block reason incl.) */ unsigned int ctx_state; /* state: active/inactive (no bitfield) */ @@ -369,7 +369,7 @@ typedef struct pfm_context { * mostly used to synchronize between system wide and per-process */ typedef struct { - spinlock_t pfs_lock; /* lock the structure */ + raw_spinlock_t pfs_lock; /* lock the structure */ unsigned int pfs_task_sessions; /* number of per task sessions */ unsigned int pfs_sys_sessions; /* number of per system wide sessions */ @@ -510,7 +510,7 @@ static pfm_intr_handler_desc_t *pfm_alt static struct proc_dir_entry *perfmon_dir; static pfm_uuid_t pfm_null_uuid = {0,}; -static spinlock_t pfm_buffer_fmt_lock; +static raw_spinlock_t pfm_buffer_fmt_lock; static LIST_HEAD(pfm_buffer_fmt_list); static pmu_config_t *pmu_conf; Index: linux-2.6.24.7/arch/ia64/kernel/process.c =================================================================== --- linux-2.6.24.7.orig/arch/ia64/kernel/process.c +++ linux-2.6.24.7/arch/ia64/kernel/process.c @@ -95,6 +95,9 @@ show_stack (struct task_struct *task, un void dump_stack (void) { + if (irqs_disabled()) { + printk("Uh oh.. entering dump_stack() with irqs disabled.\n"); + } show_stack(NULL, NULL); } @@ -200,7 +203,7 @@ void default_idle (void) { local_irq_enable(); - while (!need_resched()) { + while (!need_resched() && !need_resched_delayed()) { if (can_do_pal_halt) { local_irq_disable(); if (!need_resched()) { @@ -288,7 +291,7 @@ cpu_idle (void) current_thread_info()->status |= TS_POLLING; } - if (!need_resched()) { + if (!need_resched() && !need_resched_delayed()) { void (*idle)(void); #ifdef CONFIG_SMP min_xtp(); @@ -310,10 +313,11 @@ cpu_idle (void) normal_xtp(); #endif } - preempt_enable_no_resched(); - schedule(); + __preempt_enable_no_resched(); + __schedule(); + preempt_disable(); - check_pgt_cache(); + if (cpu_is_offline(cpu)) play_dead(); } Index: linux-2.6.24.7/arch/ia64/kernel/sal.c =================================================================== --- linux-2.6.24.7.orig/arch/ia64/kernel/sal.c +++ linux-2.6.24.7/arch/ia64/kernel/sal.c @@ -18,7 +18,7 @@ #include <asm/sal.h> #include <asm/pal.h> - __cacheline_aligned DEFINE_SPINLOCK(sal_lock); + __cacheline_aligned DEFINE_RAW_SPINLOCK(sal_lock); unsigned long sal_platform_features; unsigned short sal_revision; Index: linux-2.6.24.7/arch/ia64/kernel/salinfo.c =================================================================== --- linux-2.6.24.7.orig/arch/ia64/kernel/salinfo.c +++ linux-2.6.24.7/arch/ia64/kernel/salinfo.c @@ -140,7 +140,7 @@ enum salinfo_state { struct salinfo_data { cpumask_t cpu_event; /* which cpus have outstanding events */ - struct semaphore mutex; + struct compat_semaphore mutex; u8 *log_buffer; u64 log_size; u8 *oemdata; /* decoded oem data */ @@ -156,8 +156,8 @@ struct salinfo_data { static struct salinfo_data salinfo_data[ARRAY_SIZE(salinfo_log_name)]; -static DEFINE_SPINLOCK(data_lock); -static DEFINE_SPINLOCK(data_saved_lock); +static DEFINE_RAW_SPINLOCK(data_lock); +static DEFINE_RAW_SPINLOCK(data_saved_lock); /** salinfo_platform_oemdata - optional callback to decode oemdata from an error * record. Index: linux-2.6.24.7/arch/ia64/kernel/semaphore.c =================================================================== --- linux-2.6.24.7.orig/arch/ia64/kernel/semaphore.c +++ linux-2.6.24.7/arch/ia64/kernel/semaphore.c @@ -40,12 +40,12 @@ */ void -__up (struct semaphore *sem) +__up (struct compat_semaphore *sem) { wake_up(&sem->wait); } -void __sched __down (struct semaphore *sem) +void __sched __down (struct compat_semaphore *sem) { struct task_struct *tsk = current; DECLARE_WAITQUEUE(wait, tsk); @@ -82,7 +82,7 @@ void __sched __down (struct semaphore *s tsk->state = TASK_RUNNING; } -int __sched __down_interruptible (struct semaphore * sem) +int __sched __down_interruptible (struct compat_semaphore * sem) { int retval = 0; struct task_struct *tsk = current; @@ -142,7 +142,7 @@ int __sched __down_interruptible (struct * count. */ int -__down_trylock (struct semaphore *sem) +__down_trylock (struct compat_semaphore *sem) { unsigned long flags; int sleepers; Index: linux-2.6.24.7/arch/ia64/kernel/signal.c =================================================================== --- linux-2.6.24.7.orig/arch/ia64/kernel/signal.c +++ linux-2.6.24.7/arch/ia64/kernel/signal.c @@ -438,6 +438,14 @@ ia64_do_signal (struct sigscratch *scr, long errno = scr->pt.r8; # define ERR_CODE(c) (IS_IA32_PROCESS(&scr->pt) ? -(c) : (c)) +#ifdef CONFIG_PREEMPT_RT + /* + * Fully-preemptible kernel does not need interrupts disabled: + */ + local_irq_enable(); + preempt_check_resched(); +#endif + /* * In the ia64_leave_kernel code path, we want the common case to go fast, which * is why we may in certain cases get here from kernel mode. Just return without Index: linux-2.6.24.7/arch/ia64/kernel/smp.c =================================================================== --- linux-2.6.24.7.orig/arch/ia64/kernel/smp.c +++ linux-2.6.24.7/arch/ia64/kernel/smp.c @@ -261,6 +261,22 @@ smp_send_reschedule (int cpu) } /* + * this function sends a 'reschedule' IPI to all other CPUs. + * This is used when RT tasks are starving and other CPUs + * might be able to run them: + */ +void smp_send_reschedule_allbutself(void) +{ + unsigned int cpu; + + for_each_online_cpu(cpu) { + if (cpu != smp_processor_id()) + platform_send_ipi(cpu, IA64_IPI_RESCHEDULE, + IA64_IPI_DM_INT, 0); + } +} + +/* * Called with preemption disabled. */ static void Index: linux-2.6.24.7/arch/ia64/kernel/smpboot.c =================================================================== --- linux-2.6.24.7.orig/arch/ia64/kernel/smpboot.c +++ linux-2.6.24.7/arch/ia64/kernel/smpboot.c @@ -372,6 +372,8 @@ smp_setup_percpu_timer (void) { } +extern void register_itc_clockevent(void); + static void __cpuinit smp_callin (void) { @@ -450,6 +452,7 @@ smp_callin (void) #ifdef CONFIG_IA32_SUPPORT ia32_gdt_init(); #endif + register_itc_clockevent(); /* * Allow the master to continue. Index: linux-2.6.24.7/arch/ia64/kernel/time.c =================================================================== --- linux-2.6.24.7.orig/arch/ia64/kernel/time.c +++ linux-2.6.24.7/arch/ia64/kernel/time.c @@ -70,6 +70,7 @@ timer_interrupt (int irq, void *dev_id) platform_timer_interrupt(irq, dev_id); +#if 0 new_itm = local_cpu_data->itm_next; if (!time_after(ia64_get_itc(), new_itm)) @@ -77,29 +78,48 @@ timer_interrupt (int irq, void *dev_id) ia64_get_itc(), new_itm); profile_tick(CPU_PROFILING); +#endif + + if (time_after(ia64_get_itc(), local_cpu_data->itm_tick_next)) { - while (1) { - update_process_times(user_mode(get_irq_regs())); + unsigned long new_tick_itm; + new_tick_itm = local_cpu_data->itm_tick_next; - new_itm += local_cpu_data->itm_delta; + profile_tick(CPU_PROFILING, get_irq_regs()); - if (smp_processor_id() == time_keeper_id) { - /* - * Here we are in the timer irq handler. We have irqs locally - * disabled, but we don't know if the timer_bh is running on - * another CPU. We need to avoid to SMP race by acquiring the - * xtime_lock. - */ - write_seqlock(&xtime_lock); - do_timer(1); - local_cpu_data->itm_next = new_itm; - write_sequnlock(&xtime_lock); - } else - local_cpu_data->itm_next = new_itm; + while (1) { + update_process_times(user_mode(get_irq_regs())); + + new_tick_itm += local_cpu_data->itm_tick_delta; + + if (smp_processor_id() == time_keeper_id) { + /* + * Here we are in the timer irq handler. We have irqs locally + * disabled, but we don't know if the timer_bh is running on + * another CPU. We need to avoid to SMP race by acquiring the + * xtime_lock. + */ + write_seqlock(&xtime_lock); + do_timer(get_irq_regs()); + local_cpu_data->itm_tick_next = new_tick_itm; + write_sequnlock(&xtime_lock); + } else + local_cpu_data->itm_tick_next = new_tick_itm; + + if (time_after(new_tick_itm, ia64_get_itc())) + break; + } + } - if (time_after(new_itm, ia64_get_itc())) - break; + if (time_after(ia64_get_itc(), local_cpu_data->itm_timer_next)) { + if (itc_clockevent.event_handler) + itc_clockevent.event_handler(get_irq_regs()); + // FIXME, really, please + new_itm = local_cpu_data->itm_tick_next; + + if (time_after(new_itm, local_cpu_data->itm_timer_next)) + new_itm = local_cpu_data->itm_timer_next; /* * Allow IPIs to interrupt the timer loop. */ @@ -117,8 +137,8 @@ timer_interrupt (int irq, void *dev_id) * too fast (with the potentially devastating effect * of losing monotony of time). */ - while (!time_after(new_itm, ia64_get_itc() + local_cpu_data->itm_delta/2)) - new_itm += local_cpu_data->itm_delta; + while (!time_after(new_itm, ia64_get_itc() + local_cpu_data->itm_tick_delta/2)) + new_itm += local_cpu_data->itm_tick_delta; ia64_set_itm(new_itm); /* double check, in case we got hit by a (slow) PMI: */ } while (time_after_eq(ia64_get_itc(), new_itm)); @@ -137,7 +157,7 @@ ia64_cpu_local_tick (void) /* arrange for the cycle counter to generate a timer interrupt: */ ia64_set_itv(IA64_TIMER_VECTOR); - delta = local_cpu_data->itm_delta; + delta = local_cpu_data->itm_tick_delta; /* * Stagger the timer tick for each CPU so they don't occur all at (almost) the * same time: @@ -146,8 +166,8 @@ ia64_cpu_local_tick (void) unsigned long hi = 1UL << ia64_fls(cpu); shift = (2*(cpu - hi) + 1) * delta/hi/2; } - local_cpu_data->itm_next = ia64_get_itc() + delta + shift; - ia64_set_itm(local_cpu_data->itm_next); + local_cpu_data->itm_tick_next = ia64_get_itc() + delta + shift; + ia64_set_itm(local_cpu_data->itm_tick_next); } static int nojitter; @@ -205,7 +225,7 @@ ia64_init_itm (void) itc_freq = (platform_base_freq*itc_ratio.num)/itc_ratio.den; - local_cpu_data->itm_delta = (itc_freq + HZ/2) / HZ; + local_cpu_data->itm_tick_delta = (itc_freq + HZ/2) / HZ; printk(KERN_DEBUG "CPU %d: base freq=%lu.%03luMHz, ITC ratio=%u/%u, " "ITC freq=%lu.%03luMHz", smp_processor_id(), platform_base_freq / 1000000, (platform_base_freq / 1000) % 1000, @@ -225,6 +245,7 @@ ia64_init_itm (void) local_cpu_data->nsec_per_cyc = ((NSEC_PER_SEC<<IA64_NSEC_PER_CYC_SHIFT) + itc_freq/2)/itc_freq; +#ifdef CONFIG_TIME_INTERPOLATION if (!(sal_platform_features & IA64_SAL_PLATFORM_FEATURE_ITC_DRIFT)) { #ifdef CONFIG_SMP /* On IA64 in an SMP configuration ITCs are never accurately synchronized. @@ -297,7 +318,7 @@ static cycle_t itc_get_cycles(void) static struct irqaction timer_irqaction = { .handler = timer_interrupt, - .flags = IRQF_DISABLED | IRQF_IRQPOLL, + .flags = IRQF_DISABLED | IRQF_IRQPOLL | IRQF_NODELAY, .name = "timer" }; @@ -318,6 +339,8 @@ time_init (void) * tv_nsec field must be normalized (i.e., 0 <= nsec < NSEC_PER_SEC). */ set_normalized_timespec(&wall_to_monotonic, -xtime.tv_sec, -xtime.tv_nsec); + register_itc_clocksource(); + register_itc_clockevent(); } /* @@ -402,6 +425,7 @@ void update_vsyscall(struct timespec *wa fsyscall_gtod_data.monotonic_time.tv_nsec -= NSEC_PER_SEC; fsyscall_gtod_data.monotonic_time.tv_sec++; } +#endif write_sequnlock_irqrestore(&fsyscall_gtod_data.lock, flags); } Index: linux-2.6.24.7/arch/ia64/kernel/traps.c =================================================================== --- linux-2.6.24.7.orig/arch/ia64/kernel/traps.c +++ linux-2.6.24.7/arch/ia64/kernel/traps.c @@ -39,11 +39,11 @@ void die (const char *str, struct pt_regs *regs, long err) { static struct { - spinlock_t lock; + raw_spinlock_t lock; u32 lock_owner; int lock_owner_depth; } die = { - .lock = __SPIN_LOCK_UNLOCKED(die.lock), + .lock = RAW_SPIN_LOCK_UNLOCKED(die.lock), .lock_owner = -1, .lock_owner_depth = 0 }; @@ -181,7 +181,7 @@ __kprobes ia64_bad_break (unsigned long * access to fph by the time we get here, as the IVT's "Disabled FP-Register" handler takes * care of clearing psr.dfh. */ -static inline void +void disabled_fph_fault (struct pt_regs *regs) { struct ia64_psr *psr = ia64_psr(regs); @@ -200,7 +200,7 @@ disabled_fph_fault (struct pt_regs *regs = (struct task_struct *)ia64_get_kr(IA64_KR_FPU_OWNER); if (ia64_is_local_fpu_owner(current)) { - preempt_enable_no_resched(); + __preempt_enable_no_resched(); return; } @@ -220,7 +220,7 @@ disabled_fph_fault (struct pt_regs *regs */ psr->mfh = 1; } - preempt_enable_no_resched(); + __preempt_enable_no_resched(); } static inline int Index: linux-2.6.24.7/arch/ia64/kernel/unwind.c =================================================================== --- linux-2.6.24.7.orig/arch/ia64/kernel/unwind.c +++ linux-2.6.24.7/arch/ia64/kernel/unwind.c @@ -82,7 +82,7 @@ typedef unsigned long unw_word; typedef unsigned char unw_hash_index_t; static struct { - spinlock_t lock; /* spinlock for unwind data */ + raw_spinlock_t lock; /* spinlock for unwind data */ /* list of unwind tables (one per load-module) */ struct unw_table *tables; @@ -146,7 +146,7 @@ static struct { # endif } unw = { .tables = &unw.kernel_table, - .lock = __SPIN_LOCK_UNLOCKED(unw.lock), + .lock = RAW_SPIN_LOCK_UNLOCKED(unw.lock), .save_order = { UNW_REG_RP, UNW_REG_PFS, UNW_REG_PSP, UNW_REG_PR, UNW_REG_UNAT, UNW_REG_LC, UNW_REG_FPSR, UNW_REG_PRI_UNAT_GR Index: linux-2.6.24.7/arch/ia64/kernel/unwind_i.h =================================================================== --- linux-2.6.24.7.orig/arch/ia64/kernel/unwind_i.h +++ linux-2.6.24.7/arch/ia64/kernel/unwind_i.h @@ -154,7 +154,7 @@ struct unw_script { unsigned long ip; /* ip this script is for */ unsigned long pr_mask; /* mask of predicates script depends on */ unsigned long pr_val; /* predicate values this script is for */ - rwlock_t lock; + raw_rwlock_t lock; unsigned int flags; /* see UNW_FLAG_* in unwind.h */ unsigned short lru_chain; /* used for least-recently-used chain */ unsigned short coll_chain; /* used for hash collisions */ Index: linux-2.6.24.7/arch/ia64/mm/init.c =================================================================== --- linux-2.6.24.7.orig/arch/ia64/mm/init.c +++ linux-2.6.24.7/arch/ia64/mm/init.c @@ -37,7 +37,7 @@ #include <asm/unistd.h> #include <asm/mca.h> -DEFINE_PER_CPU(struct mmu_gather, mmu_gathers); +DEFINE_PER_CPU_LOCKED(struct mmu_gather, mmu_gathers); extern void ia64_tlb_init (void); Index: linux-2.6.24.7/arch/ia64/mm/tlb.c =================================================================== --- linux-2.6.24.7.orig/arch/ia64/mm/tlb.c +++ linux-2.6.24.7/arch/ia64/mm/tlb.c @@ -33,7 +33,7 @@ static struct { } purge; struct ia64_ctx ia64_ctx = { - .lock = __SPIN_LOCK_UNLOCKED(ia64_ctx.lock), + .lock = RAW_SPIN_LOCK_UNLOCKED(ia64_ctx.lock), .next = 1, .max_ctx = ~0U }; Index: linux-2.6.24.7/include/asm-ia64/irqflags.h =================================================================== --- /dev/null +++ linux-2.6.24.7/include/asm-ia64/irqflags.h @@ -0,0 +1,95 @@ + +/* + * include/asm-i64/irqflags.h + * + * IRQ flags handling + * + * This file gets included from lowlevel asm headers too, to provide + * wrapped versions of the local_irq_*() APIs, based on the + * raw_local_irq_*() macros from the lowlevel headers. + */ +#ifndef _ASM_IRQFLAGS_H +#define _ASM_IRQFLAGS_H + +/* For spinlocks etc */ + +/* + * - clearing psr.i is implicitly serialized (visible by next insn) + * - setting psr.i requires data serialization + * - we need a stop-bit before reading PSR because we sometimes + * write a floating-point register right before reading the PSR + * and that writes to PSR.mfl + */ +#define __local_irq_save(x) \ +do { \ + ia64_stop(); \ + (x) = ia64_getreg(_IA64_REG_PSR); \ + ia64_stop(); \ + ia64_rsm(IA64_PSR_I); \ +} while (0) + +#define __local_irq_disable() \ +do { \ + ia64_stop(); \ + ia64_rsm(IA64_PSR_I); \ +} while (0) + +#define __local_irq_restore(x) ia64_intrin_local_irq_restore((x) & IA64_PSR_I) + +#ifdef CONFIG_IA64_DEBUG_IRQ + + extern unsigned long last_cli_ip; + +# define __save_ip() last_cli_ip = ia64_getreg(_IA64_REG_IP) + +# define raw_local_irq_save(x) \ +do { \ + unsigned long psr; \ + \ + __local_irq_save(psr); \ + if (psr & IA64_PSR_I) \ + __save_ip(); \ + (x) = psr; \ +} while (0) + +# define raw_local_irq_disable() do { unsigned long x; local_irq_save(x); } while (0) + +# define raw_local_irq_restore(x) \ +do { \ + unsigned long old_psr, psr = (x); \ + \ + local_save_flags(old_psr); \ + __local_irq_restore(psr); \ + if ((old_psr & IA64_PSR_I) && !(psr & IA64_PSR_I)) \ + __save_ip(); \ +} while (0) + +#else /* !CONFIG_IA64_DEBUG_IRQ */ +# define raw_local_irq_save(x) __local_irq_save(x) +# define raw_local_irq_disable() __local_irq_disable() +# define raw_local_irq_restore(x) __local_irq_restore(x) +#endif /* !CONFIG_IA64_DEBUG_IRQ */ + +#define raw_local_irq_enable() ({ ia64_stop(); ia64_ssm(IA64_PSR_I); ia64_srlz_d(); }) +#define raw_local_save_flags(flags) ({ ia64_stop(); (flags) = ia64_getreg(_IA64_REG_PSR); }) + +#define raw_irqs_disabled() \ +({ \ + unsigned long __ia64_id_flags; \ + local_save_flags(__ia64_id_flags); \ + (__ia64_id_flags & IA64_PSR_I) == 0; \ +}) + +#define raw_irqs_disabled_flags(flags) ((flags & IA64_PSR_I) == 0) + + +#define raw_safe_halt() ia64_pal_halt_light() /* PAL_HALT_LIGHT */ + +/* TBD... */ +# define TRACE_IRQS_ON +# define TRACE_IRQS_OFF +# define TRACE_IRQS_ON_STR +# define TRACE_IRQS_OFF_STR + +#endif + Index: linux-2.6.24.7/include/asm-ia64/mmu_context.h =================================================================== --- linux-2.6.24.7.orig/include/asm-ia64/mmu_context.h +++ linux-2.6.24.7/include/asm-ia64/mmu_context.h @@ -32,7 +32,7 @@ #include <asm-generic/mm_hooks.h> struct ia64_ctx { - spinlock_t lock; + raw_spinlock_t lock; unsigned int next; /* next context number to use */ unsigned int limit; /* available free range */ unsigned int max_ctx; /* max. context value supported by all CPUs */ Index: linux-2.6.24.7/include/asm-ia64/percpu.h =================================================================== --- linux-2.6.24.7.orig/include/asm-ia64/percpu.h +++ linux-2.6.24.7/include/asm-ia64/percpu.h @@ -24,10 +24,17 @@ #define DECLARE_PER_CPU(type, name) \ extern __SMALL_ADDR_AREA __typeof__(type) per_cpu__##name +#define DECLARE_PER_CPU_LOCKED(type, name) \ + extern spinlock_t per_cpu_lock__##name##_locked; \ + extern __SMALL_ADDR_AREA __typeof__(type) per_cpu__##name##_locked + /* Separate out the type, so (int[3], foo) works. */ #define DEFINE_PER_CPU(type, name) \ - __attribute__((__section__(".data.percpu"))) \ - __SMALL_ADDR_AREA __typeof__(type) per_cpu__##name + __attribute__((__section__(".data.percpu"))) __SMALL_ADDR_AREA __typeof__(type) per_cpu__##name + +#define DEFINE_PER_CPU_LOCKED(type, name) \ + __attribute__((__section__(".data.percpu"))) __SMALL_ADDR_AREA __DEFINE_SPINLOCK(per_cpu_lock__##name##_locked); \ + __attribute__((__section__(".data.percpu"))) __SMALL_ADDR_AREA __typeof__(type) per_cpu__##name##_locked #ifdef CONFIG_SMP #define DEFINE_PER_CPU_SHARED_ALIGNED(type, name) \ @@ -55,6 +62,16 @@ DECLARE_PER_CPU(unsigned long, local_per #define __get_cpu_var(var) (*RELOC_HIDE(&per_cpu__##var, __ia64_per_cpu_var(local_per_cpu_offset))) #define __raw_get_cpu_var(var) (*RELOC_HIDE(&per_cpu__##var, __ia64_per_cpu_var(local_per_cpu_offset))) +#define per_cpu_lock(var, cpu) \ + (*RELOC_HIDE(&per_cpu_lock__##var##_locked, __per_cpu_offset[cpu])) +#define per_cpu_var_locked(var, cpu) \ + (*RELOC_HIDE(&per_cpu__##var##_locked, __per_cpu_offset[cpu])) +#define __get_cpu_lock(var, cpu) \ + per_cpu_lock(var, cpu) +#define __get_cpu_var_locked(var, cpu) \ + per_cpu_var_locked(var, cpu) + + extern void percpu_modcopy(void *pcpudst, const void *src, unsigned long size); extern void setup_per_cpu_areas (void); extern void *per_cpu_init(void); Index: linux-2.6.24.7/include/asm-ia64/processor.h =================================================================== --- linux-2.6.24.7.orig/include/asm-ia64/processor.h +++ linux-2.6.24.7/include/asm-ia64/processor.h @@ -124,8 +124,10 @@ struct ia64_psr { */ struct cpuinfo_ia64 { __u32 softirq_pending; - __u64 itm_delta; /* # of clock cycles between clock ticks */ - __u64 itm_next; /* interval timer mask value to use for next clock tick */ + __u64 itm_tick_delta; /* # of clock cycles between clock ticks */ + __u64 itm_tick_next; /* interval timer mask value to use for next clock tick */ + __u64 itm_timer_next; + __u64 __itm_next; __u64 nsec_per_cyc; /* (1000000000<<IA64_NSEC_PER_CYC_SHIFT)/itc_freq */ __u64 unimpl_va_mask; /* mask of unimplemented virtual address bits (from PAL) */ __u64 unimpl_pa_mask; /* mask of unimplemented physical address bits (from PAL) */ Index: linux-2.6.24.7/include/asm-ia64/rtc.h =================================================================== --- /dev/null +++ linux-2.6.24.7/include/asm-ia64/rtc.h @@ -0,0 +1,7 @@ +#ifndef _IA64_RTC_H +#define _IA64_RTC_H + +#error "no asm/rtc.h on IA64 !" + +#endif + Index: linux-2.6.24.7/include/asm-ia64/rwsem.h =================================================================== --- linux-2.6.24.7.orig/include/asm-ia64/rwsem.h +++ linux-2.6.24.7/include/asm-ia64/rwsem.h @@ -33,7 +33,7 @@ /* * the semaphore definition */ -struct rw_semaphore { +struct compat_rw_semaphore { signed long count; spinlock_t wait_lock; struct list_head wait_list; @@ -50,16 +50,16 @@ struct rw_semaphore { { RWSEM_UNLOCKED_VALUE, SPIN_LOCK_UNLOCKED, \ LIST_HEAD_INIT((name).wait_list) } -#define DECLARE_RWSEM(name) \ - struct rw_semaphore name = __RWSEM_INITIALIZER(name) +#define COMPAT_DECLARE_RWSEM(name) \ + struct compat_rw_semaphore name = __RWSEM_INITIALIZER(name) -extern struct rw_semaphore *rwsem_down_read_failed(struct rw_semaphore *sem); -extern struct rw_semaphore *rwsem_down_write_failed(struct rw_semaphore *sem); -extern struct rw_semaphore *rwsem_wake(struct rw_semaphore *sem); -extern struct rw_semaphore *rwsem_downgrade_wake(struct rw_semaphore *sem); +extern struct compat_rw_semaphore *rwsem_down_read_failed(struct compat_rw_semaphore *sem); +extern struct compat_rw_semaphore *rwsem_down_write_failed(struct compat_rw_semaphore *sem); +extern struct compat_rw_semaphore *rwsem_wake(struct compat_rw_semaphore *sem); +extern struct compat_rw_semaphore *rwsem_downgrade_wake(struct compat_rw_semaphore *sem); static inline void -init_rwsem (struct rw_semaphore *sem) +compat_init_rwsem (struct compat_rw_semaphore *sem) { sem->count = RWSEM_UNLOCKED_VALUE; spin_lock_init(&sem->wait_lock); @@ -70,7 +70,7 @@ init_rwsem (struct rw_semaphore *sem) * lock for reading */ static inline void -__down_read (struct rw_semaphore *sem) +__down_read (struct compat_rw_semaphore *sem) { long result = ia64_fetchadd8_acq((unsigned long *)&sem->count, 1); @@ -82,7 +82,7 @@ __down_read (struct rw_semaphore *sem) * lock for writing */ static inline void -__down_write (struct rw_semaphore *sem) +__down_write (struct compat_rw_semaphore *sem) { long old, new; @@ -99,7 +99,7 @@ __down_write (struct rw_semaphore *sem) * unlock after reading */ static inline void -__up_read (struct rw_semaphore *sem) +__up_read (struct compat_rw_semaphore *sem) { long result = ia64_fetchadd8_rel((unsigned long *)&sem->count, -1); @@ -111,7 +111,7 @@ __up_read (struct rw_semaphore *sem) * unlock after writing */ static inline void -__up_write (struct rw_semaphore *sem) +__up_write (struct compat_rw_semaphore *sem) { long old, new; @@ -128,7 +128,7 @@ __up_write (struct rw_semaphore *sem) * trylock for reading -- returns 1 if successful, 0 if contention */ static inline int -__down_read_trylock (struct rw_semaphore *sem) +__down_read_trylock (struct compat_rw_semaphore *sem) { long tmp; while ((tmp = sem->count) >= 0) { @@ -143,7 +143,7 @@ __down_read_trylock (struct rw_semaphore * trylock for writing -- returns 1 if successful, 0 if contention */ static inline int -__down_write_trylock (struct rw_semaphore *sem) +__down_write_trylock (struct compat_rw_semaphore *sem) { long tmp = cmpxchg_acq(&sem->count, RWSEM_UNLOCKED_VALUE, RWSEM_ACTIVE_WRITE_BIAS); @@ -154,7 +154,7 @@ __down_write_trylock (struct rw_semaphor * downgrade write lock to read lock */ static inline void -__downgrade_write (struct rw_semaphore *sem) +__downgrade_write (struct compat_rw_semaphore *sem) { long old, new; @@ -174,7 +174,7 @@ __downgrade_write (struct rw_semaphore * #define rwsem_atomic_add(delta, sem) atomic64_add(delta, (atomic64_t *)(&(sem)->count)) #define rwsem_atomic_update(delta, sem) atomic64_add_return(delta, (atomic64_t *)(&(sem)->count)) -static inline int rwsem_is_locked(struct rw_semaphore *sem) +static inline int compat_rwsem_is_locked(struct compat_rw_semaphore *sem) { return (sem->count != 0); } Index: linux-2.6.24.7/include/asm-ia64/sal.h =================================================================== --- linux-2.6.24.7.orig/include/asm-ia64/sal.h +++ linux-2.6.24.7/include/asm-ia64/sal.h @@ -43,7 +43,7 @@ #include <asm/system.h> #include <asm/fpu.h> -extern spinlock_t sal_lock; +extern raw_spinlock_t sal_lock; /* SAL spec _requires_ eight args for each call. */ #define __IA64_FW_CALL(entry,result,a0,a1,a2,a3,a4,a5,a6,a7) \ Index: linux-2.6.24.7/include/asm-ia64/semaphore.h =================================================================== --- linux-2.6.24.7.orig/include/asm-ia64/semaphore.h +++ linux-2.6.24.7/include/asm-ia64/semaphore.h @@ -11,53 +11,64 @@ #include <asm/atomic.h> -struct semaphore { +/* + * On !PREEMPT_RT all semaphores are compat: + */ +#ifndef CONFIG_PREEMPT_RT +# define compat_semaphore semaphore +#endif + +struct compat_semaphore { atomic_t count; int sleepers; wait_queue_head_t wait; }; -#define __SEMAPHORE_INITIALIZER(name, n) \ +#define __COMPAT_SEMAPHORE_INITIALIZER(name, n) \ { \ .count = ATOMIC_INIT(n), \ .sleepers = 0, \ .wait = __WAIT_QUEUE_HEAD_INITIALIZER((name).wait) \ } -#define __DECLARE_SEMAPHORE_GENERIC(name,count) \ - struct semaphore name = __SEMAPHORE_INITIALIZER(name, count) +#define __COMPAT_DECLARE_SEMAPHORE_GENERIC(name,count) \ + struct compat_semaphore name = __COMPAT_SEMAPHORE_INITIALIZER(name, count) -#define DECLARE_MUTEX(name) __DECLARE_SEMAPHORE_GENERIC(name, 1) +#define COMPAT_DECLARE_MUTEX(name) __COMPAT_DECLARE_SEMAPHORE_GENERIC(name, 1) + +#define compat_sema_count(sem) atomic_read(&(sem)->count) + +asmlinkage int compat_sem_is_locked(struct compat_semaphore *sem); static inline void -sema_init (struct semaphore *sem, int val) +compat_sema_init (struct compat_semaphore *sem, int val) { - *sem = (struct semaphore) __SEMAPHORE_INITIALIZER(*sem, val); + *sem = (struct compat_semaphore) __COMPAT_SEMAPHORE_INITIALIZER(*sem, val); } static inline void -init_MUTEX (struct semaphore *sem) +compat_init_MUTEX (struct compat_semaphore *sem) { - sema_init(sem, 1); + compat_sema_init(sem, 1); } static inline void -init_MUTEX_LOCKED (struct semaphore *sem) +compat_init_MUTEX_LOCKED (struct compat_semaphore *sem) { - sema_init(sem, 0); + compat_sema_init(sem, 0); } -extern void __down (struct semaphore * sem); -extern int __down_interruptible (struct semaphore * sem); -extern int __down_trylock (struct semaphore * sem); -extern void __up (struct semaphore * sem); +extern void __down (struct compat_semaphore * sem); +extern int __down_interruptible (struct compat_semaphore * sem); +extern int __down_trylock (struct compat_semaphore * sem); +extern void __up (struct compat_semaphore * sem); /* * Atomically decrement the semaphore's count. If it goes negative, * block the calling thread in the TASK_UNINTERRUPTIBLE state. */ static inline void -down (struct semaphore *sem) +compat_down (struct compat_semaphore *sem) { might_sleep(); if (ia64_fetchadd(-1, &sem->count.counter, acq) < 1) @@ -69,7 +80,7 @@ down (struct semaphore *sem) * block the calling thread in the TASK_INTERRUPTIBLE state. */ static inline int -down_interruptible (struct semaphore * sem) +compat_down_interruptible (struct compat_semaphore * sem) { int ret = 0; @@ -80,7 +91,7 @@ down_interruptible (struct semaphore * s } static inline int -down_trylock (struct semaphore *sem) +compat_down_trylock (struct compat_semaphore *sem) { int ret = 0; @@ -90,10 +101,12 @@ down_trylock (struct semaphore *sem) } static inline void -up (struct semaphore * sem) +compat_up (struct compat_semaphore * sem) { if (ia64_fetchadd(1, &sem->count.counter, rel) <= -1) __up(sem); } +#include <linux/semaphore.h> + #endif /* _ASM_IA64_SEMAPHORE_H */ Index: linux-2.6.24.7/include/asm-ia64/spinlock.h =================================================================== --- linux-2.6.24.7.orig/include/asm-ia64/spinlock.h +++ linux-2.6.24.7/include/asm-ia64/spinlock.h @@ -17,8 +17,6 @@ #include <asm/intrinsics.h> #include <asm/system.h> -#define __raw_spin_lock_init(x) ((x)->lock = 0) - #ifdef ASM_SUPPORTED /* * Try to get the lock. If we fail to get the lock, make a non-standard call to @@ -30,7 +28,7 @@ #define IA64_SPINLOCK_CLOBBERS "ar.ccv", "ar.pfs", "p14", "p15", "r27", "r28", "r29", "r30", "b6", "memory" static inline void -__raw_spin_lock_flags (raw_spinlock_t *lock, unsigned long flags) +__raw_spin_lock_flags (__raw_spinlock_t *lock, unsigned long flags) { register volatile unsigned int *ptr asm ("r31") = &lock->lock; @@ -89,7 +87,7 @@ __raw_spin_lock_flags (raw_spinlock_t *l #define __raw_spin_lock(lock) __raw_spin_lock_flags(lock, 0) /* Unlock by doing an ordered store and releasing the cacheline with nta */ -static inline void __raw_spin_unlock(raw_spinlock_t *x) { +static inline void __raw_spin_unlock(__raw_spinlock_t *x) { barrier(); asm volatile ("st4.rel.nta [%0] = r0\n\t" :: "r"(x)); } @@ -109,7 +107,7 @@ do { \ } while (ia64_spinlock_val); \ } \ } while (0) -#define __raw_spin_unlock(x) do { barrier(); ((raw_spinlock_t *) x)->lock = 0; } while (0) +#define __raw_spin_unlock(x) do { barrier(); ((__raw_spinlock_t *) x)->lock = 0; } while (0) #endif /* !ASM_SUPPORTED */ #define __raw_spin_is_locked(x) ((x)->lock != 0) @@ -122,7 +120,7 @@ do { \ #define __raw_read_lock(rw) \ do { \ - raw_rwlock_t *__read_lock_ptr = (rw); \ + __raw_rwlock_t *__read_lock_ptr = (rw); \ \ while (unlikely(ia64_fetchadd(1, (int *) __read_lock_ptr, acq) < 0)) { \ ia64_fetchadd(-1, (int *) __read_lock_ptr, rel); \ @@ -133,7 +131,7 @@ do { \ #define __raw_read_unlock(rw) \ do { \ - raw_rwlock_t *__read_lock_ptr = (rw); \ + __raw_rwlock_t *__read_lock_ptr = (rw); \ ia64_fetchadd(-1, (int *) __read_lock_ptr, rel); \ } while (0) @@ -165,7 +163,7 @@ do { \ (result == 0); \ }) -static inline void __raw_write_unlock(raw_rwlock_t *x) +static inline void __raw_write_unlock(__raw_rwlock_t *x) { u8 *y = (u8 *)x; barrier(); @@ -193,7 +191,7 @@ static inline void __raw_write_unlock(ra (ia64_val == 0); \ }) -static inline void __raw_write_unlock(raw_rwlock_t *x) +static inline void __raw_write_unlock(__raw_rwlock_t *x) { barrier(); x->write_lock = 0; @@ -201,10 +199,10 @@ static inline void __raw_write_unlock(ra #endif /* !ASM_SUPPORTED */ -static inline int __raw_read_trylock(raw_rwlock_t *x) +static inline int __raw_read_trylock(__raw_rwlock_t *x) { union { - raw_rwlock_t lock; + __raw_rwlock_t lock; __u32 word; } old, new; old.lock = new.lock = *x; @@ -213,8 +211,8 @@ static inline int __raw_read_trylock(raw return (u32)ia64_cmpxchg4_acq((__u32 *)(x), new.word, old.word) == old.word; } -#define _raw_spin_relax(lock) cpu_relax() -#define _raw_read_relax(lock) cpu_relax() -#define _raw_write_relax(lock) cpu_relax() +#define __raw_spin_relax(lock) cpu_relax() +#define __raw_read_relax(lock) cpu_relax() +#define __raw_write_relax(lock) cpu_relax() #endif /* _ASM_IA64_SPINLOCK_H */ Index: linux-2.6.24.7/include/asm-ia64/spinlock_types.h =================================================================== --- linux-2.6.24.7.orig/include/asm-ia64/spinlock_types.h +++ linux-2.6.24.7/include/asm-ia64/spinlock_types.h @@ -7,14 +7,14 @@ typedef struct { volatile unsigned int lock; -} raw_spinlock_t; +} __raw_spinlock_t; #define __RAW_SPIN_LOCK_UNLOCKED { 0 } typedef struct { volatile unsigned int read_counter : 31; volatile unsigned int write_lock : 1; -} raw_rwlock_t; +} __raw_rwlock_t; #define __RAW_RW_LOCK_UNLOCKED { 0, 0 } Index: linux-2.6.24.7/include/asm-ia64/system.h =================================================================== --- linux-2.6.24.7.orig/include/asm-ia64/system.h +++ linux-2.6.24.7/include/asm-ia64/system.h @@ -106,81 +106,16 @@ extern struct ia64_boot_param { */ #define set_mb(var, value) do { (var) = (value); mb(); } while (0) -#define safe_halt() ia64_pal_halt_light() /* PAL_HALT_LIGHT */ /* * The group barrier in front of the rsm & ssm are necessary to ensure * that none of the previous instructions in the same group are * affected by the rsm/ssm. */ -/* For spinlocks etc */ -/* - * - clearing psr.i is implicitly serialized (visible by next insn) - * - setting psr.i requires data serialization - * - we need a stop-bit before reading PSR because we sometimes - * write a floating-point register right before reading the PSR - * and that writes to PSR.mfl - */ -#define __local_irq_save(x) \ -do { \ - ia64_stop(); \ - (x) = ia64_getreg(_IA64_REG_PSR); \ - ia64_stop(); \ - ia64_rsm(IA64_PSR_I); \ -} while (0) - -#define __local_irq_disable() \ -do { \ - ia64_stop(); \ - ia64_rsm(IA64_PSR_I); \ -} while (0) - -#define __local_irq_restore(x) ia64_intrin_local_irq_restore((x) & IA64_PSR_I) - -#ifdef CONFIG_IA64_DEBUG_IRQ - extern unsigned long last_cli_ip; - -# define __save_ip() last_cli_ip = ia64_getreg(_IA64_REG_IP) - -# define local_irq_save(x) \ -do { \ - unsigned long psr; \ - \ - __local_irq_save(psr); \ - if (psr & IA64_PSR_I) \ - __save_ip(); \ - (x) = psr; \ -} while (0) - -# define local_irq_disable() do { unsigned long x; local_irq_save(x); } while (0) - -# define local_irq_restore(x) \ -do { \ - unsigned long old_psr, psr = (x); \ - \ - local_save_flags(old_psr); \ - __local_irq_restore(psr); \ - if ((old_psr & IA64_PSR_I) && !(psr & IA64_PSR_I)) \ - __save_ip(); \ -} while (0) +#include <linux/trace_irqflags.h> -#else /* !CONFIG_IA64_DEBUG_IRQ */ -# define local_irq_save(x) __local_irq_save(x) -# define local_irq_disable() __local_irq_disable() -# define local_irq_restore(x) __local_irq_restore(x) -#endif /* !CONFIG_IA64_DEBUG_IRQ */ - -#define local_irq_enable() ({ ia64_stop(); ia64_ssm(IA64_PSR_I); ia64_srlz_d(); }) -#define local_save_flags(flags) ({ ia64_stop(); (flags) = ia64_getreg(_IA64_REG_PSR); }) - -#define irqs_disabled() \ -({ \ - unsigned long __ia64_id_flags; \ - local_save_flags(__ia64_id_flags); \ - (__ia64_id_flags & IA64_PSR_I) == 0; \ -}) #ifdef __KERNEL__ Index: linux-2.6.24.7/include/asm-ia64/thread_info.h =================================================================== --- linux-2.6.24.7.orig/include/asm-ia64/thread_info.h +++ linux-2.6.24.7/include/asm-ia64/thread_info.h @@ -91,6 +91,7 @@ struct thread_info { #define TIF_MCA_INIT 18 /* this task is processing MCA or INIT */ #define TIF_DB_DISABLED 19 /* debug trap disabled for fsyscall */ #define TIF_FREEZE 20 /* is freezing for suspend */ +#define TIF_NEED_RESCHED_DELAYED 20 /* reschedule on return to userspace */ #define _TIF_SYSCALL_TRACE (1 << TIF_SYSCALL_TRACE) #define _TIF_SYSCALL_AUDIT (1 << TIF_SYSCALL_AUDIT) Index: linux-2.6.24.7/include/asm-ia64/tlb.h =================================================================== --- linux-2.6.24.7.orig/include/asm-ia64/tlb.h +++ linux-2.6.24.7/include/asm-ia64/tlb.h @@ -40,6 +40,7 @@ #include <linux/mm.h> #include <linux/pagemap.h> #include <linux/swap.h> +#include <linux/percpu.h> #include <asm/pgalloc.h> #include <asm/processor.h> @@ -61,11 +62,12 @@ struct mmu_gather { unsigned char need_flush; /* really unmapped some PTEs? */ unsigned long start_addr; unsigned long end_addr; + int cpu; struct page *pages[FREE_PTE_NR]; }; /* Users of the generic TLB shootdown code must declare this storage space. */ -DECLARE_PER_CPU(struct mmu_gather, mmu_gathers); +DECLARE_PER_CPU_LOCKED(struct mmu_gather, mmu_gathers); /* * Flush the TLB for address range START to END and, if not in fast mode, release the @@ -127,8 +129,10 @@ ia64_tlb_flush_mmu (struct mmu_gather *t static inline struct mmu_gather * tlb_gather_mmu (struct mm_struct *mm, unsigned int full_mm_flush) { - struct mmu_gather *tlb = &get_cpu_var(mmu_gathers); + int cpu; + struct mmu_gather *tlb = &get_cpu_var_locked(mmu_gathers, &cpu); + tlb->cpu = cpu; tlb->mm = mm; /* * Use fast mode if only 1 CPU is online. @@ -165,7 +169,7 @@ tlb_finish_mmu (struct mmu_gather *tlb, /* keep the page table cache within bounds */ check_pgt_cache(); - put_cpu_var(mmu_gathers); + put_cpu_var_locked(mmu_gathers, tlb->cpu); } /* �������������������������������������patches/preempt-realtime-ppc-need-resched-delayed.patch���������������������������������������������0000664�0000764�0000764�00000002154�11041657734�022255� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From tsutomu.owa@toshiba.co.jp Mon May 14 15:29:17 2007 Date: Mon, 14 May 2007 15:29:17 +0900 From: Tsutomu OWA <tsutomu.owa@toshiba.co.jp> To: linuxppc-dev@ozlabs.org, linux-kernel@vger.kernel.org Cc: mingo@elte.hu, tglx@linutronix.de Subject: Re: [patch 3/4] powerpc 2.6.21-rt1: add a need_resched_delayed() check Add a need_resched_delayed() check. This was pointed by Sergei Shtylyov; http://ozlabs.org/pipermail/linuxppc-dev/2007-March/033148.html Signed-off-by: Tsutomu Owa <tsutomu.owa@toshiba.co.jp> -- owa --- arch/powerpc/kernel/idle.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) Index: linux-2.6.24.7/arch/powerpc/kernel/idle.c =================================================================== --- linux-2.6.24.7.orig/arch/powerpc/kernel/idle.c +++ linux-2.6.24.7/arch/powerpc/kernel/idle.c @@ -74,7 +74,9 @@ void cpu_idle(void) local_irq_disable(); /* check again after disabling irqs */ - if (!need_resched() && !cpu_should_die()) + if (!need_resched() && + !need_resched_delayed() && + !cpu_should_die()) ppc_md.power_save(); local_irq_enable(); ��������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/preempt-realtime-ppc-more-resched-fixups.patch����������������������������������������������0000664�0000764�0000764�00000005403�11043075216�022202� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������--- arch/powerpc/kernel/entry_64.S | 16 +++++++++++----- arch/powerpc/kernel/idle.c | 3 ++- include/asm-powerpc/thread_info.h | 3 ++- 3 files changed, 15 insertions(+), 7 deletions(-) Index: linux-2.6.24.7/arch/powerpc/kernel/entry_64.S =================================================================== --- linux-2.6.24.7.orig/arch/powerpc/kernel/entry_64.S +++ linux-2.6.24.7/arch/powerpc/kernel/entry_64.S @@ -471,7 +471,8 @@ _GLOBAL(ret_from_except_lite) #ifdef CONFIG_PREEMPT clrrdi r9,r1,THREAD_SHIFT /* current_thread_info() */ - li r0,_TIF_NEED_RESCHED /* bits to check */ + li r0,(_TIF_NEED_RESCHED|_TIF_NEED_RESCHED_DELAYED) + /* bits to check */ ld r3,_MSR(r1) ld r4,TI_FLAGS(r9) /* Move MSR_PR bit in r3 to _TIF_SIGPENDING position in r0 */ @@ -579,16 +580,21 @@ do_work: cmpdi r0,0 crandc eq,cr1*4+eq,eq bne restore + /* here we are preempting the current task */ 1: - /* preempt_schedule_irq() expects interrupts disabled. */ - bl .preempt_schedule_irq + li r0,1 + stb r0,PACASOFTIRQEN(r13) + stb r0,PACAHARDIRQEN(r13) + ori r10,r10,MSR_EE + mtmsrd r10,1 /* reenable interrupts */ + bl .preempt_schedule mfmsr r10 clrrdi r9,r1,THREAD_SHIFT rldicl r10,r10,48,1 /* disable interrupts again */ rotldi r10,r10,16 mtmsrd r10,1 ld r4,TI_FLAGS(r9) - andi. r0,r4,_TIF_NEED_RESCHED + andi. r0,r4,(_TIF_NEED_RESCHED|_TIF_NEED_RESCHED_DELAYED) bne 1b b restore @@ -603,7 +609,7 @@ user_work: ori r10,r10,MSR_EE mtmsrd r10,1 - andi. r0,r4,_TIF_NEED_RESCHED + andi. r0,r4,(_TIF_NEED_RESCHED|_TIF_NEED_RESCHED_DELAYED) beq 1f bl .schedule b .ret_from_except_lite Index: linux-2.6.24.7/arch/powerpc/kernel/idle.c =================================================================== --- linux-2.6.24.7.orig/arch/powerpc/kernel/idle.c +++ linux-2.6.24.7/arch/powerpc/kernel/idle.c @@ -61,7 +61,8 @@ void cpu_idle(void) set_thread_flag(TIF_POLLING_NRFLAG); while (1) { tick_nohz_stop_sched_tick(); - while (!need_resched() && !cpu_should_die()) { + while (!need_resched() && !need_resched_delayed() && + !cpu_should_die()) { ppc64_runlatch_off(); if (ppc_md.power_save) { Index: linux-2.6.24.7/include/asm-powerpc/thread_info.h =================================================================== --- linux-2.6.24.7.orig/include/asm-powerpc/thread_info.h +++ linux-2.6.24.7/include/asm-powerpc/thread_info.h @@ -150,7 +150,8 @@ static inline struct thread_info *curren #define _TIF_SYSCALL_T_OR_A (_TIF_SYSCALL_TRACE|_TIF_SYSCALL_AUDIT|_TIF_SECCOMP) #define _TIF_USER_WORK_MASK ( _TIF_SIGPENDING | \ - _TIF_NEED_RESCHED | _TIF_RESTORE_SIGMASK) + _TIF_NEED_RESCHED | _TIF_RESTORE_SIGMASK | \ + _TIF_NEED_RESCHED_DELAYED) #define _TIF_PERSYSCALL_MASK (_TIF_RESTOREALL|_TIF_NOERROR) /* Bits in local_flags */ �������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/preempt-realtime-powerpc.patch��������������������������������������������������������������0000664�0000764�0000764�00000040237�11043075215�017213� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������--- arch/powerpc/kernel/smp.c | 12 ++++++++- arch/powerpc/kernel/traps.c | 9 +++++- arch/powerpc/platforms/cell/smp.c | 2 - arch/powerpc/platforms/chrp/smp.c | 2 - arch/powerpc/platforms/chrp/time.c | 2 - arch/powerpc/platforms/powermac/feature.c | 2 - arch/powerpc/platforms/powermac/nvram.c | 2 - arch/powerpc/platforms/powermac/pic.c | 2 - arch/powerpc/platforms/pseries/smp.c | 2 - arch/ppc/8260_io/enet.c | 2 - arch/ppc/8260_io/fcc_enet.c | 2 - arch/ppc/8xx_io/commproc.c | 2 - arch/ppc/8xx_io/enet.c | 2 - arch/ppc/8xx_io/fec.c | 2 - arch/ppc/kernel/smp.c | 12 ++++++++- arch/ppc/kernel/traps.c | 6 +++- arch/ppc/platforms/hdpu.c | 2 - arch/ppc/platforms/sbc82xx.c | 2 - arch/ppc/syslib/cpm2_common.c | 2 - arch/ppc/syslib/open_pic.c | 2 - arch/ppc/syslib/open_pic2.c | 2 - include/asm-powerpc/hw_irq.h | 40 ++++++++++++++++++------------ 22 files changed, 76 insertions(+), 37 deletions(-) Index: linux-2.6.24.7/arch/powerpc/kernel/smp.c =================================================================== --- linux-2.6.24.7.orig/arch/powerpc/kernel/smp.c +++ linux-2.6.24.7/arch/powerpc/kernel/smp.c @@ -126,6 +126,16 @@ void smp_send_reschedule(int cpu) smp_ops->message_pass(cpu, PPC_MSG_RESCHEDULE); } +/* + * this function sends a 'reschedule' IPI to all other CPUs. + * This is used when RT tasks are starving and other CPUs + * might be able to run them: + */ +void smp_send_reschedule_allbutself(void) +{ + smp_ops->message_pass(MSG_ALL_BUT_SELF, PPC_MSG_RESCHEDULE); +} + #ifdef CONFIG_DEBUGGER void smp_send_debugger_break(int cpu) { @@ -157,7 +167,7 @@ static void stop_this_cpu(void *dummy) * static memory requirements. It also looks cleaner. * Stolen from the i386 version. */ -static __cacheline_aligned_in_smp DEFINE_SPINLOCK(call_lock); +static __cacheline_aligned_in_smp DEFINE_RAW_SPINLOCK(call_lock); static struct call_data_struct { void (*func) (void *info); Index: linux-2.6.24.7/arch/powerpc/kernel/traps.c =================================================================== --- linux-2.6.24.7.orig/arch/powerpc/kernel/traps.c +++ linux-2.6.24.7/arch/powerpc/kernel/traps.c @@ -98,11 +98,11 @@ static inline void pmac_backlight_unblan int die(const char *str, struct pt_regs *regs, long err) { static struct { - spinlock_t lock; + raw_spinlock_t lock; u32 lock_owner; int lock_owner_depth; } die = { - .lock = __SPIN_LOCK_UNLOCKED(die.lock), + .lock = _RAW_SPIN_LOCK_UNLOCKED(die.lock), .lock_owner = -1, .lock_owner_depth = 0 }; @@ -191,6 +191,11 @@ void _exception(int signr, struct pt_reg addr, regs->nip, regs->link, code); } +#ifdef CONFIG_PREEMPT_RT + local_irq_enable(); + preempt_check_resched(); +#endif + memset(&info, 0, sizeof(info)); info.si_signo = signr; info.si_code = code; Index: linux-2.6.24.7/arch/powerpc/platforms/cell/smp.c =================================================================== --- linux-2.6.24.7.orig/arch/powerpc/platforms/cell/smp.c +++ linux-2.6.24.7/arch/powerpc/platforms/cell/smp.c @@ -134,7 +134,7 @@ static void __devinit smp_iic_setup_cpu( iic_setup_cpu(); } -static DEFINE_SPINLOCK(timebase_lock); +static DEFINE_RAW_SPINLOCK(timebase_lock); static unsigned long timebase = 0; static void __devinit cell_give_timebase(void) Index: linux-2.6.24.7/arch/powerpc/platforms/chrp/smp.c =================================================================== --- linux-2.6.24.7.orig/arch/powerpc/platforms/chrp/smp.c +++ linux-2.6.24.7/arch/powerpc/platforms/chrp/smp.c @@ -42,7 +42,7 @@ static void __devinit smp_chrp_setup_cpu mpic_setup_this_cpu(); } -static DEFINE_SPINLOCK(timebase_lock); +static DEFINE_RAW_SPINLOCK(timebase_lock); static unsigned int timebase_upper = 0, timebase_lower = 0; void __devinit smp_chrp_give_timebase(void) Index: linux-2.6.24.7/arch/powerpc/platforms/chrp/time.c =================================================================== --- linux-2.6.24.7.orig/arch/powerpc/platforms/chrp/time.c +++ linux-2.6.24.7/arch/powerpc/platforms/chrp/time.c @@ -27,7 +27,7 @@ #include <asm/sections.h> #include <asm/time.h> -extern spinlock_t rtc_lock; +extern raw_spinlock_t rtc_lock; static int nvram_as1 = NVRAM_AS1; static int nvram_as0 = NVRAM_AS0; Index: linux-2.6.24.7/arch/powerpc/platforms/powermac/feature.c =================================================================== --- linux-2.6.24.7.orig/arch/powerpc/platforms/powermac/feature.c +++ linux-2.6.24.7/arch/powerpc/platforms/powermac/feature.c @@ -59,7 +59,7 @@ extern struct device_node *k2_skiplist[2 * We use a single global lock to protect accesses. Each driver has * to take care of its own locking */ -DEFINE_SPINLOCK(feature_lock); +DEFINE_RAW_SPINLOCK(feature_lock); #define LOCK(flags) spin_lock_irqsave(&feature_lock, flags); #define UNLOCK(flags) spin_unlock_irqrestore(&feature_lock, flags); Index: linux-2.6.24.7/arch/powerpc/platforms/powermac/nvram.c =================================================================== --- linux-2.6.24.7.orig/arch/powerpc/platforms/powermac/nvram.c +++ linux-2.6.24.7/arch/powerpc/platforms/powermac/nvram.c @@ -80,7 +80,7 @@ static int is_core_99; static int core99_bank = 0; static int nvram_partitions[3]; // XXX Turn that into a sem -static DEFINE_SPINLOCK(nv_lock); +static DEFINE_RAW_SPINLOCK(nv_lock); static int (*core99_write_bank)(int bank, u8* datas); static int (*core99_erase_bank)(int bank); Index: linux-2.6.24.7/arch/powerpc/platforms/powermac/pic.c =================================================================== --- linux-2.6.24.7.orig/arch/powerpc/platforms/powermac/pic.c +++ linux-2.6.24.7/arch/powerpc/platforms/powermac/pic.c @@ -63,7 +63,7 @@ static int max_irqs; static int max_real_irqs; static u32 level_mask[4]; -static DEFINE_SPINLOCK(pmac_pic_lock); +static DEFINE_RAW_SPINLOCK(pmac_pic_lock); #define NR_MASK_WORDS ((NR_IRQS + 31) / 32) static unsigned long ppc_lost_interrupts[NR_MASK_WORDS]; Index: linux-2.6.24.7/arch/powerpc/platforms/pseries/smp.c =================================================================== --- linux-2.6.24.7.orig/arch/powerpc/platforms/pseries/smp.c +++ linux-2.6.24.7/arch/powerpc/platforms/pseries/smp.c @@ -154,7 +154,7 @@ static void __devinit smp_xics_setup_cpu } #endif /* CONFIG_XICS */ -static DEFINE_SPINLOCK(timebase_lock); +static DEFINE_RAW_SPINLOCK(timebase_lock); static unsigned long timebase = 0; static void __devinit pSeries_give_timebase(void) Index: linux-2.6.24.7/arch/ppc/8260_io/enet.c =================================================================== --- linux-2.6.24.7.orig/arch/ppc/8260_io/enet.c +++ linux-2.6.24.7/arch/ppc/8260_io/enet.c @@ -115,7 +115,7 @@ struct scc_enet_private { scc_t *sccp; struct net_device_stats stats; uint tx_full; - spinlock_t lock; + raw_spinlock_t lock; }; static int scc_enet_open(struct net_device *dev); Index: linux-2.6.24.7/arch/ppc/8260_io/fcc_enet.c =================================================================== --- linux-2.6.24.7.orig/arch/ppc/8260_io/fcc_enet.c +++ linux-2.6.24.7/arch/ppc/8260_io/fcc_enet.c @@ -375,7 +375,7 @@ struct fcc_enet_private { volatile fcc_enet_t *ep; struct net_device_stats stats; uint tx_free; - spinlock_t lock; + raw_spinlock_t lock; #ifdef CONFIG_USE_MDIO uint phy_id; Index: linux-2.6.24.7/arch/ppc/8xx_io/commproc.c =================================================================== --- linux-2.6.24.7.orig/arch/ppc/8xx_io/commproc.c +++ linux-2.6.24.7/arch/ppc/8xx_io/commproc.c @@ -370,7 +370,7 @@ cpm_setbrg(uint brg, uint rate) /* * dpalloc / dpfree bits. */ -static spinlock_t cpm_dpmem_lock; +static raw_spinlock_t cpm_dpmem_lock; /* * 16 blocks should be enough to satisfy all requests * until the memory subsystem goes up... Index: linux-2.6.24.7/arch/ppc/8xx_io/enet.c =================================================================== --- linux-2.6.24.7.orig/arch/ppc/8xx_io/enet.c +++ linux-2.6.24.7/arch/ppc/8xx_io/enet.c @@ -143,7 +143,7 @@ struct scc_enet_private { unsigned char *rx_vaddr[RX_RING_SIZE]; struct net_device_stats stats; uint tx_full; - spinlock_t lock; + raw_spinlock_t lock; }; static int scc_enet_open(struct net_device *dev); Index: linux-2.6.24.7/arch/ppc/8xx_io/fec.c =================================================================== --- linux-2.6.24.7.orig/arch/ppc/8xx_io/fec.c +++ linux-2.6.24.7/arch/ppc/8xx_io/fec.c @@ -164,7 +164,7 @@ struct fec_enet_private { struct net_device_stats stats; uint tx_full; - spinlock_t lock; + raw_spinlock_t lock; #ifdef CONFIG_USE_MDIO uint phy_id; Index: linux-2.6.24.7/arch/ppc/kernel/smp.c =================================================================== --- linux-2.6.24.7.orig/arch/ppc/kernel/smp.c +++ linux-2.6.24.7/arch/ppc/kernel/smp.c @@ -136,6 +136,16 @@ void smp_send_reschedule(int cpu) smp_message_pass(cpu, PPC_MSG_RESCHEDULE); } +/* + * this function sends a 'reschedule' IPI to all other CPUs. + * This is used when RT tasks are starving and other CPUs + * might be able to run them: + */ +void smp_send_reschedule_allbutself(void) +{ + smp_message_pass(MSG_ALL_BUT_SELF, PPC_MSG_RESCHEDULE, 0, 0); +} + #ifdef CONFIG_XMON void smp_send_xmon_break(int cpu) { @@ -160,7 +170,7 @@ void smp_send_stop(void) * static memory requirements. It also looks cleaner. * Stolen from the i386 version. */ -static DEFINE_SPINLOCK(call_lock); +static DEFINE_RAW_SPINLOCK(call_lock); static struct call_data_struct { void (*func) (void *info); Index: linux-2.6.24.7/arch/ppc/kernel/traps.c =================================================================== --- linux-2.6.24.7.orig/arch/ppc/kernel/traps.c +++ linux-2.6.24.7/arch/ppc/kernel/traps.c @@ -72,7 +72,7 @@ void (*debugger_fault_handler)(struct pt * Trap & Exception support */ -DEFINE_SPINLOCK(die_lock); +DEFINE_RAW_SPINLOCK(die_lock); int die(const char * str, struct pt_regs * fp, long err) { @@ -108,6 +108,10 @@ void _exception(int signr, struct pt_reg debugger(regs); die("Exception in kernel mode", regs, signr); } +#ifdef CONFIG_PREEMPT_RT + local_irq_enable(); + preempt_check_resched(); +#endif info.si_signo = signr; info.si_errno = 0; info.si_code = code; Index: linux-2.6.24.7/arch/ppc/platforms/hdpu.c =================================================================== --- linux-2.6.24.7.orig/arch/ppc/platforms/hdpu.c +++ linux-2.6.24.7/arch/ppc/platforms/hdpu.c @@ -55,7 +55,7 @@ static void parse_bootinfo(unsigned long static void hdpu_set_l1pe(void); static void hdpu_cpustate_set(unsigned char new_state); #ifdef CONFIG_SMP -static DEFINE_SPINLOCK(timebase_lock); +static DEFINE_RAW_SPINLOCK(timebase_lock); static unsigned int timebase_upper = 0, timebase_lower = 0; extern int smp_tb_synchronized; Index: linux-2.6.24.7/arch/ppc/platforms/sbc82xx.c =================================================================== --- linux-2.6.24.7.orig/arch/ppc/platforms/sbc82xx.c +++ linux-2.6.24.7/arch/ppc/platforms/sbc82xx.c @@ -65,7 +65,7 @@ static void sbc82xx_time_init(void) static volatile char *sbc82xx_i8259_map; static char sbc82xx_i8259_mask = 0xff; -static DEFINE_SPINLOCK(sbc82xx_i8259_lock); +static DEFINE_RAW_SPINLOCK(sbc82xx_i8259_lock); static void sbc82xx_i8259_mask_and_ack_irq(unsigned int irq_nr) { Index: linux-2.6.24.7/arch/ppc/syslib/cpm2_common.c =================================================================== --- linux-2.6.24.7.orig/arch/ppc/syslib/cpm2_common.c +++ linux-2.6.24.7/arch/ppc/syslib/cpm2_common.c @@ -114,7 +114,7 @@ cpm2_fastbrg(uint brg, uint rate, int di /* * dpalloc / dpfree bits. */ -static spinlock_t cpm_dpmem_lock; +static raw_spinlock_t cpm_dpmem_lock; /* 16 blocks should be enough to satisfy all requests * until the memory subsystem goes up... */ static rh_block_t cpm_boot_dpmem_rh_block[16]; Index: linux-2.6.24.7/arch/ppc/syslib/open_pic.c =================================================================== --- linux-2.6.24.7.orig/arch/ppc/syslib/open_pic.c +++ linux-2.6.24.7/arch/ppc/syslib/open_pic.c @@ -526,7 +526,7 @@ void openpic_reset_processor_phys(u_int } #if defined(CONFIG_SMP) || defined(CONFIG_PM) -static DEFINE_SPINLOCK(openpic_setup_lock); +static DEFINE_RAW_SPINLOCK(openpic_setup_lock); #endif #ifdef CONFIG_SMP Index: linux-2.6.24.7/arch/ppc/syslib/open_pic2.c =================================================================== --- linux-2.6.24.7.orig/arch/ppc/syslib/open_pic2.c +++ linux-2.6.24.7/arch/ppc/syslib/open_pic2.c @@ -380,7 +380,7 @@ static void openpic2_set_spurious(u_int vec); } -static DEFINE_SPINLOCK(openpic2_setup_lock); +static DEFINE_RAW_SPINLOCK(openpic2_setup_lock); /* * Initialize a timer interrupt (and disable it) Index: linux-2.6.24.7/include/asm-powerpc/hw_irq.h =================================================================== --- linux-2.6.24.7.orig/include/asm-powerpc/hw_irq.h +++ linux-2.6.24.7/include/asm-powerpc/hw_irq.h @@ -20,8 +20,8 @@ static inline unsigned long local_get_fl { unsigned long flags; - __asm__ __volatile__("lbz %0,%1(13)" - : "=r" (flags) +<<<<<<< delete extern unsigned long local_get_flags(void); +<<<<<<< delete extern unsigned long local_irq_disable(void); : "i" (offsetof(struct paca_struct, soft_enabled))); return flags; @@ -39,14 +39,19 @@ static inline unsigned long local_irq_di return flags; } -extern void local_irq_restore(unsigned long); + extern void iseries_handle_interrupts(void); +extern unsigned long raw_local_get_flags(void); +extern unsigned long raw_local_irq_disable(void); +extern void raw_local_irq_restore(unsigned long); + +#define raw_local_irq_enable() raw_local_irq_restore(1) +#define raw_local_save_flags(flags) ((flags) = raw_local_get_flags()) +#define raw_local_irq_save(flags) ((flags) = raw_local_irq_disable()) -#define local_irq_enable() local_irq_restore(1) -#define local_save_flags(flags) ((flags) = local_get_flags()) -#define local_irq_save(flags) ((flags) = local_irq_disable()) +#define raw_irqs_disabled() (raw_local_get_flags() == 0) +#define raw_irqs_disabled_flags(flags) ((flags) == 0) -#define irqs_disabled() (local_get_flags() == 0) #define __hard_irq_enable() __mtmsrd(mfmsr() | MSR_EE, 1) #define __hard_irq_disable() __mtmsrd(mfmsr() & ~MSR_EE, 1) @@ -62,13 +67,15 @@ extern void iseries_handle_interrupts(vo #if defined(CONFIG_BOOKE) #define SET_MSR_EE(x) mtmsr(x) -#define local_irq_restore(flags) __asm__ __volatile__("wrtee %0" : : "r" (flags) : "memory") +#define raw_local_irq_restore(flags) __asm__ __volatile__("wrtee %0" : : "r" (flags) : "memory") +<<<<<<< delete #define local_irq_restore(flags) do { \ +#define raw_local_irq_restore(flags) do { \ #else #define SET_MSR_EE(x) mtmsr(x) -#define local_irq_restore(flags) mtmsr(flags) +#define raw_local_irq_restore(flags) mtmsr(flags) #endif -static inline void local_irq_disable(void) +static inline void raw_local_irq_disable(void) { #ifdef CONFIG_BOOKE __asm__ __volatile__("wrteei 0": : :"memory"); @@ -80,7 +87,7 @@ static inline void local_irq_disable(voi #endif } -static inline void local_irq_enable(void) +static inline void raw_local_irq_enable(void) { #ifdef CONFIG_BOOKE __asm__ __volatile__("wrteei 1": : :"memory"); @@ -92,7 +99,7 @@ static inline void local_irq_enable(void #endif } -static inline void local_irq_save_ptr(unsigned long *flags) +static inline void raw_local_irq_save_ptr(unsigned long *flags) { unsigned long msr; msr = mfmsr(); @@ -105,13 +112,16 @@ static inline void local_irq_save_ptr(un __asm__ __volatile__("": : :"memory"); } -#define local_save_flags(flags) ((flags) = mfmsr()) -#define local_irq_save(flags) local_irq_save_ptr(&flags) -#define irqs_disabled() ((mfmsr() & MSR_EE) == 0) +#define raw_local_save_flags(flags) ((flags) = mfmsr()) +#define raw_local_irq_save(flags) raw_local_irq_save_ptr(&flags) +#define raw_irqs_disabled() ((mfmsr() & MSR_EE) == 0) +#define raw_irqs_disabled_flags(flags) ((flags & MSR_EE) == 0) #define hard_irq_enable() local_irq_enable() #define hard_irq_disable() local_irq_disable() +#include <linux/trace_irqflags.h> + #endif /* CONFIG_PPC64 */ /* �����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/preempt-realtime-powerpc-update.patch�������������������������������������������������������0000664�0000764�0000764�00000004177�11041657734�020510� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������--- arch/powerpc/Kconfig.debug | 4 ++++ arch/powerpc/kernel/idle.c | 2 +- include/asm-powerpc/hw_irq.h | 2 +- include/asm-powerpc/pmac_feature.h | 2 +- 4 files changed, 7 insertions(+), 3 deletions(-) Index: linux-2.6.24.7/arch/powerpc/Kconfig.debug =================================================================== --- linux-2.6.24.7.orig/arch/powerpc/Kconfig.debug +++ linux-2.6.24.7/arch/powerpc/Kconfig.debug @@ -2,6 +2,10 @@ menu "Kernel hacking" source "lib/Kconfig.debug" +config TRACE_IRQFLAGS_SUPPORT + bool + default y + config DEBUG_STACKOVERFLOW bool "Check for stack overflows" depends on DEBUG_KERNEL Index: linux-2.6.24.7/arch/powerpc/kernel/idle.c =================================================================== --- linux-2.6.24.7.orig/arch/powerpc/kernel/idle.c +++ linux-2.6.24.7/arch/powerpc/kernel/idle.c @@ -98,7 +98,7 @@ void cpu_idle(void) tick_nohz_restart_sched_tick(); if (cpu_should_die()) cpu_die(); - preempt_enable_no_resched(); + __preempt_enable_no_resched(); schedule(); preempt_disable(); } Index: linux-2.6.24.7/include/asm-powerpc/hw_irq.h =================================================================== --- linux-2.6.24.7.orig/include/asm-powerpc/hw_irq.h +++ linux-2.6.24.7/include/asm-powerpc/hw_irq.h @@ -120,7 +120,7 @@ static inline void raw_local_irq_save_pt #define hard_irq_enable() local_irq_enable() #define hard_irq_disable() local_irq_disable() -#include <linux/trace_irqflags.h> +#include <linux/irqflags.h> #endif /* CONFIG_PPC64 */ Index: linux-2.6.24.7/include/asm-powerpc/pmac_feature.h =================================================================== --- linux-2.6.24.7.orig/include/asm-powerpc/pmac_feature.h +++ linux-2.6.24.7/include/asm-powerpc/pmac_feature.h @@ -378,7 +378,7 @@ extern struct macio_chip* macio_find(str * Those are exported by pmac feature for internal use by arch code * only like the platform function callbacks, do not use directly in drivers */ -extern spinlock_t feature_lock; +extern raw_spinlock_t feature_lock; extern struct device_node *uninorth_node; extern u32 __iomem *uninorth_base; �������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/preempt-realtime-powerpc-a7.patch�����������������������������������������������������������0000664�0000764�0000764�00000010336�11043037123�017511� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������� To fix the following compile error by changing local_irq_restore() to raw_local_irq_restore(). - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - include/asm-powerpc/hw_irq.h In file included from include/asm/system.h:9, from include/linux/list.h:9, from include/linux/signal.h:8, from arch/powerpc/kernel/asm-offsets.c:16: include/asm/hw_irq.h: In function 'local_get_flags': include/asm/hw_irq.h:23: error: expected expression before '<<' token include/asm/hw_irq.h:24: error: expected expression before '<<' token include/asm/hw_irq.h:25: error: expected expression before ':' token include/asm/hw_irq.h:25: error: expected statement before ')' token - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Signed-off-by: Tsutomu Owa <tsutomu.owa@toshiba.co.jp> -- owa --- arch/powerpc/kernel/head_64.S | 2 +- arch/powerpc/kernel/irq.c | 2 +- arch/powerpc/kernel/ppc_ksyms.c | 2 +- include/asm-powerpc/hw_irq.h | 18 ++++++++---------- 4 files changed, 11 insertions(+), 13 deletions(-) Index: linux-2.6.24.7/arch/powerpc/kernel/head_64.S =================================================================== --- linux-2.6.24.7.orig/arch/powerpc/kernel/head_64.S +++ linux-2.6.24.7/arch/powerpc/kernel/head_64.S @@ -878,7 +878,7 @@ END_FW_FTR_SECTION_IFCLR(FW_FEATURE_ISER * handles any interrupts pending at this point. */ ld r3,SOFTE(r1) - bl .local_irq_restore + bl .raw_local_irq_restore b 11f /* Here we have a page fault that hash_page can't handle. */ Index: linux-2.6.24.7/arch/powerpc/kernel/irq.c =================================================================== --- linux-2.6.24.7.orig/arch/powerpc/kernel/irq.c +++ linux-2.6.24.7/arch/powerpc/kernel/irq.c @@ -112,7 +112,7 @@ static inline notrace void set_soft_enab : : "r" (enable), "i" (offsetof(struct paca_struct, soft_enabled))); } -notrace void local_irq_restore(unsigned long en) +notrace void raw_local_irq_restore(unsigned long en) { /* * get_paca()->soft_enabled = en; Index: linux-2.6.24.7/arch/powerpc/kernel/ppc_ksyms.c =================================================================== --- linux-2.6.24.7.orig/arch/powerpc/kernel/ppc_ksyms.c +++ linux-2.6.24.7/arch/powerpc/kernel/ppc_ksyms.c @@ -46,7 +46,7 @@ #include <asm/ftrace.h> #ifdef CONFIG_PPC64 -EXPORT_SYMBOL(local_irq_restore); +EXPORT_SYMBOL(raw_local_irq_restore); #endif #ifdef CONFIG_PPC32 Index: linux-2.6.24.7/include/asm-powerpc/hw_irq.h =================================================================== --- linux-2.6.24.7.orig/include/asm-powerpc/hw_irq.h +++ linux-2.6.24.7/include/asm-powerpc/hw_irq.h @@ -16,18 +16,18 @@ extern void timer_interrupt(struct pt_re #ifdef CONFIG_PPC64 #include <asm/paca.h> -static inline unsigned long local_get_flags(void) +static inline unsigned long raw_local_get_flags(void) { unsigned long flags; -<<<<<<< delete extern unsigned long local_get_flags(void); -<<<<<<< delete extern unsigned long local_irq_disable(void); + __asm__ __volatile__("lbz %0,%1(13)" + : "=r" (flags) : "i" (offsetof(struct paca_struct, soft_enabled))); return flags; } -static inline unsigned long local_irq_disable(void) +static inline unsigned long raw_local_irq_disable(void) { unsigned long flags, zero; @@ -53,8 +53,8 @@ extern void raw_local_irq_restore(unsign #define raw_irqs_disabled_flags(flags) ((flags) == 0) -#define __hard_irq_enable() __mtmsrd(mfmsr() | MSR_EE, 1) -#define __hard_irq_disable() __mtmsrd(mfmsr() & ~MSR_EE, 1) +#define __hard_irq_enable() __mtmsrd(mfmsr() | MSR_EE, 1) +#define __hard_irq_disable() __mtmsrd(mfmsr() & ~MSR_EE, 1) #define hard_irq_disable() \ do { \ @@ -63,17 +63,15 @@ extern void raw_local_irq_restore(unsign get_paca()->hard_enabled = 0; \ } while(0) -#else +#else /* CONFIG_PPC64 */ #if defined(CONFIG_BOOKE) #define SET_MSR_EE(x) mtmsr(x) #define raw_local_irq_restore(flags) __asm__ __volatile__("wrtee %0" : : "r" (flags) : "memory") -<<<<<<< delete #define local_irq_restore(flags) do { \ -#define raw_local_irq_restore(flags) do { \ #else #define SET_MSR_EE(x) mtmsr(x) #define raw_local_irq_restore(flags) mtmsr(flags) -#endif +#endif /* CONFIG_BOOKE */ static inline void raw_local_irq_disable(void) { ��������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/preempt-realtime-powerpc-b2.patch�����������������������������������������������������������0000664�0000764�0000764�00000005027�11041657734�017524� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������� To convert the spinlocks into the raw onces to fix the following warnings/errors. - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Badness at arch/powerpc/kernel/entry_64.S:651 Call Trace: [C0000000006133E0] [C00000000000FAAC] show_stack+0x68/0x1b0 (unreliable) [C000000000613480] [C0000000001EF004] .repor000001EF004] .report_bug+0x94/0xe8 [C000000000613510] [C0000000003EAD58] .program_check_exception+0x170/0x5a8 [C00000000000487C] program_check_common+0xfc/0x100 --- arch/powerpc/kernel/irq.c | 2 +- arch/powerpc/kernel/rtas.c | 2 +- arch/powerpc/mm/hash_native_64.c | 2 +- include/asm-powerpc/rtas.h | 2 +- 4 files changed, 4 insertions(+), 4 deletions(-) Index: linux-2.6.24.7/arch/powerpc/kernel/irq.c =================================================================== --- linux-2.6.24.7.orig/arch/powerpc/kernel/irq.c +++ linux-2.6.24.7/arch/powerpc/kernel/irq.c @@ -403,7 +403,7 @@ void do_softirq(void) #ifdef CONFIG_PPC_MERGE static LIST_HEAD(irq_hosts); -static DEFINE_SPINLOCK(irq_big_lock); +static DEFINE_RAW_SPINLOCK(irq_big_lock); static DEFINE_PER_CPU(unsigned int, irq_radix_reader); static unsigned int irq_radix_writer; struct irq_map_entry irq_map[NR_IRQS]; Index: linux-2.6.24.7/arch/powerpc/kernel/rtas.c =================================================================== --- linux-2.6.24.7.orig/arch/powerpc/kernel/rtas.c +++ linux-2.6.24.7/arch/powerpc/kernel/rtas.c @@ -41,7 +41,7 @@ #include <asm/atomic.h> struct rtas_t rtas = { - .lock = SPIN_LOCK_UNLOCKED + .lock = RAW_SPIN_LOCK_UNLOCKED(lock) }; EXPORT_SYMBOL(rtas); Index: linux-2.6.24.7/arch/powerpc/mm/hash_native_64.c =================================================================== --- linux-2.6.24.7.orig/arch/powerpc/mm/hash_native_64.c +++ linux-2.6.24.7/arch/powerpc/mm/hash_native_64.c @@ -36,7 +36,7 @@ #define HPTE_LOCK_BIT 3 -static DEFINE_SPINLOCK(native_tlbie_lock); +static DEFINE_RAW_SPINLOCK(native_tlbie_lock); static inline void __tlbie(unsigned long va, int psize, int ssize) { Index: linux-2.6.24.7/include/asm-powerpc/rtas.h =================================================================== --- linux-2.6.24.7.orig/include/asm-powerpc/rtas.h +++ linux-2.6.24.7/include/asm-powerpc/rtas.h @@ -58,7 +58,7 @@ struct rtas_t { unsigned long entry; /* physical address pointer */ unsigned long base; /* physical address pointer */ unsigned long size; - spinlock_t lock; + raw_spinlock_t lock; struct rtas_args args; struct device_node *dev; /* virtual address pointer */ }; ���������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/preempt-realtime-powerpc-b3.patch�����������������������������������������������������������0000664�0000764�0000764�00000003756�11041657732�017532� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������� To fix the following runtime warning. - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - BUG: using smp_processor_id() in preemptible [00000000] code: init/371 caller is .pgtable_free_tlb+0x2c/0x14c Call Trace: [C00000000FF6B770] [C00000000000FAAC] .show_stack+0x68/0x1b0 (unreliable) [C00000000FF6B810] [C0000000001F7190] .debug_smp_processor_id+0xc8/0xf8 [C00000000FF6B8A0] [C00000000002C52C] .pgtable_free_tlb+0x2c/0x14c [C00000000FF6B940] [C0000000000B6528] .free_pgd_range+0x234/0x3bc [C00000000FF6BA40] [C0000000000B6AB8] .free_pgtables+0x224/0x260 [C00000000FF6BB00] [C0000000000B7FE8] .exit_mmap+0x100/0x208 [C00000000FF6BBC0] [C000000000055FB0] .mmput+0x70/0x12c [C00000000FF6BC50] [C00000000005B728] .exit_mm+0x150/0x170 [C00000000FF6BCE0] [C00000000005D80C] .do_exit+0x28c/0x9bc [C00000000FF6BDA0] [C00000000005DFF0] .sys_exit_group+0x0/0x8 [C00000000FF6BE30] [C000000000008634] syscall_exit+0x0/0x40 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Would it be better to just use raw_smp_processor_id() rather than tlb->cpu? Signed-off-by: Tsutomu Owa <tsutomu.owa@toshiba.co.jp> -- owa --- arch/powerpc/mm/tlb_64.c | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-) Index: linux-2.6.24.7/arch/powerpc/mm/tlb_64.c =================================================================== --- linux-2.6.24.7.orig/arch/powerpc/mm/tlb_64.c +++ linux-2.6.24.7/arch/powerpc/mm/tlb_64.c @@ -91,8 +91,11 @@ static void pte_free_submit(struct pte_f void pgtable_free_tlb(struct mmu_gather *tlb, pgtable_free_t pgf) { - /* This is safe since tlb_gather_mmu has disabled preemption */ - cpumask_t local_cpumask = cpumask_of_cpu(smp_processor_id()); + /* + * This is safe since tlb_gather_mmu has disabled preemption. + * tlb->cpu is set by tlb_gather_mmu as well. + */ + cpumask_t local_cpumask = cpumask_of_cpu(tlb->cpu); struct pte_freelist_batch **batchp = &__get_cpu_var(pte_freelist_cur); if (atomic_read(&tlb->mm->mm_users) < 2 || ������������������patches/preempt-realtime-powerpc-b4.patch�����������������������������������������������������������0000664�0000764�0000764�00000007256�11041657734�017534� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������� At Wed, 07 Mar 2007 11:10:59 +0100, Benjamin Herrenschmidt wrote: > > On Wed, 2007-03-07 at 10:16 +0100, Ingo Molnar wrote: > > * Tsutomu OWA <tsutomu.owa@toshiba.co.jp> wrote: > > > > > @@ -342,6 +342,7 @@ static int xmon_core(struct pt_regs *reg > > > > > > msr = mfmsr(); > > > mtmsr(msr & ~MSR_EE); /* disable interrupts */ > > > + preempt_disable(); > > > > i'm not an xmon expert, but maybe it might make more sense to first > > disable preemption, then interrupts - otherwise you could be preempted > > right after having disabled these interrupts (and be scheduled to > > another CPU, etc.). What is the difference between local_irq_save() and > > the above 'disable interrupts' sequence? If it's not the same and > > xmon_core() relied on having hardirqs disabled then it might make sense > > to do a local_irq_save() there, instead of a preempt_disable(). > > powerpc 64 bits nowadays does lazy HW masking, so local_irq_disable() > will not actually switch MSR_EE off. However, xmon needs that to happen > (though we have a nicer accessor to do it, I suspect some bitrot need > fixing in there, possibly already fixed in .21) > > I agree that preempt_disable() should be put before the MSR tweaking > though. As all of you said, I'm resending the patch here. To fix the following runtime warnings when entering xmon. - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Entering xmon BUG: using smp_processor_id() in preemptible [00000000] code: khvcd/280 caller is .xmon_core+0xb8/0x8ec Call Trace: [C00000000FD737C0] [C00000000000FAAC] .show_stack+0x68/0x1b0 (unreliable) [C00000000FD73860] [C0000000001F71F0] .debug_smp_processor_id+0xc8/0xf8 [C00000000FD738F0] [C00000000004AF30] .xmon_core+0xb8/0x8ec [C00000000FD73A80] [C00000000004B918] .xmon+0x38/0x4c [C00000000FD73C60] [C00000000004BA8C] .sysrq_handle_xmon+0x48/0x5c [C00000000FD73CD0] [C000000000243A68] .__handle_sysrq+0xe0/0x1b0 [C00000000FD73D70] [C000000000244974] .hvc_poll+0x18c/0x2b4 [C00000000FD73E50] [C000000000244E80] .khvcd+0x88/0x164 [C00000000FD73EE0] [C000000000075014] .kthread+0x124/0x174 [C00000000FD73F90] [C000000000023D48] .kernel_thread+0x4c/0x68 BUG: khvcd:280 task might have lost a preemption check! Call Trace: [C00000000FD73740] [C00000000000FAAC] .show_stack+0x68/0x1b0 (unreliable) [C00000000FD737E0] [C000000000054920] .preempt_enable_no_resched+0x64/0x7c [C00000000FD73860] [C0000000001F71F8] .debug_smp_processor_id+0xd0/0xf8 [C00000000FD738F0] [C00000000004AF30] .xmon_core+0xb8/0x8ec [C00000000FD73A80] [C00000000004B918] .xmon+0x38/0x4c [C00000000FD73C60] [C00000000004BA8C] .sysrq_handle_xmon+0x48/0x5c [C00000000FD73CD0] [C000000000243A68] .__handle_sysrq+0xe0/0x1b0 [C00000000FD73D70] [C000000000244974] .hvc_poll+0x18c/0x2b4 [C00000000FD73E50] [C000000000244E80] .khvcd+0x88/0x164 [C00000000FD73EE0] [C000000000075014] .kthread+0x124/0x174 [C00000000FD73F90] [C000000000023D48] .kernel_thread+0x4c/0x68 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - thanks a lot! Signed-off-by: Tsutomu Owa <tsutomu.owa@toshiba.co.jp> -- owa --- arch/powerpc/xmon/xmon.c | 2 ++ 1 file changed, 2 insertions(+) Index: linux-2.6.24.7/arch/powerpc/xmon/xmon.c =================================================================== --- linux-2.6.24.7.orig/arch/powerpc/xmon/xmon.c +++ linux-2.6.24.7/arch/powerpc/xmon/xmon.c @@ -340,6 +340,7 @@ static int xmon_core(struct pt_regs *reg unsigned long timeout; #endif + preempt_disable(); local_irq_save(flags); bp = in_breakpoint_table(regs->nip, &offset); @@ -516,6 +517,7 @@ static int xmon_core(struct pt_regs *reg insert_cpu_bpts(); local_irq_restore(flags); + preempt_enable(); return cmd != 'X' && cmd != EOF; } ��������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/preempt-realtime-powerpc-add-raw-relax-macros.patch�����������������������������������������0000664�0000764�0000764�00000002146�11041657734�023132� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From tsutomu.owa@toshiba.co.jp Mon May 14 15:26:25 2007 Date: Mon, 14 May 2007 15:26:25 +0900 From: Tsutomu OWA <tsutomu.owa@toshiba.co.jp> To: linuxppc-dev@ozlabs.org, linux-kernel@vger.kernel.org Cc: mingo@elte.hu, tglx@linutronix.de Subject: Re: [patch 1/4] powerpc 2.6.21-rt1: fix a build breakage by adding __raw_*_relax() macros Add missing macros to fix a build breakage for PREEMPT_DESKTOP. Signed-off-by: Tsutomu OWA <tsutomu.owa@toshiba.co.jp> -- owa --- include/asm-powerpc/spinlock.h | 4 ++++ 1 file changed, 4 insertions(+) Index: linux-2.6.24.7/include/asm-powerpc/spinlock.h =================================================================== --- linux-2.6.24.7.orig/include/asm-powerpc/spinlock.h +++ linux-2.6.24.7/include/asm-powerpc/spinlock.h @@ -289,5 +289,9 @@ static __inline__ void __raw_write_unloc #define _raw_read_relax(lock) __rw_yield(lock) #define _raw_write_relax(lock) __rw_yield(lock) +#define __raw_spin_relax(lock) cpu_relax() +#define __raw_read_relax(lock) cpu_relax() +#define __raw_write_relax(lock) cpu_relax() + #endif /* __KERNEL__ */ #endif /* __ASM_SPINLOCK_H */ ��������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/preempt-realtime-powerpc-tlb-batching.patch�������������������������������������������������0000664�0000764�0000764�00000003761�11041657734�021562� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From tsutomu.owa@toshiba.co.jp Tue May 15 15:27:26 2007 Date: Tue, 15 May 2007 15:27:26 +0900 From: Tsutomu OWA <tsutomu.owa@toshiba.co.jp> To: Arnd Bergmann <arnd@arndb.de> Cc: linuxppc-dev@ozlabs.org, Thomas Gleixner <tglx@linutronix.de>, mingo@elte.hu, linux-kernel@vger.kernel.org Subject: Re: [patch 4/4] powerpc 2.6.21-rt1: reduce scheduling latency by changing tlb flush size At Mon, 14 May 2007 16:40:02 +0200, Arnd Bergmann wrote: > > +#if defined(CONFIG_PPC_CELLEB) && defined(CONFIG_PREEMPT_RT) > > +/* Since tlb flush takes long time on Celleb, reduce it to 1 when Celleb && RT */ > > +#define PPC64_TLB_BATCH_NR 1 > With this code, you get silent side-effects of enabling PPC_CELLEB > along with another platform. > Maybe instead you should change the hpte_need_flush() to always flush > when running on the celleb platform and PREEMPT_RT is enabled. OK, how about this one? thanks a lot! Since flushing tlb needs expensive hypervisor call(s) on celleb, always flush it on RT to reduce scheduling latency. Signed-off-by: Tsutomu OWA <tsutomu.owa@toshiba.co.jp> -- owa --- arch/powerpc/mm/tlb_64.c | 13 +++++++++++++ 1 file changed, 13 insertions(+) Index: linux-2.6.24.7/arch/powerpc/mm/tlb_64.c =================================================================== --- linux-2.6.24.7.orig/arch/powerpc/mm/tlb_64.c +++ linux-2.6.24.7/arch/powerpc/mm/tlb_64.c @@ -30,6 +30,7 @@ #include <asm/tlbflush.h> #include <asm/tlb.h> #include <asm/bug.h> +#include <asm/machdep.h> DEFINE_PER_CPU(struct ppc64_tlb_batch, ppc64_tlb_batch); @@ -207,6 +208,18 @@ void hpte_need_flush(struct mm_struct *m batch->pte[i] = rpte; batch->vaddr[i] = vaddr; batch->index = ++i; + +#ifdef CONFIG_PREEMPT_RT + /* + * Since flushing tlb needs expensive hypervisor call(s) on celleb, + * always flush it on RT to reduce scheduling latency. + */ + if (machine_is(celleb)) { + flush_tlb_pending(); + return; + } +#endif /* CONFIG_PREEMPT_RT */ + if (i >= PPC64_TLB_BATCH_NR) __flush_tlb_pending(batch); } ���������������patches/preempt-realtime-powerpc-celleb-raw-spinlocks.patch�����������������������������������������0000664�0000764�0000764�00000003170�11041657732�023234� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From tsutomu.owa@toshiba.co.jp Mon May 14 15:28:23 2007 Date: Mon, 14 May 2007 15:28:23 +0900 From: Tsutomu OWA <tsutomu.owa@toshiba.co.jp> To: linuxppc-dev@ozlabs.org, linux-kernel@vger.kernel.org Cc: mingo@elte.hu, tglx@linutronix.de Subject: Re: [patch 2/4] powerpc 2.6.21-rt1: convert spinlocks to raw ones for Celleb. Convert more spinlocks to raw ones for Celleb. Signed-off-by: Tsutomu OWA <tsutomu.owa@toshiba.co.jp> -- owa --- arch/powerpc/platforms/celleb/htab.c | 2 +- arch/powerpc/platforms/celleb/interrupt.c | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) Index: linux-2.6.24.7/arch/powerpc/platforms/celleb/htab.c =================================================================== --- linux-2.6.24.7.orig/arch/powerpc/platforms/celleb/htab.c +++ linux-2.6.24.7/arch/powerpc/platforms/celleb/htab.c @@ -40,7 +40,7 @@ #define DBG_LOW(fmt...) do { } while(0) #endif -static DEFINE_SPINLOCK(beat_htab_lock); +static DEFINE_RAW_SPINLOCK(beat_htab_lock); static inline unsigned int beat_read_mask(unsigned hpte_group) { Index: linux-2.6.24.7/arch/powerpc/platforms/celleb/interrupt.c =================================================================== --- linux-2.6.24.7.orig/arch/powerpc/platforms/celleb/interrupt.c +++ linux-2.6.24.7/arch/powerpc/platforms/celleb/interrupt.c @@ -34,7 +34,7 @@ extern int hardirq_preemption; #endif /* CONFIG_PREEMPT_HARDIRQS */ #define MAX_IRQS NR_IRQS -static DEFINE_SPINLOCK(beatic_irq_mask_lock); +static DEFINE_RAW_SPINLOCK(beatic_irq_mask_lock); static uint64_t beatic_irq_mask_enable[(MAX_IRQS+255)/64]; static uint64_t beatic_irq_mask_ack[(MAX_IRQS+255)/64]; ��������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/preempt-realtime-powerpc-missing-raw-spinlocks.patch����������������������������������������0000664�0000764�0000764�00000007525�11041657732�023467� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From sshtylyov@ru.mvista.com Thu Jun 21 22:24:22 2007 Return-Path: <sshtylyov@ru.mvista.com> X-Spam-Checker-Version: SpamAssassin 3.1.7-deb (2006-10-05) on debian X-Spam-Level: X-Spam-Status: No, score=0.0 required=5.0 tests=AWL autolearn=unavailable version=3.1.7-deb Received: from imap.sh.mvista.com (unknown [63.81.120.155]) by mail.tglx.de (Postfix) with ESMTP id 2149065C065 for <tglx@linutronix.de>; Thu, 21 Jun 2007 22:24:22 +0200 (CEST) Received: from wasted.dev.rtsoft.ru (unknown [10.150.0.9]) by imap.sh.mvista.com (Postfix) with ESMTP id D27113EC9; Thu, 21 Jun 2007 13:24:15 -0700 (PDT) From: Sergei Shtylyov <sshtylyov@ru.mvista.com> Organization: MontaVista Software Inc. To: tglx@linutronix.de, bruce.ashfield@gmail.com, rostedt@goodmis.org Subject: [PATCH] (2.6.20-rt3) PowerPC: convert spinlocks into raw Date: Thu, 21 Jun 2007 23:25:58 +0300 User-Agent: KMail/1.5 MIME-Version: 1.0 Content-Disposition: inline Content-Type: text/plain; charset="iso-8859-1" Message-Id: <200706220025.58799.sshtylyov@ru.mvista.com> X-Evolution-Source: imap://tglx%40linutronix.de@localhost:8993/ Content-Transfer-Encoding: 8bit Convert the spinlocks in the PowerPC interrupt related code into the raw ones, also convert the PURR and PMC related spinlocks... Signed-off-by: Mark A. Greer <mgreer@mvista.com> Signed-off-by: Sergei Shtylyov <sshtylyov@ru.mvista.com> --- Resending in hopes it still can apply -- if it doesn't, bug me again... :-) --- arch/powerpc/kernel/pmc.c | 2 +- arch/powerpc/sysdev/i8259.c | 2 +- arch/powerpc/sysdev/ipic.c | 2 +- arch/powerpc/sysdev/mpic.c | 2 +- include/asm-powerpc/mpic.h | 2 +- 5 files changed, 5 insertions(+), 5 deletions(-) Index: linux-2.6.24.7/arch/powerpc/kernel/pmc.c =================================================================== --- linux-2.6.24.7.orig/arch/powerpc/kernel/pmc.c +++ linux-2.6.24.7/arch/powerpc/kernel/pmc.c @@ -37,7 +37,7 @@ static void dummy_perf(struct pt_regs *r } -static DEFINE_SPINLOCK(pmc_owner_lock); +static DEFINE_RAW_SPINLOCK(pmc_owner_lock); static void *pmc_owner_caller; /* mostly for debugging */ perf_irq_t perf_irq = dummy_perf; Index: linux-2.6.24.7/arch/powerpc/sysdev/i8259.c =================================================================== --- linux-2.6.24.7.orig/arch/powerpc/sysdev/i8259.c +++ linux-2.6.24.7/arch/powerpc/sysdev/i8259.c @@ -23,7 +23,7 @@ static unsigned char cached_8259[2] = { #define cached_A1 (cached_8259[0]) #define cached_21 (cached_8259[1]) -static DEFINE_SPINLOCK(i8259_lock); +static DEFINE_RAW_SPINLOCK(i8259_lock); static struct irq_host *i8259_host; Index: linux-2.6.24.7/arch/powerpc/sysdev/ipic.c =================================================================== --- linux-2.6.24.7.orig/arch/powerpc/sysdev/ipic.c +++ linux-2.6.24.7/arch/powerpc/sysdev/ipic.c @@ -30,7 +30,7 @@ #include "ipic.h" static struct ipic * primary_ipic; -static DEFINE_SPINLOCK(ipic_lock); +static DEFINE_RAW_SPINLOCK(ipic_lock); static struct ipic_info ipic_info[] = { [9] = { Index: linux-2.6.24.7/arch/powerpc/sysdev/mpic.c =================================================================== --- linux-2.6.24.7.orig/arch/powerpc/sysdev/mpic.c +++ linux-2.6.24.7/arch/powerpc/sysdev/mpic.c @@ -46,7 +46,7 @@ static struct mpic *mpics; static struct mpic *mpic_primary; -static DEFINE_SPINLOCK(mpic_lock); +static DEFINE_RAW_SPINLOCK(mpic_lock); #ifdef CONFIG_PPC32 /* XXX for now */ #ifdef CONFIG_IRQ_ALL_CPUS Index: linux-2.6.24.7/include/asm-powerpc/mpic.h =================================================================== --- linux-2.6.24.7.orig/include/asm-powerpc/mpic.h +++ linux-2.6.24.7/include/asm-powerpc/mpic.h @@ -275,7 +275,7 @@ struct mpic #ifdef CONFIG_MPIC_U3_HT_IRQS /* The fixup table */ struct mpic_irq_fixup *fixups; - spinlock_t fixup_lock; + raw_spinlock_t fixup_lock; #endif /* Register access method */ ���������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/preempt-realtime-sh.patch�������������������������������������������������������������������0000664�0000764�0000764�00000077763�11041657735�016177� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From lethal@linux-sh.org Fri Apr 27 10:21:47 2007 Date: Fri, 27 Apr 2007 10:21:47 +0900 From: Paul Mundt <lethal@linux-sh.org> To: Thomas Gleixner <tglx@linutronix.de>, Ingo Molnar <mingo@elte.hu> Subject: [PATCH] preempt-rt: Preliminary SH support Hi Thomas, Ingo, Here's preliminary preempt-rt support for SH. It was written against 2.6.21-rc5, but still applies cleanly. I've kept the clock events stuff out of this patch, since I'm planning on overhauling the timer stuff on SH first, but this should trickle in through 2.6.22-rc. Feel free to either merge this in to preempt-rt or hold off until the timer stuff gets done. Patch from Matsubara-san. Signed-off-by: Katsuya MATSUBARA <matsu@igel.co.jp> Signed-off-by: Paul Mundt <lethal@linux-sh.org> -- arch/sh/kernel/cpu/clock.c | 2 - arch/sh/kernel/cpu/sh4/sq.c | 2 - arch/sh/kernel/entry-common.S | 8 ++--- arch/sh/kernel/irq.c | 2 - arch/sh/kernel/process.c | 10 +++--- arch/sh/kernel/semaphore.c | 14 ++++++--- arch/sh/kernel/sh_ksyms.c | 9 ++--- arch/sh/kernel/signal.c | 7 ++++ arch/sh/kernel/time.c | 2 - arch/sh/kernel/traps.c | 2 - arch/sh/mm/cache-sh4.c | 12 +++---- arch/sh/mm/init.c | 2 - arch/sh/mm/pg-sh4.c | 4 +- arch/sh/mm/tlb-flush.c | 20 ++++++------ arch/sh/mm/tlb-sh4.c | 4 +- include/asm-sh/atomic-irq.h | 24 +++++++-------- include/asm-sh/atomic.h | 8 ++--- include/asm-sh/bitops.h | 24 +++++++-------- include/asm-sh/pgalloc.h | 2 - include/asm-sh/rwsem.h | 46 ++++++++++++++--------------- include/asm-sh/semaphore-helper.h | 8 ++--- include/asm-sh/semaphore.h | 59 +++++++++++++++++++++++--------------- include/asm-sh/system.h | 12 +++---- include/asm-sh/thread_info.h | 2 + 24 files changed, 157 insertions(+), 128 deletions(-) Index: linux-2.6.24.7/arch/sh/kernel/cpu/clock.c =================================================================== --- linux-2.6.24.7.orig/arch/sh/kernel/cpu/clock.c +++ linux-2.6.24.7/arch/sh/kernel/cpu/clock.c @@ -28,7 +28,7 @@ #include <asm/timer.h> static LIST_HEAD(clock_list); -static DEFINE_SPINLOCK(clock_lock); +static DEFINE_RAW_SPINLOCK(clock_lock); static DEFINE_MUTEX(clock_list_sem); /* Index: linux-2.6.24.7/arch/sh/kernel/cpu/sh4/sq.c =================================================================== --- linux-2.6.24.7.orig/arch/sh/kernel/cpu/sh4/sq.c +++ linux-2.6.24.7/arch/sh/kernel/cpu/sh4/sq.c @@ -37,7 +37,7 @@ struct sq_mapping { }; static struct sq_mapping *sq_mapping_list; -static DEFINE_SPINLOCK(sq_mapping_lock); +static DEFINE_RAW_SPINLOCK(sq_mapping_lock); static struct kmem_cache *sq_cache; static unsigned long *sq_bitmap; Index: linux-2.6.24.7/arch/sh/kernel/entry-common.S =================================================================== --- linux-2.6.24.7.orig/arch/sh/kernel/entry-common.S +++ linux-2.6.24.7/arch/sh/kernel/entry-common.S @@ -157,7 +157,7 @@ ENTRY(resume_userspace) mov.l @(TI_FLAGS,r8), r0 ! current_thread_info->flags tst #_TIF_WORK_MASK, r0 bt/s __restore_all - tst #_TIF_NEED_RESCHED, r0 + tst #_TIF_NEED_RESCHED|_TIF_NEED_RESCHED_DELAYED, r0 .align 2 work_pending: @@ -209,10 +209,10 @@ work_resched: tst #_TIF_WORK_MASK, r0 bt __restore_all bra work_pending - tst #_TIF_NEED_RESCHED, r0 + tst #_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_DELAYED, r0 .align 2 -1: .long schedule +1: .long __schedule 2: .long do_notify_resume 3: .long restore_all #ifdef CONFIG_TRACE_IRQFLAGS @@ -226,7 +226,7 @@ syscall_exit_work: ! r8: current_thread_info tst #_TIF_SYSCALL_TRACE | _TIF_SINGLESTEP, r0 bt/s work_pending - tst #_TIF_NEED_RESCHED, r0 + tst #_TIF_NEED_RESCHED| _TIF_NEED_RESCHED_DELAYED, r0 #ifdef CONFIG_TRACE_IRQFLAGS mov.l 5f, r0 jsr @r0 Index: linux-2.6.24.7/arch/sh/kernel/irq.c =================================================================== --- linux-2.6.24.7.orig/arch/sh/kernel/irq.c +++ linux-2.6.24.7/arch/sh/kernel/irq.c @@ -81,7 +81,7 @@ static union irq_ctx *hardirq_ctx[NR_CPU static union irq_ctx *softirq_ctx[NR_CPUS] __read_mostly; #endif -asmlinkage int do_IRQ(unsigned int irq, struct pt_regs *regs) +asmlinkage notrace int do_IRQ(unsigned int irq, struct pt_regs *regs) { struct pt_regs *old_regs = set_irq_regs(regs); #ifdef CONFIG_IRQSTACKS Index: linux-2.6.24.7/arch/sh/kernel/process.c =================================================================== --- linux-2.6.24.7.orig/arch/sh/kernel/process.c +++ linux-2.6.24.7/arch/sh/kernel/process.c @@ -65,7 +65,7 @@ void default_idle(void) clear_thread_flag(TIF_POLLING_NRFLAG); smp_mb__after_clear_bit(); set_bl_bit(); - while (!need_resched()) + while (!need_resched() && !need_resched_delayed()) cpu_sleep(); clear_bl_bit(); set_thread_flag(TIF_POLLING_NRFLAG); @@ -86,13 +86,15 @@ void cpu_idle(void) idle = default_idle; tick_nohz_stop_sched_tick(); - while (!need_resched()) + while (!need_resched() && !need_resched_delayed()) idle(); tick_nohz_restart_sched_tick(); - preempt_enable_no_resched(); - schedule(); + local_irq_disable(); + __preempt_enable_no_resched(); + __schedule(); preempt_disable(); + local_irq_enable(); check_pgt_cache(); } } Index: linux-2.6.24.7/arch/sh/kernel/semaphore.c =================================================================== --- linux-2.6.24.7.orig/arch/sh/kernel/semaphore.c +++ linux-2.6.24.7/arch/sh/kernel/semaphore.c @@ -46,7 +46,7 @@ DEFINE_SPINLOCK(semaphore_wake_lock); * critical part is the inline stuff in <asm/semaphore.h> * where we want to avoid any extra jumps and calls. */ -void __up(struct semaphore *sem) +void __attribute_used__ __compat_up(struct compat_semaphore *sem) { wake_one_more(sem); wake_up(&sem->wait); @@ -104,7 +104,7 @@ void __up(struct semaphore *sem) tsk->state = TASK_RUNNING; \ remove_wait_queue(&sem->wait, &wait); -void __sched __down(struct semaphore * sem) +void __attribute_used__ __sched __compat_down(struct compat_semaphore * sem) { DOWN_VAR DOWN_HEAD(TASK_UNINTERRUPTIBLE) @@ -114,7 +114,7 @@ void __sched __down(struct semaphore * s DOWN_TAIL(TASK_UNINTERRUPTIBLE) } -int __sched __down_interruptible(struct semaphore * sem) +int __attribute_used__ __sched __compat_down_interruptible(struct compat_semaphore * sem) { int ret = 0; DOWN_VAR @@ -133,7 +133,13 @@ int __sched __down_interruptible(struct return ret; } -int __down_trylock(struct semaphore * sem) +int __attribute_used__ __compat_down_trylock(struct compat_semaphore * sem) { return waking_non_zero_trylock(sem); } + +fastcall int __sched compat_sem_is_locked(struct compat_semaphore *sem) +{ + return (int) atomic_read(&sem->count) < 0; +} + Index: linux-2.6.24.7/arch/sh/kernel/sh_ksyms.c =================================================================== --- linux-2.6.24.7.orig/arch/sh/kernel/sh_ksyms.c +++ linux-2.6.24.7/arch/sh/kernel/sh_ksyms.c @@ -26,7 +26,6 @@ EXPORT_SYMBOL(sh_mv); /* platform dependent support */ EXPORT_SYMBOL(dump_fpu); EXPORT_SYMBOL(kernel_thread); -EXPORT_SYMBOL(irq_desc); EXPORT_SYMBOL(no_irq_type); EXPORT_SYMBOL(strlen); @@ -49,10 +48,10 @@ EXPORT_SYMBOL(get_vm_area); #endif /* semaphore exports */ -EXPORT_SYMBOL(__up); -EXPORT_SYMBOL(__down); -EXPORT_SYMBOL(__down_interruptible); -EXPORT_SYMBOL(__down_trylock); +EXPORT_SYMBOL(__compat_up); +EXPORT_SYMBOL(__compat_down); +EXPORT_SYMBOL(__compat_down_interruptible); +EXPORT_SYMBOL(__compat_down_trylock); EXPORT_SYMBOL(__udelay); EXPORT_SYMBOL(__ndelay); Index: linux-2.6.24.7/arch/sh/kernel/signal.c =================================================================== --- linux-2.6.24.7.orig/arch/sh/kernel/signal.c +++ linux-2.6.24.7/arch/sh/kernel/signal.c @@ -564,6 +564,13 @@ static void do_signal(struct pt_regs *re struct k_sigaction ka; sigset_t *oldset; +#ifdef CONFIG_PREEMPT_RT + /* + * Fully-preemptible kernel does not need interrupts disabled: + */ + raw_local_irq_enable(); + preempt_check_resched(); +#endif /* * We want the common case to go fast, which * is why we may in certain cases get here from Index: linux-2.6.24.7/arch/sh/kernel/time.c =================================================================== --- linux-2.6.24.7.orig/arch/sh/kernel/time.c +++ linux-2.6.24.7/arch/sh/kernel/time.c @@ -24,7 +24,7 @@ struct sys_timer *sys_timer; /* Move this somewhere more sensible.. */ -DEFINE_SPINLOCK(rtc_lock); +DEFINE_RAW_SPINLOCK(rtc_lock); EXPORT_SYMBOL(rtc_lock); /* Dummy RTC ops */ Index: linux-2.6.24.7/arch/sh/kernel/traps.c =================================================================== --- linux-2.6.24.7.orig/arch/sh/kernel/traps.c +++ linux-2.6.24.7/arch/sh/kernel/traps.c @@ -77,7 +77,7 @@ static void dump_mem(const char *str, un } } -static DEFINE_SPINLOCK(die_lock); +static DEFINE_RAW_SPINLOCK(die_lock); void die(const char * str, struct pt_regs * regs, long err) { Index: linux-2.6.24.7/arch/sh/mm/cache-sh4.c =================================================================== --- linux-2.6.24.7.orig/arch/sh/mm/cache-sh4.c +++ linux-2.6.24.7/arch/sh/mm/cache-sh4.c @@ -204,7 +204,7 @@ void flush_cache_sigtramp(unsigned long index = CACHE_IC_ADDRESS_ARRAY | (v & boot_cpu_data.icache.entry_mask); - local_irq_save(flags); + raw_local_irq_save(flags); jump_to_P2(); for (i = 0; i < boot_cpu_data.icache.ways; @@ -213,7 +213,7 @@ void flush_cache_sigtramp(unsigned long back_to_P1(); wmb(); - local_irq_restore(flags); + raw_local_irq_restore(flags); } static inline void flush_cache_4096(unsigned long start, @@ -229,10 +229,10 @@ static inline void flush_cache_4096(unsi (start < CACHE_OC_ADDRESS_ARRAY)) exec_offset = 0x20000000; - local_irq_save(flags); + raw_local_irq_save(flags); __flush_cache_4096(start | SH_CACHE_ASSOC, P1SEGADDR(phys), exec_offset); - local_irq_restore(flags); + raw_local_irq_restore(flags); } /* @@ -260,7 +260,7 @@ static inline void flush_icache_all(void { unsigned long flags, ccr; - local_irq_save(flags); + raw_local_irq_save(flags); jump_to_P2(); /* Flush I-cache */ @@ -274,7 +274,7 @@ static inline void flush_icache_all(void */ back_to_P1(); - local_irq_restore(flags); + raw_local_irq_restore(flags); } void flush_dcache_all(void) Index: linux-2.6.24.7/arch/sh/mm/init.c =================================================================== --- linux-2.6.24.7.orig/arch/sh/mm/init.c +++ linux-2.6.24.7/arch/sh/mm/init.c @@ -21,7 +21,7 @@ #include <asm/sections.h> #include <asm/cache.h> -DEFINE_PER_CPU(struct mmu_gather, mmu_gathers); +DEFINE_PER_CPU_LOCKED(struct mmu_gather, mmu_gathers); pgd_t swapper_pg_dir[PTRS_PER_PGD]; void (*copy_page)(void *from, void *to); Index: linux-2.6.24.7/arch/sh/mm/pg-sh4.c =================================================================== --- linux-2.6.24.7.orig/arch/sh/mm/pg-sh4.c +++ linux-2.6.24.7/arch/sh/mm/pg-sh4.c @@ -28,9 +28,9 @@ static inline void *kmap_coherent(struct vaddr = __fix_to_virt(FIX_CMAP_END - idx); pte = mk_pte(page, PAGE_KERNEL); - local_irq_save(flags); + raw_local_irq_save(flags); flush_tlb_one(get_asid(), vaddr); - local_irq_restore(flags); + raw_local_irq_restore(flags); update_mmu_cache(NULL, vaddr, pte); Index: linux-2.6.24.7/arch/sh/mm/tlb-flush.c =================================================================== --- linux-2.6.24.7.orig/arch/sh/mm/tlb-flush.c +++ linux-2.6.24.7/arch/sh/mm/tlb-flush.c @@ -24,7 +24,7 @@ void local_flush_tlb_page(struct vm_area asid = cpu_asid(cpu, vma->vm_mm); page &= PAGE_MASK; - local_irq_save(flags); + raw_local_irq_save(flags); if (vma->vm_mm != current->mm) { saved_asid = get_asid(); set_asid(asid); @@ -32,7 +32,7 @@ void local_flush_tlb_page(struct vm_area local_flush_tlb_one(asid, page); if (saved_asid != MMU_NO_ASID) set_asid(saved_asid); - local_irq_restore(flags); + raw_local_irq_restore(flags); } } @@ -46,7 +46,7 @@ void local_flush_tlb_range(struct vm_are unsigned long flags; int size; - local_irq_save(flags); + raw_local_irq_save(flags); size = (end - start + (PAGE_SIZE - 1)) >> PAGE_SHIFT; if (size > (MMU_NTLB_ENTRIES/4)) { /* Too many TLB to flush */ cpu_context(cpu, mm) = NO_CONTEXT; @@ -71,7 +71,7 @@ void local_flush_tlb_range(struct vm_are if (saved_asid != MMU_NO_ASID) set_asid(saved_asid); } - local_irq_restore(flags); + raw_local_irq_restore(flags); } } @@ -81,7 +81,7 @@ void local_flush_tlb_kernel_range(unsign unsigned long flags; int size; - local_irq_save(flags); + raw_local_irq_save(flags); size = (end - start + (PAGE_SIZE - 1)) >> PAGE_SHIFT; if (size > (MMU_NTLB_ENTRIES/4)) { /* Too many TLB to flush */ local_flush_tlb_all(); @@ -100,7 +100,7 @@ void local_flush_tlb_kernel_range(unsign } set_asid(saved_asid); } - local_irq_restore(flags); + raw_local_irq_restore(flags); } void local_flush_tlb_mm(struct mm_struct *mm) @@ -112,11 +112,11 @@ void local_flush_tlb_mm(struct mm_struct if (cpu_context(cpu, mm) != NO_CONTEXT) { unsigned long flags; - local_irq_save(flags); + raw_local_irq_save(flags); cpu_context(cpu, mm) = NO_CONTEXT; if (mm == current->mm) activate_context(mm, cpu); - local_irq_restore(flags); + raw_local_irq_restore(flags); } } @@ -131,10 +131,10 @@ void local_flush_tlb_all(void) * TF-bit for SH-3, TI-bit for SH-4. * It's same position, bit #2. */ - local_irq_save(flags); + raw_local_irq_save(flags); status = ctrl_inl(MMUCR); status |= 0x04; ctrl_outl(status, MMUCR); ctrl_barrier(); - local_irq_restore(flags); + raw_local_irq_restore(flags); } Index: linux-2.6.24.7/arch/sh/mm/tlb-sh4.c =================================================================== --- linux-2.6.24.7.orig/arch/sh/mm/tlb-sh4.c +++ linux-2.6.24.7/arch/sh/mm/tlb-sh4.c @@ -43,7 +43,7 @@ void update_mmu_cache(struct vm_area_str } #endif - local_irq_save(flags); + raw_local_irq_save(flags); /* Set PTEH register */ vpn = (address & MMU_VPN_MASK) | get_asid(); @@ -76,7 +76,7 @@ void update_mmu_cache(struct vm_area_str /* Load the TLB */ asm volatile("ldtlb": /* no output */ : /* no input */ : "memory"); - local_irq_restore(flags); + raw_local_irq_restore(flags); } void local_flush_tlb_one(unsigned long asid, unsigned long page) Index: linux-2.6.24.7/include/asm-sh/atomic-irq.h =================================================================== --- linux-2.6.24.7.orig/include/asm-sh/atomic-irq.h +++ linux-2.6.24.7/include/asm-sh/atomic-irq.h @@ -10,29 +10,29 @@ static inline void atomic_add(int i, ato { unsigned long flags; - local_irq_save(flags); + raw_local_irq_save(flags); *(long *)v += i; - local_irq_restore(flags); + raw_local_irq_restore(flags); } static inline void atomic_sub(int i, atomic_t *v) { unsigned long flags; - local_irq_save(flags); + raw_local_irq_save(flags); *(long *)v -= i; - local_irq_restore(flags); + raw_local_irq_restore(flags); } static inline int atomic_add_return(int i, atomic_t *v) { unsigned long temp, flags; - local_irq_save(flags); + raw_local_irq_save(flags); temp = *(long *)v; temp += i; *(long *)v = temp; - local_irq_restore(flags); + raw_local_irq_restore(flags); return temp; } @@ -41,11 +41,11 @@ static inline int atomic_sub_return(int { unsigned long temp, flags; - local_irq_save(flags); + raw_local_irq_save(flags); temp = *(long *)v; temp -= i; *(long *)v = temp; - local_irq_restore(flags); + raw_local_irq_restore(flags); return temp; } @@ -54,18 +54,18 @@ static inline void atomic_clear_mask(uns { unsigned long flags; - local_irq_save(flags); + raw_local_irq_save(flags); *(long *)v &= ~mask; - local_irq_restore(flags); + raw_local_irq_restore(flags); } static inline void atomic_set_mask(unsigned int mask, atomic_t *v) { unsigned long flags; - local_irq_save(flags); + raw_local_irq_save(flags); *(long *)v |= mask; - local_irq_restore(flags); + raw_local_irq_restore(flags); } #endif /* __ASM_SH_ATOMIC_IRQ_H */ Index: linux-2.6.24.7/include/asm-sh/atomic.h =================================================================== --- linux-2.6.24.7.orig/include/asm-sh/atomic.h +++ linux-2.6.24.7/include/asm-sh/atomic.h @@ -49,11 +49,11 @@ static inline int atomic_cmpxchg(atomic_ int ret; unsigned long flags; - local_irq_save(flags); + raw_local_irq_save(flags); ret = v->counter; if (likely(ret == old)) v->counter = new; - local_irq_restore(flags); + raw_local_irq_restore(flags); return ret; } @@ -65,11 +65,11 @@ static inline int atomic_add_unless(atom int ret; unsigned long flags; - local_irq_save(flags); + raw_local_irq_save(flags); ret = v->counter; if (ret != u) v->counter += a; - local_irq_restore(flags); + raw_local_irq_restore(flags); return ret != u; } Index: linux-2.6.24.7/include/asm-sh/bitops.h =================================================================== --- linux-2.6.24.7.orig/include/asm-sh/bitops.h +++ linux-2.6.24.7/include/asm-sh/bitops.h @@ -19,9 +19,9 @@ static inline void set_bit(int nr, volat a += nr >> 5; mask = 1 << (nr & 0x1f); - local_irq_save(flags); + raw_local_irq_save(flags); *a |= mask; - local_irq_restore(flags); + raw_local_irq_restore(flags); } /* @@ -37,9 +37,9 @@ static inline void clear_bit(int nr, vol a += nr >> 5; mask = 1 << (nr & 0x1f); - local_irq_save(flags); + raw_local_irq_save(flags); *a &= ~mask; - local_irq_restore(flags); + raw_local_irq_restore(flags); } static inline void change_bit(int nr, volatile void * addr) @@ -50,9 +50,9 @@ static inline void change_bit(int nr, vo a += nr >> 5; mask = 1 << (nr & 0x1f); - local_irq_save(flags); + raw_local_irq_save(flags); *a ^= mask; - local_irq_restore(flags); + raw_local_irq_restore(flags); } static inline int test_and_set_bit(int nr, volatile void * addr) @@ -63,10 +63,10 @@ static inline int test_and_set_bit(int n a += nr >> 5; mask = 1 << (nr & 0x1f); - local_irq_save(flags); + raw_local_irq_save(flags); retval = (mask & *a) != 0; *a |= mask; - local_irq_restore(flags); + raw_local_irq_restore(flags); return retval; } @@ -79,10 +79,10 @@ static inline int test_and_clear_bit(int a += nr >> 5; mask = 1 << (nr & 0x1f); - local_irq_save(flags); + raw_local_irq_save(flags); retval = (mask & *a) != 0; *a &= ~mask; - local_irq_restore(flags); + raw_local_irq_restore(flags); return retval; } @@ -95,10 +95,10 @@ static inline int test_and_change_bit(in a += nr >> 5; mask = 1 << (nr & 0x1f); - local_irq_save(flags); + raw_local_irq_save(flags); retval = (mask & *a) != 0; *a ^= mask; - local_irq_restore(flags); + raw_local_irq_restore(flags); return retval; } Index: linux-2.6.24.7/include/asm-sh/pgalloc.h =================================================================== --- linux-2.6.24.7.orig/include/asm-sh/pgalloc.h +++ linux-2.6.24.7/include/asm-sh/pgalloc.h @@ -13,7 +13,7 @@ static inline void pmd_populate_kernel(s set_pmd(pmd, __pmd((unsigned long)pte)); } -static inline void pmd_populate(struct mm_struct *mm, pmd_t *pmd, +static inline void notrace pmd_populate(struct mm_struct *mm, pmd_t *pmd, struct page *pte) { set_pmd(pmd, __pmd((unsigned long)page_address(pte))); Index: linux-2.6.24.7/include/asm-sh/rwsem.h =================================================================== --- linux-2.6.24.7.orig/include/asm-sh/rwsem.h +++ linux-2.6.24.7/include/asm-sh/rwsem.h @@ -19,7 +19,7 @@ /* * the semaphore definition */ -struct rw_semaphore { +struct compat_rw_semaphore { long count; #define RWSEM_UNLOCKED_VALUE 0x00000000 #define RWSEM_ACTIVE_BIAS 0x00000001 @@ -27,7 +27,7 @@ struct rw_semaphore { #define RWSEM_WAITING_BIAS (-0x00010000) #define RWSEM_ACTIVE_READ_BIAS RWSEM_ACTIVE_BIAS #define RWSEM_ACTIVE_WRITE_BIAS (RWSEM_WAITING_BIAS + RWSEM_ACTIVE_BIAS) - spinlock_t wait_lock; + raw_spinlock_t wait_lock; struct list_head wait_list; #ifdef CONFIG_DEBUG_LOCK_ALLOC struct lockdep_map dep_map; @@ -45,25 +45,25 @@ struct rw_semaphore { LIST_HEAD_INIT((name).wait_list) \ __RWSEM_DEP_MAP_INIT(name) } -#define DECLARE_RWSEM(name) \ - struct rw_semaphore name = __RWSEM_INITIALIZER(name) +#define COMPAT_DECLARE_RWSEM(name) \ + struct compat_rw_semaphore name = __RWSEM_INITIALIZER(name) -extern struct rw_semaphore *rwsem_down_read_failed(struct rw_semaphore *sem); -extern struct rw_semaphore *rwsem_down_write_failed(struct rw_semaphore *sem); -extern struct rw_semaphore *rwsem_wake(struct rw_semaphore *sem); -extern struct rw_semaphore *rwsem_downgrade_wake(struct rw_semaphore *sem); +extern struct compat_rw_semaphore *rwsem_down_read_failed(struct compat_rw_semaphore *sem); +extern struct compat_rw_semaphore *rwsem_down_write_failed(struct compat_rw_semaphore *sem); +extern struct compat_rw_semaphore *rwsem_wake(struct compat_rw_semaphore *sem); +extern struct compat_rw_semaphore *rwsem_downgrade_wake(struct compat_rw_semaphore *sem); -extern void __init_rwsem(struct rw_semaphore *sem, const char *name, +extern void __compat_init_rwsem(struct rw_semaphore *sem, const char *name, struct lock_class_key *key); -#define init_rwsem(sem) \ +#define compat_init_rwsem(sem) \ do { \ static struct lock_class_key __key; \ \ - __init_rwsem((sem), #sem, &__key); \ + __compat_init_rwsem((sem), #sem, &__key); \ } while (0) -static inline void init_rwsem(struct rw_semaphore *sem) +static inline void compat_init_rwsem(struct rw_semaphore *sem) { sem->count = RWSEM_UNLOCKED_VALUE; spin_lock_init(&sem->wait_lock); @@ -73,7 +73,7 @@ static inline void init_rwsem(struct rw_ /* * lock for reading */ -static inline void __down_read(struct rw_semaphore *sem) +static inline void __down_read(struct compat_rw_semaphore *sem) { if (atomic_inc_return((atomic_t *)(&sem->count)) > 0) smp_wmb(); @@ -81,7 +81,7 @@ static inline void __down_read(struct rw rwsem_down_read_failed(sem); } -static inline int __down_read_trylock(struct rw_semaphore *sem) +static inline int __down_read_trylock(struct compat_rw_semaphore *sem) { int tmp; @@ -98,7 +98,7 @@ static inline int __down_read_trylock(st /* * lock for writing */ -static inline void __down_write(struct rw_semaphore *sem) +static inline void __down_write(struct compat_rw_semaphore *sem) { int tmp; @@ -110,7 +110,7 @@ static inline void __down_write(struct r rwsem_down_write_failed(sem); } -static inline int __down_write_trylock(struct rw_semaphore *sem) +static inline int __down_write_trylock(struct compat_rw_semaphore *sem) { int tmp; @@ -123,7 +123,7 @@ static inline int __down_write_trylock(s /* * unlock after reading */ -static inline void __up_read(struct rw_semaphore *sem) +static inline void __up_read(struct compat_rw_semaphore *sem) { int tmp; @@ -136,7 +136,7 @@ static inline void __up_read(struct rw_s /* * unlock after writing */ -static inline void __up_write(struct rw_semaphore *sem) +static inline void __up_write(struct compat_rw_semaphore *sem) { smp_wmb(); if (atomic_sub_return(RWSEM_ACTIVE_WRITE_BIAS, @@ -147,7 +147,7 @@ static inline void __up_write(struct rw_ /* * implement atomic add functionality */ -static inline void rwsem_atomic_add(int delta, struct rw_semaphore *sem) +static inline void rwsem_atomic_add(int delta, struct compat_rw_semaphore *sem) { atomic_add(delta, (atomic_t *)(&sem->count)); } @@ -155,7 +155,7 @@ static inline void rwsem_atomic_add(int /* * downgrade write lock to read lock */ -static inline void __downgrade_write(struct rw_semaphore *sem) +static inline void __downgrade_write(struct compat_rw_semaphore *sem) { int tmp; @@ -165,7 +165,7 @@ static inline void __downgrade_write(str rwsem_downgrade_wake(sem); } -static inline void __down_write_nested(struct rw_semaphore *sem, int subclass) +static inline void __down_write_nested(struct compat_rw_semaphore *sem, int subclass) { __down_write(sem); } @@ -173,13 +173,13 @@ static inline void __down_write_nested(s /* * implement exchange and add functionality */ -static inline int rwsem_atomic_update(int delta, struct rw_semaphore *sem) +static inline int rwsem_atomic_update(int delta, struct compat_rw_semaphore *sem) { smp_mb(); return atomic_add_return(delta, (atomic_t *)(&sem->count)); } -static inline int rwsem_is_locked(struct rw_semaphore *sem) +static inline int rwsem_is_locked(struct compat_rw_semaphore *sem) { return (sem->count != 0); } Index: linux-2.6.24.7/include/asm-sh/semaphore-helper.h =================================================================== --- linux-2.6.24.7.orig/include/asm-sh/semaphore-helper.h +++ linux-2.6.24.7/include/asm-sh/semaphore-helper.h @@ -14,12 +14,12 @@ * This is trivially done with load_locked/store_cond, * which we have. Let the rest of the losers suck eggs. */ -static __inline__ void wake_one_more(struct semaphore * sem) +static __inline__ void wake_one_more(struct compat_semaphore * sem) { atomic_inc((atomic_t *)&sem->sleepers); } -static __inline__ int waking_non_zero(struct semaphore *sem) +static __inline__ int waking_non_zero(struct compat_semaphore *sem) { unsigned long flags; int ret = 0; @@ -43,7 +43,7 @@ static __inline__ int waking_non_zero(st * protected by the spinlock in order to make atomic this atomic_inc() with the * atomic_read() in wake_one_more(), otherwise we can race. -arca */ -static __inline__ int waking_non_zero_interruptible(struct semaphore *sem, +static __inline__ int waking_non_zero_interruptible(struct compat_semaphore *sem, struct task_struct *tsk) { unsigned long flags; @@ -70,7 +70,7 @@ static __inline__ int waking_non_zero_in * protected by the spinlock in order to make atomic this atomic_inc() with the * atomic_read() in wake_one_more(), otherwise we can race. -arca */ -static __inline__ int waking_non_zero_trylock(struct semaphore *sem) +static __inline__ int waking_non_zero_trylock(struct compat_semaphore *sem) { unsigned long flags; int ret = 1; Index: linux-2.6.24.7/include/asm-sh/semaphore.h =================================================================== --- linux-2.6.24.7.orig/include/asm-sh/semaphore.h +++ linux-2.6.24.7/include/asm-sh/semaphore.h @@ -20,28 +20,35 @@ #include <asm/system.h> #include <asm/atomic.h> -struct semaphore { +/* + * On !PREEMPT_RT all semaphores are compat: + */ +#ifndef CONFIG_PREEMPT_RT +# define compat_semaphore semaphore +#endif + +struct compat_semaphore { atomic_t count; int sleepers; wait_queue_head_t wait; }; -#define __SEMAPHORE_INITIALIZER(name, n) \ +#define __COMPAT_SEMAPHORE_INITIALIZER(name, n) \ { \ .count = ATOMIC_INIT(n), \ .sleepers = 0, \ .wait = __WAIT_QUEUE_HEAD_INITIALIZER((name).wait) \ } -#define __DECLARE_SEMAPHORE_GENERIC(name,count) \ - struct semaphore name = __SEMAPHORE_INITIALIZER(name,count) +#define __COMPAT_DECLARE_SEMAPHORE_GENERIC(name,count) \ + struct compat_semaphore name = __COMPAT_SEMAPHORE_INITIALIZER(name,count) -#define DECLARE_MUTEX(name) __DECLARE_SEMAPHORE_GENERIC(name,1) +#define COMPAT_DECLARE_MUTEX(name) __COMPAT_DECLARE_SEMAPHORE_GENERIC(name,1) -static inline void sema_init (struct semaphore *sem, int val) +static inline void compat_sema_init (struct compat_semaphore *sem, int val) { /* - * *sem = (struct semaphore)__SEMAPHORE_INITIALIZER((*sem),val); + * *sem = (struct compat_semaphore)__SEMAPHORE_INITIALIZER((*sem),val); * * i'd rather use the more flexible initialization above, but sadly * GCC 2.7.2.3 emits a bogus warning. EGCS doesn't. Oh well. @@ -51,14 +58,14 @@ static inline void sema_init (struct sem init_waitqueue_head(&sem->wait); } -static inline void init_MUTEX (struct semaphore *sem) +static inline void compat_init_MUTEX (struct compat_semaphore *sem) { - sema_init(sem, 1); + compat_sema_init(sem, 1); } -static inline void init_MUTEX_LOCKED (struct semaphore *sem) +static inline void compat_init_MUTEX_LOCKED (struct compat_semaphore *sem) { - sema_init(sem, 0); + compat_sema_init(sem, 0); } #if 0 @@ -68,36 +75,36 @@ asmlinkage int __down_failed_trylock(vo asmlinkage void __up_wakeup(void /* special register calling convention */); #endif -asmlinkage void __down(struct semaphore * sem); -asmlinkage int __down_interruptible(struct semaphore * sem); -asmlinkage int __down_trylock(struct semaphore * sem); -asmlinkage void __up(struct semaphore * sem); +asmlinkage void __compat_down(struct compat_semaphore * sem); +asmlinkage int __compat_down_interruptible(struct compat_semaphore * sem); +asmlinkage int __compat_down_trylock(struct compat_semaphore * sem); +asmlinkage void __compat_up(struct compat_semaphore * sem); extern spinlock_t semaphore_wake_lock; -static inline void down(struct semaphore * sem) +static inline void compat_down(struct compat_semaphore * sem) { might_sleep(); if (atomic_dec_return(&sem->count) < 0) - __down(sem); + __compat_down(sem); } -static inline int down_interruptible(struct semaphore * sem) +static inline int compat_down_interruptible(struct compat_semaphore * sem) { int ret = 0; might_sleep(); if (atomic_dec_return(&sem->count) < 0) - ret = __down_interruptible(sem); + ret = __compat_down_interruptible(sem); return ret; } -static inline int down_trylock(struct semaphore * sem) +static inline int compat_down_trylock(struct compat_semaphore * sem) { int ret = 0; if (atomic_dec_return(&sem->count) < 0) - ret = __down_trylock(sem); + ret = __compat_down_trylock(sem); return ret; } @@ -105,11 +112,17 @@ static inline int down_trylock(struct se * Note! This is subtle. We jump to wake people up only if * the semaphore was negative (== somebody was waiting on it). */ -static inline void up(struct semaphore * sem) +static inline void compat_up(struct compat_semaphore * sem) { if (atomic_inc_return(&sem->count) <= 0) - __up(sem); + __compat_up(sem); } +extern int compat_sem_is_locked(struct compat_semaphore *sem); + +#define compat_sema_count(sem) atomic_read(&(sem)->count) + +#include <linux/semaphore.h> + #endif #endif /* __ASM_SH_SEMAPHORE_H */ Index: linux-2.6.24.7/include/asm-sh/system.h =================================================================== --- linux-2.6.24.7.orig/include/asm-sh/system.h +++ linux-2.6.24.7/include/asm-sh/system.h @@ -159,10 +159,10 @@ static inline unsigned long xchg_u32(vol { unsigned long flags, retval; - local_irq_save(flags); + raw_local_irq_save(flags); retval = *m; *m = val; - local_irq_restore(flags); + raw_local_irq_restore(flags); return retval; } @@ -170,10 +170,10 @@ static inline unsigned long xchg_u8(vola { unsigned long flags, retval; - local_irq_save(flags); + raw_local_irq_save(flags); retval = *m; *m = val & 0xff; - local_irq_restore(flags); + raw_local_irq_restore(flags); return retval; } @@ -208,11 +208,11 @@ static inline unsigned long __cmpxchg_u3 __u32 retval; unsigned long flags; - local_irq_save(flags); + raw_local_irq_save(flags); retval = *m; if (retval == old) *m = new; - local_irq_restore(flags); /* implies memory barrier */ + raw_local_irq_restore(flags); /* implies memory barrier */ return retval; } Index: linux-2.6.24.7/include/asm-sh/thread_info.h =================================================================== --- linux-2.6.24.7.orig/include/asm-sh/thread_info.h +++ linux-2.6.24.7/include/asm-sh/thread_info.h @@ -111,6 +111,7 @@ static inline struct thread_info *curren #define TIF_NEED_RESCHED 2 /* rescheduling necessary */ #define TIF_RESTORE_SIGMASK 3 /* restore signal mask in do_signal() */ #define TIF_SINGLESTEP 4 /* singlestepping active */ +#define TIF_NEED_RESCHED_DELAYED 6 /* reschedule on return to userspace */ #define TIF_USEDFPU 16 /* FPU was used by this task this quantum (SMP) */ #define TIF_POLLING_NRFLAG 17 /* true if poll_idle() is polling TIF_NEED_RESCHED */ #define TIF_MEMDIE 18 @@ -121,6 +122,7 @@ static inline struct thread_info *curren #define _TIF_NEED_RESCHED (1<<TIF_NEED_RESCHED) #define _TIF_RESTORE_SIGMASK (1<<TIF_RESTORE_SIGMASK) #define _TIF_SINGLESTEP (1<<TIF_SINGLESTEP) +#define _TIF_NEED_RESCHED_DELAYED (1<<TIF_NEED_RESCHED_DELAYED) #define _TIF_USEDFPU (1<<TIF_USEDFPU) #define _TIF_POLLING_NRFLAG (1<<TIF_POLLING_NRFLAG) #define _TIF_FREEZE (1<<TIF_FREEZE) �������������patches/preempt-realtime-i386.patch�����������������������������������������������������������������0000664�0000764�0000764�00000067355�11041657731�016246� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������--- arch/x86/Kconfig.debug | 2 + arch/x86/kernel/cpu/mtrr/generic.c | 2 - arch/x86/kernel/head_32.S | 1 arch/x86/kernel/i8253.c | 2 - arch/x86/kernel/i8259_32.c | 2 - arch/x86/kernel/io_apic_32.c | 4 +-- arch/x86/kernel/irq_32.c | 4 ++- arch/x86/kernel/microcode.c | 2 - arch/x86/kernel/nmi_32.c | 5 +++ arch/x86/kernel/process_32.c | 19 ++++++++++---- arch/x86/kernel/signal_32.c | 14 ++++++++++ arch/x86/kernel/smp_32.c | 19 ++++++++++---- arch/x86/kernel/traps_32.c | 18 +++++++++++--- arch/x86/kernel/vm86_32.c | 1 arch/x86/mm/fault_32.c | 1 arch/x86/mm/highmem_32.c | 37 ++++++++++++++++++++++------- arch/x86/mm/pgtable_32.c | 2 - arch/x86/pci/common.c | 2 - arch/x86/pci/direct.c | 29 ++++++++++++++-------- arch/x86/pci/pci.h | 2 - include/asm-x86/acpi_32.h | 4 +-- include/asm-x86/dma_32.h | 2 - include/asm-x86/highmem.h | 27 +++++++++++++++++++++ include/asm-x86/i8253.h | 2 - include/asm-x86/i8259.h | 2 - include/asm-x86/mach-default/irq_vectors.h | 2 - include/asm-x86/mc146818rtc_32.h | 2 - include/asm-x86/pgtable_32.h | 2 - include/asm-x86/tlbflush_32.h | 26 ++++++++++++++++++++ include/asm-x86/xor_32.h | 21 ++++++++++++++-- kernel/Kconfig.instrumentation | 5 +++ 31 files changed, 211 insertions(+), 52 deletions(-) Index: linux-2.6.24.7/arch/x86/Kconfig.debug =================================================================== --- linux-2.6.24.7.orig/arch/x86/Kconfig.debug +++ linux-2.6.24.7/arch/x86/Kconfig.debug @@ -50,6 +50,7 @@ config DEBUG_PAGEALLOC config DEBUG_RODATA bool "Write protect kernel read-only data structures" depends on DEBUG_KERNEL + default y help Mark the kernel read-only data as write-protected in the pagetables, in order to catch accidental (and incorrect) writes to such const @@ -61,6 +62,7 @@ config 4KSTACKS bool "Use 4Kb for kernel stacks instead of 8Kb" depends on DEBUG_KERNEL depends on X86_32 + default y help If you say Y here the kernel will use a 4Kb stacksize for the kernel stack attached to each process/thread. This facilitates Index: linux-2.6.24.7/arch/x86/kernel/cpu/mtrr/generic.c =================================================================== --- linux-2.6.24.7.orig/arch/x86/kernel/cpu/mtrr/generic.c +++ linux-2.6.24.7/arch/x86/kernel/cpu/mtrr/generic.c @@ -330,7 +330,7 @@ static unsigned long set_mtrr_state(void static unsigned long cr4 = 0; -static DEFINE_SPINLOCK(set_atomicity_lock); +static DEFINE_RAW_SPINLOCK(set_atomicity_lock); /* * Since we are disabling the cache don't allow any interrupts - they Index: linux-2.6.24.7/arch/x86/kernel/head_32.S =================================================================== --- linux-2.6.24.7.orig/arch/x86/kernel/head_32.S +++ linux-2.6.24.7/arch/x86/kernel/head_32.S @@ -533,6 +533,7 @@ ignore_int: call printk #endif addl $(5*4),%esp + call dump_stack popl %ds popl %es popl %edx Index: linux-2.6.24.7/arch/x86/kernel/i8253.c =================================================================== --- linux-2.6.24.7.orig/arch/x86/kernel/i8253.c +++ linux-2.6.24.7/arch/x86/kernel/i8253.c @@ -14,7 +14,7 @@ #include <asm/i8253.h> #include <asm/io.h> -DEFINE_SPINLOCK(i8253_lock); +DEFINE_RAW_SPINLOCK(i8253_lock); EXPORT_SYMBOL(i8253_lock); /* Index: linux-2.6.24.7/arch/x86/kernel/i8259_32.c =================================================================== --- linux-2.6.24.7.orig/arch/x86/kernel/i8259_32.c +++ linux-2.6.24.7/arch/x86/kernel/i8259_32.c @@ -33,7 +33,7 @@ */ static int i8259A_auto_eoi; -DEFINE_SPINLOCK(i8259A_lock); +DEFINE_RAW_SPINLOCK(i8259A_lock); static void mask_and_ack_8259A(unsigned int); static struct irq_chip i8259A_chip = { Index: linux-2.6.24.7/arch/x86/kernel/io_apic_32.c =================================================================== --- linux-2.6.24.7.orig/arch/x86/kernel/io_apic_32.c +++ linux-2.6.24.7/arch/x86/kernel/io_apic_32.c @@ -56,8 +56,8 @@ atomic_t irq_mis_count; /* Where if anywhere is the i8259 connect in external int mode */ static struct { int pin, apic; } ioapic_i8259 = { -1, -1 }; -static DEFINE_SPINLOCK(ioapic_lock); -static DEFINE_SPINLOCK(vector_lock); +static DEFINE_RAW_SPINLOCK(ioapic_lock); +static DEFINE_RAW_SPINLOCK(vector_lock); int timer_over_8254 __initdata = 1; Index: linux-2.6.24.7/arch/x86/kernel/irq_32.c =================================================================== --- linux-2.6.24.7.orig/arch/x86/kernel/irq_32.c +++ linux-2.6.24.7/arch/x86/kernel/irq_32.c @@ -79,6 +79,8 @@ fastcall unsigned int do_IRQ(struct pt_r u32 *isp; #endif + irq_show_regs_callback(smp_processor_id(), regs); + if (unlikely((unsigned)irq >= NR_IRQS)) { printk(KERN_EMERG "%s: cannot handle IRQ %d\n", __FUNCTION__, irq); @@ -96,7 +98,7 @@ fastcall unsigned int do_IRQ(struct pt_r __asm__ __volatile__("andl %%esp,%0" : "=r" (esp) : "0" (THREAD_SIZE - 1)); if (unlikely(esp < (sizeof(struct thread_info) + STACK_WARN))) { - printk("do_IRQ: stack overflow: %ld\n", + printk("BUG: do_IRQ: stack overflow: %ld\n", esp - sizeof(struct thread_info)); dump_stack(); } Index: linux-2.6.24.7/arch/x86/kernel/microcode.c =================================================================== --- linux-2.6.24.7.orig/arch/x86/kernel/microcode.c +++ linux-2.6.24.7/arch/x86/kernel/microcode.c @@ -117,7 +117,7 @@ MODULE_LICENSE("GPL"); #define exttable_size(et) ((et)->count * EXT_SIGNATURE_SIZE + EXT_HEADER_SIZE) /* serialize access to the physical write to MSR 0x79 */ -static DEFINE_SPINLOCK(microcode_update_lock); +static DEFINE_RAW_SPINLOCK(microcode_update_lock); /* no concurrent ->write()s are allowed on /dev/cpu/microcode */ static DEFINE_MUTEX(microcode_mutex); Index: linux-2.6.24.7/arch/x86/kernel/nmi_32.c =================================================================== --- linux-2.6.24.7.orig/arch/x86/kernel/nmi_32.c +++ linux-2.6.24.7/arch/x86/kernel/nmi_32.c @@ -59,7 +59,12 @@ static int endflag __initdata = 0; static __init void nmi_cpu_busy(void *data) { #ifdef CONFIG_SMP + /* + * avoid a warning, on PREEMPT_RT this wont run in hardirq context: + */ +#ifndef CONFIG_PREEMPT_RT local_irq_enable_in_hardirq(); +#endif /* Intentionally don't use cpu_relax here. This is to make sure that the performance counter really ticks, even if there is a simulator or similar that catches the Index: linux-2.6.24.7/arch/x86/kernel/process_32.c =================================================================== --- linux-2.6.24.7.orig/arch/x86/kernel/process_32.c +++ linux-2.6.24.7/arch/x86/kernel/process_32.c @@ -342,9 +342,10 @@ void __show_registers(struct pt_regs *re regs->eax, regs->ebx, regs->ecx, regs->edx); printk("ESI: %08lx EDI: %08lx EBP: %08lx ESP: %08lx\n", regs->esi, regs->edi, regs->ebp, esp); - printk(" DS: %04x ES: %04x FS: %04x GS: %04x SS: %04x\n", + printk(" DS: %04x ES: %04x FS: %04x GS: %04x SS: %04x" + " preempt:%08x\n", regs->xds & 0xffff, regs->xes & 0xffff, - regs->xfs & 0xffff, gs, ss); + regs->xfs & 0xffff, gs, ss, preempt_count()); if (!all) return; @@ -416,15 +417,23 @@ void exit_thread(void) if (unlikely(test_thread_flag(TIF_IO_BITMAP))) { struct task_struct *tsk = current; struct thread_struct *t = &tsk->thread; - int cpu = get_cpu(); - struct tss_struct *tss = &per_cpu(init_tss, cpu); + void *io_bitmap_ptr = t->io_bitmap_ptr; + int cpu; + struct tss_struct *tss; - kfree(t->io_bitmap_ptr); + /* + * On PREEMPT_RT we must not call kfree() with + * preemption disabled, so we first zap the pointer: + */ t->io_bitmap_ptr = NULL; + kfree(io_bitmap_ptr); + clear_thread_flag(TIF_IO_BITMAP); /* * Careful, clear this in the TSS too: */ + cpu = get_cpu(); + tss = &per_cpu(init_tss, cpu); memset(tss->io_bitmap, 0xff, tss->io_bitmap_max); t->io_bitmap_max = 0; tss->io_bitmap_owner = NULL; Index: linux-2.6.24.7/arch/x86/kernel/signal_32.c =================================================================== --- linux-2.6.24.7.orig/arch/x86/kernel/signal_32.c +++ linux-2.6.24.7/arch/x86/kernel/signal_32.c @@ -536,6 +536,13 @@ handle_signal(unsigned long sig, siginfo } } +#ifdef CONFIG_PREEMPT_RT + /* + * Fully-preemptible kernel does not need interrupts disabled: + */ + local_irq_enable(); + preempt_check_resched(); +#endif /* * If TF is set due to a debugger (PT_DTRACE), clear the TF flag so * that register information in the sigcontext is correct. @@ -576,6 +583,13 @@ static void fastcall do_signal(struct pt struct k_sigaction ka; sigset_t *oldset; +#ifdef CONFIG_PREEMPT_RT + /* + * Fully-preemptible kernel does not need interrupts disabled: + */ + local_irq_enable(); + preempt_check_resched(); +#endif /* * We want the common case to go fast, which * is why we may in certain cases get here from Index: linux-2.6.24.7/arch/x86/kernel/smp_32.c =================================================================== --- linux-2.6.24.7.orig/arch/x86/kernel/smp_32.c +++ linux-2.6.24.7/arch/x86/kernel/smp_32.c @@ -247,7 +247,7 @@ void send_IPI_mask_sequence(cpumask_t ma static cpumask_t flush_cpumask; static struct mm_struct * flush_mm; static unsigned long flush_va; -static DEFINE_SPINLOCK(tlbstate_lock); +static DEFINE_RAW_SPINLOCK(tlbstate_lock); /* * We cannot call mmdrop() because we are in interrupt context, @@ -476,10 +476,20 @@ static void native_smp_send_reschedule(i } /* + * this function sends a 'reschedule' IPI to all other CPUs. + * This is used when RT tasks are starving and other CPUs + * might be able to run them: + */ +void smp_send_reschedule_allbutself(void) +{ + send_IPI_allbutself(RESCHEDULE_VECTOR); +} + +/* * Structure and data for smp_call_function(). This is designed to minimise * static memory requirements. It also looks cleaner. */ -static DEFINE_SPINLOCK(call_lock); +static DEFINE_RAW_SPINLOCK(call_lock); struct call_data_struct { void (*func) (void *info); @@ -634,9 +644,8 @@ static void native_smp_send_stop(void) } /* - * Reschedule call back. Nothing to do, - * all the work is done automatically when - * we return from the interrupt. + * Reschedule call back. Trigger a reschedule pass so that + * RT-overload balancing can pass tasks around. */ fastcall void smp_reschedule_interrupt(struct pt_regs *regs) { Index: linux-2.6.24.7/arch/x86/kernel/traps_32.c =================================================================== --- linux-2.6.24.7.orig/arch/x86/kernel/traps_32.c +++ linux-2.6.24.7/arch/x86/kernel/traps_32.c @@ -297,6 +297,12 @@ void dump_stack(void) EXPORT_SYMBOL(dump_stack); +#if defined(CONFIG_DEBUG_STACKOVERFLOW) && defined(CONFIG_EVENT_TRACE) +extern unsigned long worst_stack_left; +#else +# define worst_stack_left -1L +#endif + void show_registers(struct pt_regs *regs) { int i; @@ -366,7 +372,7 @@ void die(const char * str, struct pt_reg u32 lock_owner; int lock_owner_depth; } die = { - .lock = __RAW_SPIN_LOCK_UNLOCKED, + .lock = RAW_SPIN_LOCK_UNLOCKED(die.lock), .lock_owner = -1, .lock_owner_depth = 0 }; @@ -378,7 +384,7 @@ void die(const char * str, struct pt_reg if (die.lock_owner != raw_smp_processor_id()) { console_verbose(); raw_local_irq_save(flags); - __raw_spin_lock(&die.lock); + spin_lock(&die.lock); die.lock_owner = smp_processor_id(); die.lock_owner_depth = 0; bust_spinlocks(1); @@ -427,7 +433,7 @@ void die(const char * str, struct pt_reg bust_spinlocks(0); die.lock_owner = -1; add_taint(TAINT_DIE); - __raw_spin_unlock(&die.lock); + spin_unlock(&die.lock); raw_local_irq_restore(flags); if (!regs) @@ -467,6 +473,11 @@ static void __kprobes do_trap(int trapnr if (!user_mode(regs)) goto kernel_trap; +#ifdef CONFIG_PREEMPT_RT + local_irq_enable(); + preempt_check_resched(); +#endif + trap_signal: { /* * We want error_code and trap_no set for userspace faults and @@ -724,6 +735,7 @@ void __kprobes die_nmi(struct pt_regs *r crash_kexec(regs); } + nmi_exit(); do_exit(SIGSEGV); } Index: linux-2.6.24.7/arch/x86/kernel/vm86_32.c =================================================================== --- linux-2.6.24.7.orig/arch/x86/kernel/vm86_32.c +++ linux-2.6.24.7/arch/x86/kernel/vm86_32.c @@ -135,6 +135,7 @@ struct pt_regs * fastcall save_v86_state local_irq_enable(); if (!current->thread.vm86_info) { + local_irq_disable(); printk("no vm86_info: BAD\n"); do_exit(SIGSEGV); } Index: linux-2.6.24.7/arch/x86/mm/fault_32.c =================================================================== --- linux-2.6.24.7.orig/arch/x86/mm/fault_32.c +++ linux-2.6.24.7/arch/x86/mm/fault_32.c @@ -502,6 +502,7 @@ bad_area_nosemaphore: nr = (address - idt_descr.address) >> 3; if (nr == 6) { + zap_rt_locks(); do_invalid_op(regs, 0); return; } Index: linux-2.6.24.7/arch/x86/mm/highmem_32.c =================================================================== --- linux-2.6.24.7.orig/arch/x86/mm/highmem_32.c +++ linux-2.6.24.7/arch/x86/mm/highmem_32.c @@ -18,6 +18,26 @@ void kunmap(struct page *page) kunmap_high(page); } +void kunmap_virt(void *ptr) +{ + struct page *page; + + if ((unsigned long)ptr < PKMAP_ADDR(0)) + return; + page = pte_page(pkmap_page_table[PKMAP_NR((unsigned long)ptr)]); + kunmap(page); +} + +struct page *kmap_to_page(void *ptr) +{ + struct page *page; + + if ((unsigned long)ptr < PKMAP_ADDR(0)) + return virt_to_page(ptr); + page = pte_page(pkmap_page_table[PKMAP_NR((unsigned long)ptr)]); + return page; +} + /* * kmap_atomic/kunmap_atomic is significantly faster than kmap/kunmap because * no global lock is needed and because the kmap code must perform a global TLB @@ -26,7 +46,7 @@ void kunmap(struct page *page) * However when holding an atomic kmap is is not legal to sleep, so atomic * kmaps are appropriate for short, tight code paths only. */ -void *kmap_atomic_prot(struct page *page, enum km_type type, pgprot_t prot) +void *__kmap_atomic_prot(struct page *page, enum km_type type, pgprot_t prot) { enum fixed_addresses idx; unsigned long vaddr; @@ -46,12 +66,12 @@ void *kmap_atomic_prot(struct page *page return (void *)vaddr; } -void *kmap_atomic(struct page *page, enum km_type type) +void *__kmap_atomic(struct page *page, enum km_type type) { return kmap_atomic_prot(page, type, kmap_prot); } -void kunmap_atomic(void *kvaddr, enum km_type type) +void __kunmap_atomic(void *kvaddr, enum km_type type) { unsigned long vaddr = (unsigned long) kvaddr & PAGE_MASK; enum fixed_addresses idx = type + KM_TYPE_NR*smp_processor_id(); @@ -78,7 +98,7 @@ void kunmap_atomic(void *kvaddr, enum km /* This is the same as kmap_atomic() but can map memory that doesn't * have a struct page associated with it. */ -void *kmap_atomic_pfn(unsigned long pfn, enum km_type type) +void *__kmap_atomic_pfn(unsigned long pfn, enum km_type type) { enum fixed_addresses idx; unsigned long vaddr; @@ -93,7 +113,7 @@ void *kmap_atomic_pfn(unsigned long pfn, return (void*) vaddr; } -struct page *kmap_atomic_to_page(void *ptr) +struct page *__kmap_atomic_to_page(void *ptr) { unsigned long idx, vaddr = (unsigned long)ptr; pte_t *pte; @@ -108,6 +128,7 @@ struct page *kmap_atomic_to_page(void *p EXPORT_SYMBOL(kmap); EXPORT_SYMBOL(kunmap); -EXPORT_SYMBOL(kmap_atomic); -EXPORT_SYMBOL(kunmap_atomic); -EXPORT_SYMBOL(kmap_atomic_to_page); +EXPORT_SYMBOL(kunmap_virt); +EXPORT_SYMBOL(__kmap_atomic); +EXPORT_SYMBOL(__kunmap_atomic); +EXPORT_SYMBOL(__kmap_atomic_to_page); Index: linux-2.6.24.7/arch/x86/mm/pgtable_32.c =================================================================== --- linux-2.6.24.7.orig/arch/x86/mm/pgtable_32.c +++ linux-2.6.24.7/arch/x86/mm/pgtable_32.c @@ -210,7 +210,7 @@ void pmd_ctor(struct kmem_cache *cache, * vmalloc faults work because attached pagetables are never freed. * -- wli */ -DEFINE_SPINLOCK(pgd_lock); +DEFINE_RAW_SPINLOCK(pgd_lock); struct page *pgd_list; static inline void pgd_list_add(pgd_t *pgd) Index: linux-2.6.24.7/arch/x86/pci/common.c =================================================================== --- linux-2.6.24.7.orig/arch/x86/pci/common.c +++ linux-2.6.24.7/arch/x86/pci/common.c @@ -54,7 +54,7 @@ int pcibios_scanned; * This interrupt-safe spinlock protects all accesses to PCI * configuration space. */ -DEFINE_SPINLOCK(pci_config_lock); +DEFINE_RAW_SPINLOCK(pci_config_lock); /* * Several buggy motherboards address only 16 devices and mirror Index: linux-2.6.24.7/arch/x86/pci/direct.c =================================================================== --- linux-2.6.24.7.orig/arch/x86/pci/direct.c +++ linux-2.6.24.7/arch/x86/pci/direct.c @@ -220,16 +220,23 @@ static int __init pci_check_type1(void) unsigned int tmp; int works = 0; - local_irq_save(flags); + spin_lock_irqsave(&pci_config_lock, flags); outb(0x01, 0xCFB); tmp = inl(0xCF8); outl(0x80000000, 0xCF8); - if (inl(0xCF8) == 0x80000000 && pci_sanity_check(&pci_direct_conf1)) { - works = 1; + + if (inl(0xCF8) == 0x80000000) { + spin_unlock_irqrestore(&pci_config_lock, flags); + + if (pci_sanity_check(&pci_direct_conf1)) + works = 1; + + spin_lock_irqsave(&pci_config_lock, flags); } outl(tmp, 0xCF8); - local_irq_restore(flags); + + spin_unlock_irqrestore(&pci_config_lock, flags); return works; } @@ -239,17 +246,19 @@ static int __init pci_check_type2(void) unsigned long flags; int works = 0; - local_irq_save(flags); + spin_lock_irqsave(&pci_config_lock, flags); outb(0x00, 0xCFB); outb(0x00, 0xCF8); outb(0x00, 0xCFA); - if (inb(0xCF8) == 0x00 && inb(0xCFA) == 0x00 && - pci_sanity_check(&pci_direct_conf2)) { - works = 1; - } - local_irq_restore(flags); + if (inb(0xCF8) == 0x00 && inb(0xCFA) == 0x00) { + spin_unlock_irqrestore(&pci_config_lock, flags); + + if (pci_sanity_check(&pci_direct_conf2)) + works = 1; + } else + spin_unlock_irqrestore(&pci_config_lock, flags); return works; } Index: linux-2.6.24.7/arch/x86/pci/pci.h =================================================================== --- linux-2.6.24.7.orig/arch/x86/pci/pci.h +++ linux-2.6.24.7/arch/x86/pci/pci.h @@ -80,7 +80,7 @@ struct irq_routing_table { extern unsigned int pcibios_irq_mask; extern int pcibios_scanned; -extern spinlock_t pci_config_lock; +extern raw_spinlock_t pci_config_lock; extern int (*pcibios_enable_irq)(struct pci_dev *dev); extern void (*pcibios_disable_irq)(struct pci_dev *dev); Index: linux-2.6.24.7/include/asm-x86/acpi_32.h =================================================================== --- linux-2.6.24.7.orig/include/asm-x86/acpi_32.h +++ linux-2.6.24.7/include/asm-x86/acpi_32.h @@ -52,8 +52,8 @@ #define ACPI_ASM_MACROS #define BREAKPOINT3 -#define ACPI_DISABLE_IRQS() local_irq_disable() -#define ACPI_ENABLE_IRQS() local_irq_enable() +#define ACPI_DISABLE_IRQS() local_irq_disable_nort() +#define ACPI_ENABLE_IRQS() local_irq_enable_nort() #define ACPI_FLUSH_CPU_CACHE() wbinvd() int __acpi_acquire_global_lock(unsigned int *lock); Index: linux-2.6.24.7/include/asm-x86/dma_32.h =================================================================== --- linux-2.6.24.7.orig/include/asm-x86/dma_32.h +++ linux-2.6.24.7/include/asm-x86/dma_32.h @@ -134,7 +134,7 @@ #define DMA_AUTOINIT 0x10 -extern spinlock_t dma_spin_lock; +extern spinlock_t dma_spin_lock; static __inline__ unsigned long claim_dma_lock(void) { Index: linux-2.6.24.7/include/asm-x86/highmem.h =================================================================== --- linux-2.6.24.7.orig/include/asm-x86/highmem.h +++ linux-2.6.24.7/include/asm-x86/highmem.h @@ -67,6 +67,16 @@ extern void * FASTCALL(kmap_high(struct extern void FASTCALL(kunmap_high(struct page *page)); void *kmap(struct page *page); +extern void kunmap_virt(void *ptr); +extern struct page *kmap_to_page(void *ptr); +void kunmap(struct page *page); + +void *__kmap_atomic_prot(struct page *page, enum km_type type, pgprot_t prot); +void *__kmap_atomic(struct page *page, enum km_type type); +void __kunmap_atomic(void *kvaddr, enum km_type type); +void *__kmap_atomic_pfn(unsigned long pfn, enum km_type type); +struct page *__kmap_atomic_to_page(void *ptr); + void kunmap(struct page *page); void *kmap_atomic_prot(struct page *page, enum km_type type, pgprot_t prot); void *kmap_atomic(struct page *page, enum km_type type); @@ -80,6 +90,23 @@ struct page *kmap_atomic_to_page(void *p #define flush_cache_kmaps() do { } while (0) +/* + * on PREEMPT_RT kmap_atomic() is a wrapper that uses kmap(): + */ +#ifdef CONFIG_PREEMPT_RT +# define kmap_atomic_prot(page, type, prot) kmap(page) +# define kmap_atomic(page, type) kmap(page) +# define kmap_atomic_pfn(pfn, type) kmap(pfn_to_page(pfn)) +# define kunmap_atomic(kvaddr, type) kunmap_virt(kvaddr) +# define kmap_atomic_to_page(kvaddr) kmap_to_page(kvaddr) +#else +# define kmap_atomic_prot(page, type, prot) __kmap_atomic_prot(page, type, prot) +# define kmap_atomic(page, type) __kmap_atomic(page, type) +# define kmap_atomic_pfn(pfn, type) __kmap_atomic_pfn(pfn, type) +# define kunmap_atomic(kvaddr, type) __kunmap_atomic(kvaddr, type) +# define kmap_atomic_to_page(kvaddr) __kmap_atomic_to_page(kvaddr) +#endif + #endif /* __KERNEL__ */ #endif /* _ASM_HIGHMEM_H */ Index: linux-2.6.24.7/include/asm-x86/i8253.h =================================================================== --- linux-2.6.24.7.orig/include/asm-x86/i8253.h +++ linux-2.6.24.7/include/asm-x86/i8253.h @@ -6,7 +6,7 @@ #define PIT_CH0 0x40 #define PIT_CH2 0x42 -extern spinlock_t i8253_lock; +extern raw_spinlock_t i8253_lock; extern struct clock_event_device *global_clock_event; Index: linux-2.6.24.7/include/asm-x86/i8259.h =================================================================== --- linux-2.6.24.7.orig/include/asm-x86/i8259.h +++ linux-2.6.24.7/include/asm-x86/i8259.h @@ -7,7 +7,7 @@ extern unsigned int cached_irq_mask; #define cached_master_mask (__byte(0, cached_irq_mask)) #define cached_slave_mask (__byte(1, cached_irq_mask)) -extern spinlock_t i8259A_lock; +extern raw_spinlock_t i8259A_lock; extern void init_8259A(int auto_eoi); extern void enable_8259A_irq(unsigned int irq); Index: linux-2.6.24.7/include/asm-x86/mach-default/irq_vectors.h =================================================================== --- linux-2.6.24.7.orig/include/asm-x86/mach-default/irq_vectors.h +++ linux-2.6.24.7/include/asm-x86/mach-default/irq_vectors.h @@ -63,7 +63,7 @@ * levels. (0x80 is the syscall vector) */ #define FIRST_DEVICE_VECTOR 0x31 -#define FIRST_SYSTEM_VECTOR 0xef +#define FIRST_SYSTEM_VECTOR 0xee #define TIMER_IRQ 0 Index: linux-2.6.24.7/include/asm-x86/mc146818rtc_32.h =================================================================== --- linux-2.6.24.7.orig/include/asm-x86/mc146818rtc_32.h +++ linux-2.6.24.7/include/asm-x86/mc146818rtc_32.h @@ -72,7 +72,7 @@ static inline unsigned char current_lock lock_cmos(reg) #define lock_cmos_suffix(reg) \ unlock_cmos(); \ - local_irq_restore(cmos_flags); \ + local_irq_restore(cmos_flags); \ } while (0) #else #define lock_cmos_prefix(reg) do {} while (0) Index: linux-2.6.24.7/include/asm-x86/pgtable_32.h =================================================================== --- linux-2.6.24.7.orig/include/asm-x86/pgtable_32.h +++ linux-2.6.24.7/include/asm-x86/pgtable_32.h @@ -33,7 +33,7 @@ struct vm_area_struct; extern unsigned long empty_zero_page[1024]; extern pgd_t swapper_pg_dir[1024]; extern struct kmem_cache *pmd_cache; -extern spinlock_t pgd_lock; +extern raw_spinlock_t pgd_lock; extern struct page *pgd_list; void check_pgt_cache(void); Index: linux-2.6.24.7/include/asm-x86/tlbflush_32.h =================================================================== --- linux-2.6.24.7.orig/include/asm-x86/tlbflush_32.h +++ linux-2.6.24.7/include/asm-x86/tlbflush_32.h @@ -4,6 +4,21 @@ #include <linux/mm.h> #include <asm/processor.h> +/* + * TLB-flush needs to be nonpreemptible on PREEMPT_RT due to the + * following complex race scenario: + * + * if the current task is lazy-TLB and does a TLB flush and + * gets preempted after the movl %%r3, %0 but before the + * movl %0, %%cr3 then its ->active_mm might change and it will + * install the wrong cr3 when it switches back. This is not a + * problem for the lazy-TLB task itself, but if the next task it + * switches to has an ->mm that is also the lazy-TLB task's + * new ->active_mm, then the scheduler will assume that cr3 is + * the new one, while we overwrote it with the old one. The result + * is the wrong cr3 in the new (non-lazy-TLB) task, which typically + * causes an infinite pagefault upon the next userspace access. + */ #ifdef CONFIG_PARAVIRT #include <asm/paravirt.h> #else @@ -16,11 +31,13 @@ do { \ unsigned int tmpreg; \ \ + preempt_disable(); \ __asm__ __volatile__( \ "movl %%cr3, %0; \n" \ "movl %0, %%cr3; # flush TLB \n" \ : "=r" (tmpreg) \ :: "memory"); \ + preempt_enable(); \ } while (0) /* @@ -31,6 +48,7 @@ do { \ unsigned int tmpreg, cr4, cr4_orig; \ \ + preempt_disable(); \ __asm__ __volatile__( \ "movl %%cr4, %2; # turn off PGE \n" \ "movl %2, %1; \n" \ @@ -42,6 +60,7 @@ : "=&r" (tmpreg), "=&r" (cr4), "=&r" (cr4_orig) \ : "i" (~X86_CR4_PGE) \ : "memory"); \ + preempt_enable(); \ } while (0) #define __native_flush_tlb_single(addr) \ @@ -97,6 +116,13 @@ static inline void flush_tlb_mm(struct mm_struct *mm) { + /* + * This is safe on PREEMPT_RT because if we preempt + * right after the check but before the __flush_tlb(), + * and if ->active_mm changes, then we might miss a + * TLB flush, but that TLB flush happened already when + * ->active_mm was changed: + */ if (mm == current->active_mm) __flush_tlb(); } Index: linux-2.6.24.7/include/asm-x86/xor_32.h =================================================================== --- linux-2.6.24.7.orig/include/asm-x86/xor_32.h +++ linux-2.6.24.7/include/asm-x86/xor_32.h @@ -862,7 +862,21 @@ static struct xor_block_template xor_blo #include <asm-generic/xor.h> #undef XOR_TRY_TEMPLATES -#define XOR_TRY_TEMPLATES \ +/* + * MMX/SSE ops disable preemption for long periods of time, + * so on PREEMPT_RT use the register-based ops only: + */ +#ifdef CONFIG_PREEMPT_RT +# define XOR_TRY_TEMPLATES \ + do { \ + xor_speed(&xor_block_8regs); \ + xor_speed(&xor_block_8regs_p); \ + xor_speed(&xor_block_32regs); \ + xor_speed(&xor_block_32regs_p); \ + } while (0) +# define XOR_SELECT_TEMPLATE(FASTEST) (FASTEST) +#else +# define XOR_TRY_TEMPLATES \ do { \ xor_speed(&xor_block_8regs); \ xor_speed(&xor_block_8regs_p); \ @@ -875,9 +889,10 @@ static struct xor_block_template xor_blo xor_speed(&xor_block_p5_mmx); \ } \ } while (0) - /* We force the use of the SSE xor block because it can write around L2. We may also be able to load into the L1 only depending on how the cpu deals with a load to a line that is being prefetched. */ -#define XOR_SELECT_TEMPLATE(FASTEST) \ +# define XOR_SELECT_TEMPLATE(FASTEST) \ (cpu_has_xmm ? &xor_block_pIII_sse : FASTEST) +#endif + Index: linux-2.6.24.7/kernel/Kconfig.instrumentation =================================================================== --- linux-2.6.24.7.orig/kernel/Kconfig.instrumentation +++ linux-2.6.24.7/kernel/Kconfig.instrumentation @@ -29,6 +29,11 @@ config OPROFILE If unsure, say N. +config PROFILE_NMI + bool + depends on OPROFILE + default y + config KPROBES bool "Kprobes" depends on KALLSYMS && MODULES && !UML �����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/remove-check-pgt-cache-calls.patch����������������������������������������������������������0000664�0000764�0000764�00000000730�11041657734�017563� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������--- arch/x86/kernel/process_32.c | 1 - 1 file changed, 1 deletion(-) Index: linux-2.6.24.7/arch/x86/kernel/process_32.c =================================================================== --- linux-2.6.24.7.orig/arch/x86/kernel/process_32.c +++ linux-2.6.24.7/arch/x86/kernel/process_32.c @@ -184,7 +184,6 @@ void cpu_idle(void) if (__get_cpu_var(cpu_idle_state)) __get_cpu_var(cpu_idle_state) = 0; - check_pgt_cache(); rmb(); idle = pm_idle; ����������������������������������������patches/preempt-irqs-i386-idle-poll-loop-fix.patch��������������������������������������������������0000664�0000764�0000764�00000001020�11041657734�021010� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������--- arch/x86/kernel/process_32.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) Index: linux-2.6.24.7/arch/x86/kernel/process_32.c =================================================================== --- linux-2.6.24.7.orig/arch/x86/kernel/process_32.c +++ linux-2.6.24.7/arch/x86/kernel/process_32.c @@ -134,7 +134,9 @@ EXPORT_SYMBOL(default_idle); */ static void poll_idle (void) { - cpu_relax(); + do { + cpu_relax(); + } while (!need_resched() && !need_resched_delayed()); } #ifdef CONFIG_HOTPLUG_CPU ����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/preempt-realtime-ftrace.patch���������������������������������������������������������������0000664�0000764�0000764�00000010751�11041657732�017006� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������--- kernel/trace/ftrace.c | 4 ++-- kernel/trace/trace.c | 10 +++++----- kernel/trace/trace.h | 2 +- kernel/trace/trace_hist.c | 2 +- kernel/trace/trace_irqsoff.c | 2 +- kernel/trace/trace_sched_wakeup.c | 2 +- 6 files changed, 11 insertions(+), 11 deletions(-) Index: linux-2.6.24.7/kernel/trace/ftrace.c =================================================================== --- linux-2.6.24.7.orig/kernel/trace/ftrace.c +++ linux-2.6.24.7/kernel/trace/ftrace.c @@ -39,7 +39,7 @@ static int last_ftrace_enabled; */ static int ftrace_disabled __read_mostly; -static DEFINE_SPINLOCK(ftrace_lock); +static DEFINE_RAW_SPINLOCK(ftrace_lock); static DEFINE_MUTEX(ftrace_sysctl_lock); static struct ftrace_ops ftrace_list_end __read_mostly = @@ -166,7 +166,7 @@ static struct hlist_head ftrace_hash[FTR static DEFINE_PER_CPU(int, ftrace_shutdown_disable_cpu); -static DEFINE_SPINLOCK(ftrace_shutdown_lock); +static DEFINE_RAW_SPINLOCK(ftrace_shutdown_lock); static DEFINE_MUTEX(ftraced_lock); static DEFINE_MUTEX(ftrace_regex_lock); Index: linux-2.6.24.7/kernel/trace/trace.c =================================================================== --- linux-2.6.24.7.orig/kernel/trace/trace.c +++ linux-2.6.24.7/kernel/trace/trace.c @@ -223,8 +223,8 @@ static const char *trace_options[] = { * This is defined as a raw_spinlock_t in order to help * with performance when lockdep debugging is enabled. */ -static raw_spinlock_t ftrace_max_lock = - (raw_spinlock_t)__RAW_SPIN_LOCK_UNLOCKED; +static __raw_spinlock_t ftrace_max_lock = + (__raw_spinlock_t)__RAW_SPIN_LOCK_UNLOCKED; /* * Copy the new maximum trace into the separate maximum-trace @@ -654,7 +654,7 @@ static unsigned map_pid_to_cmdline[PID_M static unsigned map_cmdline_to_pid[SAVED_CMDLINES]; static char saved_cmdlines[SAVED_CMDLINES][TASK_COMM_LEN]; static int cmdline_idx; -static DEFINE_SPINLOCK(trace_cmdline_lock); +static DEFINE_RAW_SPINLOCK(trace_cmdline_lock); /* temporary disable recording */ atomic_t trace_record_cmdline_disabled __read_mostly; @@ -3355,8 +3355,8 @@ __init static int tracer_alloc_buffers(v /* use the LRU flag to differentiate the two buffers */ ClearPageLRU(page); - data->lock = (raw_spinlock_t)__RAW_SPIN_LOCK_UNLOCKED; - max_tr.data[i]->lock = (raw_spinlock_t)__RAW_SPIN_LOCK_UNLOCKED; + data->lock = (__raw_spinlock_t)__RAW_SPIN_LOCK_UNLOCKED; + max_tr.data[i]->lock = (__raw_spinlock_t)__RAW_SPIN_LOCK_UNLOCKED; /* Only allocate if we are actually using the max trace */ #ifdef CONFIG_TRACER_MAX_TRACE Index: linux-2.6.24.7/kernel/trace/trace.h =================================================================== --- linux-2.6.24.7.orig/kernel/trace/trace.h +++ linux-2.6.24.7/kernel/trace/trace.h @@ -162,7 +162,7 @@ struct trace_entry { struct trace_array_cpu { struct list_head trace_pages; atomic_t disabled; - raw_spinlock_t lock; + __raw_spinlock_t lock; struct lock_class_key lock_key; /* these fields get copied into max-trace: */ Index: linux-2.6.24.7/kernel/trace/trace_hist.c =================================================================== --- linux-2.6.24.7.orig/kernel/trace/trace_hist.c +++ linux-2.6.24.7/kernel/trace/trace_hist.c @@ -412,7 +412,7 @@ int tracing_wakeup_hist __read_mostly = static unsigned wakeup_prio = (unsigned)-1 ; static struct task_struct *wakeup_task; static cycle_t wakeup_start; -static DEFINE_SPINLOCK(wakeup_lock); +static DEFINE_RAW_SPINLOCK(wakeup_lock); notrace void tracing_hist_wakeup_start(struct task_struct *p, struct task_struct *curr) Index: linux-2.6.24.7/kernel/trace/trace_irqsoff.c =================================================================== --- linux-2.6.24.7.orig/kernel/trace/trace_irqsoff.c +++ linux-2.6.24.7/kernel/trace/trace_irqsoff.c @@ -24,7 +24,7 @@ static int tracer_enabled __read_most static DEFINE_PER_CPU(int, tracing_cpu); -static DEFINE_SPINLOCK(max_trace_lock); +static DEFINE_RAW_SPINLOCK(max_trace_lock); enum { TRACER_IRQS_OFF = (1 << 1), Index: linux-2.6.24.7/kernel/trace/trace_sched_wakeup.c =================================================================== --- linux-2.6.24.7.orig/kernel/trace/trace_sched_wakeup.c +++ linux-2.6.24.7/kernel/trace/trace_sched_wakeup.c @@ -26,7 +26,7 @@ static struct task_struct *wakeup_task; static int wakeup_cpu; static unsigned wakeup_prio = -1; -static DEFINE_SPINLOCK(wakeup_lock); +static DEFINE_RAW_SPINLOCK(wakeup_lock); static void __wakeup_reset(struct trace_array *tr); �����������������������patches/preempt-realtime-ftrace-disable-ftraced.patch�����������������������������������������������0000664�0000764�0000764�00000001265�11041657733�022016� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������--- init/main.c | 4 ++++ 1 file changed, 4 insertions(+) Index: linux-2.6.24.7/init/main.c =================================================================== --- linux-2.6.24.7.orig/init/main.c +++ linux-2.6.24.7/init/main.c @@ -59,6 +59,7 @@ #include <linux/device.h> #include <linux/kthread.h> #include <linux/sched.h> +#include <linux/ftrace.h> #include <asm/io.h> #include <asm/bugs.h> @@ -783,6 +784,9 @@ static int noinline init_post(void) (void) sys_dup(0); (void) sys_dup(0); +#ifdef CONFIG_PREEMPT_RT + ftrace_disable_daemon(); +#endif if (ramdisk_execute_command) { run_init_process(ramdisk_execute_command); printk(KERN_WARNING "Failed to execute %s\n", �������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/preempt-realtime-sched.patch����������������������������������������������������������������0000664�0000764�0000764�00000064747�11041657731�016645� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������--- include/linux/sched.h | 47 ++++++ kernel/sched.c | 375 +++++++++++++++++++++++++++++++++++++++++++------- kernel/sched_rt.c | 60 +++++++- 3 files changed, 431 insertions(+), 51 deletions(-) Index: linux-2.6.24.7/include/linux/sched.h =================================================================== --- linux-2.6.24.7.orig/include/linux/sched.h +++ linux-2.6.24.7/include/linux/sched.h @@ -91,6 +91,16 @@ struct sched_param { #include <asm/processor.h> +#ifdef CONFIG_PREEMPT +extern int kernel_preemption; +#else +# define kernel_preemption 0 +#endif +#ifdef CONFIG_PREEMPT_VOLUNTARY +extern int voluntary_preemption; +#else +# define voluntary_preemption 0 +#endif #ifdef CONFIG_PREEMPT_SOFTIRQS extern int softirq_preemption; #else @@ -200,6 +210,28 @@ extern struct semaphore kernel_sem; #define set_task_state(tsk, state_value) \ set_mb((tsk)->state, (state_value)) +// #define PREEMPT_DIRECT + +#ifdef CONFIG_X86_LOCAL_APIC +extern void nmi_show_all_regs(void); +#else +# define nmi_show_all_regs() do { } while (0) +#endif + +#include <linux/smp.h> +#include <linux/sem.h> +#include <linux/signal.h> +#include <linux/securebits.h> +#include <linux/fs_struct.h> +#include <linux/compiler.h> +#include <linux/completion.h> +#include <linux/pid.h> +#include <linux/percpu.h> +#include <linux/topology.h> +#include <linux/seccomp.h> + +struct exec_domain; + /* * set_current_state() includes a barrier so that the write of current->state * is correctly serialised wrt the caller's subsequent test of whether to @@ -319,6 +351,11 @@ extern signed long FASTCALL(schedule_tim extern signed long schedule_timeout_interruptible(signed long timeout); extern signed long schedule_timeout_uninterruptible(signed long timeout); asmlinkage void schedule(void); +/* + * This one can be called with interrupts disabled, only + * to be used by lowlevel arch code! + */ +asmlinkage void __sched __schedule(void); struct nsproxy; struct user_namespace; @@ -1419,6 +1456,15 @@ extern struct pid *cad_pid; extern void free_task(struct task_struct *tsk); #define get_task_struct(tsk) do { atomic_inc(&(tsk)->usage); } while(0) +#ifdef CONFIG_PREEMPT_RT +extern void __put_task_struct_cb(struct rcu_head *rhp); + +static inline void put_task_struct(struct task_struct *t) +{ + if (atomic_dec_and_test(&t->usage)) + call_rcu(&t->rcu, __put_task_struct_cb); +} +#else extern void __put_task_struct(struct task_struct *t); static inline void put_task_struct(struct task_struct *t) @@ -1426,6 +1472,7 @@ static inline void put_task_struct(struc if (atomic_dec_and_test(&t->usage)) __put_task_struct(t); } +#endif /* * Per process flags Index: linux-2.6.24.7/kernel/sched.c =================================================================== --- linux-2.6.24.7.orig/kernel/sched.c +++ linux-2.6.24.7/kernel/sched.c @@ -4,6 +4,7 @@ * Kernel scheduler and related syscalls * * Copyright (C) 1991-2002 Linus Torvalds + * Copyright (C) 2004 Red Hat, Inc., Ingo Molnar <mingo@redhat.com> * * 1996-12-23 Modified by Dave Grothe to fix bugs in semaphores and * make semaphores SMP safe @@ -16,6 +17,7 @@ * by Davide Libenzi, preemptible kernel bits by Robert Love. * 2003-09-03 Interactivity tuning by Con Kolivas. * 2004-04-02 Scheduler domains code by Nick Piggin + * 2004-10-13 Real-Time Preemption support by Ingo Molnar * 2007-04-15 Work begun on replacing all interactivity tuning with a * fair scheduling design by Con Kolivas. * 2007-05-05 Load balancing (smp-nice) and other improvements @@ -59,6 +61,7 @@ #include <linux/sysctl.h> #include <linux/syscalls.h> #include <linux/times.h> +#include <linux/kallsyms.h> #include <linux/tsacct_kern.h> #include <linux/kprobes.h> #include <linux/delayacct.h> @@ -114,6 +117,20 @@ unsigned long long __attribute__((weak)) #define NICE_0_LOAD SCHED_LOAD_SCALE #define NICE_0_SHIFT SCHED_LOAD_SHIFT +#if (BITS_PER_LONG < 64) +#define JIFFIES_TO_NS64(TIME) \ + ((unsigned long long)(TIME) * ((unsigned long) (1000000000 / HZ))) + +#define NS64_TO_JIFFIES(TIME) \ + ((((unsigned long long)((TIME)) >> BITS_PER_LONG) * \ + (1 + NS_TO_JIFFIES(~0UL))) + NS_TO_JIFFIES((unsigned long)(TIME))) +#else /* BITS_PER_LONG < 64 */ + +#define NS64_TO_JIFFIES(TIME) NS_TO_JIFFIES(TIME) +#define JIFFIES_TO_NS64(TIME) JIFFIES_TO_NS(TIME) + +#endif /* BITS_PER_LONG < 64 */ + /* * These are the 'tuning knobs' of the scheduler: * @@ -143,6 +160,32 @@ static inline void sg_inc_cpu_power(stru } #endif +#define TASK_PREEMPTS_CURR(p, rq) \ + ((p)->prio < (rq)->curr->prio) + +/* + * Tweaks for current + */ + +#ifdef CURRENT_PTR +struct task_struct * const ___current = &init_task; +struct task_struct ** const current_ptr = (struct task_struct ** const)&___current; +struct thread_info * const current_ti = &init_thread_union.thread_info; +struct thread_info ** const current_ti_ptr = (struct thread_info ** const)¤t_ti; + +EXPORT_SYMBOL(___current); +EXPORT_SYMBOL(current_ti); + +/* + * The scheduler itself doesnt want 'current' to be cached + * during context-switches: + */ +# undef current +# define current __current() +# undef current_thread_info +# define current_thread_info() __current_thread_info() +#endif + static inline int rt_policy(int policy) { if (unlikely(policy == SCHED_FIFO) || unlikely(policy == SCHED_RR)) @@ -278,6 +321,7 @@ struct rt_rq { struct list_head *rt_load_balance_head, *rt_load_balance_curr; unsigned long rt_nr_running; unsigned long rt_nr_migratory; + unsigned long rt_nr_uninterruptible; /* highest queued rt task prio */ int highest_prio; int overloaded; @@ -324,7 +368,7 @@ static struct root_domain def_root_domai */ struct rq { /* runqueue lock: */ - spinlock_t lock; + raw_spinlock_t lock; /* * nr_running and cpu_load should be in the same cacheline because @@ -357,6 +401,8 @@ struct rq { */ unsigned long nr_uninterruptible; + unsigned long switch_timestamp; + unsigned long slice_avg; struct task_struct *curr, *idle; unsigned long next_balance; struct mm_struct *prev_mm; @@ -406,6 +452,13 @@ struct rq { /* BKL stats */ unsigned int bkl_count; + + /* RT-overload stats: */ + unsigned long rto_schedule; + unsigned long rto_schedule_tail; + unsigned long rto_wakeup; + unsigned long rto_pulled; + unsigned long rto_pushed; #endif struct lock_class_key rq_lock_key; }; @@ -569,11 +622,23 @@ unsigned long long notrace cpu_clock(int } EXPORT_SYMBOL_GPL(cpu_clock); +/* + * We really dont want to do anything complex within switch_to() + * on PREEMPT_RT - this check enforces this. + */ +#ifdef prepare_arch_switch +# ifdef CONFIG_PREEMPT_RT +# error FIXME +# else +# define _finish_arch_switch finish_arch_switch +# endif +#endif + #ifndef prepare_arch_switch # define prepare_arch_switch(next) do { } while (0) #endif #ifndef finish_arch_switch -# define finish_arch_switch(prev) do { } while (0) +# define _finish_arch_switch(prev) do { } while (0) #endif static inline int task_current(struct rq *rq, struct task_struct *p) @@ -604,7 +669,7 @@ static inline void finish_lock_switch(st */ spin_acquire(&rq->lock.dep_map, 0, 0, _THIS_IP_); - spin_unlock_irq(&rq->lock); + spin_unlock(&rq->lock); } #else /* __ARCH_WANT_UNLOCKED_CTXSW */ @@ -645,8 +710,8 @@ static inline void finish_lock_switch(st smp_wmb(); prev->oncpu = 0; #endif -#ifndef __ARCH_WANT_INTERRUPTS_ON_CTXSW - local_irq_enable(); +#ifdef __ARCH_WANT_INTERRUPTS_ON_CTXSW + local_irq_disable(); #endif } #endif /* __ARCH_WANT_UNLOCKED_CTXSW */ @@ -1093,6 +1158,8 @@ static inline int normal_prio(struct tas prio = MAX_RT_PRIO-1 - p->rt_priority; else prio = __normal_prio(p); + +// trace_special_pid(p->pid, PRIO(p), __PRIO(prio)); return prio; } @@ -1593,6 +1660,13 @@ try_to_wake_up(struct task_struct *p, un long old_state; struct rq *rq; +#ifdef CONFIG_PREEMPT_RT + /* + * sync wakeups can increase wakeup latencies: + */ + if (rt_task(p)) + sync = 0; +#endif rq = task_rq_lock(p, &flags); old_state = p->state; if (!(old_state & state)) @@ -1658,7 +1732,10 @@ out_activate: out_running: trace_kernel_sched_wakeup(rq, p); - p->state = TASK_RUNNING; + if (mutex) + p->state = TASK_RUNNING_MUTEX; + else + p->state = TASK_RUNNING; #ifdef CONFIG_SMP if (p->sched_class->task_wake_up) p->sched_class->task_wake_up(rq, p); @@ -1962,7 +2039,7 @@ static void finish_task_switch(struct rq * Manfred Spraul <manfred@colorfullife.com> */ prev_state = prev->state; - finish_arch_switch(prev); + _finish_arch_switch(prev); finish_lock_switch(rq, prev); #ifdef CONFIG_SMP if (current->sched_class->post_schedule) @@ -1989,12 +2066,15 @@ static void finish_task_switch(struct rq asmlinkage void schedule_tail(struct task_struct *prev) __releases(rq->lock) { - struct rq *rq = this_rq(); - - finish_task_switch(rq, prev); + preempt_disable(); // TODO: move this to fork setup + finish_task_switch(this_rq(), prev); + __preempt_enable_no_resched(); + local_irq_enable(); #ifdef __ARCH_WANT_UNLOCKED_CTXSW /* In this case, finish_task_switch does not reenable preemption */ preempt_enable(); +#else + preempt_check_resched(); #endif if (current->set_child_tid) put_user(task_pid_vnr(current), current->set_child_tid); @@ -2043,6 +2123,11 @@ context_switch(struct rq *rq, struct tas spin_release(&rq->lock.dep_map, 1, _THIS_IP_); #endif +#ifdef CURRENT_PTR + barrier(); + *current_ptr = next; + *current_ti_ptr = next->thread_info; +#endif /* Here we just switch the register state and the stack. */ switch_to(prev, next, prev); @@ -2089,6 +2174,11 @@ unsigned long nr_uninterruptible(void) return sum; } +unsigned long nr_uninterruptible_cpu(int cpu) +{ + return cpu_rq(cpu)->nr_uninterruptible; +} + unsigned long long nr_context_switches(void) { int i; @@ -3569,6 +3659,8 @@ void scheduler_tick(void) struct task_struct *curr = rq->curr; u64 next_tick = rq->tick_timestamp + TICK_NSEC; + BUG_ON(!irqs_disabled()); + spin_lock(&rq->lock); __update_rq_clock(rq); /* @@ -3666,8 +3758,8 @@ static noinline void __schedule_bug(stru { struct pt_regs *regs = get_irq_regs(); - printk(KERN_ERR "BUG: scheduling while atomic: %s/%d/0x%08x\n", - prev->comm, prev->pid, preempt_count()); + printk(KERN_ERR "BUG: scheduling while atomic: %s/0x%08x/%d, CPU#%d\n", + prev->comm, preempt_count(), prev->pid, smp_processor_id()); debug_show_held_locks(prev); if (irqs_disabled()) @@ -3684,6 +3776,8 @@ static noinline void __schedule_bug(stru */ static inline void schedule_debug(struct task_struct *prev) { + WARN_ON(system_state == SYSTEM_BOOTING); + /* * Test if we are atomic. Since do_exit() needs to call into * schedule() atomically, we ignore that path for now. @@ -3738,14 +3832,13 @@ pick_next_task(struct rq *rq, struct tas /* * schedule() is the main scheduler function. */ -asmlinkage void __sched schedule(void) +asmlinkage void __sched __schedule(void) { struct task_struct *prev, *next; long *switch_count; struct rq *rq; int cpu; -need_resched: preempt_disable(); cpu = smp_processor_id(); rq = cpu_rq(cpu); @@ -3754,7 +3847,6 @@ need_resched: switch_count = &prev->nivcsw; release_kernel_lock(prev); -need_resched_nonpreemptible: schedule_debug(prev); @@ -3764,19 +3856,25 @@ need_resched_nonpreemptible: local_irq_disable(); __update_rq_clock(rq); spin_lock(&rq->lock); + cpu = smp_processor_id(); clear_tsk_need_resched(prev); clear_tsk_need_resched_delayed(prev); - if (prev->state && !(preempt_count() & PREEMPT_ACTIVE)) { + if ((prev->state & ~TASK_RUNNING_MUTEX) && + !(preempt_count() & PREEMPT_ACTIVE)) { if (unlikely((prev->state & TASK_INTERRUPTIBLE) && unlikely(signal_pending(prev)))) { prev->state = TASK_RUNNING; } else { + touch_softlockup_watchdog(); deactivate_task(rq, prev, 1); } switch_count = &prev->nvcsw; } + if (preempt_count() & PREEMPT_ACTIVE) + sub_preempt_count(PREEMPT_ACTIVE); + #ifdef CONFIG_SMP if (prev->sched_class->pre_schedule) prev->sched_class->pre_schedule(rq, prev); @@ -3796,22 +3894,90 @@ need_resched_nonpreemptible: ++*switch_count; context_switch(rq, prev, next); /* unlocks the rq */ - } else - spin_unlock_irq(&rq->lock); + __preempt_enable_no_resched(); + } else { + __preempt_enable_no_resched(); + spin_unlock(&rq->lock); + } - if (unlikely(reacquire_kernel_lock(current) < 0)) { - cpu = smp_processor_id(); - rq = cpu_rq(cpu); - goto need_resched_nonpreemptible; + reacquire_kernel_lock(current); + if (!irqs_disabled()) { + static int once = 1; + if (once) { + once = 0; + print_irqtrace_events(current); + WARN_ON(1); + } } - __preempt_enable_no_resched(); - if (unlikely(test_thread_flag(TIF_NEED_RESCHED) || - test_thread_flag(TIF_NEED_RESCHED_DELAYED))) - goto need_resched; +} + +/* + * schedule() is the main scheduler function. + */ +asmlinkage void __sched schedule(void) +{ + WARN_ON(system_state == SYSTEM_BOOTING); + /* + * Test if we have interrupts disabled. + */ + if (unlikely(irqs_disabled())) { + printk(KERN_ERR "BUG: scheduling with irqs disabled: " + "%s/0x%08x/%d\n", current->comm, preempt_count(), + current->pid); + print_symbol("caller is %s\n", + (long)__builtin_return_address(0)); + dump_stack(); + } + + if (unlikely(current->flags & PF_NOSCHED)) { + current->flags &= ~PF_NOSCHED; + printk(KERN_ERR "%s:%d userspace BUG: scheduling in " + "user-atomic context!\n", current->comm, current->pid); + dump_stack(); + send_sig(SIGUSR2, current, 1); + } + + local_irq_disable(); + + do { + __schedule(); + } while (unlikely(test_thread_flag(TIF_NEED_RESCHED) || + test_thread_flag(TIF_NEED_RESCHED_DELAYED))); + + local_irq_enable(); } EXPORT_SYMBOL(schedule); #ifdef CONFIG_PREEMPT + +/* + * Global flag to turn preemption off on a CONFIG_PREEMPT kernel: + */ +int kernel_preemption = 1; + +static int __init preempt_setup (char *str) +{ + if (!strncmp(str, "off", 3)) { + if (kernel_preemption) { + printk(KERN_INFO "turning off kernel preemption!\n"); + kernel_preemption = 0; + } + return 1; + } + if (!strncmp(str, "on", 2)) { + if (!kernel_preemption) { + printk(KERN_INFO "turning on kernel preemption!\n"); + kernel_preemption = 1; + } + return 1; + } + get_option(&str, &kernel_preemption); + + return 1; +} + +__setup("preempt=", preempt_setup); + /* * this is the entry point to schedule() from in-kernel preemption * off of preempt_enable. Kernel preemptions off return from interrupt @@ -3824,6 +3990,8 @@ asmlinkage void __sched preempt_schedule struct task_struct *task = current; int saved_lock_depth; #endif + if (!kernel_preemption) + return; /* * If there is a non-zero preempt_count or interrupts are disabled, * we do not want to preempt the current task. Just return.. @@ -3832,6 +4000,7 @@ asmlinkage void __sched preempt_schedule return; do { + local_irq_disable(); add_preempt_count(PREEMPT_ACTIVE); /* @@ -3843,11 +4012,11 @@ asmlinkage void __sched preempt_schedule saved_lock_depth = task->lock_depth; task->lock_depth = -1; #endif - schedule(); + __schedule(); #ifdef CONFIG_PREEMPT_BKL task->lock_depth = saved_lock_depth; #endif - sub_preempt_count(PREEMPT_ACTIVE); + local_irq_enable(); /* * Check again in case we missed a preemption opportunity @@ -3859,10 +4028,10 @@ asmlinkage void __sched preempt_schedule EXPORT_SYMBOL(preempt_schedule); /* - * this is the entry point to schedule() from kernel preemption - * off of irq context. - * Note, that this is called and return with irqs disabled. This will - * protect us against recursive calling from irq. + * this is is the entry point for the IRQ return path. Called with + * interrupts disabled. To avoid infinite irq-entry recursion problems + * with fast-paced IRQ sources we do all of this carefully to never + * enable interrupts again. */ asmlinkage void __sched preempt_schedule_irq(void) { @@ -3871,10 +4040,18 @@ asmlinkage void __sched preempt_schedule struct task_struct *task = current; int saved_lock_depth; #endif - /* Catch callers which need to be fixed */ - WARN_ON_ONCE(ti->preempt_count || !irqs_disabled()); + + if (!kernel_preemption) + return; + /* + * If there is a non-zero preempt_count then just return. + * (interrupts are disabled) + */ + if (unlikely(ti->preempt_count)) + return; do { + local_irq_disable(); add_preempt_count(PREEMPT_ACTIVE); /* @@ -3886,13 +4063,12 @@ asmlinkage void __sched preempt_schedule saved_lock_depth = task->lock_depth; task->lock_depth = -1; #endif - local_irq_enable(); - schedule(); + __schedule(); + local_irq_disable(); #ifdef CONFIG_PREEMPT_BKL task->lock_depth = saved_lock_depth; #endif - sub_preempt_count(PREEMPT_ACTIVE); /* * Check again in case we missed a preemption opportunity @@ -4156,7 +4332,7 @@ EXPORT_SYMBOL(sleep_on_timeout); void rt_mutex_setprio(struct task_struct *p, int prio) { unsigned long flags; - int oldprio, on_rq, running; + int oldprio, prev_resched, on_rq, running; struct rq *rq; const struct sched_class *prev_class = p->sched_class; @@ -4180,12 +4356,17 @@ void rt_mutex_setprio(struct task_struct p->prio = prio; +// trace_special_pid(p->pid, __PRIO(oldprio), PRIO(p)); + prev_resched = _need_resched(); + if (running) p->sched_class->set_curr_task(rq); if (on_rq) { enqueue_task(rq, p, 0); check_class_changed(rq, p, prev_class, oldprio, running); } +// trace_special(prev_resched, _need_resched(), 0); + task_rq_unlock(rq, &flags); } @@ -4777,14 +4958,17 @@ asmlinkage long sys_sched_yield(void) */ spin_unlock_no_resched(&rq->lock); - schedule(); + __schedule(); + + local_irq_enable(); + preempt_check_resched(); return 0; } static void __cond_resched(void) { -#ifdef CONFIG_DEBUG_SPINLOCK_SLEEP +#if defined(CONFIG_DEBUG_SPINLOCK_SLEEP) || defined(CONFIG_DEBUG_PREEMPT) __might_sleep(__FILE__, __LINE__); #endif /* @@ -4793,10 +4977,11 @@ static void __cond_resched(void) * cond_resched() call. */ do { + local_irq_disable(); add_preempt_count(PREEMPT_ACTIVE); - schedule(); - sub_preempt_count(PREEMPT_ACTIVE); + __schedule(); } while (need_resched()); + local_irq_enable(); } int __sched cond_resched(void) @@ -4822,7 +5007,7 @@ int __cond_resched_raw_spinlock(raw_spin { int ret = 0; - if (need_lockbreak(lock)) { + if (need_lockbreak_raw(lock)) { spin_unlock(lock); cpu_relax(); ret = 1; @@ -4838,6 +5023,25 @@ int __cond_resched_raw_spinlock(raw_spin } EXPORT_SYMBOL(__cond_resched_raw_spinlock); +#ifdef CONFIG_PREEMPT_RT + +int __cond_resched_spinlock(spinlock_t *lock) +{ +#if (defined(CONFIG_SMP) && defined(CONFIG_PREEMPT)) || defined(CONFIG_PREEMPT_RT) + if (lock->break_lock) { + lock->break_lock = 0; + spin_unlock_no_resched(lock); + __cond_resched(); + spin_lock(lock); + return 1; + } +#endif + return 0; +} +EXPORT_SYMBOL(__cond_resched_spinlock); + +#endif + /* * Voluntarily preempt a process context that has softirqs disabled: */ @@ -4884,11 +5088,15 @@ int cond_resched_hardirq_context(void) WARN_ON_ONCE(!irqs_disabled()); if (hardirq_need_resched()) { +#ifndef CONFIG_PREEMPT_RT irq_exit(); +#endif local_irq_enable(); __cond_resched(); +#ifndef CONFIG_PREEMPT_RT local_irq_disable(); __irq_enter(); +#endif return 1; } @@ -4896,17 +5104,58 @@ int cond_resched_hardirq_context(void) } EXPORT_SYMBOL(cond_resched_hardirq_context); +#ifdef CONFIG_PREEMPT_VOLUNTARY + +int voluntary_preemption = 1; + +EXPORT_SYMBOL(voluntary_preemption); + +static int __init voluntary_preempt_setup (char *str) +{ + if (!strncmp(str, "off", 3)) + voluntary_preemption = 0; + else + get_option(&str, &voluntary_preemption); + if (!voluntary_preemption) + printk("turning off voluntary preemption!\n"); + + return 1; +} + +__setup("voluntary-preempt=", voluntary_preempt_setup); + +#endif + /** * yield - yield the current processor to other threads. * * This is a shortcut for kernel-space yielding - it marks the * thread runnable and calls sys_sched_yield(). */ -void __sched yield(void) +void __sched __yield(void) { set_current_state(TASK_RUNNING); sys_sched_yield(); } + +void __sched yield(void) +{ + static int once = 1; + + /* + * it's a bug to rely on yield() with RT priorities. We print + * the first occurance after bootup ... this will still give + * us an idea about the scope of the problem, without spamming + * the syslog: + */ + if (once && rt_task(current)) { + once = 0; + printk(KERN_ERR "BUG: %s:%d RT task yield()-ing!\n", + current->comm, current->pid); + dump_stack(); + } + __yield(); +} EXPORT_SYMBOL(yield); /* @@ -5089,6 +5338,7 @@ static void show_task(struct task_struct void show_state_filter(unsigned long state_filter) { struct task_struct *g, *p; + int do_unlock = 1; #if BITS_PER_LONG == 32 printk(KERN_INFO @@ -5097,7 +5347,16 @@ void show_state_filter(unsigned long sta printk(KERN_INFO " task PC stack pid father\n"); #endif +#ifdef CONFIG_PREEMPT_RT + if (!read_trylock(&tasklist_lock)) { + printk("hm, tasklist_lock write-locked.\n"); + printk("ignoring ...\n"); + do_unlock = 0; + } +#else read_lock(&tasklist_lock); +#endif + do_each_thread(g, p) { /* * reset the NMI-timeout, listing all files on a slow @@ -5113,7 +5372,8 @@ void show_state_filter(unsigned long sta #ifdef CONFIG_SCHED_DEBUG sysrq_sched_debug_show(); #endif - read_unlock(&tasklist_lock); + if (do_unlock) + read_unlock(&tasklist_lock); /* * Only show locks if all tasks are dumped: */ @@ -5154,7 +5414,9 @@ void __cpuinit init_idle(struct task_str spin_unlock_irqrestore(&rq->lock, flags); /* Set the preempt count _outside_ the spinlocks! */ -#if defined(CONFIG_PREEMPT) && !defined(CONFIG_PREEMPT_BKL) +#if defined(CONFIG_PREEMPT) && \ + !defined(CONFIG_PREEMPT_BKL) && \ + !defined(CONFIG_PREEMPT_RT) task_thread_info(idle)->preempt_count = (idle->lock_depth >= 0); #else task_thread_info(idle)->preempt_count = 0; @@ -5279,11 +5541,18 @@ EXPORT_SYMBOL_GPL(set_cpus_allowed); static int __migrate_task(struct task_struct *p, int src_cpu, int dest_cpu) { struct rq *rq_dest, *rq_src; + unsigned long flags; int ret = 0, on_rq; if (unlikely(cpu_is_offline(dest_cpu))) return ret; + /* + * PREEMPT_RT: this relies on write_lock_irq(&tasklist_lock) + * disabling interrupts - which on PREEMPT_RT does not do: + */ + local_irq_save(flags); + rq_src = cpu_rq(src_cpu); rq_dest = cpu_rq(dest_cpu); @@ -5307,6 +5576,8 @@ static int __migrate_task(struct task_st ret = 1; out: double_rq_unlock(rq_src, rq_dest); + local_irq_restore(flags); + return ret; } @@ -7100,6 +7371,9 @@ void __init sched_init(void) atomic_inc(&init_mm.mm_count); enter_lazy_tlb(&init_mm, current); +#ifdef CONFIG_PREEMPT_RT + printk("Real-Time Preemption Support (C) 2004-2007 Ingo Molnar\n"); +#endif /* * Make us the idle thread. Technically, schedule() should not be * called from this thread, however somewhere below it might be, @@ -7121,13 +7395,16 @@ void __might_sleep(char *file, int line) if ((in_atomic() || irqs_disabled()) && system_state == SYSTEM_RUNNING && !oops_in_progress) { + if (debug_direct_keyboard && hardirq_count()) + return; if (time_before(jiffies, prev_jiffy + HZ) && prev_jiffy) return; prev_jiffy = jiffies; printk(KERN_ERR "BUG: sleeping function called from invalid" - " context at %s:%d\n", file, line); - printk("in_atomic():%d, irqs_disabled():%d\n", - in_atomic(), irqs_disabled()); + " context %s(%d) at %s:%d\n", + current->comm, current->pid, file, line); + printk("in_atomic():%d [%08x], irqs_disabled():%d\n", + in_atomic(), preempt_count(), irqs_disabled()); debug_show_held_locks(current); if (irqs_disabled()) print_irqtrace_events(current); Index: linux-2.6.24.7/kernel/sched_rt.c =================================================================== --- linux-2.6.24.7.orig/kernel/sched_rt.c +++ linux-2.6.24.7/kernel/sched_rt.c @@ -115,6 +115,48 @@ static inline void dec_rt_tasks(struct t #endif /* CONFIG_SMP */ } +static inline void incr_rt_nr_uninterruptible(struct task_struct *p, + struct rq *rq) +{ + rq->rt.rt_nr_uninterruptible++; +} + +static inline void decr_rt_nr_uninterruptible(struct task_struct *p, + struct rq *rq) +{ + rq->rt.rt_nr_uninterruptible--; +} + +unsigned long rt_nr_running(void) +{ + unsigned long i, sum = 0; + + for_each_online_cpu(i) + sum += cpu_rq(i)->rt.rt_nr_running; + + return sum; +} + +unsigned long rt_nr_running_cpu(int cpu) +{ + return cpu_rq(cpu)->rt.rt_nr_running; +} + +unsigned long rt_nr_uninterruptible(void) +{ + unsigned long i, sum = 0; + + for_each_online_cpu(i) + sum += cpu_rq(i)->rt.rt_nr_uninterruptible; + + return sum; +} + +unsigned long rt_nr_uninterruptible_cpu(int cpu) +{ + return cpu_rq(cpu)->rt.rt_nr_uninterruptible; +} + static void enqueue_task_rt(struct rq *rq, struct task_struct *p, int wakeup) { struct rt_prio_array *array = &rq->rt.active; @@ -122,6 +164,9 @@ static void enqueue_task_rt(struct rq *r list_add_tail(&p->run_list, array->queue + p->prio); __set_bit(p->prio, array->bitmap); inc_rt_tasks(p, rq); + + if (p->state == TASK_UNINTERRUPTIBLE) + decr_rt_nr_uninterruptible(p, rq); } /* @@ -133,6 +178,9 @@ static void dequeue_task_rt(struct rq *r update_curr_rt(rq); + if (p->state == TASK_UNINTERRUPTIBLE) + incr_rt_nr_uninterruptible(p, rq); + list_del(&p->run_list); if (list_empty(array->queue + p->prio)) __clear_bit(p->prio, array->bitmap); @@ -500,6 +548,8 @@ static int push_rt_task(struct rq *rq) resched_task(lowest_rq->curr); + schedstat_inc(rq, rto_pushed); + spin_unlock(&lowest_rq->lock); ret = 1; @@ -606,6 +656,7 @@ static int pull_rt_task(struct rq *this_ */ next = p; + schedstat_inc(src_rq, rto_pulled); } out: spin_unlock(&src_rq->lock); @@ -617,8 +668,10 @@ static int pull_rt_task(struct rq *this_ static void pre_schedule_rt(struct rq *rq, struct task_struct *prev) { /* Try to pull RT tasks here if we lower this rq's prio */ - if (unlikely(rt_task(prev)) && rq->rt.highest_prio > prev->prio) + if (unlikely(rt_task(prev)) && rq->rt.highest_prio > prev->prio) { pull_rt_task(rq); + schedstat_inc(rq, rto_schedule); + } } static void post_schedule_rt(struct rq *rq) @@ -633,6 +686,7 @@ static void post_schedule_rt(struct rq * if (unlikely(rq->rt.overloaded)) { spin_lock_irq(&rq->lock); push_rt_tasks(rq); + schedstat_inc(rq, rto_schedule_tail); spin_unlock_irq(&rq->lock); } } @@ -642,8 +696,10 @@ static void task_wake_up_rt(struct rq *r { if (!task_running(rq, p) && (p->prio >= rq->rt.highest_prio) && - rq->rt.overloaded) + rq->rt.overloaded) { push_rt_tasks(rq); + schedstat_inc(rq, rto_wakeup); + } } static unsigned long �������������������������patches/preempt-realtime-mmdrop-delayed.patch�������������������������������������������������������0000664�0000764�0000764�00000016310�11041657731�020441� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������--- include/linux/mm_types.h | 3 + include/linux/sched.h | 8 ++ kernel/fork.c | 139 +++++++++++++++++++++++++++++++++++++++++++++++ kernel/sched.c | 6 +- 4 files changed, 155 insertions(+), 1 deletion(-) Index: linux-2.6.24.7/include/linux/mm_types.h =================================================================== --- linux-2.6.24.7.orig/include/linux/mm_types.h +++ linux-2.6.24.7/include/linux/mm_types.h @@ -199,6 +199,9 @@ struct mm_struct { /* Architecture-specific MM context */ mm_context_t context; + /* realtime bits */ + struct list_head delayed_drop; + /* Swap token stuff */ /* * Last value of global fault stamp as seen by this process. Index: linux-2.6.24.7/include/linux/sched.h =================================================================== --- linux-2.6.24.7.orig/include/linux/sched.h +++ linux-2.6.24.7/include/linux/sched.h @@ -1789,12 +1789,20 @@ extern struct mm_struct * mm_alloc(void) /* mmdrop drops the mm and the page tables */ extern void FASTCALL(__mmdrop(struct mm_struct *)); +extern void FASTCALL(__mmdrop_delayed(struct mm_struct *)); + static inline void mmdrop(struct mm_struct * mm) { if (unlikely(atomic_dec_and_test(&mm->mm_count))) __mmdrop(mm); } +static inline void mmdrop_delayed(struct mm_struct * mm) +{ + if (atomic_dec_and_test(&mm->mm_count)) + __mmdrop_delayed(mm); +} + /* mmput gets rid of the mappings and all user-space */ extern void mmput(struct mm_struct *); /* Grab a reference to a task's mm, if it is not already going away */ Index: linux-2.6.24.7/kernel/fork.c =================================================================== --- linux-2.6.24.7.orig/kernel/fork.c +++ linux-2.6.24.7/kernel/fork.c @@ -34,6 +34,7 @@ #include <linux/swap.h> #include <linux/syscalls.h> #include <linux/jiffies.h> +#include <linux/interrupt.h> #include <linux/futex.h> #include <linux/task_io_accounting_ops.h> #include <linux/rcupdate.h> @@ -41,6 +42,8 @@ #include <linux/mount.h> #include <linux/audit.h> #include <linux/profile.h> +#include <linux/kthread.h> +#include <linux/notifier.h> #include <linux/rmap.h> #include <linux/acct.h> #include <linux/tsacct_kern.h> @@ -71,6 +74,15 @@ DEFINE_PER_CPU(unsigned long, process_co __cacheline_aligned DEFINE_RWLOCK(tasklist_lock); /* outer */ +/* + * Delayed mmdrop. In the PREEMPT_RT case we + * dont want to do this from the scheduling + * context. + */ +static DEFINE_PER_CPU(struct task_struct *, desched_task); + +static DEFINE_PER_CPU(struct list_head, delayed_drop_list); + int nr_processes(void) { int cpu; @@ -132,6 +144,8 @@ void __put_task_struct(struct task_struc void __init fork_init(unsigned long mempages) { + int i; + #ifndef __HAVE_ARCH_TASK_STRUCT_ALLOCATOR #ifndef ARCH_MIN_TASKALIGN #define ARCH_MIN_TASKALIGN L1_CACHE_BYTES @@ -159,6 +173,9 @@ void __init fork_init(unsigned long memp init_task.signal->rlim[RLIMIT_NPROC].rlim_max = max_threads/2; init_task.signal->rlim[RLIMIT_SIGPENDING] = init_task.signal->rlim[RLIMIT_NPROC]; + + for (i = 0; i < NR_CPUS; i++) + INIT_LIST_HEAD(&per_cpu(delayed_drop_list, i)); } static struct task_struct *dup_task_struct(struct task_struct *orig) @@ -354,6 +371,7 @@ static struct mm_struct * mm_init(struct spin_lock_init(&mm->page_table_lock); rwlock_init(&mm->ioctx_list_lock); mm->ioctx_list = NULL; + INIT_LIST_HEAD(&mm->delayed_drop); mm->free_area_cache = TASK_UNMAPPED_BASE; mm->cached_hole_size = ~0UL; @@ -1312,7 +1330,9 @@ static struct task_struct *copy_process( attach_pid(p, PIDTYPE_PGID, task_pgrp(current)); attach_pid(p, PIDTYPE_SID, task_session(current)); list_add_tail_rcu(&p->tasks, &init_task.tasks); + preempt_disable(); __get_cpu_var(process_counts)++; + preempt_enable(); } attach_pid(p, PIDTYPE_PID, pid); nr_threads++; @@ -1743,3 +1763,122 @@ bad_unshare_cleanup_thread: bad_unshare_out: return err; } + +static int mmdrop_complete(void) +{ + struct list_head *head; + int ret = 0; + + head = &get_cpu_var(delayed_drop_list); + while (!list_empty(head)) { + struct mm_struct *mm = list_entry(head->next, + struct mm_struct, delayed_drop); + list_del(&mm->delayed_drop); + put_cpu_var(delayed_drop_list); + + __mmdrop(mm); + ret = 1; + + head = &get_cpu_var(delayed_drop_list); + } + put_cpu_var(delayed_drop_list); + + return ret; +} + +/* + * We dont want to do complex work from the scheduler, thus + * we delay the work to a per-CPU worker thread: + */ +void fastcall __mmdrop_delayed(struct mm_struct *mm) +{ + struct task_struct *desched_task; + struct list_head *head; + + head = &get_cpu_var(delayed_drop_list); + list_add_tail(&mm->delayed_drop, head); + desched_task = __get_cpu_var(desched_task); + if (desched_task) + wake_up_process(desched_task); + put_cpu_var(delayed_drop_list); +} + +static int desched_thread(void * __bind_cpu) +{ + set_user_nice(current, -10); + current->flags |= PF_NOFREEZE | PF_SOFTIRQ; + + set_current_state(TASK_INTERRUPTIBLE); + + while (!kthread_should_stop()) { + + if (mmdrop_complete()) + continue; + schedule(); + + /* This must be called from time to time on ia64, and is a no-op on other archs. + * Used to be in cpu_idle(), but with the new -rt semantics it can't stay there. + */ + check_pgt_cache(); + + set_current_state(TASK_INTERRUPTIBLE); + } + __set_current_state(TASK_RUNNING); + return 0; +} + +static int __devinit cpu_callback(struct notifier_block *nfb, + unsigned long action, + void *hcpu) +{ + int hotcpu = (unsigned long)hcpu; + struct task_struct *p; + + switch (action) { + case CPU_UP_PREPARE: + + BUG_ON(per_cpu(desched_task, hotcpu)); + INIT_LIST_HEAD(&per_cpu(delayed_drop_list, hotcpu)); + p = kthread_create(desched_thread, hcpu, "desched/%d", hotcpu); + if (IS_ERR(p)) { + printk("desched_thread for %i failed\n", hotcpu); + return NOTIFY_BAD; + } + per_cpu(desched_task, hotcpu) = p; + kthread_bind(p, hotcpu); + break; + case CPU_ONLINE: + + wake_up_process(per_cpu(desched_task, hotcpu)); + break; +#ifdef CONFIG_HOTPLUG_CPU + case CPU_UP_CANCELED: + + /* Unbind so it can run. Fall thru. */ + kthread_bind(per_cpu(desched_task, hotcpu), smp_processor_id()); + case CPU_DEAD: + + p = per_cpu(desched_task, hotcpu); + per_cpu(desched_task, hotcpu) = NULL; + kthread_stop(p); + takeover_tasklets(hotcpu); + break; +#endif /* CONFIG_HOTPLUG_CPU */ + } + return NOTIFY_OK; +} + +static struct notifier_block __devinitdata cpu_nfb = { + .notifier_call = cpu_callback +}; + +__init int spawn_desched_task(void) +{ + void *cpu = (void *)(long)smp_processor_id(); + + cpu_callback(&cpu_nfb, CPU_UP_PREPARE, cpu); + cpu_callback(&cpu_nfb, CPU_ONLINE, cpu); + register_cpu_notifier(&cpu_nfb); + return 0; +} + Index: linux-2.6.24.7/kernel/sched.c =================================================================== --- linux-2.6.24.7.orig/kernel/sched.c +++ linux-2.6.24.7/kernel/sched.c @@ -2047,8 +2047,12 @@ static void finish_task_switch(struct rq #endif fire_sched_in_preempt_notifiers(current); + /* + * Delay the final freeing of the mm or task, so that we dont have + * to do complex work from within the scheduler: + */ if (mm) - mmdrop(mm); + mmdrop_delayed(mm); if (unlikely(prev_state == TASK_DEAD)) { /* * Remove function-return probe instances associated with this ������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/preempt-realtime-sched-i386.patch�����������������������������������������������������������0000664�0000764�0000764�00000003601�11041657734�017315� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������--- arch/x86/kernel/entry_32.S | 11 +++++++---- arch/x86/kernel/process_32.c | 4 +++- 2 files changed, 10 insertions(+), 5 deletions(-) Index: linux-2.6.24.7/arch/x86/kernel/entry_32.S =================================================================== --- linux-2.6.24.7.orig/arch/x86/kernel/entry_32.S +++ linux-2.6.24.7/arch/x86/kernel/entry_32.S @@ -265,14 +265,18 @@ END(ret_from_exception) #ifdef CONFIG_PREEMPT ENTRY(resume_kernel) DISABLE_INTERRUPTS(CLBR_ANY) + cmpl $0, kernel_preemption + jz restore_nocheck cmpl $0,TI_preempt_count(%ebp) # non-zero preempt_count ? jnz restore_nocheck need_resched: movl TI_flags(%ebp), %ecx # need_resched set ? testb $_TIF_NEED_RESCHED, %cl - jz restore_all + jz restore_nocheck testl $IF_MASK,PT_EFLAGS(%esp) # interrupts off (exception path) ? - jz restore_all + jz restore_nocheck + DISABLE_INTERRUPTS(CLBR_ANY) + call preempt_schedule_irq jmp need_resched END(resume_kernel) @@ -484,12 +488,11 @@ work_pending: testl $(_TIF_NEED_RESCHED|_TIF_NEED_RESCHED_DELAYED), %ecx jz work_notifysig work_resched: - call schedule LOCKDEP_SYS_EXIT DISABLE_INTERRUPTS(CLBR_ANY) # make sure we don't miss an interrupt + call __schedule # setting need_resched or sigpending # between sampling and the iret - TRACE_IRQS_OFF movl TI_flags(%ebp), %ecx andl $_TIF_WORK_MASK, %ecx # is there any work to be done other # than syscall tracing? Index: linux-2.6.24.7/arch/x86/kernel/process_32.c =================================================================== --- linux-2.6.24.7.orig/arch/x86/kernel/process_32.c +++ linux-2.6.24.7/arch/x86/kernel/process_32.c @@ -201,10 +201,12 @@ void cpu_idle(void) idle(); start_critical_timings(); } + local_irq_disable(); tick_nohz_restart_sched_tick(); __preempt_enable_no_resched(); - schedule(); + __schedule(); preempt_disable(); + local_irq_enable(); } } �������������������������������������������������������������������������������������������������������������������������������patches/preempt-realtime-prevent-idle-boosting.patch������������������������������������������������0000664�0000764�0000764�00000003421�11041657731�021755� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������Subject: Premmpt-RT: Preevent boosting of idle task Idle task boosting is a nono in general. There is one exception, when NOHZ is active: The idle task calls get_next_timer_interrupt() and holds the timer wheel base->lock on the CPU and another CPU wants to access the timer (probably to cancel it). We can safely ignore the boosting request, as the idle CPU runs this code with interrupts disabled and will complete the lock protected section without being interrupted. So there is no real need to boost. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> --- kernel/sched.c | 20 ++++++++++++++++++++ 1 file changed, 20 insertions(+) Index: linux-2.6.24.7/kernel/sched.c =================================================================== --- linux-2.6.24.7.orig/kernel/sched.c +++ linux-2.6.24.7/kernel/sched.c @@ -4343,6 +4343,25 @@ void rt_mutex_setprio(struct task_struct BUG_ON(prio < 0 || prio > MAX_PRIO); rq = task_rq_lock(p, &flags); + + /* + * Idle task boosting is a nono in general. There is one + * exception, when NOHZ is active: + * + * The idle task calls get_next_timer_interrupt() and holds + * the timer wheel base->lock on the CPU and another CPU wants + * to access the timer (probably to cancel it). We can safely + * ignore the boosting request, as the idle CPU runs this code + * with interrupts disabled and will complete the lock + * protected section without being interrupted. So there is no + * real need to boost. + */ + if (unlikely(p == rq->idle)) { + WARN_ON(p != rq->curr); + WARN_ON(p->pi_blocked_on); + goto out_unlock; + } + update_rq_clock(rq); oldprio = p->prio; @@ -4371,6 +4390,7 @@ void rt_mutex_setprio(struct task_struct } // trace_special(prev_resched, _need_resched(), 0); +out_unlock: task_rq_unlock(rq, &flags); } �����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/schedule-tail-balance-disable-irqs.patch����������������������������������������������������0000664�0000764�0000764�00000001157�11041657733�020754� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������--- kernel/sched_rt.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) Index: linux-2.6.24.7/kernel/sched_rt.c =================================================================== --- linux-2.6.24.7.orig/kernel/sched_rt.c +++ linux-2.6.24.7/kernel/sched_rt.c @@ -684,10 +684,10 @@ static void post_schedule_rt(struct rq * * first via finish_lock_switch and then reaquire it here. */ if (unlikely(rq->rt.overloaded)) { - spin_lock_irq(&rq->lock); + spin_lock(&rq->lock); push_rt_tasks(rq); schedstat_inc(rq, rto_schedule_tail); - spin_unlock_irq(&rq->lock); + spin_unlock(&rq->lock); } } �����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/preempt-realtime-sched-cpupri.patch���������������������������������������������������������0000664�0000764�0000764�00000000705�11041657732�020126� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������--- kernel/sched_cpupri.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) Index: linux-2.6.24.7/kernel/sched_cpupri.h =================================================================== --- linux-2.6.24.7.orig/kernel/sched_cpupri.h +++ linux-2.6.24.7/kernel/sched_cpupri.h @@ -12,7 +12,7 @@ /* values 2-101 are RT priorities 0-99 */ struct cpupri_vec { - spinlock_t lock; + raw_spinlock_t lock; int count; cpumask_t mask; }; �����������������������������������������������������������patches/preempt-realtime-core.patch�����������������������������������������������������������������0000664�0000764�0000764�00000112231�11041673205�016457� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������--- include/linux/completion.h | 1 include/linux/hardirq.h | 42 ++++++------ include/linux/kernel.h | 15 ++++ include/linux/profile.h | 12 ++- include/linux/radix-tree.h | 13 +++ include/linux/smp.h | 11 +++ include/linux/smp_lock.h | 4 - include/linux/workqueue.h | 3 kernel/Kconfig.preempt | 148 +++++++++++++++++++++++++++++---------------- kernel/exit.c | 20 ++++-- kernel/fork.c | 26 +++++++ kernel/futex.c | 10 ++- kernel/notifier.c | 4 - kernel/signal.c | 9 ++ kernel/softirq.c | 14 +++- kernel/stop_machine.c | 4 - kernel/sys.c | 1 kernel/user.c | 4 - kernel/workqueue.c | 52 ++++++++++++++- lib/Kconfig.debug | 4 - lib/Makefile | 3 lib/kernel_lock.c | 27 +++++--- lib/locking-selftest.c | 29 +++++--- lib/radix-tree.c | 7 +- lib/smp_processor_id.c | 2 25 files changed, 343 insertions(+), 122 deletions(-) Index: linux-2.6.24.7/include/linux/completion.h =================================================================== --- linux-2.6.24.7.orig/include/linux/completion.h +++ linux-2.6.24.7/include/linux/completion.h @@ -48,6 +48,7 @@ extern unsigned long wait_for_completion unsigned long timeout); extern unsigned long wait_for_completion_interruptible_timeout( struct completion *x, unsigned long timeout); +extern unsigned int completion_done(struct completion *x); extern void complete(struct completion *); extern void complete_all(struct completion *); Index: linux-2.6.24.7/include/linux/hardirq.h =================================================================== --- linux-2.6.24.7.orig/include/linux/hardirq.h +++ linux-2.6.24.7/include/linux/hardirq.h @@ -41,23 +41,25 @@ # error HARDIRQ_BITS is too low! #endif #endif +#define PREEMPT_ACTIVE_BITS 1 -#define PREEMPT_SHIFT 0 -#define SOFTIRQ_SHIFT (PREEMPT_SHIFT + PREEMPT_BITS) -#define HARDIRQ_SHIFT (SOFTIRQ_SHIFT + SOFTIRQ_BITS) - -#define __IRQ_MASK(x) ((1UL << (x))-1) - -#define PREEMPT_MASK (__IRQ_MASK(PREEMPT_BITS) << PREEMPT_SHIFT) -#define SOFTIRQ_MASK (__IRQ_MASK(SOFTIRQ_BITS) << SOFTIRQ_SHIFT) -#define HARDIRQ_MASK (__IRQ_MASK(HARDIRQ_BITS) << HARDIRQ_SHIFT) - -#define PREEMPT_OFFSET (1UL << PREEMPT_SHIFT) -#define SOFTIRQ_OFFSET (1UL << SOFTIRQ_SHIFT) -#define HARDIRQ_OFFSET (1UL << HARDIRQ_SHIFT) +#define PREEMPT_SHIFT 0 +#define SOFTIRQ_SHIFT (PREEMPT_SHIFT + PREEMPT_BITS) +#define HARDIRQ_SHIFT (SOFTIRQ_SHIFT + SOFTIRQ_BITS) +#define PREEMPT_ACTIVE_SHIFT (HARDIRQ_SHIFT + HARDIRQ_BITS) + +#define __IRQ_MASK(x) ((1UL << (x))-1) + +#define PREEMPT_MASK (__IRQ_MASK(PREEMPT_BITS) << PREEMPT_SHIFT) +#define SOFTIRQ_MASK (__IRQ_MASK(SOFTIRQ_BITS) << SOFTIRQ_SHIFT) +#define HARDIRQ_MASK (__IRQ_MASK(HARDIRQ_BITS) << HARDIRQ_SHIFT) + +#define PREEMPT_OFFSET (1UL << PREEMPT_SHIFT) +#define SOFTIRQ_OFFSET (1UL << SOFTIRQ_SHIFT) +#define HARDIRQ_OFFSET (1UL << HARDIRQ_SHIFT) #if PREEMPT_ACTIVE < (1 << (HARDIRQ_SHIFT + HARDIRQ_BITS)) -#error PREEMPT_ACTIVE is too low! +# error PREEMPT_ACTIVE is too low! #endif #define hardirq_count() (preempt_count() & HARDIRQ_MASK) @@ -68,11 +70,13 @@ * Are we doing bottom half or hardware interrupt processing? * Are we in a softirq context? Interrupt context? */ -#define in_irq() (hardirq_count()) -#define in_softirq() (softirq_count()) -#define in_interrupt() (irq_count()) - -#if defined(CONFIG_PREEMPT) && !defined(CONFIG_PREEMPT_BKL) +#define in_irq() (hardirq_count() || (current->flags & PF_HARDIRQ)) +#define in_softirq() (softirq_count() || (current->flags & PF_SOFTIRQ)) +#define in_interrupt() (irq_count()) + +#if defined(CONFIG_PREEMPT) && \ + !defined(CONFIG_PREEMPT_BKL) && \ + !defined(CONFIG_PREEMPT_RT) # define in_atomic() ((preempt_count() & ~PREEMPT_ACTIVE) != kernel_locked()) #else # define in_atomic() ((preempt_count() & ~PREEMPT_ACTIVE) != 0) Index: linux-2.6.24.7/include/linux/kernel.h =================================================================== --- linux-2.6.24.7.orig/include/linux/kernel.h +++ linux-2.6.24.7/include/linux/kernel.h @@ -111,7 +111,7 @@ extern int cond_resched(void); # define might_resched() do { } while (0) #endif -#ifdef CONFIG_DEBUG_SPINLOCK_SLEEP +#if defined(CONFIG_DEBUG_SPINLOCK_SLEEP) || defined(CONFIG_DEBUG_PREEMPT) void __might_sleep(char *file, int line); # define might_sleep() \ do { __might_sleep(__FILE__, __LINE__); might_resched(); } while (0) @@ -194,6 +194,18 @@ static inline int log_buf_read(int idx) static inline int log_buf_copy(char *dest, int idx, int len) { return 0; } #endif +#ifdef CONFIG_PREEMPT_RT +extern void zap_rt_locks(void); +#else +# define zap_rt_locks() do { } while (0) +#endif + +#ifdef CONFIG_PREEMPT_RT +extern void zap_rt_locks(void); +#else +# define zap_rt_locks() do { } while (0) +#endif + unsigned long int_sqrt(unsigned long); extern int printk_ratelimit(void); @@ -225,6 +237,7 @@ extern void add_taint(unsigned); /* Values used for system_state */ extern enum system_states { SYSTEM_BOOTING, + SYSTEM_BOOTING_SCHEDULER_OK, SYSTEM_RUNNING, SYSTEM_HALT, SYSTEM_POWER_OFF, Index: linux-2.6.24.7/include/linux/profile.h =================================================================== --- linux-2.6.24.7.orig/include/linux/profile.h +++ linux-2.6.24.7/include/linux/profile.h @@ -6,16 +6,18 @@ #include <linux/kernel.h> #include <linux/init.h> #include <linux/cpumask.h> +#include <linux/kernel_stat.h> #include <linux/cache.h> #include <asm/errno.h> extern int prof_on __read_mostly; -#define CPU_PROFILING 1 -#define SCHED_PROFILING 2 -#define SLEEP_PROFILING 3 -#define KVM_PROFILING 4 +#define CPU_PROFILING 1 +#define SCHED_PROFILING 2 +#define SLEEP_PROFILING 3 +#define KVM_PROFILING 4 +#define PREEMPT_PROFILING 5 struct proc_dir_entry; struct pt_regs; @@ -54,6 +56,8 @@ enum profile_type { PROFILE_MUNMAP }; +extern int prof_pid; + #ifdef CONFIG_PROFILING struct task_struct; Index: linux-2.6.24.7/include/linux/radix-tree.h =================================================================== --- linux-2.6.24.7.orig/include/linux/radix-tree.h +++ linux-2.6.24.7/include/linux/radix-tree.h @@ -161,7 +161,18 @@ radix_tree_gang_lookup(struct radix_tree unsigned long first_index, unsigned int max_items); unsigned long radix_tree_next_hole(struct radix_tree_root *root, unsigned long index, unsigned long max_scan); +/* + * On a mutex based kernel we can freely schedule within the radix code: + */ +#ifdef CONFIG_PREEMPT_RT +static inline int radix_tree_preload(gfp_t gfp_mask) +{ + return 0; +} +#else int radix_tree_preload(gfp_t gfp_mask); +#endif + void radix_tree_init(void); void *radix_tree_tag_set(struct radix_tree_root *root, unsigned long index, unsigned int tag); @@ -177,7 +188,9 @@ int radix_tree_tagged(struct radix_tree_ static inline void radix_tree_preload_end(void) { +#ifndef CONFIG_PREEMPT_RT preempt_enable(); +#endif } #endif /* _LINUX_RADIX_TREE_H */ Index: linux-2.6.24.7/include/linux/smp.h =================================================================== --- linux-2.6.24.7.orig/include/linux/smp.h +++ linux-2.6.24.7/include/linux/smp.h @@ -33,6 +33,16 @@ extern void smp_send_stop(void); */ extern void smp_send_reschedule(int cpu); +/* + * trigger a reschedule on all other CPUs: + */ +extern void smp_send_reschedule_allbutself(void); + +/* + * trigger a reschedule on all other CPUs: + */ +extern void smp_send_reschedule_allbutself(void); + /* * Prepare machine for booting other CPUs. @@ -98,6 +108,7 @@ static inline int up_smp_call_function(v 0; \ }) static inline void smp_send_reschedule(int cpu) { } +static inline void smp_send_reschedule_allbutself(void) { } #define num_booting_cpus() 1 #define smp_prepare_boot_cpu() do {} while (0) #define smp_call_function_single(cpuid, func, info, retry, wait) \ Index: linux-2.6.24.7/include/linux/smp_lock.h =================================================================== --- linux-2.6.24.7.orig/include/linux/smp_lock.h +++ linux-2.6.24.7/include/linux/smp_lock.h @@ -17,6 +17,8 @@ extern void __lockfunc __release_kernel_ __release_kernel_lock(); \ } while (0) + + /* * Non-SMP kernels will never block on the kernel lock, * so we are better off returning a constant zero from @@ -44,7 +46,7 @@ extern void __lockfunc unlock_kernel(voi #define lock_kernel() do { } while(0) #define unlock_kernel() do { } while(0) #define release_kernel_lock(task) do { } while(0) -#define reacquire_kernel_lock(task) 0 +#define reacquire_kernel_lock(task) do { } while(0) #define kernel_locked() 1 #endif /* CONFIG_LOCK_KERNEL */ Index: linux-2.6.24.7/include/linux/workqueue.h =================================================================== --- linux-2.6.24.7.orig/include/linux/workqueue.h +++ linux-2.6.24.7/include/linux/workqueue.h @@ -176,6 +176,9 @@ __create_workqueue_key(const char *name, #define create_freezeable_workqueue(name) __create_workqueue((name), 1, 1) #define create_singlethread_workqueue(name) __create_workqueue((name), 1, 0) +extern void set_workqueue_prio(struct workqueue_struct *wq, int policy, + int rt_priority, int nice); + extern void destroy_workqueue(struct workqueue_struct *wq); extern int FASTCALL(queue_work(struct workqueue_struct *wq, struct work_struct *work)); Index: linux-2.6.24.7/kernel/Kconfig.preempt =================================================================== --- linux-2.6.24.7.orig/kernel/Kconfig.preempt +++ linux-2.6.24.7/kernel/Kconfig.preempt @@ -1,14 +1,13 @@ - choice - prompt "Preemption Model" - default PREEMPT_NONE + prompt "Preemption Mode" + default PREEMPT_RT config PREEMPT_NONE bool "No Forced Preemption (Server)" help - This is the traditional Linux preemption model, geared towards + This is the traditional Linux preemption model geared towards throughput. It will still provide good latencies most of the - time, but there are no guarantees and occasional longer delays + time but there are no guarantees and occasional long delays are possible. Select this option if you are building a kernel for a server or @@ -21,7 +20,7 @@ config PREEMPT_VOLUNTARY help This option reduces the latency of the kernel by adding more "explicit preemption points" to the kernel code. These new - preemption points have been selected to reduce the maximum + preemption points have been selected to minimize the maximum latency of rescheduling, providing faster application reactions, at the cost of slightly lower throughput. @@ -33,31 +32,109 @@ config PREEMPT_VOLUNTARY Select this if you are building a kernel for a desktop system. -config PREEMPT +config PREEMPT_DESKTOP bool "Preemptible Kernel (Low-Latency Desktop)" help This option reduces the latency of the kernel by making - all kernel code (that is not executing in a critical section) + all kernel code that is not executing in a critical section preemptible. This allows reaction to interactive events by permitting a low priority process to be preempted involuntarily even if it is in kernel mode executing a system call and would - otherwise not be about to reach a natural preemption point. - This allows applications to run more 'smoothly' even when the - system is under load, at the cost of slightly lower throughput - and a slight runtime overhead to kernel code. + otherwise not about to reach a preemption point. This allows + applications to run more 'smoothly' even when the system is + under load, at the cost of slighly lower throughput and a + slight runtime overhead to kernel code. + + (According to profiles, when this mode is selected then even + during kernel-intense workloads the system is in an immediately + preemptible state more than 50% of the time.) Select this if you are building a kernel for a desktop or embedded system with latency requirements in the milliseconds range. +config PREEMPT_RT + bool "Complete Preemption (Real-Time)" + select PREEMPT_SOFTIRQS + select PREEMPT_HARDIRQS + select PREEMPT_RCU + select RT_MUTEXES + help + This option further reduces the scheduling latency of the + kernel by replacing almost every spinlock used by the kernel + with preemptible mutexes and thus making all but the most + critical kernel code involuntarily preemptible. The remaining + handful of lowlevel non-preemptible codepaths are short and + have a deterministic latency of a couple of tens of + microseconds (depending on the hardware). This also allows + applications to run more 'smoothly' even when the system is + under load, at the cost of lower throughput and runtime + overhead to kernel code. + + (According to profiles, when this mode is selected then even + during kernel-intense workloads the system is in an immediately + preemptible state more than 95% of the time.) + + Select this if you are building a kernel for a desktop, + embedded or real-time system with guaranteed latency + requirements of 100 usecs or lower. + endchoice +config PREEMPT + bool + default y + depends on PREEMPT_DESKTOP || PREEMPT_RT + +config PREEMPT_SOFTIRQS + bool "Thread Softirqs" + default n +# depends on PREEMPT + help + This option reduces the latency of the kernel by 'threading' + soft interrupts. This means that all softirqs will execute + in softirqd's context. While this helps latency, it can also + reduce performance. + + The threading of softirqs can also be controlled via + /proc/sys/kernel/softirq_preemption runtime flag and the + sofirq-preempt=0/1 boot-time option. + + Say N if you are unsure. + +config PREEMPT_HARDIRQS + bool "Thread Hardirqs" + default n + depends on !GENERIC_HARDIRQS_NO__DO_IRQ + select PREEMPT_SOFTIRQS + help + This option reduces the latency of the kernel by 'threading' + hardirqs. This means that all (or selected) hardirqs will run + in their own kernel thread context. While this helps latency, + this feature can also reduce performance. + + The threading of hardirqs can also be controlled via the + /proc/sys/kernel/hardirq_preemption runtime flag and the + hardirq-preempt=0/1 boot-time option. Per-irq threading can + be enabled/disable via the /proc/irq/<IRQ>/<handler>/threaded + runtime flags. + + Say N if you are unsure. + +config PREEMPT_BKL + bool + depends on PREEMPT_RT || !SPINLOCK_BKL + default n if !PREEMPT + default y + choice prompt "RCU implementation type:" + default PREEMPT_RCU if PREEMPT_RT default CLASSIC_RCU config CLASSIC_RCU bool "Classic RCU" + depends on !PREEMPT_RT help This option selects the classic RCU implementation that is designed for best read-side performance on non-realtime @@ -91,48 +168,15 @@ config RCU_TRACE Say Y here if you want to enable RCU tracing Say N if you are unsure. -config PREEMPT_SOFTIRQS - bool "Thread Softirqs" - default n -# depends on PREEMPT - help - This option reduces the latency of the kernel by 'threading' - soft interrupts. This means that all softirqs will execute - in softirqd's context. While this helps latency, it can also - reduce performance. - - The threading of softirqs can also be controlled via - /proc/sys/kernel/softirq_preemption runtime flag and the - sofirq-preempt=0/1 boot-time option. - - Say N if you are unsure. - -config PREEMPT_HARDIRQS - bool "Thread Hardirqs" +config SPINLOCK_BKL + bool "Old-Style Big Kernel Lock" + depends on (PREEMPT || SMP) && !PREEMPT_RT default n - depends on !GENERIC_HARDIRQS_NO__DO_IRQ - select PREEMPT_SOFTIRQS - help - This option reduces the latency of the kernel by 'threading' - hardirqs. This means that all (or selected) hardirqs will run - in their own kernel thread context. While this helps latency, - this feature can also reduce performance. - - The threading of hardirqs can also be controlled via the - /proc/sys/kernel/hardirq_preemption runtime flag and the - hardirq-preempt=0/1 boot-time option. Per-irq threading can - be enabled/disable via the /proc/irq/<IRQ>/<handler>/threaded - runtime flags. - - Say N if you are unsure. - -config PREEMPT_BKL - bool "Preempt The Big Kernel Lock" - depends on SMP || PREEMPT - default y help - This option reduces the latency of the kernel by making the - big kernel lock preemptible. + This option increases the latency of the kernel by making the + big kernel lock spinlock-based (which is bad for latency). + However, enable this option if you see any problems to revert + back to the traditional spinlock BKL design. Say Y here if you are building a kernel for a desktop system. Say N if you are unsure. Index: linux-2.6.24.7/kernel/exit.c =================================================================== --- linux-2.6.24.7.orig/kernel/exit.c +++ linux-2.6.24.7/kernel/exit.c @@ -63,7 +63,9 @@ static void __unhash_process(struct task detach_pid(p, PIDTYPE_SID); list_del_rcu(&p->tasks); + preempt_disable(); __get_cpu_var(process_counts)--; + preempt_enable(); } list_del_rcu(&p->thread_group); remove_parent(p); @@ -585,9 +587,11 @@ static void exit_mm(struct task_struct * task_lock(tsk); tsk->mm = NULL; up_read(&mm->mmap_sem); + preempt_disable(); // FIXME enter_lazy_tlb(mm, current); /* We don't want this task to be frozen prematurely */ clear_freeze_flag(tsk); + preempt_enable(); task_unlock(tsk); mmput(mm); } @@ -1042,15 +1046,18 @@ fastcall NORET_TYPE void do_exit(long co if (tsk->splice_pipe) __free_pipe_info(tsk->splice_pipe); - preempt_disable(); +again: + local_irq_disable(); /* causes final put_task_struct in finish_task_switch(). */ tsk->state = TASK_DEAD; - schedule(); - BUG(); - /* Avoid "noreturn function does return". */ - for (;;) - cpu_relax(); /* For when BUG is null */ + __schedule(); + printk(KERN_ERR "BUG: dead task %s:%d back from the grave!\n", + current->comm, current->pid); + printk(KERN_ERR ".... flags: %08x, count: %d, state: %08lx\n", + current->flags, atomic_read(¤t->usage), current->state); + printk(KERN_ERR ".... trying again ...\n"); + goto again; } EXPORT_SYMBOL_GPL(do_exit); @@ -1553,6 +1560,7 @@ repeat: int ret; list_for_each_entry(p, &tsk->children, sibling) { + BUG_ON(!atomic_read(&p->usage)); ret = eligible_child(pid, options, p); if (!ret) continue; Index: linux-2.6.24.7/kernel/fork.c =================================================================== --- linux-2.6.24.7.orig/kernel/fork.c +++ linux-2.6.24.7/kernel/fork.c @@ -127,10 +127,13 @@ void free_task(struct task_struct *tsk) } EXPORT_SYMBOL(free_task); -void __put_task_struct(struct task_struct *tsk) +#ifdef CONFIG_PREEMPT_RT +void __put_task_struct_cb(struct rcu_head *rhp) { + struct task_struct *tsk = container_of(rhp, struct task_struct, rcu); + + BUG_ON(atomic_read(&tsk->usage)); WARN_ON(!tsk->exit_state); - WARN_ON(atomic_read(&tsk->usage)); WARN_ON(tsk == current); security_task_free(tsk); @@ -142,6 +145,23 @@ void __put_task_struct(struct task_struc free_task(tsk); } +#else + +void __put_task_struct(struct task_struct *tsk) +{ + WARN_ON(!(tsk->exit_state & (EXIT_DEAD | EXIT_ZOMBIE))); + BUG_ON(atomic_read(&tsk->usage)); + WARN_ON(tsk == current); + + security_task_free(tsk); + free_uid(tsk->user); + put_group_info(tsk->group_info); + + if (!profile_handoff_task(tsk)) + free_task(tsk); +} +#endif + void __init fork_init(unsigned long mempages) { int i; @@ -1264,11 +1284,13 @@ static struct task_struct *copy_process( * to ensure it is on a valid CPU (and if not, just force it back to * parent's CPU). This avoids alot of nasty races. */ + preempt_disable(); p->cpus_allowed = current->cpus_allowed; p->nr_cpus_allowed = current->nr_cpus_allowed; if (unlikely(!cpu_isset(task_cpu(p), p->cpus_allowed) || !cpu_online(task_cpu(p)))) set_task_cpu(p, smp_processor_id()); + preempt_enable(); /* CLONE_PARENT re-uses the old parent */ if (clone_flags & (CLONE_PARENT|CLONE_THREAD)) Index: linux-2.6.24.7/kernel/futex.c =================================================================== --- linux-2.6.24.7.orig/kernel/futex.c +++ linux-2.6.24.7/kernel/futex.c @@ -949,7 +949,7 @@ static int futex_requeue(u32 __user *uad plist_del(&this->list, &hb1->chain); plist_add(&this->list, &hb2->chain); this->lock_ptr = &hb2->lock; -#ifdef CONFIG_DEBUG_PI_LIST +#if defined(CONFIG_DEBUG_PI_LIST) && !defined(CONFIG_PREEMPT_RT) this->list.plist.lock = &hb2->lock; #endif } @@ -1010,7 +1010,7 @@ static inline void __queue_me(struct fut prio = min(current->normal_prio, MAX_RT_PRIO); plist_node_init(&q->list, prio); -#ifdef CONFIG_DEBUG_PI_LIST +#if defined(CONFIG_DEBUG_PI_LIST) && !defined(CONFIG_PREEMPT_RT) q->list.plist.lock = &hb->lock; #endif plist_add(&q->list, &hb->chain); @@ -1300,6 +1300,10 @@ static int futex_wait(u32 __user *uaddr, * q.lock_ptr != 0 is not safe, because of ordering against wakeup. */ if (likely(!plist_node_empty(&q.list))) { + unsigned long nosched_flag = current->flags & PF_NOSCHED; + + current->flags &= ~PF_NOSCHED; + if (!abs_time) schedule(); else { @@ -1322,6 +1326,8 @@ static int futex_wait(u32 __user *uaddr, /* Flag if a timeout occured */ rem = (t.task == NULL); } + + current->flags |= nosched_flag; } __set_current_state(TASK_RUNNING); Index: linux-2.6.24.7/kernel/notifier.c =================================================================== --- linux-2.6.24.7.orig/kernel/notifier.c +++ linux-2.6.24.7/kernel/notifier.c @@ -55,7 +55,7 @@ static int notifier_chain_unregister(str * @returns: notifier_call_chain returns the value returned by the * last notifier function called. */ -static int __kprobes notifier_call_chain(struct notifier_block **nl, +static int __kprobes notrace notifier_call_chain(struct notifier_block **nl, unsigned long val, void *v, int nr_to_call, int *nr_calls) { @@ -193,7 +193,7 @@ int blocking_notifier_chain_register(str * not yet working and interrupts must remain disabled. At * such times we must not call down_write(). */ - if (unlikely(system_state == SYSTEM_BOOTING)) + if (unlikely(system_state < SYSTEM_RUNNING)) return notifier_chain_register(&nh->head, n); down_write(&nh->rwsem); Index: linux-2.6.24.7/kernel/signal.c =================================================================== --- linux-2.6.24.7.orig/kernel/signal.c +++ linux-2.6.24.7/kernel/signal.c @@ -762,8 +762,10 @@ specific_send_sig_info(int sig, struct s { int ret = 0; - BUG_ON(!irqs_disabled()); + BUG_ON_NONRT(!irqs_disabled()); +#ifdef CONFIG_SMP assert_spin_locked(&t->sighand->siglock); +#endif /* Short-circuit ignored signals. */ if (sig_ignored(t, sig)) @@ -1624,6 +1626,7 @@ static void ptrace_stop(int exit_code, i if (may_ptrace_stop()) { do_notify_parent_cldstop(current, CLD_TRAPPED); read_unlock(&tasklist_lock); + current->flags &= ~PF_NOSCHED; schedule(); } else { /* @@ -1684,6 +1687,7 @@ finish_stop(int stop_count) } do { + current->flags &= ~PF_NOSCHED; schedule(); } while (try_to_freeze()); /* @@ -1795,6 +1799,9 @@ int get_signal_to_deliver(siginfo_t *inf try_to_freeze(); +#ifdef CONFIG_PREEMPT_RT + might_sleep(); +#endif relock: spin_lock_irq(¤t->sighand->siglock); for (;;) { Index: linux-2.6.24.7/kernel/softirq.c =================================================================== --- linux-2.6.24.7.orig/kernel/softirq.c +++ linux-2.6.24.7/kernel/softirq.c @@ -16,6 +16,7 @@ #include <linux/kernel_stat.h> #include <linux/interrupt.h> #include <linux/init.h> +#include <linux/delay.h> #include <linux/mm.h> #include <linux/notifier.h> #include <linux/percpu.h> @@ -120,6 +121,8 @@ static void trigger_softirqs(void) } } +#ifndef CONFIG_PREEMPT_RT + /* * This one is for softirq.c-internal use, * where hardirqs are disabled legitimately: @@ -237,6 +240,8 @@ void local_bh_enable_ip(unsigned long ip } EXPORT_SYMBOL(local_bh_enable_ip); +#endif + /* * We restart softirq processing MAX_SOFTIRQ_RESTART times, * and we fall back to softirqd after that. @@ -647,7 +652,7 @@ void tasklet_kill(struct tasklet_struct while (test_and_set_bit(TASKLET_STATE_SCHED, &t->state)) { do - yield(); + msleep(1); while (test_bit(TASKLET_STATE_SCHED, &t->state)); } tasklet_unlock_wait(t); @@ -910,6 +915,11 @@ int softirq_preemption = 1; EXPORT_SYMBOL(softirq_preemption); +/* + * Real-Time Preemption depends on softirq threading: + */ +#ifndef CONFIG_PREEMPT_RT + static int __init softirq_preempt_setup (char *str) { if (!strncmp(str, "off", 3)) @@ -923,7 +933,7 @@ static int __init softirq_preempt_setup } __setup("softirq-preempt=", softirq_preempt_setup); - +#endif #endif #ifdef CONFIG_SMP Index: linux-2.6.24.7/kernel/stop_machine.c =================================================================== --- linux-2.6.24.7.orig/kernel/stop_machine.c +++ linux-2.6.24.7/kernel/stop_machine.c @@ -63,7 +63,7 @@ static int stopmachine(void *cpu) /* Yield in first stage: migration threads need to * help our sisters onto their CPUs. */ if (!prepared && !irqs_disabled) - yield(); + __yield(); else cpu_relax(); } @@ -109,7 +109,7 @@ static int stop_machine(void) /* Wait for them all to come to life. */ while (atomic_read(&stopmachine_thread_ack) != stopmachine_num_threads) - yield(); + __yield(); /* If some failed, kill them all. */ if (ret < 0) { Index: linux-2.6.24.7/kernel/sys.c =================================================================== --- linux-2.6.24.7.orig/kernel/sys.c +++ linux-2.6.24.7/kernel/sys.c @@ -36,6 +36,7 @@ #include <linux/compat.h> #include <linux/syscalls.h> +#include <linux/rt_lock.h> #include <linux/kprobes.h> #include <linux/user_namespace.h> Index: linux-2.6.24.7/kernel/user.c =================================================================== --- linux-2.6.24.7.orig/kernel/user.c +++ linux-2.6.24.7/kernel/user.c @@ -312,11 +312,11 @@ void free_uid(struct user_struct *up) if (!up) return; - local_irq_save(flags); + local_irq_save_nort(flags); if (atomic_dec_and_lock(&up->__count, &uidhash_lock)) free_user(up, flags); else - local_irq_restore(flags); + local_irq_restore_nort(flags); } struct user_struct * alloc_uid(struct user_namespace *ns, uid_t uid) Index: linux-2.6.24.7/kernel/workqueue.c =================================================================== --- linux-2.6.24.7.orig/kernel/workqueue.c +++ linux-2.6.24.7/kernel/workqueue.c @@ -26,6 +26,7 @@ #include <linux/slab.h> #include <linux/cpu.h> #include <linux/notifier.h> +#include <linux/syscalls.h> #include <linux/kthread.h> #include <linux/hardirq.h> #include <linux/mempolicy.h> @@ -34,6 +35,8 @@ #include <linux/debug_locks.h> #include <linux/lockdep.h> +#include <asm/uaccess.h> + /* * The per-CPU workqueue (if single thread, we always use the first * possible cpu). @@ -161,15 +164,16 @@ static void __queue_work(struct cpu_work * * We queue the work to the CPU it was submitted, but there is no * guarantee that it will be processed by that CPU. + * + * Especially no such guarantee on PREEMPT_RT. */ int fastcall queue_work(struct workqueue_struct *wq, struct work_struct *work) { - int ret = 0; + int ret = 0, cpu = raw_smp_processor_id(); if (!test_and_set_bit(WORK_STRUCT_PENDING, work_data_bits(work))) { BUG_ON(!list_empty(&work->entry)); - __queue_work(wq_per_cpu(wq, get_cpu()), work); - put_cpu(); + __queue_work(wq_per_cpu(wq, cpu), work); ret = 1; } return ret; @@ -798,6 +802,47 @@ static void cleanup_workqueue_thread(str cwq->thread = NULL; } +void set_workqueue_thread_prio(struct workqueue_struct *wq, int cpu, + int policy, int rt_priority, int nice) +{ + struct sched_param param = { .sched_priority = rt_priority }; + struct cpu_workqueue_struct *cwq; + mm_segment_t oldfs = get_fs(); + struct task_struct *p; + unsigned long flags; + int ret; + + cwq = per_cpu_ptr(wq->cpu_wq, cpu); + spin_lock_irqsave(&cwq->lock, flags); + p = cwq->thread; + spin_unlock_irqrestore(&cwq->lock, flags); + + set_user_nice(p, nice); + + set_fs(KERNEL_DS); + ret = sys_sched_setscheduler(p->pid, policy, ¶m); + set_fs(oldfs); + + WARN_ON(ret); +} + + void set_workqueue_prio(struct workqueue_struct *wq, int policy, + int rt_priority, int nice) +{ + int cpu; + + /* We don't need the distraction of CPUs appearing and vanishing. */ + mutex_lock(&workqueue_mutex); + if (is_single_threaded(wq)) + set_workqueue_thread_prio(wq, 0, policy, rt_priority, nice); + else { + for_each_online_cpu(cpu) + set_workqueue_thread_prio(wq, cpu, policy, + rt_priority, nice); + } + mutex_unlock(&workqueue_mutex); +} + /** * destroy_workqueue - safely terminate a workqueue * @wq: target workqueue @@ -880,4 +925,5 @@ void __init init_workqueues(void) hotcpu_notifier(workqueue_cpu_callback, 0); keventd_wq = create_workqueue("events"); BUG_ON(!keventd_wq); + set_workqueue_prio(keventd_wq, SCHED_FIFO, 1, -20); } Index: linux-2.6.24.7/lib/Kconfig.debug =================================================================== --- linux-2.6.24.7.orig/lib/Kconfig.debug +++ linux-2.6.24.7/lib/Kconfig.debug @@ -189,6 +189,8 @@ config DEBUG_RT_MUTEXES help This allows rt mutex semantics violations and rt mutex related deadlocks (lockups) to be detected and reported automatically. + When realtime preemption is enabled this includes spinlocks, + rwlocks, mutexes and (rw)semaphores config DEBUG_PI_LIST bool @@ -212,7 +214,7 @@ config DEBUG_SPINLOCK config DEBUG_MUTEXES bool "Mutex debugging: basic checks" - depends on DEBUG_KERNEL + depends on DEBUG_KERNEL && !PREEMPT_RT help This feature allows mutex semantics violations to be detected and reported. Index: linux-2.6.24.7/lib/Makefile =================================================================== --- linux-2.6.24.7.orig/lib/Makefile +++ linux-2.6.24.7/lib/Makefile @@ -36,7 +36,8 @@ obj-$(CONFIG_HAS_IOMEM) += iomap_copy.o obj-$(CONFIG_CHECK_SIGNATURE) += check_signature.o obj-$(CONFIG_DEBUG_LOCKING_API_SELFTESTS) += locking-selftest.o obj-$(CONFIG_DEBUG_SPINLOCK) += spinlock_debug.o -lib-$(CONFIG_RWSEM_GENERIC_SPINLOCK) += rwsem-spinlock.o +obj-$(CONFIG_PREEMPT_RT) += plist.o +obj-$(CONFIG_RWSEM_GENERIC_SPINLOCK) += rwsem-spinlock.o lib-$(CONFIG_RWSEM_XCHGADD_ALGORITHM) += rwsem.o lib-$(CONFIG_SEMAPHORE_SLEEPERS) += semaphore-sleepers.o lib-$(CONFIG_GENERIC_FIND_NEXT_BIT) += find_next_bit.o Index: linux-2.6.24.7/lib/kernel_lock.c =================================================================== --- linux-2.6.24.7.orig/lib/kernel_lock.c +++ linux-2.6.24.7/lib/kernel_lock.c @@ -35,22 +35,25 @@ DECLARE_MUTEX(kernel_sem); * about recursion, both due to the down() and due to the enabling of * preemption. schedule() will re-check the preemption flag after * reacquiring the semaphore. + * + * Called with interrupts disabled. */ int __lockfunc __reacquire_kernel_lock(void) { struct task_struct *task = current; int saved_lock_depth = task->lock_depth; + local_irq_enable(); BUG_ON(saved_lock_depth < 0); task->lock_depth = -1; - __preempt_enable_no_resched(); down(&kernel_sem); - preempt_disable(); task->lock_depth = saved_lock_depth; + local_irq_disable(); + return 0; } @@ -67,11 +70,15 @@ void __lockfunc lock_kernel(void) struct task_struct *task = current; int depth = task->lock_depth + 1; - if (likely(!depth)) + if (likely(!depth)) { /* * No recursion worries - we set up lock_depth _after_ */ down(&kernel_sem); +#ifdef CONFIG_DEBUG_RT_MUTEXES + current->last_kernel_lock = __builtin_return_address(0); +#endif + } task->lock_depth = depth; } @@ -82,8 +89,12 @@ void __lockfunc unlock_kernel(void) BUG_ON(task->lock_depth < 0); - if (likely(--task->lock_depth < 0)) + if (likely(--task->lock_depth == -1)) { +#ifdef CONFIG_DEBUG_RT_MUTEXES + current->last_kernel_lock = NULL; +#endif up(&kernel_sem); + } } #else @@ -116,11 +127,9 @@ static __cacheline_aligned_in_smp DEFIN */ int __lockfunc __reacquire_kernel_lock(void) { - while (!_raw_spin_trylock(&kernel_flag)) { - if (test_thread_flag(TIF_NEED_RESCHED)) - return -EAGAIN; - cpu_relax(); - } + local_irq_enable(); + _raw_spin_lock(&kernel_flag); + local_irq_disable(); preempt_disable(); return 0; } Index: linux-2.6.24.7/lib/locking-selftest.c =================================================================== --- linux-2.6.24.7.orig/lib/locking-selftest.c +++ linux-2.6.24.7/lib/locking-selftest.c @@ -158,7 +158,7 @@ static void init_shared_classes(void) local_bh_disable(); \ local_irq_disable(); \ trace_softirq_enter(); \ - WARN_ON(!in_softirq()); + /* FIXME: preemptible softirqs. WARN_ON(!in_softirq()); */ #define SOFTIRQ_EXIT() \ trace_softirq_exit(); \ @@ -550,6 +550,11 @@ GENERATE_TESTCASE(init_held_rsem) #undef E /* + * FIXME: turns these into raw-spinlock tests on -rt + */ +#ifndef CONFIG_PREEMPT_RT + +/* * locking an irq-safe lock with irqs enabled: */ #define E1() \ @@ -890,6 +895,8 @@ GENERATE_PERMUTATIONS_3_EVENTS(irq_read_ #include "locking-selftest-softirq.h" // GENERATE_PERMUTATIONS_3_EVENTS(irq_read_recursion2_soft) +#endif /* !CONFIG_PREEMPT_RT */ + #ifdef CONFIG_DEBUG_LOCK_ALLOC # define I_SPINLOCK(x) lockdep_reset_lock(&lock_##x.dep_map) # define I_RWLOCK(x) lockdep_reset_lock(&rwlock_##x.dep_map) @@ -1004,7 +1011,7 @@ static inline void print_testname(const #define DO_TESTCASE_1(desc, name, nr) \ print_testname(desc"/"#nr); \ - dotest(name##_##nr, SUCCESS, LOCKTYPE_RWLOCK); \ + dotest(name##_##nr, SUCCESS, LOCKTYPE_RWLOCK); \ printk("\n"); #define DO_TESTCASE_1B(desc, name, nr) \ @@ -1012,17 +1019,17 @@ static inline void print_testname(const dotest(name##_##nr, FAILURE, LOCKTYPE_RWLOCK); \ printk("\n"); -#define DO_TESTCASE_3(desc, name, nr) \ - print_testname(desc"/"#nr); \ - dotest(name##_spin_##nr, FAILURE, LOCKTYPE_SPIN); \ - dotest(name##_wlock_##nr, FAILURE, LOCKTYPE_RWLOCK); \ +#define DO_TESTCASE_3(desc, name, nr) \ + print_testname(desc"/"#nr); \ + dotest(name##_spin_##nr, FAILURE, LOCKTYPE_SPIN); \ + dotest(name##_wlock_##nr, FAILURE, LOCKTYPE_RWLOCK); \ dotest(name##_rlock_##nr, SUCCESS, LOCKTYPE_RWLOCK); \ printk("\n"); -#define DO_TESTCASE_3RW(desc, name, nr) \ - print_testname(desc"/"#nr); \ +#define DO_TESTCASE_3RW(desc, name, nr) \ + print_testname(desc"/"#nr); \ dotest(name##_spin_##nr, FAILURE, LOCKTYPE_SPIN|LOCKTYPE_RWLOCK);\ - dotest(name##_wlock_##nr, FAILURE, LOCKTYPE_RWLOCK); \ + dotest(name##_wlock_##nr, FAILURE, LOCKTYPE_RWLOCK); \ dotest(name##_rlock_##nr, SUCCESS, LOCKTYPE_RWLOCK); \ printk("\n"); @@ -1053,7 +1060,7 @@ static inline void print_testname(const print_testname(desc); \ dotest(name##_spin, FAILURE, LOCKTYPE_SPIN); \ dotest(name##_wlock, FAILURE, LOCKTYPE_RWLOCK); \ - dotest(name##_rlock, SUCCESS, LOCKTYPE_RWLOCK); \ + dotest(name##_rlock, SUCCESS, LOCKTYPE_RWLOCK); \ dotest(name##_mutex, FAILURE, LOCKTYPE_MUTEX); \ dotest(name##_wsem, FAILURE, LOCKTYPE_RWSEM); \ dotest(name##_rsem, FAILURE, LOCKTYPE_RWSEM); \ @@ -1185,6 +1192,7 @@ void locking_selftest(void) /* * irq-context testcases: */ +#ifndef CONFIG_PREEMPT_RT DO_TESTCASE_2x6("irqs-on + irq-safe-A", irqsafe1); DO_TESTCASE_2x3("sirq-safe-A => hirqs-on", irqsafe2A); DO_TESTCASE_2x6("safe-A + irqs-on", irqsafe2B); @@ -1194,6 +1202,7 @@ void locking_selftest(void) DO_TESTCASE_6x2("irq read-recursion", irq_read_recursion); // DO_TESTCASE_6x2B("irq read-recursion #2", irq_read_recursion2); +#endif if (unexpected_testcase_failures) { printk("-----------------------------------------------------------------\n"); Index: linux-2.6.24.7/lib/radix-tree.c =================================================================== --- linux-2.6.24.7.orig/lib/radix-tree.c +++ linux-2.6.24.7/lib/radix-tree.c @@ -103,12 +103,13 @@ radix_tree_node_alloc(struct radix_tree_ if (ret == NULL && !(gfp_mask & __GFP_WAIT)) { struct radix_tree_preload *rtp; - rtp = &__get_cpu_var(radix_tree_preloads); + rtp = &get_cpu_var(radix_tree_preloads); if (rtp->nr) { ret = rtp->nodes[rtp->nr - 1]; rtp->nodes[rtp->nr - 1] = NULL; rtp->nr--; } + put_cpu_var(radix_tree_preloads); } BUG_ON(radix_tree_is_indirect_ptr(ret)); return ret; @@ -127,6 +128,8 @@ radix_tree_node_free(struct radix_tree_n call_rcu(&node->rcu_head, radix_tree_node_rcu_free); } +#ifndef CONFIG_PREEMPT_RT + /* * Load up this CPU's radix_tree_node buffer with sufficient objects to * ensure that the addition of a single element in the tree cannot fail. On @@ -160,6 +163,8 @@ out: } EXPORT_SYMBOL(radix_tree_preload); +#endif + static inline void tag_set(struct radix_tree_node *node, unsigned int tag, int offset) { Index: linux-2.6.24.7/lib/smp_processor_id.c =================================================================== --- linux-2.6.24.7.orig/lib/smp_processor_id.c +++ linux-2.6.24.7/lib/smp_processor_id.c @@ -42,7 +42,7 @@ unsigned int debug_smp_processor_id(void if (!printk_ratelimit()) goto out_enable; - printk(KERN_ERR "BUG: using smp_processor_id() in preemptible [%08x] code: %s/%d\n", preempt_count(), current->comm, current->pid); + printk(KERN_ERR "BUG: using smp_processor_id() in preemptible [%08x] code: %s/%d\n", preempt_count()-1, current->comm, current->pid); print_symbol("caller is %s\n", (long)__builtin_return_address(0)); dump_stack(); �����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/preempt-realtime-fs-block.patch�������������������������������������������������������������0000664�0000764�0000764�00000030455�11041657734�017247� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������--- block/ll_rw_blk.c | 6 ++-- fs/aio.c | 6 +++- fs/block_dev.c | 34 +++++++++++++++++++++------ fs/dcache.c | 5 ++-- fs/dnotify.c | 2 - fs/exec.c | 8 +++++- fs/file.c | 5 ++-- fs/lockd/svc.c | 8 +----- fs/pipe.c | 12 +++++++++ fs/proc/proc_misc.c | 56 ++++++++++++++++++++++++++++++++++++++++++++++ fs/proc/task_mmu.c | 4 ++- fs/xfs/linux-2.6/mrlock.h | 4 +-- fs/xfs/xfs_mount.h | 2 - include/linux/genhd.h | 11 +++++++-- 14 files changed, 132 insertions(+), 31 deletions(-) Index: linux-2.6.24.7/block/ll_rw_blk.c =================================================================== --- linux-2.6.24.7.orig/block/ll_rw_blk.c +++ linux-2.6.24.7/block/ll_rw_blk.c @@ -1548,7 +1548,7 @@ static int ll_merge_requests_fn(struct r */ void blk_plug_device(struct request_queue *q) { - WARN_ON(!irqs_disabled()); + WARN_ON_NONRT(!irqs_disabled()); /* * don't plug a stopped queue, it must be paired with blk_start_queue() @@ -1571,7 +1571,7 @@ EXPORT_SYMBOL(blk_plug_device); */ int blk_remove_plug(struct request_queue *q) { - WARN_ON(!irqs_disabled()); + WARN_ON_NONRT(!irqs_disabled()); if (!test_and_clear_bit(QUEUE_FLAG_PLUGGED, &q->queue_flags)) return 0; @@ -1670,7 +1670,7 @@ EXPORT_SYMBOL(blk_unplug); **/ void blk_start_queue(struct request_queue *q) { - WARN_ON(!irqs_disabled()); + WARN_ON_NONRT(!irqs_disabled()); clear_bit(QUEUE_FLAG_STOPPED, &q->queue_flags); Index: linux-2.6.24.7/fs/aio.c =================================================================== --- linux-2.6.24.7.orig/fs/aio.c +++ linux-2.6.24.7/fs/aio.c @@ -582,13 +582,15 @@ static void use_mm(struct mm_struct *mm) tsk->flags |= PF_BORROWED_MM; active_mm = tsk->active_mm; atomic_inc(&mm->mm_count); - tsk->mm = mm; - tsk->active_mm = mm; + local_irq_disable(); // FIXME /* * Note that on UML this *requires* PF_BORROWED_MM to be set, otherwise * it won't work. Update it accordingly if you change it here */ switch_mm(active_mm, mm, tsk); + tsk->mm = mm; + tsk->active_mm = mm; + local_irq_enable(); task_unlock(tsk); mmdrop(active_mm); Index: linux-2.6.24.7/fs/block_dev.c =================================================================== --- linux-2.6.24.7.orig/fs/block_dev.c +++ linux-2.6.24.7/fs/block_dev.c @@ -1225,14 +1225,32 @@ static int __blkdev_get(struct block_dev * For now, block device ->open() routine must _not_ * examine anything in 'inode' argument except ->i_rdev. */ - struct file fake_file = {}; - struct dentry fake_dentry = {}; - fake_file.f_mode = mode; - fake_file.f_flags = flags; - fake_file.f_path.dentry = &fake_dentry; - fake_dentry.d_inode = bdev->bd_inode; - - return do_open(bdev, &fake_file, for_part); + struct file *fake_file; + struct dentry *fake_dentry; + int err = -ENOMEM; + + fake_file = kmalloc(sizeof(*fake_file), GFP_KERNEL); + if (!fake_file) + goto out; + memset(fake_file, 0, sizeof(*fake_file)); + + fake_dentry = kmalloc(sizeof(*fake_dentry), GFP_KERNEL); + if (!fake_dentry) + goto out_free_file; + memset(fake_dentry, 0, sizeof(*fake_dentry)); + + fake_file->f_mode = mode; + fake_file->f_flags = flags; + fake_file->f_path.dentry = fake_dentry; + fake_dentry->d_inode = bdev->bd_inode; + + err = do_open(bdev, fake_file, for_part); + + kfree(fake_dentry); +out_free_file: + kfree(fake_file); +out: + return err; } int blkdev_get(struct block_device *bdev, mode_t mode, unsigned flags) Index: linux-2.6.24.7/fs/dcache.c =================================================================== --- linux-2.6.24.7.orig/fs/dcache.c +++ linux-2.6.24.7/fs/dcache.c @@ -704,8 +704,9 @@ void shrink_dcache_for_umount(struct sup { struct dentry *dentry; - if (down_read_trylock(&sb->s_umount)) - BUG(); +// -rt: this might succeed there ... +// if (down_read_trylock(&sb->s_umount)) +// BUG(); dentry = sb->s_root; sb->s_root = NULL; Index: linux-2.6.24.7/fs/dnotify.c =================================================================== --- linux-2.6.24.7.orig/fs/dnotify.c +++ linux-2.6.24.7/fs/dnotify.c @@ -173,7 +173,7 @@ void dnotify_parent(struct dentry *dentr spin_lock(&dentry->d_lock); parent = dentry->d_parent; - if (parent->d_inode->i_dnotify_mask & event) { + if (unlikely(parent->d_inode->i_dnotify_mask & event)) { dget(parent); spin_unlock(&dentry->d_lock); __inode_dir_notify(parent->d_inode, event); Index: linux-2.6.24.7/fs/exec.c =================================================================== --- linux-2.6.24.7.orig/fs/exec.c +++ linux-2.6.24.7/fs/exec.c @@ -48,6 +48,7 @@ #include <linux/security.h> #include <linux/syscalls.h> #include <linux/rmap.h> +#include <linux/delay.h> #include <linux/tsacct_kern.h> #include <linux/cn_proc.h> #include <linux/audit.h> @@ -721,11 +722,16 @@ static int exec_mmap(struct mm_struct *m } } task_lock(tsk); + + local_irq_disable(); active_mm = tsk->active_mm; + activate_mm(active_mm, mm); tsk->mm = mm; tsk->active_mm = mm; - activate_mm(active_mm, mm); + local_irq_enable(); + task_unlock(tsk); + arch_pick_mmap_layout(mm); if (old_mm) { up_read(&old_mm->mmap_sem); Index: linux-2.6.24.7/fs/file.c =================================================================== --- linux-2.6.24.7.orig/fs/file.c +++ linux-2.6.24.7/fs/file.c @@ -96,14 +96,15 @@ void free_fdtable_rcu(struct rcu_head *r kfree(fdt->open_fds); kfree(fdt); } else { - fddef = &get_cpu_var(fdtable_defer_list); + + fddef = &per_cpu(fdtable_defer_list, raw_smp_processor_id()); + spin_lock(&fddef->lock); fdt->next = fddef->next; fddef->next = fdt; /* vmallocs are handled from the workqueue context */ schedule_work(&fddef->wq); spin_unlock(&fddef->lock); - put_cpu_var(fdtable_defer_list); } } Index: linux-2.6.24.7/fs/lockd/svc.c =================================================================== --- linux-2.6.24.7.orig/fs/lockd/svc.c +++ linux-2.6.24.7/fs/lockd/svc.c @@ -349,16 +349,12 @@ lockd_down(void) * Wait for the lockd process to exit, but since we're holding * the lockd semaphore, we can't wait around forever ... */ - clear_thread_flag(TIF_SIGPENDING); - interruptible_sleep_on_timeout(&lockd_exit, HZ); - if (nlmsvc_pid) { + if (wait_event_interruptible_timeout(lockd_exit, + nlmsvc_pid == 0, HZ) <= 0) { printk(KERN_WARNING "lockd_down: lockd failed to exit, clearing pid\n"); nlmsvc_pid = 0; } - spin_lock_irq(¤t->sighand->siglock); - recalc_sigpending(); - spin_unlock_irq(¤t->sighand->siglock); out: mutex_unlock(&nlmsvc_mutex); } Index: linux-2.6.24.7/fs/pipe.c =================================================================== --- linux-2.6.24.7.orig/fs/pipe.c +++ linux-2.6.24.7/fs/pipe.c @@ -385,8 +385,14 @@ redo: wake_up_interruptible_sync(&pipe->wait); kill_fasync(&pipe->fasync_writers, SIGIO, POLL_OUT); } + /* + * Hack: we turn off atime updates for -RT kernels. + * Who uses them on pipes anyway? + */ +#ifndef CONFIG_PREEMPT_RT if (ret > 0) file_accessed(filp); +#endif return ret; } @@ -558,8 +564,14 @@ out: wake_up_interruptible_sync(&pipe->wait); kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN); } + /* + * Hack: we turn off atime updates for -RT kernels. + * Who uses them on pipes anyway? + */ +#ifndef CONFIG_PREEMPT_RT if (ret > 0) file_update_time(filp); +#endif return ret; } Index: linux-2.6.24.7/fs/proc/proc_misc.c =================================================================== --- linux-2.6.24.7.orig/fs/proc/proc_misc.c +++ linux-2.6.24.7/fs/proc/proc_misc.c @@ -96,6 +96,27 @@ static int loadavg_read_proc(char *page, return proc_calc_metrics(page, start, off, count, eof, len); } +#ifdef CONFIG_PREEMPT_RT +static int loadavg_rt_read_proc(char *page, char **start, off_t off, + int count, int *eof, void *data) +{ + extern unsigned long avenrun_rt[]; + extern unsigned long rt_nr_running(void); + int a, b, c; + int len; + + a = avenrun_rt[0] + (FIXED_1/200); + b = avenrun_rt[1] + (FIXED_1/200); + c = avenrun_rt[2] + (FIXED_1/200); + len = sprintf(page,"%d.%02d %d.%02d %d.%02d %ld/%d %d\n", + LOAD_INT(a), LOAD_FRAC(a), + LOAD_INT(b), LOAD_FRAC(b), + LOAD_INT(c), LOAD_FRAC(c), + rt_nr_running(), nr_threads, current->nsproxy->pid_ns->last_pid); + return proc_calc_metrics(page, start, off, count, eof, len); +} +#endif + static int uptime_read_proc(char *page, char **start, off_t off, int count, int *eof, void *data) { @@ -555,6 +576,38 @@ static int show_stat(struct seq_file *p, nr_iowait()); kfree(per_irq_sum); +#ifdef CONFIG_PREEMPT_RT + { + unsigned long nr_uninterruptible_cpu(int cpu); + extern int pi_initialized; + unsigned long rt_nr_running(void); + unsigned long rt_nr_running_cpu(int cpu); + unsigned long rt_nr_uninterruptible(void); + unsigned long rt_nr_uninterruptible_cpu(int cpu); + + int i; + + seq_printf(p, "pi_init: %d\n", pi_initialized); + seq_printf(p, "nr_running(): %ld\n", + nr_running()); + seq_printf(p, "nr_uninterruptible(): %ld\n", + nr_uninterruptible()); + for_each_online_cpu(i) + seq_printf(p, "nr_uninterruptible(%d): %ld\n", + i, nr_uninterruptible_cpu(i)); + seq_printf(p, "rt_nr_running(): %ld\n", + rt_nr_running()); + for_each_online_cpu(i) + seq_printf(p, "rt_nr_running(%d): %ld\n", + i, rt_nr_running_cpu(i)); + seq_printf(p, "nr_rt_uninterruptible(): %ld\n", + rt_nr_uninterruptible()); + for_each_online_cpu(i) + seq_printf(p, "nr_rt_uninterruptible(%d): %ld\n", + i, rt_nr_uninterruptible_cpu(i)); + } +#endif + return 0; } @@ -704,6 +757,9 @@ void __init proc_misc_init(void) int (*read_proc)(char*,char**,off_t,int,int*,void*); } *p, simple_ones[] = { {"loadavg", loadavg_read_proc}, +#ifdef CONFIG_PREEMPT_RT + {"loadavgrt", loadavg_rt_read_proc}, +#endif {"uptime", uptime_read_proc}, {"meminfo", meminfo_read_proc}, {"version", version_read_proc}, Index: linux-2.6.24.7/fs/proc/task_mmu.c =================================================================== --- linux-2.6.24.7.orig/fs/proc/task_mmu.c +++ linux-2.6.24.7/fs/proc/task_mmu.c @@ -416,8 +416,10 @@ static void *m_start(struct seq_file *m, vma = NULL; if ((unsigned long)l < mm->map_count) { vma = mm->mmap; - while (l-- && vma) + while (l-- && vma) { vma = vma->vm_next; + cond_resched(); + } goto out; } Index: linux-2.6.24.7/fs/xfs/linux-2.6/mrlock.h =================================================================== --- linux-2.6.24.7.orig/fs/xfs/linux-2.6/mrlock.h +++ linux-2.6.24.7/fs/xfs/linux-2.6/mrlock.h @@ -23,8 +23,8 @@ enum { MR_NONE, MR_ACCESS, MR_UPDATE }; typedef struct { - struct rw_semaphore mr_lock; - int mr_writer; + struct compat_rw_semaphore mr_lock; + int mr_writer; } mrlock_t; #define mrinit(mrp, name) \ Index: linux-2.6.24.7/fs/xfs/xfs_mount.h =================================================================== --- linux-2.6.24.7.orig/fs/xfs/xfs_mount.h +++ linux-2.6.24.7/fs/xfs/xfs_mount.h @@ -383,7 +383,7 @@ typedef struct xfs_mount { uint m_bm_maxlevels[2]; /* XFS_BM_MAXLEVELS */ uint m_in_maxlevels; /* XFS_IN_MAXLEVELS */ struct xfs_perag *m_perag; /* per-ag accounting info */ - struct rw_semaphore m_peraglock; /* lock for m_perag (pointer) */ + struct compat_rw_semaphore m_peraglock; /* lock for m_perag (pointer) */ struct mutex m_growlock; /* growfs mutex */ int m_fixedfsid[2]; /* unchanged for life of FS */ uint m_dmevmask; /* DMI events for this FS */ Index: linux-2.6.24.7/include/linux/genhd.h =================================================================== --- linux-2.6.24.7.orig/include/linux/genhd.h +++ linux-2.6.24.7/include/linux/genhd.h @@ -157,15 +157,22 @@ struct disk_attribute { * variants disable/enable preemption. */ #ifdef CONFIG_SMP -#define __disk_stat_add(gendiskp, field, addnd) \ - (per_cpu_ptr(gendiskp->dkstats, smp_processor_id())->field += addnd) +#define __disk_stat_add(gendiskp, field, addnd) \ +do { \ + preempt_disable(); \ + (per_cpu_ptr(gendiskp->dkstats, \ + smp_processor_id())->field += addnd); \ + preempt_enable(); \ +} while (0) #define disk_stat_read(gendiskp, field) \ ({ \ typeof(gendiskp->dkstats->field) res = 0; \ int i; \ + preempt_disable(); \ for_each_possible_cpu(i) \ res += per_cpu_ptr(gendiskp->dkstats, i)->field; \ + preempt_enable(); \ res; \ }) �������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/preempt-realtime-acpi.patch�����������������������������������������������������������������0000664�0000764�0000764�00000012000�11041657731�016442� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������--- drivers/acpi/ec.c | 12 ++++++++++++ drivers/acpi/hardware/hwregs.c | 12 ++++++------ drivers/acpi/processor_idle.c | 2 +- drivers/acpi/utilities/utmutex.c | 2 +- include/acpi/acglobal.h | 7 ++++++- include/acpi/acpiosxf.h | 2 +- 6 files changed, 27 insertions(+), 10 deletions(-) Index: linux-2.6.24.7/drivers/acpi/ec.c =================================================================== --- linux-2.6.24.7.orig/drivers/acpi/ec.c +++ linux-2.6.24.7/drivers/acpi/ec.c @@ -531,7 +531,19 @@ static u32 acpi_ec_gpe_handler(void *dat pr_debug(PREFIX "~~~> interrupt\n"); clear_bit(EC_FLAGS_WAIT_GPE, &ec->flags); if (test_bit(EC_FLAGS_GPE_MODE, &ec->flags)) +#if 0 wake_up(&ec->wait); +#else + // hack ... + if (waitqueue_active(&ec->wait)) { + struct task_struct *task; + + task = list_entry(ec->wait.task_list.next, + wait_queue_t, task_list)->private; + if (task) + wake_up_process(task); + } +#endif if (acpi_ec_read_status(ec) & ACPI_EC_FLAG_SCI) { if (!test_and_set_bit(EC_FLAGS_QUERY_PENDING, &ec->flags)) Index: linux-2.6.24.7/drivers/acpi/hardware/hwregs.c =================================================================== --- linux-2.6.24.7.orig/drivers/acpi/hardware/hwregs.c +++ linux-2.6.24.7/drivers/acpi/hardware/hwregs.c @@ -73,7 +73,7 @@ acpi_status acpi_hw_clear_acpi_status(vo ACPI_BITMASK_ALL_FIXED_STATUS, (u16) acpi_gbl_FADT.xpm1a_event_block.address)); - lock_flags = acpi_os_acquire_lock(acpi_gbl_hardware_lock); + spin_lock_irqsave(acpi_gbl_hardware_lock, lock_flags); status = acpi_hw_register_write(ACPI_REGISTER_PM1_STATUS, ACPI_BITMASK_ALL_FIXED_STATUS); @@ -97,7 +97,7 @@ acpi_status acpi_hw_clear_acpi_status(vo status = acpi_ev_walk_gpe_list(acpi_hw_clear_gpe_block); unlock_and_exit: - acpi_os_release_lock(acpi_gbl_hardware_lock, lock_flags); + spin_unlock_irqrestore(acpi_gbl_hardware_lock, lock_flags); return_ACPI_STATUS(status); } @@ -300,9 +300,9 @@ acpi_status acpi_get_register(u32 regist { acpi_status status; acpi_cpu_flags flags; - flags = acpi_os_acquire_lock(acpi_gbl_hardware_lock); + spin_lock_irqsave(acpi_gbl_hardware_lock, flags); status = acpi_get_register_unlocked(register_id, return_value); - acpi_os_release_lock(acpi_gbl_hardware_lock, flags); + spin_unlock_irqrestore(acpi_gbl_hardware_lock, flags); return status; } @@ -339,7 +339,7 @@ acpi_status acpi_set_register(u32 regist return_ACPI_STATUS(AE_BAD_PARAMETER); } - lock_flags = acpi_os_acquire_lock(acpi_gbl_hardware_lock); + spin_lock_irqsave(acpi_gbl_hardware_lock, lock_flags); /* Always do a register read first so we can insert the new bits */ @@ -443,7 +443,7 @@ acpi_status acpi_set_register(u32 regist unlock_and_exit: - acpi_os_release_lock(acpi_gbl_hardware_lock, lock_flags); + spin_unlock_irqrestore(acpi_gbl_hardware_lock, lock_flags); /* Normalize the value that was read */ Index: linux-2.6.24.7/drivers/acpi/processor_idle.c =================================================================== --- linux-2.6.24.7.orig/drivers/acpi/processor_idle.c +++ linux-2.6.24.7/drivers/acpi/processor_idle.c @@ -1461,7 +1461,7 @@ static int acpi_idle_enter_simple(struct } static int c3_cpu_count; -static DEFINE_SPINLOCK(c3_lock); +static DEFINE_RAW_SPINLOCK(c3_lock); /** * acpi_idle_enter_bm - enters C3 with proper BM handling Index: linux-2.6.24.7/drivers/acpi/utilities/utmutex.c =================================================================== --- linux-2.6.24.7.orig/drivers/acpi/utilities/utmutex.c +++ linux-2.6.24.7/drivers/acpi/utilities/utmutex.c @@ -116,7 +116,7 @@ void acpi_ut_mutex_terminate(void) /* Delete the spinlocks */ acpi_os_delete_lock(acpi_gbl_gpe_lock); - acpi_os_delete_lock(acpi_gbl_hardware_lock); +// acpi_os_delete_lock(acpi_gbl_hardware_lock); return_VOID; } Index: linux-2.6.24.7/include/acpi/acglobal.h =================================================================== --- linux-2.6.24.7.orig/include/acpi/acglobal.h +++ linux-2.6.24.7/include/acpi/acglobal.h @@ -184,7 +184,12 @@ ACPI_EXTERN acpi_semaphore acpi_gbl_glob * interrupt level */ ACPI_EXTERN spinlock_t _acpi_gbl_gpe_lock; /* For GPE data structs and registers */ -ACPI_EXTERN spinlock_t _acpi_gbl_hardware_lock; /* For ACPI H/W except GPE registers */ + +/* + * Need to be raw because it might be used in acpi_processor_idle(): + */ +ACPI_EXTERN raw_spinlock_t _acpi_gbl_hardware_lock; /* For ACPI H/W except GPE registers */ + #define acpi_gbl_gpe_lock &_acpi_gbl_gpe_lock #define acpi_gbl_hardware_lock &_acpi_gbl_hardware_lock Index: linux-2.6.24.7/include/acpi/acpiosxf.h =================================================================== --- linux-2.6.24.7.orig/include/acpi/acpiosxf.h +++ linux-2.6.24.7/include/acpi/acpiosxf.h @@ -61,7 +61,7 @@ typedef enum { OSL_EC_BURST_HANDLER } acpi_execute_type; -#define ACPI_NO_UNIT_LIMIT ((u32) -1) +#define ACPI_NO_UNIT_LIMIT (INT_MAX/2) #define ACPI_MUTEX_SEM 1 /* Functions for acpi_os_signal */ patches/preempt-realtime-ipc.patch������������������������������������������������������������������0000664�0000764�0000764�00000005734�11041657734�016324� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������--- ipc/mqueue.c | 5 +++++ ipc/msg.c | 25 +++++++++++++++++++------ ipc/sem.c | 6 ++++++ 3 files changed, 30 insertions(+), 6 deletions(-) Index: linux-2.6.24.7/ipc/mqueue.c =================================================================== --- linux-2.6.24.7.orig/ipc/mqueue.c +++ linux-2.6.24.7/ipc/mqueue.c @@ -779,12 +779,17 @@ static inline void pipelined_send(struct struct msg_msg *message, struct ext_wait_queue *receiver) { + /* + * Keep them in one critical section for PREEMPT_RT: + */ + preempt_disable(); receiver->msg = message; list_del(&receiver->list); receiver->state = STATE_PENDING; wake_up_process(receiver->task); smp_wmb(); receiver->state = STATE_READY; + preempt_enable(); } /* pipelined_receive() - if there is task waiting in sys_mq_timedsend() Index: linux-2.6.24.7/ipc/msg.c =================================================================== --- linux-2.6.24.7.orig/ipc/msg.c +++ linux-2.6.24.7/ipc/msg.c @@ -261,12 +261,19 @@ static void expunge_all(struct msg_queue while (tmp != &msq->q_receivers) { struct msg_receiver *msr; + /* + * Make sure that the wakeup doesnt preempt + * this CPU prematurely. (on PREEMPT_RT) + */ + preempt_disable(); + msr = list_entry(tmp, struct msg_receiver, r_list); tmp = tmp->next; msr->r_msg = NULL; - wake_up_process(msr->r_tsk); - smp_mb(); + wake_up_process(msr->r_tsk); /* serializes */ msr->r_msg = ERR_PTR(res); + + preempt_enable(); } } @@ -637,22 +644,28 @@ static inline int pipelined_send(struct !security_msg_queue_msgrcv(msq, msg, msr->r_tsk, msr->r_msgtype, msr->r_mode)) { + /* + * Make sure that the wakeup doesnt preempt + * this CPU prematurely. (on PREEMPT_RT) + */ + preempt_disable(); + list_del(&msr->r_list); if (msr->r_maxsize < msg->m_ts) { msr->r_msg = NULL; - wake_up_process(msr->r_tsk); - smp_mb(); + wake_up_process(msr->r_tsk); /* serializes */ msr->r_msg = ERR_PTR(-E2BIG); } else { msr->r_msg = NULL; msq->q_lrpid = task_pid_vnr(msr->r_tsk); msq->q_rtime = get_seconds(); - wake_up_process(msr->r_tsk); - smp_mb(); + wake_up_process(msr->r_tsk); /* serializes */ msr->r_msg = msg; + preempt_enable(); return 1; } + preempt_enable(); } } return 0; Index: linux-2.6.24.7/ipc/sem.c =================================================================== --- linux-2.6.24.7.orig/ipc/sem.c +++ linux-2.6.24.7/ipc/sem.c @@ -467,6 +467,11 @@ static void update_queue (struct sem_arr if (error <= 0) { struct sem_queue *n; remove_from_queue(sma,q); + /* + * make sure that the wakeup doesnt preempt + * _this_ cpu prematurely. (on preempt_rt) + */ + preempt_disable(); q->status = IN_WAKEUP; /* * Continue scanning. The next operation @@ -489,6 +494,7 @@ static void update_queue (struct sem_arr */ smp_wmb(); q->status = error; + preempt_enable(); q = n; } else { q = q->next; ������������������������������������patches/preempt-realtime-sound.patch����������������������������������������������������������������0000664�0000764�0000764�00000001424�11041657732�016667� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������--- sound/core/pcm_lib.c | 2 ++ 1 file changed, 2 insertions(+) Index: linux-2.6.24.7/sound/core/pcm_lib.c =================================================================== --- linux-2.6.24.7.orig/sound/core/pcm_lib.c +++ linux-2.6.24.7/sound/core/pcm_lib.c @@ -30,6 +30,7 @@ #include <sound/pcm_params.h> #include <sound/timer.h> +#include <linux/ftrace.h> /* * fill ring buffer with silence * runtime->silence_start: starting pointer to silence area @@ -130,6 +131,7 @@ static void xrun(struct snd_pcm_substrea snd_pcm_stop(substream, SNDRV_PCM_STATE_XRUN); #ifdef CONFIG_SND_PCM_XRUN_DEBUG if (substream->pstr->xrun_debug) { + user_trace_stop(); snd_printd(KERN_DEBUG "XRUN: pcmC%dD%d%c\n", substream->pcm->card->number, substream->pcm->device, ��������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/preempt-realtime-mm.patch�������������������������������������������������������������������0000664�0000764�0000764�00000016576�11041657732�016166� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������--- include/linux/pagevec.h | 2 +- include/linux/vmstat.h | 10 ++++++++++ mm/bounce.c | 4 ++-- mm/memory.c | 11 +++++++++-- mm/mmap.c | 10 ++++++++-- mm/vmscan.c | 10 ++++++++-- mm/vmstat.c | 38 ++++++++++++++++++++++++++++++++------ 7 files changed, 70 insertions(+), 15 deletions(-) Index: linux-2.6.24.7/include/linux/pagevec.h =================================================================== --- linux-2.6.24.7.orig/include/linux/pagevec.h +++ linux-2.6.24.7/include/linux/pagevec.h @@ -9,7 +9,7 @@ #define _LINUX_PAGEVEC_H /* 14 pointers + two long's align the pagevec structure to a power of two */ -#define PAGEVEC_SIZE 14 +#define PAGEVEC_SIZE 8 struct page; struct address_space; Index: linux-2.6.24.7/include/linux/vmstat.h =================================================================== --- linux-2.6.24.7.orig/include/linux/vmstat.h +++ linux-2.6.24.7/include/linux/vmstat.h @@ -59,7 +59,12 @@ DECLARE_PER_CPU(struct vm_event_state, v static inline void __count_vm_event(enum vm_event_item item) { +#ifdef CONFIG_PREEMPT_RT + get_cpu_var(vm_event_states).event[item]++; + put_cpu(); +#else __get_cpu_var(vm_event_states).event[item]++; +#endif } static inline void count_vm_event(enum vm_event_item item) @@ -70,7 +75,12 @@ static inline void count_vm_event(enum v static inline void __count_vm_events(enum vm_event_item item, long delta) { +#ifdef CONFIG_PREEMPT_RT + get_cpu_var(vm_event_states).event[item] += delta; + put_cpu(); +#else __get_cpu_var(vm_event_states).event[item] += delta; +#endif } static inline void count_vm_events(enum vm_event_item item, long delta) Index: linux-2.6.24.7/mm/bounce.c =================================================================== --- linux-2.6.24.7.orig/mm/bounce.c +++ linux-2.6.24.7/mm/bounce.c @@ -48,11 +48,11 @@ static void bounce_copy_vec(struct bio_v unsigned long flags; unsigned char *vto; - local_irq_save(flags); + local_irq_save_nort(flags); vto = kmap_atomic(to->bv_page, KM_BOUNCE_READ); memcpy(vto + to->bv_offset, vfrom, to->bv_len); kunmap_atomic(vto, KM_BOUNCE_READ); - local_irq_restore(flags); + local_irq_restore_nort(flags); } #else /* CONFIG_HIGHMEM */ Index: linux-2.6.24.7/mm/memory.c =================================================================== --- linux-2.6.24.7.orig/mm/memory.c +++ linux-2.6.24.7/mm/memory.c @@ -278,7 +278,9 @@ void free_pgtables(struct mmu_gather **t if (!vma) /* Sometimes when exiting after an oops */ return; +#ifndef CONFIG_PREEMPT_RT if (vma->vm_next) +#endif tlb_finish_mmu(*tlb, tlb_start_addr(*tlb), tlb_end_addr(*tlb)); /* * Hide vma from rmap and vmtruncate before freeeing pgtables, @@ -289,7 +291,9 @@ void free_pgtables(struct mmu_gather **t unlink_file_vma(unlink); unlink = unlink->vm_next; } +#ifndef CONFIG_PREEMPT_RT if (vma->vm_next) +#endif *tlb = tlb_gather_mmu(vma->vm_mm, fullmm); #endif while (vma) { @@ -804,10 +808,13 @@ static unsigned long unmap_page_range(st return addr; } -#ifdef CONFIG_PREEMPT +#if defined(CONFIG_PREEMPT) && !defined(CONFIG_PREEMPT_RT) # define ZAP_BLOCK_SIZE (8 * PAGE_SIZE) #else -/* No preempt: go for improved straight-line efficiency */ +/* + * No preempt: go for improved straight-line efficiency + * on PREEMPT_RT this is not a critical latency-path. + */ # define ZAP_BLOCK_SIZE (1024 * PAGE_SIZE) #endif Index: linux-2.6.24.7/mm/mmap.c =================================================================== --- linux-2.6.24.7.orig/mm/mmap.c +++ linux-2.6.24.7/mm/mmap.c @@ -1910,10 +1910,16 @@ asmlinkage long sys_munmap(unsigned long static inline void verify_mm_writelocked(struct mm_struct *mm) { #ifdef CONFIG_DEBUG_VM - if (unlikely(down_read_trylock(&mm->mmap_sem))) { +# ifdef CONFIG_PREEMPT_RT + if (unlikely(!rt_rwsem_is_locked(&mm->mmap_sem))) { WARN_ON(1); - up_read(&mm->mmap_sem); } +# else + if (unlikely(down_read_trylock(&mm->mmap_sem))) { + WARN_ON(1); + up_read(&mm->mmap_sem); + } +# endif #endif } Index: linux-2.6.24.7/mm/vmscan.c =================================================================== --- linux-2.6.24.7.orig/mm/vmscan.c +++ linux-2.6.24.7/mm/vmscan.c @@ -23,6 +23,7 @@ #include <linux/file.h> #include <linux/writeback.h> #include <linux/blkdev.h> +#include <linux/interrupt.h> #include <linux/buffer_head.h> /* for try_to_release_page(), buffer_heads_over_limit */ #include <linux/mm_inline.h> @@ -840,7 +841,7 @@ static unsigned long shrink_inactive_lis } nr_reclaimed += nr_freed; - local_irq_disable(); + local_irq_disable_nort(); if (current_is_kswapd()) { __count_zone_vm_events(PGSCAN_KSWAPD, zone, nr_scan); __count_vm_events(KSWAPD_STEAL, nr_freed); @@ -871,9 +872,14 @@ static unsigned long shrink_inactive_lis } } } while (nr_scanned < max_scan); + /* + * Non-PREEMPT_RT relies on IRQs-off protecting the page_states + * per-CPU data. PREEMPT_RT has that data protected even in + * __mod_page_state(), so no need to keep IRQs disabled. + */ spin_unlock(&zone->lru_lock); done: - local_irq_enable(); + local_irq_enable_nort(); pagevec_release(&pvec); return nr_reclaimed; } Index: linux-2.6.24.7/mm/vmstat.c =================================================================== --- linux-2.6.24.7.orig/mm/vmstat.c +++ linux-2.6.24.7/mm/vmstat.c @@ -157,10 +157,14 @@ static void refresh_zone_stat_thresholds void __mod_zone_page_state(struct zone *zone, enum zone_stat_item item, int delta) { - struct per_cpu_pageset *pcp = zone_pcp(zone, smp_processor_id()); - s8 *p = pcp->vm_stat_diff + item; + struct per_cpu_pageset *pcp; + int cpu; long x; + s8 *p; + cpu = get_cpu(); + pcp = zone_pcp(zone, cpu); + p = pcp->vm_stat_diff + item; x = delta + *p; if (unlikely(x > pcp->stat_threshold || x < -pcp->stat_threshold)) { @@ -168,6 +172,7 @@ void __mod_zone_page_state(struct zone * x = 0; } *p = x; + put_cpu(); } EXPORT_SYMBOL(__mod_zone_page_state); @@ -210,9 +215,13 @@ EXPORT_SYMBOL(mod_zone_page_state); */ void __inc_zone_state(struct zone *zone, enum zone_stat_item item) { - struct per_cpu_pageset *pcp = zone_pcp(zone, smp_processor_id()); - s8 *p = pcp->vm_stat_diff + item; + struct per_cpu_pageset *pcp; + int cpu; + s8 *p; + cpu = get_cpu(); + pcp = zone_pcp(zone, cpu); + p = pcp->vm_stat_diff + item; (*p)++; if (unlikely(*p > pcp->stat_threshold)) { @@ -221,18 +230,34 @@ void __inc_zone_state(struct zone *zone, zone_page_state_add(*p + overstep, zone, item); *p = -overstep; } + put_cpu(); } void __inc_zone_page_state(struct page *page, enum zone_stat_item item) { +#ifdef CONFIG_PREEMPT_RT + unsigned long flags; + struct zone *zone; + + zone = page_zone(page); + local_irq_save(flags); + __inc_zone_state(zone, item); + local_irq_restore(flags); +#else __inc_zone_state(page_zone(page), item); +#endif } EXPORT_SYMBOL(__inc_zone_page_state); void __dec_zone_state(struct zone *zone, enum zone_stat_item item) { - struct per_cpu_pageset *pcp = zone_pcp(zone, smp_processor_id()); - s8 *p = pcp->vm_stat_diff + item; + struct per_cpu_pageset *pcp; + int cpu; + s8 *p; + + cpu = get_cpu(); + pcp = zone_pcp(zone, cpu); + p = pcp->vm_stat_diff + item; (*p)--; @@ -242,6 +267,7 @@ void __dec_zone_state(struct zone *zone, zone_page_state_add(*p - overstep, zone, item); *p = overstep; } + put_cpu(); } void __dec_zone_page_state(struct page *page, enum zone_stat_item item) ����������������������������������������������������������������������������������������������������������������������������������patches/preempt-realtime-init-show-enabled-debugs.patch���������������������������������������������0000664�0000764�0000764�00000010035�11041657732�022315� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������--- init/main.c | 54 ++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 54 insertions(+) Index: linux-2.6.24.7/init/main.c =================================================================== --- linux-2.6.24.7.orig/init/main.c +++ linux-2.6.24.7/init/main.c @@ -437,6 +437,8 @@ static void noinline __init_refok rest_i { int pid; + system_state = SYSTEM_BOOTING_SCHEDULER_OK; + kernel_thread(kernel_init, NULL, CLONE_FS | CLONE_SIGHAND); numa_default_policy(); pid = kernel_thread(kthreadd, NULL, CLONE_FS | CLONE_FILES); @@ -649,6 +651,9 @@ asmlinkage void __init start_kernel(void acpi_early_init(); /* before LAPIC and SMP init */ +#ifdef CONFIG_PREEMPT_RT + WARN_ON(irqs_disabled()); +#endif /* Do the rest non-__init'ed, we're now alive */ rest_init(); } @@ -753,12 +758,14 @@ __setup("nosoftlockup", nosoftlockup_set static void __init do_pre_smp_initcalls(void) { extern int spawn_ksoftirqd(void); + extern int spawn_desched_task(void); migration_init(); posix_cpu_thread_init(); spawn_ksoftirqd(); if (!nosoftlockup) spawn_softlockup_task(); + spawn_desched_task(); } static void run_init_process(char *init_filename) @@ -792,6 +799,9 @@ static int noinline init_post(void) printk(KERN_WARNING "Failed to execute %s\n", ramdisk_execute_command); } +#ifdef CONFIG_PREEMPT_RT + WARN_ON(irqs_disabled()); +#endif /* * We try each of these until one succeeds. @@ -857,7 +867,51 @@ static int __init kernel_init(void * unu ramdisk_execute_command = NULL; prepare_namespace(); } +#ifdef CONFIG_PREEMPT_RT + WARN_ON(irqs_disabled()); +#endif +#define DEBUG_COUNT (defined(CONFIG_DEBUG_RT_MUTEXES) + defined(CONFIG_CRITICAL_PREEMPT_TIMING) + defined(CONFIG_CRITICAL_IRQSOFF_TIMING) + defined(CONFIG_FUNCTION_TRACE) + defined(CONFIG_DEBUG_SLAB) + defined(CONFIG_DEBUG_PAGEALLOC) + defined(CONFIG_LOCKDEP)) + +#if DEBUG_COUNT > 0 + printk(KERN_ERR "*****************************************************************************\n"); + printk(KERN_ERR "* *\n"); +#if DEBUG_COUNT == 1 + printk(KERN_ERR "* REMINDER, the following debugging option is turned on in your .config: *\n"); +#else + printk(KERN_ERR "* REMINDER, the following debugging options are turned on in your .config: *\n"); +#endif + printk(KERN_ERR "* *\n"); +#ifdef CONFIG_DEBUG_RT_MUTEXES + printk(KERN_ERR "* CONFIG_DEBUG_RT_MUTEXES *\n"); +#endif +#ifdef CONFIG_CRITICAL_PREEMPT_TIMING + printk(KERN_ERR "* CONFIG_CRITICAL_PREEMPT_TIMING *\n"); +#endif +#ifdef CONFIG_CRITICAL_IRQSOFF_TIMING + printk(KERN_ERR "* CONFIG_CRITICAL_IRQSOFF_TIMING *\n"); +#endif +#ifdef CONFIG_FUNCTION_TRACE + printk(KERN_ERR "* CONFIG_FUNCTION_TRACE *\n"); +#endif +#ifdef CONFIG_DEBUG_SLAB + printk(KERN_ERR "* CONFIG_DEBUG_SLAB *\n"); +#endif +#ifdef CONFIG_DEBUG_PAGEALLOC + printk(KERN_ERR "* CONFIG_DEBUG_PAGEALLOC *\n"); +#endif +#ifdef CONFIG_LOCKDEP + printk(KERN_ERR "* CONFIG_LOCKDEP *\n"); +#endif + printk(KERN_ERR "* *\n"); +#if DEBUG_COUNT == 1 + printk(KERN_ERR "* it may increase runtime overhead and latencies. *\n"); +#else + printk(KERN_ERR "* they may increase runtime overhead and latencies. *\n"); +#endif + printk(KERN_ERR "* *\n"); + printk(KERN_ERR "*****************************************************************************\n"); +#endif /* * Ok, we have completed the initial bootup, and * we're essentially up and running. Get rid of the ���������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/preempt-realtime-compile-fixes.patch��������������������������������������������������������0000664�0000764�0000764�00000001130�11041657735�020300� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������--- drivers/block/paride/pseudo.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) Index: linux-2.6.24.7/drivers/block/paride/pseudo.h =================================================================== --- linux-2.6.24.7.orig/drivers/block/paride/pseudo.h +++ linux-2.6.24.7/drivers/block/paride/pseudo.h @@ -43,7 +43,7 @@ static unsigned long ps_timeout; static int ps_tq_active = 0; static int ps_nice = 0; -static DEFINE_SPINLOCK(ps_spinlock __attribute__((unused))); +static __attribute__((unused)) DEFINE_SPINLOCK(ps_spinlock); static DECLARE_DELAYED_WORK(ps_tq, ps_tq_int); ����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/preempt-realtime-console.patch��������������������������������������������������������������0000664�0000764�0000764�00000003476�11041657732�017212� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������--- drivers/video/console/fbcon.c | 5 +++-- include/linux/console.h | 1 + 2 files changed, 4 insertions(+), 2 deletions(-) Index: linux-2.6.24.7/drivers/video/console/fbcon.c =================================================================== --- linux-2.6.24.7.orig/drivers/video/console/fbcon.c +++ linux-2.6.24.7/drivers/video/console/fbcon.c @@ -1306,7 +1306,6 @@ static void fbcon_clear(struct vc_data * { struct fb_info *info = registered_fb[con2fb_map[vc->vc_num]]; struct fbcon_ops *ops = info->fbcon_par; - struct display *p = &fb_display[vc->vc_num]; u_int y_break; @@ -1335,10 +1334,11 @@ static void fbcon_putcs(struct vc_data * struct display *p = &fb_display[vc->vc_num]; struct fbcon_ops *ops = info->fbcon_par; - if (!fbcon_is_inactive(vc, info)) + if (!fbcon_is_inactive(vc, info)) { ops->putcs(vc, info, s, count, real_y(p, ypos), xpos, get_color(vc, info, scr_readw(s), 1), get_color(vc, info, scr_readw(s), 0)); + } } static void fbcon_putc(struct vc_data *vc, int c, int ypos, int xpos) @@ -3322,6 +3322,7 @@ static const struct consw fb_con = { .con_screen_pos = fbcon_screen_pos, .con_getxy = fbcon_getxy, .con_resize = fbcon_resize, + .con_preemptible = 1, }; static struct notifier_block fbcon_event_notifier = { Index: linux-2.6.24.7/include/linux/console.h =================================================================== --- linux-2.6.24.7.orig/include/linux/console.h +++ linux-2.6.24.7/include/linux/console.h @@ -55,6 +55,7 @@ struct consw { void (*con_invert_region)(struct vc_data *, u16 *, int); u16 *(*con_screen_pos)(struct vc_data *, int); unsigned long (*con_getxy)(struct vc_data *, unsigned long, int *, int *); + int con_preemptible; // can it reschedule from within printk? }; extern const struct consw *conswitchp; ��������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/preempt-realtime-debug-sysctl.patch���������������������������������������������������������0000664�0000764�0000764�00000007762�11041657730�020155� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������--- drivers/char/sysrq.c | 18 ++++++++++++++- drivers/char/tty_io.c | 1 kernel/panic.c | 1 kernel/sysctl.c | 58 ++++++++++++++++++++++++++++++++++++++++++++++++++ 4 files changed, 77 insertions(+), 1 deletion(-) Index: linux-2.6.24.7/drivers/char/sysrq.c =================================================================== --- linux-2.6.24.7.orig/drivers/char/sysrq.c +++ linux-2.6.24.7/drivers/char/sysrq.c @@ -209,6 +209,22 @@ static struct sysrq_key_op sysrq_showreg .enable_mask = SYSRQ_ENABLE_DUMP, }; +#if defined(__i386__) + +static void sysrq_handle_showallregs(int key, struct tty_struct *tty) +{ + nmi_show_all_regs(); +} + +static struct sysrq_key_op sysrq_showallregs_op = { + .handler = sysrq_handle_showallregs, + .help_msg = "showalLcpupc", + .action_msg = "Show Regs On All CPUs", +}; +#else +#define sysrq_showallregs_op (*(struct sysrq_key_op *)0) +#endif + static void sysrq_handle_showstate(int key, struct tty_struct *tty) { show_state(); @@ -341,7 +357,7 @@ static struct sysrq_key_op *sysrq_key_ta &sysrq_kill_op, /* i */ NULL, /* j */ &sysrq_SAK_op, /* k */ - NULL, /* l */ + &sysrq_showallregs_op, /* l */ &sysrq_showmem_op, /* m */ &sysrq_unrt_op, /* n */ /* o: This will often be registered as 'Off' at init time */ Index: linux-2.6.24.7/drivers/char/tty_io.c =================================================================== --- linux-2.6.24.7.orig/drivers/char/tty_io.c +++ linux-2.6.24.7/drivers/char/tty_io.c @@ -258,6 +258,7 @@ static int check_tty_count(struct tty_st printk(KERN_WARNING "Warning: dev (%s) tty->count(%d) " "!= #fd's(%d) in %s\n", tty->name, tty->count, count, routine); + dump_stack(); return count; } #endif Index: linux-2.6.24.7/kernel/panic.c =================================================================== --- linux-2.6.24.7.orig/kernel/panic.c +++ linux-2.6.24.7/kernel/panic.c @@ -79,6 +79,7 @@ NORET_TYPE void panic(const char * fmt, vsnprintf(buf, sizeof(buf), fmt, args); va_end(args); printk(KERN_EMERG "Kernel panic - not syncing: %s\n",buf); + dump_stack(); bust_spinlocks(0); /* Index: linux-2.6.24.7/kernel/sysctl.c =================================================================== --- linux-2.6.24.7.orig/kernel/sysctl.c +++ linux-2.6.24.7/kernel/sysctl.c @@ -340,6 +340,54 @@ static struct ctl_table kern_table[] = { }, #endif { + .ctl_name = CTL_UNNUMBERED, + .procname = "prof_pid", + .data = &prof_pid, + .maxlen = sizeof(int), + .mode = 0644, + .proc_handler = &proc_dointvec, + }, +#ifdef CONFIG_PREEMPT + { + .ctl_name = CTL_UNNUMBERED, + .procname = "kernel_preemption", + .data = &kernel_preemption, + .maxlen = sizeof(int), + .mode = 0644, + .proc_handler = &proc_dointvec, + }, +#endif +#ifdef CONFIG_PREEMPT_VOLUNTARY + { + .ctl_name = CTL_UNNUMBERED, + .procname = "voluntary_preemption", + .data = &voluntary_preemption, + .maxlen = sizeof(int), + .mode = 0644, + .proc_handler = &proc_dointvec, + }, +#endif +#if defined(CONFIG_PREEMPT_SOFTIRQS) && !defined(CONFIG_PREEMPT_RT) + { + .ctl_name = CTL_UNNUMBERED, + .procname = "softirq_preemption", + .data = &softirq_preemption, + .maxlen = sizeof(int), + .mode = 0644, + .proc_handler = &proc_dointvec, + }, +#endif +#if defined(CONFIG_PREEMPT_HARDIRQS) && !defined(CONFIG_PREEMPT_RT) + { + .ctl_name = CTL_UNNUMBERED, + .procname = "hardirq_preemption", + .data = &hardirq_preemption, + .maxlen = sizeof(int), + .mode = 0644, + .proc_handler = &proc_dointvec, + }, +#endif + { .ctl_name = KERN_PANIC, .procname = "panic", .data = &panic_timeout, @@ -347,6 +395,16 @@ static struct ctl_table kern_table[] = { .mode = 0644, .proc_handler = &proc_dointvec, }, +#ifdef CONFIG_GENERIC_HARDIRQS + { + .ctl_name = CTL_UNNUMBERED, + .procname = "debug_direct_keyboard", + .data = &debug_direct_keyboard, + .maxlen = sizeof(int), + .mode = 0644, + .proc_handler = &proc_dointvec, + }, +#endif { .ctl_name = KERN_CORE_USES_PID, .procname = "core_uses_pid", ��������������patches/preempt-realtime-ide.patch������������������������������������������������������������������0000664�0000764�0000764�00000022747�11041657734�016315� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������--- drivers/ide/ide-floppy.c | 4 ++-- drivers/ide/ide-io.c | 4 ++-- drivers/ide/ide-iops.c | 24 +++++++++++------------- drivers/ide/ide-lib.c | 14 +++++--------- drivers/ide/ide-probe.c | 8 ++++---- drivers/ide/ide-taskfile.c | 6 +++--- drivers/ide/pci/alim15x3.c | 12 ++++++------ drivers/ide/pci/hpt366.c | 4 ++-- 8 files changed, 35 insertions(+), 41 deletions(-) Index: linux-2.6.24.7/drivers/ide/ide-floppy.c =================================================================== --- linux-2.6.24.7.orig/drivers/ide/ide-floppy.c +++ linux-2.6.24.7/drivers/ide/ide-floppy.c @@ -1668,9 +1668,9 @@ static int idefloppy_get_format_progress atapi_status_t status; unsigned long flags; - local_irq_save(flags); + local_irq_save_nort(flags); status.all = HWIF(drive)->INB(IDE_STATUS_REG); - local_irq_restore(flags); + local_irq_restore_nort(flags); progress_indication = !status.b.dsc ? 0 : 0x10000; } Index: linux-2.6.24.7/drivers/ide/ide-io.c =================================================================== --- linux-2.6.24.7.orig/drivers/ide/ide-io.c +++ linux-2.6.24.7/drivers/ide/ide-io.c @@ -1194,7 +1194,7 @@ static void ide_do_request (ide_hwgroup_ ide_get_lock(ide_intr, hwgroup); /* caller must own ide_lock */ - BUG_ON(!irqs_disabled()); + BUG_ON_NONRT(!irqs_disabled()); while (!hwgroup->busy) { hwgroup->busy = 1; @@ -1462,7 +1462,7 @@ void ide_timer_expiry (unsigned long dat #endif /* DISABLE_IRQ_NOSYNC */ /* local CPU only, * as if we were handling an interrupt */ - local_irq_disable(); + local_irq_disable_nort(); if (hwgroup->polling) { startstop = handler(drive); } else if (drive_is_ready(drive)) { Index: linux-2.6.24.7/drivers/ide/ide-iops.c =================================================================== --- linux-2.6.24.7.orig/drivers/ide/ide-iops.c +++ linux-2.6.24.7/drivers/ide/ide-iops.c @@ -220,10 +220,10 @@ static void ata_input_data(ide_drive_t * if (io_32bit) { if (io_32bit & 2) { unsigned long flags; - local_irq_save(flags); + local_irq_save_nort(flags); ata_vlb_sync(drive, IDE_NSECTOR_REG); hwif->INSL(IDE_DATA_REG, buffer, wcount); - local_irq_restore(flags); + local_irq_restore_nort(flags); } else hwif->INSL(IDE_DATA_REG, buffer, wcount); } else { @@ -242,10 +242,10 @@ static void ata_output_data(ide_drive_t if (io_32bit) { if (io_32bit & 2) { unsigned long flags; - local_irq_save(flags); + local_irq_save_nort(flags); ata_vlb_sync(drive, IDE_NSECTOR_REG); hwif->OUTSL(IDE_DATA_REG, buffer, wcount); - local_irq_restore(flags); + local_irq_restore_nort(flags); } else hwif->OUTSL(IDE_DATA_REG, buffer, wcount); } else { @@ -506,12 +506,12 @@ static int __ide_wait_stat(ide_drive_t * if (!(stat & BUSY_STAT)) break; - local_irq_restore(flags); + local_irq_restore_nort(flags); *rstat = stat; return -EBUSY; } } - local_irq_restore(flags); + local_irq_restore_nort(flags); } /* * Allow status to settle, then read it again. @@ -730,17 +730,15 @@ int ide_driveid_update(ide_drive_t *driv printk("%s: CHECK for good STATUS\n", drive->name); return 0; } - local_irq_save(flags); - SELECT_MASK(drive, 0); id = kmalloc(SECTOR_WORDS*4, GFP_ATOMIC); - if (!id) { - local_irq_restore(flags); + if (!id) return 0; - } + local_irq_save_nort(flags); + SELECT_MASK(drive, 0); ata_input_data(drive, id, SECTOR_WORDS); (void) hwif->INB(IDE_STATUS_REG); /* clear drive IRQ */ - local_irq_enable(); - local_irq_restore(flags); + local_irq_enable_nort(); + local_irq_restore_nort(flags); ide_fix_driveid(id); if (id) { drive->id->dma_ultra = id->dma_ultra; Index: linux-2.6.24.7/drivers/ide/ide-lib.c =================================================================== --- linux-2.6.24.7.orig/drivers/ide/ide-lib.c +++ linux-2.6.24.7/drivers/ide/ide-lib.c @@ -447,15 +447,16 @@ int ide_set_xfer_rate(ide_drive_t *drive static void ide_dump_opcode(ide_drive_t *drive) { + unsigned long flags; struct request *rq; u8 opcode = 0; int found = 0; - spin_lock(&ide_lock); + spin_lock_irqsave(&ide_lock, flags); rq = NULL; if (HWGROUP(drive)) rq = HWGROUP(drive)->rq; - spin_unlock(&ide_lock); + spin_unlock_irqrestore(&ide_lock, flags); if (!rq) return; if (rq->cmd_type == REQ_TYPE_ATA_CMD || @@ -484,10 +485,8 @@ static void ide_dump_opcode(ide_drive_t static u8 ide_dump_ata_status(ide_drive_t *drive, const char *msg, u8 stat) { ide_hwif_t *hwif = HWIF(drive); - unsigned long flags; u8 err = 0; - local_irq_save(flags); printk("%s: %s: status=0x%02x { ", drive->name, msg, stat); if (stat & BUSY_STAT) printk("Busy "); @@ -548,7 +547,7 @@ static u8 ide_dump_ata_status(ide_drive_ printk("\n"); } ide_dump_opcode(drive); - local_irq_restore(flags); + return err; } @@ -563,14 +562,11 @@ static u8 ide_dump_ata_status(ide_drive_ static u8 ide_dump_atapi_status(ide_drive_t *drive, const char *msg, u8 stat) { - unsigned long flags; - atapi_status_t status; atapi_error_t error; status.all = stat; error.all = 0; - local_irq_save(flags); printk("%s: %s: status=0x%02x { ", drive->name, msg, stat); if (status.b.bsy) printk("Busy "); @@ -596,7 +592,7 @@ static u8 ide_dump_atapi_status(ide_driv printk("}\n"); } ide_dump_opcode(drive); - local_irq_restore(flags); + return error.all; } Index: linux-2.6.24.7/drivers/ide/ide-probe.c =================================================================== --- linux-2.6.24.7.orig/drivers/ide/ide-probe.c +++ linux-2.6.24.7/drivers/ide/ide-probe.c @@ -128,7 +128,7 @@ static inline void do_identify (ide_driv hwif->ata_input_data(drive, id, SECTOR_WORDS); drive->id_read = 1; - local_irq_enable(); + local_irq_enable_nort(); ide_fix_driveid(id); #if defined (CONFIG_SCSI_EATA_PIO) || defined (CONFIG_SCSI_EATA) @@ -311,14 +311,14 @@ static int actual_try_to_identify (ide_d unsigned long flags; /* local CPU only; some systems need this */ - local_irq_save(flags); + local_irq_save_nort(flags); /* drive returned ID */ do_identify(drive, cmd); /* drive responded with ID */ rc = 0; /* clear drive IRQ */ (void) hwif->INB(IDE_STATUS_REG); - local_irq_restore(flags); + local_irq_restore_nort(flags); } else { /* drive refused ID */ rc = 2; @@ -801,7 +801,7 @@ static void probe_hwif(ide_hwif_t *hwif) } while ((stat & BUSY_STAT) && time_after(timeout, jiffies)); } - local_irq_restore(flags); + local_irq_restore_nort(flags); /* * Use cached IRQ number. It might be (and is...) changed by probe * code above Index: linux-2.6.24.7/drivers/ide/ide-taskfile.c =================================================================== --- linux-2.6.24.7.orig/drivers/ide/ide-taskfile.c +++ linux-2.6.24.7/drivers/ide/ide-taskfile.c @@ -269,7 +269,7 @@ static void ide_pio_sector(ide_drive_t * offset %= PAGE_SIZE; #ifdef CONFIG_HIGHMEM - local_irq_save(flags); + local_irq_save_nort(flags); #endif buf = kmap_atomic(page, KM_BIO_SRC_IRQ) + offset; @@ -289,7 +289,7 @@ static void ide_pio_sector(ide_drive_t * kunmap_atomic(buf, KM_BIO_SRC_IRQ); #ifdef CONFIG_HIGHMEM - local_irq_restore(flags); + local_irq_restore_nort(flags); #endif } @@ -457,7 +457,7 @@ ide_startstop_t pre_task_out_intr (ide_d } if (!drive->unmask) - local_irq_disable(); + local_irq_disable_nort(); ide_set_handler(drive, &task_out_intr, WAIT_WORSTCASE, NULL); ide_pio_datablock(drive, rq, 1); Index: linux-2.6.24.7/drivers/ide/pci/alim15x3.c =================================================================== --- linux-2.6.24.7.orig/drivers/ide/pci/alim15x3.c +++ linux-2.6.24.7/drivers/ide/pci/alim15x3.c @@ -322,7 +322,7 @@ static void ali_set_pio_mode(ide_drive_t if (r_clc >= 16) r_clc = 0; } - local_irq_save(flags); + local_irq_save_nort(flags); /* * PIO mode => ATA FIFO on, ATAPI FIFO off @@ -344,7 +344,7 @@ static void ali_set_pio_mode(ide_drive_t pci_write_config_byte(dev, port, s_clc); pci_write_config_byte(dev, port+drive->select.b.unit+2, (a_clc << 4) | r_clc); - local_irq_restore(flags); + local_irq_restore_nort(flags); /* * setup active rec @@ -479,7 +479,7 @@ static unsigned int __devinit init_chips } #endif /* defined(DISPLAY_ALI_TIMINGS) && defined(CONFIG_IDE_PROC_FS) */ - local_irq_save(flags); + local_irq_save_nort(flags); if (m5229_revision < 0xC2) { /* @@ -570,7 +570,7 @@ out: } pci_dev_put(north); pci_dev_put(isa_dev); - local_irq_restore(flags); + local_irq_restore_nort(flags); return 0; } @@ -632,7 +632,7 @@ static u8 __devinit ata66_ali15x3(ide_hw unsigned long flags; u8 cbl = ATA_CBL_PATA40, tmpbyte; - local_irq_save(flags); + local_irq_save_nort(flags); if (m5229_revision >= 0xC2) { /* @@ -653,7 +653,7 @@ static u8 __devinit ata66_ali15x3(ide_hw } } - local_irq_restore(flags); + local_irq_restore_nort(flags); return cbl; } Index: linux-2.6.24.7/drivers/ide/pci/hpt366.c =================================================================== --- linux-2.6.24.7.orig/drivers/ide/pci/hpt366.c +++ linux-2.6.24.7/drivers/ide/pci/hpt366.c @@ -1430,7 +1430,7 @@ static void __devinit init_dma_hpt366(id dma_old = inb(dmabase + 2); - local_irq_save(flags); + local_irq_save_nort(flags); dma_new = dma_old; pci_read_config_byte(dev, hwif->channel ? 0x4b : 0x43, &masterdma); @@ -1441,7 +1441,7 @@ static void __devinit init_dma_hpt366(id if (dma_new != dma_old) outb(dma_new, dmabase + 2); - local_irq_restore(flags); + local_irq_restore_nort(flags); ide_setup_dma(hwif, dmabase, 8); } �������������������������patches/preempt-realtime-input.patch����������������������������������������������������������������0000664�0000764�0000764�00000002472�11041657733�016703� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������--- drivers/input/gameport/gameport.c | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-) Index: linux-2.6.24.7/drivers/input/gameport/gameport.c =================================================================== --- linux-2.6.24.7.orig/drivers/input/gameport/gameport.c +++ linux-2.6.24.7/drivers/input/gameport/gameport.c @@ -21,6 +21,7 @@ #include <linux/slab.h> #include <linux/delay.h> #include <linux/kthread.h> +#include <linux/interrupt.h> #include <linux/sched.h> /* HZ */ #include <linux/mutex.h> #include <linux/freezer.h> @@ -100,12 +101,12 @@ static int gameport_measure_speed(struct tx = 1 << 30; for(i = 0; i < 50; i++) { - local_irq_save(flags); + local_irq_save_nort(flags); GET_TIME(t1); for (t = 0; t < 50; t++) gameport_read(gameport); GET_TIME(t2); GET_TIME(t3); - local_irq_restore(flags); + local_irq_restore_nort(flags); udelay(i * 10); if ((t = DELTA(t2,t1) - DELTA(t3,t2)) < tx) tx = t; } @@ -124,11 +125,11 @@ static int gameport_measure_speed(struct tx = 1 << 30; for(i = 0; i < 50; i++) { - local_irq_save(flags); + local_irq_save_nort(flags); rdtscl(t1); for (t = 0; t < 50; t++) gameport_read(gameport); rdtscl(t2); - local_irq_restore(flags); + local_irq_restore_nort(flags); udelay(i * 10); if (t2 - t1 < tx) tx = t2 - t1; } ������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/preempt-realtime-irqs.patch�����������������������������������������������������������������0000664�0000764�0000764�00000011065�11041657730�016515� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������--- include/linux/irq.h | 10 ++++------ kernel/irq/handle.c | 10 +++++++++- kernel/irq/manage.c | 22 ++++++++++++++++------ kernel/irq/spurious.c | 3 +-- 4 files changed, 30 insertions(+), 15 deletions(-) Index: linux-2.6.24.7/include/linux/irq.h =================================================================== --- linux-2.6.24.7.orig/include/linux/irq.h +++ linux-2.6.24.7/include/linux/irq.h @@ -146,7 +146,6 @@ struct irq_chip { * @last_unhandled: aging timer for unhandled count * @thread: Thread pointer for threaded preemptible irq handling * @wait_for_handler: Waitqueue to wait for a running preemptible handler - * @cycles: Timestamp for stats and debugging * @lock: locking for SMP * @affinity: IRQ affinity on SMP * @cpu: cpu index useful for balancing @@ -169,10 +168,10 @@ struct irq_desc { unsigned int irq_count; /* For detecting broken IRQs */ unsigned int irqs_unhandled; unsigned long last_unhandled; /* Aging timer for unhandled count */ - struct task_struct *thread; - wait_queue_head_t wait_for_handler; - cycles_t timestamp; - spinlock_t lock; + struct task_struct *thread; + wait_queue_head_t wait_for_handler; + cycles_t timestamp; + raw_spinlock_t lock; #ifdef CONFIG_SMP cpumask_t affinity; unsigned int cpu; @@ -408,7 +407,6 @@ extern int set_irq_msi(unsigned int irq, /* Early initialization of irqs */ extern void early_init_hardirqs(void); -extern cycles_t irq_timestamp(unsigned int irq); #if defined(CONFIG_PREEMPT_HARDIRQS) extern void init_hardirqs(void); Index: linux-2.6.24.7/kernel/irq/handle.c =================================================================== --- linux-2.6.24.7.orig/kernel/irq/handle.c +++ linux-2.6.24.7/kernel/irq/handle.c @@ -54,12 +54,13 @@ struct irq_desc irq_desc[NR_IRQS] __cach .chip = &no_irq_chip, .handle_irq = handle_bad_irq, .depth = 1, - .lock = __SPIN_LOCK_UNLOCKED(irq_desc->lock), + .lock = RAW_SPIN_LOCK_UNLOCKED(irq_desc), #ifdef CONFIG_SMP .affinity = CPU_MASK_ALL #endif } }; +EXPORT_SYMBOL_GPL(irq_desc); /* * What should we do if we get a hw irq event on an illegal vector? @@ -248,6 +249,13 @@ fastcall unsigned int __do_IRQ(unsigned desc->chip->end(irq); return 1; } + /* + * If the task is currently running in user mode, don't + * detect soft lockups. If CONFIG_DETECT_SOFTLOCKUP is not + * configured, this should be optimized out. + */ + if (user_mode(get_irq_regs())) + touch_softlockup_watchdog(); spin_lock(&desc->lock); if (desc->chip->ack) Index: linux-2.6.24.7/kernel/irq/manage.c =================================================================== --- linux-2.6.24.7.orig/kernel/irq/manage.c +++ linux-2.6.24.7/kernel/irq/manage.c @@ -500,9 +500,9 @@ void free_irq(unsigned int irq, void *de * parallel with our fake */ if (action->flags & IRQF_SHARED) { - local_irq_save(flags); + local_irq_save_nort(flags); action->handler(irq, dev_id); - local_irq_restore(flags); + local_irq_restore_nort(flags); } #endif kfree(action); @@ -594,9 +594,9 @@ int request_irq(unsigned int irq, irq_ha */ unsigned long flags; - local_irq_save(flags); + local_irq_save_nort(flags); handler(irq, dev_id); - local_irq_restore(flags); + local_irq_restore_nort(flags); } #endif @@ -614,6 +614,11 @@ int hardirq_preemption = 1; EXPORT_SYMBOL(hardirq_preemption); +/* + * Real-Time Preemption depends on hardirq threading: + */ +#ifndef CONFIG_PREEMPT_RT + static int __init hardirq_preempt_setup (char *str) { if (!strncmp(str, "off", 3)) @@ -628,6 +633,7 @@ static int __init hardirq_preempt_setup __setup("hardirq-preempt=", hardirq_preempt_setup); +#endif /* * threaded simple handler @@ -787,12 +793,16 @@ static int do_irqd(void * __desc) sys_sched_setscheduler(current->pid, SCHED_FIFO, ¶m); while (!kthread_should_stop()) { - local_irq_disable(); + local_irq_disable_nort(); set_current_state(TASK_INTERRUPTIBLE); +#ifndef CONFIG_PREEMPT_RT irq_enter(); +#endif do_hardirq(desc); +#ifndef CONFIG_PREEMPT_RT irq_exit(); - local_irq_enable(); +#endif + local_irq_enable_nort(); cond_resched(); #ifdef CONFIG_SMP /* Index: linux-2.6.24.7/kernel/irq/spurious.c =================================================================== --- linux-2.6.24.7.orig/kernel/irq/spurious.c +++ linux-2.6.24.7/kernel/irq/spurious.c @@ -59,9 +59,8 @@ static int misrouted_irq(int irq) } action = action->next; } - local_irq_disable(); /* Now clean up the flags */ - spin_lock(&desc->lock); + spin_lock_irq(&desc->lock); action = desc->action; /* ���������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/preempt-realtime-net-drivers.patch����������������������������������������������������������0000664�0000764�0000764�00000001047�11041657734�020004� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������--- drivers/net/tulip/tulip_core.c | 1 + 1 file changed, 1 insertion(+) Index: linux-2.6.24.7/drivers/net/tulip/tulip_core.c =================================================================== --- linux-2.6.24.7.orig/drivers/net/tulip/tulip_core.c +++ linux-2.6.24.7/drivers/net/tulip/tulip_core.c @@ -1797,6 +1797,7 @@ static void __devexit tulip_remove_one ( pci_iounmap(pdev, tp->base_addr); free_netdev (dev); pci_release_regions (pdev); + pci_disable_device (pdev); pci_set_drvdata (pdev, NULL); /* pci_power_off (pdev, -1); */ �����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/preempt-realtime-printk.patch���������������������������������������������������������������0000664�0000764�0000764�00000010505�11041657733�017047� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������--- kernel/printk.c | 54 +++++++++++++++++++++++++++++++++++++++++++++--------- 1 file changed, 45 insertions(+), 9 deletions(-) Index: linux-2.6.24.7/kernel/printk.c =================================================================== --- linux-2.6.24.7.orig/kernel/printk.c +++ linux-2.6.24.7/kernel/printk.c @@ -84,7 +84,7 @@ static int console_locked, console_suspe * It is also used in interesting ways to provide interlocking in * release_console_sem(). */ -static DEFINE_SPINLOCK(logbuf_lock); +static DEFINE_RAW_SPINLOCK(logbuf_lock); #define LOG_BUF_MASK (log_buf_len-1) #define LOG_BUF(idx) (log_buf[(idx) & LOG_BUF_MASK]) @@ -435,7 +435,7 @@ static void __call_console_drivers(unsig for (con = console_drivers; con; con = con->next) { if ((con->flags & CON_ENABLED) && con->write && - (cpu_online(smp_processor_id()) || + (cpu_online(raw_smp_processor_id()) || (con->flags & CON_ANYTIME))) con->write(con, &LOG_BUF(start), end - start); } @@ -551,6 +551,7 @@ static void zap_locks(void) spin_lock_init(&logbuf_lock); /* And make sure that we print immediately */ init_MUTEX(&console_sem); + zap_rt_locks(); } #if defined(CONFIG_PRINTK_TIME) @@ -649,6 +650,7 @@ asmlinkage int vprintk(const char *fmt, lockdep_off(); spin_lock(&logbuf_lock); printk_cpu = smp_processor_id(); + preempt_enable(); /* Emit the output into the temporary buffer */ printed_len = vscnprintf(printk_buf, sizeof(printk_buf), fmt, args); @@ -718,6 +720,8 @@ asmlinkage int vprintk(const char *fmt, console_locked = 1; printk_cpu = UINT_MAX; spin_unlock(&logbuf_lock); + lockdep_on(); + local_irq_restore(flags); /* * Console drivers may assume that per-cpu resources have @@ -725,7 +729,7 @@ asmlinkage int vprintk(const char *fmt, * being able to cope (CON_ANYTIME) don't call them until * this CPU is officially up. */ - if (cpu_online(smp_processor_id()) || have_callable_console()) { + if (cpu_online(raw_smp_processor_id()) || have_callable_console()) { console_may_schedule = 0; release_console_sem(); } else { @@ -733,8 +737,6 @@ asmlinkage int vprintk(const char *fmt, console_locked = 0; up(&console_sem); } - lockdep_on(); - raw_local_irq_restore(flags); } else { /* * Someone else owns the drivers. We drop the spinlock, which @@ -747,7 +749,6 @@ asmlinkage int vprintk(const char *fmt, raw_local_irq_restore(flags); } - preempt_enable(); return printed_len; } EXPORT_SYMBOL(printk); @@ -971,13 +972,31 @@ void release_console_sem(void) _con_start = con_start; _log_end = log_end; con_start = log_end; /* Flush */ + /* + * on PREEMPT_RT, call console drivers with + * interrupts enabled (if printk was called + * with interrupts disabled): + */ +#ifdef CONFIG_PREEMPT_RT + spin_unlock_irqrestore(&logbuf_lock, flags); +#else spin_unlock(&logbuf_lock); +#endif call_console_drivers(_con_start, _log_end); - local_irq_restore(flags); + local_irq_restore_nort(flags); } console_locked = 0; - up(&console_sem); spin_unlock_irqrestore(&logbuf_lock, flags); + up(&console_sem); + /* + * On PREEMPT_RT kernels __wake_up may sleep, so wake syslogd + * up only if we are in a preemptible section. We normally dont + * printk from non-preemptible sections so this is for the emergency + * case only. + */ +#ifdef CONFIG_PREEMPT_RT + if (!in_atomic() && !irqs_disabled()) +#endif if (wake_klogd) wake_up_klogd(); } @@ -1244,7 +1263,7 @@ void tty_write_message(struct tty_struct */ int __printk_ratelimit(int ratelimit_jiffies, int ratelimit_burst) { - static DEFINE_SPINLOCK(ratelimit_lock); + static DEFINE_RAW_SPINLOCK(ratelimit_lock); static unsigned long toks = 10 * 5 * HZ; static unsigned long last_msg; static int missed; @@ -1285,6 +1304,23 @@ int printk_ratelimit(void) } EXPORT_SYMBOL(printk_ratelimit); +static DEFINE_RAW_SPINLOCK(warn_lock); + +void __WARN_ON(const char *func, const char *file, const int line) +{ + unsigned long flags; + + spin_lock_irqsave(&warn_lock, flags); + printk("%s/%d[CPU#%d]: BUG in %s at %s:%d\n", + current->comm, current->pid, raw_smp_processor_id(), + func, file, line); + dump_stack(); + spin_unlock_irqrestore(&warn_lock, flags); +} + +EXPORT_SYMBOL(__WARN_ON); + + /** * printk_timed_ratelimit - caller-controlled printk ratelimiting * @caller_jiffies: pointer to caller's state �������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/preempt-realtime-profiling.patch������������������������������������������������������������0000664�0000764�0000764�00000002154�11041657730�017527� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������--- kernel/profile.c | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) Index: linux-2.6.24.7/kernel/profile.c =================================================================== --- linux-2.6.24.7.orig/kernel/profile.c +++ linux-2.6.24.7/kernel/profile.c @@ -23,6 +23,7 @@ #include <linux/profile.h> #include <linux/highmem.h> #include <linux/mutex.h> +#include <linux/sched.h> #include <asm/sections.h> #include <asm/semaphore.h> #include <asm/irq_regs.h> @@ -46,6 +47,7 @@ int prof_on __read_mostly; EXPORT_SYMBOL_GPL(prof_on); static cpumask_t prof_cpu_mask = CPU_MASK_ALL; +int prof_pid = -1; #ifdef CONFIG_SMP static DEFINE_PER_CPU(struct profile_hit *[2], cpu_profile_hits); static DEFINE_PER_CPU(int, cpu_profile_flip); @@ -416,7 +418,8 @@ void __profile_tick(int type, struct pt_ { if (type == CPU_PROFILING && timer_hook) timer_hook(regs); - if (!user_mode(regs) && cpu_isset(smp_processor_id(), prof_cpu_mask)) + if (!user_mode(regs) && cpu_isset(smp_processor_id(), prof_cpu_mask) && + (prof_pid == -1 || prof_pid == current->pid)) profile_hit(type, (void *)profile_pc(regs)); } ��������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/preempt-realtime-rawlocks.patch�������������������������������������������������������������0000664�0000764�0000764�00000010036�11041657732�017363� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������--- drivers/oprofile/oprofilefs.c | 2 +- drivers/pci/access.c | 2 +- drivers/video/console/vgacon.c | 2 +- include/linux/kprobes.h | 2 +- include/linux/oprofile.h | 2 +- include/linux/percpu_counter.h | 2 +- kernel/kprobes.c | 2 +- kernel/softlockup.c | 2 +- 8 files changed, 8 insertions(+), 8 deletions(-) Index: linux-2.6.24.7/drivers/oprofile/oprofilefs.c =================================================================== --- linux-2.6.24.7.orig/drivers/oprofile/oprofilefs.c +++ linux-2.6.24.7/drivers/oprofile/oprofilefs.c @@ -21,7 +21,7 @@ #define OPROFILEFS_MAGIC 0x6f70726f -DEFINE_SPINLOCK(oprofilefs_lock); +DEFINE_RAW_SPINLOCK(oprofilefs_lock); static struct inode * oprofilefs_get_inode(struct super_block * sb, int mode) { Index: linux-2.6.24.7/drivers/pci/access.c =================================================================== --- linux-2.6.24.7.orig/drivers/pci/access.c +++ linux-2.6.24.7/drivers/pci/access.c @@ -11,7 +11,7 @@ * configuration space. */ -static DEFINE_SPINLOCK(pci_lock); +static DEFINE_RAW_SPINLOCK(pci_lock); /* * Wrappers for all PCI configuration access functions. They just check Index: linux-2.6.24.7/drivers/video/console/vgacon.c =================================================================== --- linux-2.6.24.7.orig/drivers/video/console/vgacon.c +++ linux-2.6.24.7/drivers/video/console/vgacon.c @@ -51,7 +51,7 @@ #include <video/vga.h> #include <asm/io.h> -static DEFINE_SPINLOCK(vga_lock); +static DEFINE_RAW_SPINLOCK(vga_lock); static int cursor_size_lastfrom; static int cursor_size_lastto; static u32 vgacon_xres; Index: linux-2.6.24.7/include/linux/kprobes.h =================================================================== --- linux-2.6.24.7.orig/include/linux/kprobes.h +++ linux-2.6.24.7/include/linux/kprobes.h @@ -182,7 +182,7 @@ static inline void kretprobe_assert(stru } } -extern spinlock_t kretprobe_lock; +extern raw_spinlock_t kretprobe_lock; extern struct mutex kprobe_mutex; extern int arch_prepare_kprobe(struct kprobe *p); extern void arch_arm_kprobe(struct kprobe *p); Index: linux-2.6.24.7/include/linux/oprofile.h =================================================================== --- linux-2.6.24.7.orig/include/linux/oprofile.h +++ linux-2.6.24.7/include/linux/oprofile.h @@ -159,6 +159,6 @@ ssize_t oprofilefs_ulong_to_user(unsigne int oprofilefs_ulong_from_user(unsigned long * val, char const __user * buf, size_t count); /** lock for read/write safety */ -extern spinlock_t oprofilefs_lock; +extern raw_spinlock_t oprofilefs_lock; #endif /* OPROFILE_H */ Index: linux-2.6.24.7/include/linux/percpu_counter.h =================================================================== --- linux-2.6.24.7.orig/include/linux/percpu_counter.h +++ linux-2.6.24.7/include/linux/percpu_counter.h @@ -16,7 +16,7 @@ #ifdef CONFIG_SMP struct percpu_counter { - spinlock_t lock; + raw_spinlock_t lock; s64 count; #ifdef CONFIG_HOTPLUG_CPU struct list_head list; /* All percpu_counters are on a list */ Index: linux-2.6.24.7/kernel/kprobes.c =================================================================== --- linux-2.6.24.7.orig/kernel/kprobes.c +++ linux-2.6.24.7/kernel/kprobes.c @@ -69,7 +69,7 @@ static struct hlist_head kretprobe_inst_ static bool kprobe_enabled; DEFINE_MUTEX(kprobe_mutex); /* Protects kprobe_table */ -DEFINE_SPINLOCK(kretprobe_lock); /* Protects kretprobe_inst_table */ +DEFINE_RAW_SPINLOCK(kretprobe_lock); /* Protects kretprobe_inst_table */ static DEFINE_PER_CPU(struct kprobe *, kprobe_instance) = NULL; #ifdef __ARCH_WANT_KPROBES_INSN_SLOT Index: linux-2.6.24.7/kernel/softlockup.c =================================================================== --- linux-2.6.24.7.orig/kernel/softlockup.c +++ linux-2.6.24.7/kernel/softlockup.c @@ -17,7 +17,7 @@ #include <asm/irq_regs.h> -static DEFINE_SPINLOCK(print_lock); +static DEFINE_RAW_SPINLOCK(print_lock); static DEFINE_PER_CPU(unsigned long, touch_timestamp); static DEFINE_PER_CPU(unsigned long, print_timestamp); ��������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/preempt-realtime-rcu.patch������������������������������������������������������������������0000664�0000764�0000764�00000004156�11041657734�016337� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������--- kernel/rcuclassic.c | 6 +++--- kernel/rcupreempt.c | 6 +++--- 2 files changed, 6 insertions(+), 6 deletions(-) Index: linux-2.6.24.7/kernel/rcuclassic.c =================================================================== --- linux-2.6.24.7.orig/kernel/rcuclassic.c +++ linux-2.6.24.7/kernel/rcuclassic.c @@ -57,7 +57,7 @@ struct rcu_ctrlblk { int signaled; - spinlock_t lock ____cacheline_internodealigned_in_smp; + raw_spinlock_t lock ____cacheline_internodealigned_in_smp; cpumask_t cpumask; /* CPUs that need to switch in order */ /* for current batch to proceed. */ } ____cacheline_internodealigned_in_smp; @@ -96,13 +96,13 @@ struct rcu_data { static struct rcu_ctrlblk rcu_ctrlblk = { .cur = -300, .completed = -300, - .lock = __SPIN_LOCK_UNLOCKED(&rcu_ctrlblk.lock), + .lock = RAW_SPIN_LOCK_UNLOCKED(&rcu_ctrlblk.lock), .cpumask = CPU_MASK_NONE, }; static struct rcu_ctrlblk rcu_bh_ctrlblk = { .cur = -300, .completed = -300, - .lock = __SPIN_LOCK_UNLOCKED(&rcu_bh_ctrlblk.lock), + .lock = RAW_SPIN_LOCK_UNLOCKED(&rcu_bh_ctrlblk.lock), .cpumask = CPU_MASK_NONE, }; Index: linux-2.6.24.7/kernel/rcupreempt.c =================================================================== --- linux-2.6.24.7.orig/kernel/rcupreempt.c +++ linux-2.6.24.7/kernel/rcupreempt.c @@ -62,7 +62,7 @@ #define GP_STAGES 2 struct rcu_data { - spinlock_t lock; /* Protect rcu_data fields. */ + raw_spinlock_t lock; /* Protect rcu_data fields. */ long completed; /* Number of last completed batch. */ int waitlistcount; struct rcu_head *nextlist; @@ -76,12 +76,12 @@ struct rcu_data { #endif /* #ifdef CONFIG_RCU_TRACE */ }; struct rcu_ctrlblk { - spinlock_t fliplock; /* Protect state-machine transitions. */ + raw_spinlock_t fliplock; /* Protect state-machine transitions. */ long completed; /* Number of last completed batch. */ }; static DEFINE_PER_CPU(struct rcu_data, rcu_data); static struct rcu_ctrlblk rcu_ctrlblk = { - .fliplock = SPIN_LOCK_UNLOCKED, + .fliplock = RAW_SPIN_LOCK_UNLOCKED(rcu_ctrlblk.fliplock), .completed = 0, }; static DEFINE_PER_CPU(int [2], rcu_flipctr) = { 0, 0 }; ������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/preempt-realtime-timer.patch����������������������������������������������������������������0000664�0000764�0000764�00000020767�11041657733�016673� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������--- include/linux/hrtimer.h | 2 - include/linux/time.h | 2 - kernel/time/clockevents.c | 2 - kernel/time/clocksource.c | 2 - kernel/time/tick-broadcast.c | 2 - kernel/time/tick-common.c | 2 - kernel/time/tick-internal.h | 2 - kernel/time/tick-sched.c | 2 - kernel/time/timekeeping.c | 2 - kernel/time/timer_stats.c | 6 ++--- kernel/timer.c | 46 +++++++++++++++++++++++++++++++++++++++++-- 11 files changed, 56 insertions(+), 14 deletions(-) Index: linux-2.6.24.7/include/linux/hrtimer.h =================================================================== --- linux-2.6.24.7.orig/include/linux/hrtimer.h +++ linux-2.6.24.7/include/linux/hrtimer.h @@ -191,7 +191,7 @@ struct hrtimer_clock_base { * @nr_events: Total number of timer interrupt events */ struct hrtimer_cpu_base { - spinlock_t lock; + raw_spinlock_t lock; struct lock_class_key lock_key; struct hrtimer_clock_base clock_base[HRTIMER_MAX_CLOCK_BASES]; #ifdef CONFIG_HIGH_RES_TIMERS Index: linux-2.6.24.7/include/linux/time.h =================================================================== --- linux-2.6.24.7.orig/include/linux/time.h +++ linux-2.6.24.7/include/linux/time.h @@ -92,7 +92,7 @@ static inline struct timespec timespec_s extern struct timespec xtime; extern struct timespec wall_to_monotonic; -extern seqlock_t xtime_lock; +extern raw_seqlock_t xtime_lock; extern unsigned long read_persistent_clock(void); extern int update_persistent_clock(struct timespec now); Index: linux-2.6.24.7/kernel/time/clockevents.c =================================================================== --- linux-2.6.24.7.orig/kernel/time/clockevents.c +++ linux-2.6.24.7/kernel/time/clockevents.c @@ -27,7 +27,7 @@ static LIST_HEAD(clockevents_released); static RAW_NOTIFIER_HEAD(clockevents_chain); /* Protection for the above */ -static DEFINE_SPINLOCK(clockevents_lock); +static DEFINE_RAW_SPINLOCK(clockevents_lock); /** * clockevents_delta2ns - Convert a latch value (device ticks) to nanoseconds Index: linux-2.6.24.7/kernel/time/clocksource.c =================================================================== --- linux-2.6.24.7.orig/kernel/time/clocksource.c +++ linux-2.6.24.7/kernel/time/clocksource.c @@ -51,7 +51,7 @@ static struct clocksource *curr_clocksou static struct clocksource *next_clocksource; static struct clocksource *clocksource_override; static LIST_HEAD(clocksource_list); -static DEFINE_SPINLOCK(clocksource_lock); +static DEFINE_RAW_SPINLOCK(clocksource_lock); static char override_name[32]; static int finished_booting; Index: linux-2.6.24.7/kernel/time/tick-broadcast.c =================================================================== --- linux-2.6.24.7.orig/kernel/time/tick-broadcast.c +++ linux-2.6.24.7/kernel/time/tick-broadcast.c @@ -29,7 +29,7 @@ struct tick_device tick_broadcast_device; static cpumask_t tick_broadcast_mask; -static DEFINE_SPINLOCK(tick_broadcast_lock); +static DEFINE_RAW_SPINLOCK(tick_broadcast_lock); #ifdef CONFIG_TICK_ONESHOT static void tick_broadcast_clear_oneshot(int cpu); Index: linux-2.6.24.7/kernel/time/tick-common.c =================================================================== --- linux-2.6.24.7.orig/kernel/time/tick-common.c +++ linux-2.6.24.7/kernel/time/tick-common.c @@ -32,7 +32,7 @@ DEFINE_PER_CPU(struct tick_device, tick_ ktime_t tick_next_period; ktime_t tick_period; int tick_do_timer_cpu __read_mostly = -1; -DEFINE_SPINLOCK(tick_device_lock); +DEFINE_RAW_SPINLOCK(tick_device_lock); /* * Debugging: see timer_list.c Index: linux-2.6.24.7/kernel/time/tick-internal.h =================================================================== --- linux-2.6.24.7.orig/kernel/time/tick-internal.h +++ linux-2.6.24.7/kernel/time/tick-internal.h @@ -2,7 +2,7 @@ * tick internal variable and functions used by low/high res code */ DECLARE_PER_CPU(struct tick_device, tick_cpu_device); -extern spinlock_t tick_device_lock; +extern raw_spinlock_t tick_device_lock; extern ktime_t tick_next_period; extern ktime_t tick_period; extern int tick_do_timer_cpu __read_mostly; Index: linux-2.6.24.7/kernel/time/tick-sched.c =================================================================== --- linux-2.6.24.7.orig/kernel/time/tick-sched.c +++ linux-2.6.24.7/kernel/time/tick-sched.c @@ -178,7 +178,7 @@ void tick_nohz_stop_sched_tick(void) if (unlikely(ts->nohz_mode == NOHZ_MODE_INACTIVE)) goto end; - if (need_resched()) + if (need_resched() || need_resched_delayed()) goto end; cpu = smp_processor_id(); Index: linux-2.6.24.7/kernel/time/timekeeping.c =================================================================== --- linux-2.6.24.7.orig/kernel/time/timekeeping.c +++ linux-2.6.24.7/kernel/time/timekeeping.c @@ -24,7 +24,7 @@ * This read-write spinlock protects us from races in SMP while * playing with xtime and avenrun. */ -__cacheline_aligned_in_smp DEFINE_SEQLOCK(xtime_lock); +__cacheline_aligned_in_smp DEFINE_RAW_SEQLOCK(xtime_lock); EXPORT_SYMBOL_GPL(xtime_lock); Index: linux-2.6.24.7/kernel/time/timer_stats.c =================================================================== --- linux-2.6.24.7.orig/kernel/time/timer_stats.c +++ linux-2.6.24.7/kernel/time/timer_stats.c @@ -81,12 +81,12 @@ struct entry { /* * Spinlock protecting the tables - not taken during lookup: */ -static DEFINE_SPINLOCK(table_lock); +static DEFINE_RAW_SPINLOCK(table_lock); /* * Per-CPU lookup locks for fast hash lookup: */ -static DEFINE_PER_CPU(spinlock_t, lookup_lock); +static DEFINE_PER_CPU(raw_spinlock_t, lookup_lock); /* * Mutex to serialize state changes with show-stats activities: @@ -238,7 +238,7 @@ void timer_stats_update_stats(void *time /* * It doesnt matter which lock we take: */ - spinlock_t *lock; + raw_spinlock_t *lock = &per_cpu(lookup_lock, raw_smp_processor_id()); struct entry *entry, input; unsigned long flags; Index: linux-2.6.24.7/kernel/timer.c =================================================================== --- linux-2.6.24.7.orig/kernel/timer.c +++ linux-2.6.24.7/kernel/timer.c @@ -860,9 +860,22 @@ unsigned long get_next_timer_interrupt(u tvec_base_t *base = __get_cpu_var(tvec_bases); unsigned long expires; +#ifdef CONFIG_PREEMPT_RT + /* + * On PREEMPT_RT we cannot sleep here. If the trylock does not + * succeed then we return the worst-case 'expires in 1 tick' + * value: + */ + if (spin_trylock(&base->lock)) { + expires = __next_timer_interrupt(base); + spin_unlock(&base->lock); + } else + expires = now + 1; +#else spin_lock(&base->lock); expires = __next_timer_interrupt(base); spin_unlock(&base->lock); +#endif if (time_before_eq(expires, now)) return now; @@ -915,8 +928,29 @@ void update_process_times(int user_tick) */ static unsigned long count_active_tasks(void) { + /* + * On PREEMPT_RT, we are running in the timer softirq thread, + * so consider 1 less running tasks: + */ +#ifdef CONFIG_PREEMPT_RT + return (nr_active() - 1) * FIXED_1; +#else return nr_active() * FIXED_1; +#endif +} + +#ifdef CONFIG_PREEMPT_RT +/* + * Nr of active tasks - counted in fixed-point numbers + */ +static unsigned long count_active_rt_tasks(void) +{ + extern unsigned long rt_nr_running(void); + extern unsigned long rt_nr_uninterruptible(void); + + return (rt_nr_running() + rt_nr_uninterruptible()) * FIXED_1; } +#endif /* * Hmm.. Changed this, as the GNU make sources (load.c) seems to @@ -930,6 +964,8 @@ unsigned long avenrun[3]; EXPORT_SYMBOL(avenrun); +unsigned long avenrun_rt[3]; + /* * calc_load - given tick count, update the avenrun load estimates. * This is called while holding a write_lock on xtime_lock. @@ -948,6 +984,12 @@ static inline void calc_load(unsigned lo CALC_LOAD(avenrun[2], EXP_15, active_tasks); count += LOAD_FREQ; } while (count < 0); +#ifdef CONFIG_PREEMPT_RT + active_tasks = count_active_rt_tasks(); + CALC_LOAD(avenrun_rt[0], EXP_1, active_tasks); + CALC_LOAD(avenrun_rt[1], EXP_5, active_tasks); + CALC_LOAD(avenrun_rt[2], EXP_15, active_tasks); +#endif } } @@ -1371,7 +1413,7 @@ static void __cpuinit migrate_timers(int old_base = per_cpu(tvec_bases, cpu); new_base = get_cpu_var(tvec_bases); - local_irq_disable(); + local_irq_disable_nort(); double_spin_lock(&new_base->lock, &old_base->lock, smp_processor_id() < cpu); @@ -1388,7 +1430,7 @@ static void __cpuinit migrate_timers(int double_spin_unlock(&new_base->lock, &old_base->lock, smp_processor_id() < cpu); - local_irq_enable(); + local_irq_enable_nort(); put_cpu_var(tvec_bases); } #endif /* CONFIG_HOTPLUG_CPU */ ���������patches/kstat-fix-spurious-system-load-spikes-in-proc-loadavgrt.patch�������������������������������0000664�0000764�0000764�00000012327�11041657730�025151� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From lclaudio@uudg.org Fri Aug 17 21:40:37 2007 Return-Path: <lclaudio@uudg.org> X-Spam-Checker-Version: SpamAssassin 3.1.7-deb (2006-10-05) on debian X-Spam-Level: X-Spam-Status: No, score=0.0 required=5.0 tests=none autolearn=ham version=3.1.7-deb Received: from sr02-01.mta.terra.com.br (sr02-01.mta.terra.com.br [200.154.152.30]) by mail.tglx.de (Postfix) with ESMTP id 2E9BC65C3D9 for <tglx@linutronix.de>; Fri, 17 Aug 2007 21:40:37 +0200 (CEST) Received: from tiaro.hst.terra.com.br (tiaro.hst.terra.com.br [200.176.10.7]) by bundure.hst.terra.com.br (Postfix) with ESMTP id 459344D7005C; Fri, 17 Aug 2007 16:40:34 -0300 (BRT) X-Terra-Karma: -2% X-Terra-Hash: 9bbc9fa12a67f4c16ad599245ee6a8fb Received-SPF: none (tiaro.hst.terra.com.br: 200.176.10.7 is neither permitted nor denied by domain of uudg.org) client-ip=200.176.10.7; envelope-from=lclaudio@uudg.org; helo=lclaudio.dyndns.org; Received: from lclaudio.dyndns.org (unknown [189.4.11.102]) (authenticated user lc_poa) by tiaro.hst.terra.com.br (Postfix) with ESMTP id 97492214136; Fri, 17 Aug 2007 16:40:32 -0300 (BRT) Received: by lclaudio.dyndns.org (Postfix, from userid 500) id 7530B117DC8; Fri, 17 Aug 2007 16:37:07 -0300 (BRT) Date: Fri, 17 Aug 2007 16:37:06 -0300 From: "Luis Claudio R. Goncalves" <lclaudio@uudg.org> To: Thomas Gleixner <tglx@linutronix.de>, Ingo Molnar <mingo@elte.hu> Subject: [PATCH] Fixes spurious system load spikes in /proc/loadavgrt Message-ID: <20070817193706.GB18693@unix.sh> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.5.14 (2007-02-12) X-Evolution-Source: imap://tglx%40linutronix.de@localhost:8993/ Content-Transfer-Encoding: 8bit Hi! The patch I sent to the list had a minor glitch in the path for the second half of the diff. This is the fixed version. Sorry for any disturbance! Best regards, Luis Hello, The values in /proc/loadavgrt are sometimes the real load and sometimes garbage. As you can see in th tests below, it occurs from in 2.6.21.5-rt20 to 2.6.23-rc2-rt2. The code for calc_load(), in kernel/timer.c has not changed much in -rt patches. [lclaudio@lab sandbox]$ ls /proc/loadavg* /proc/loadavg /proc/loadavgrt [lclaudio@lab sandbox]$ uname -a Linux lab.casa 2.6.21-34.el5rt #1 SMP PREEMPT RT Thu Jul 12 15:26:48 EDT 2007 x86_64 x86_64 x86_64 GNU/Linux [lclaudio@lab sandbox]$ cat /proc/loadavg* 4.57 4.90 4.16 3/146 23499 0.44 0.98 1.78 0/146 23499 ... [lclaudio@lab sandbox]$ cat /proc/loadavg* 4.65 4.80 4.75 5/144 20720 23896.04 -898421.23 383170.94 2/144 20720 [root@neverland ~]# uname -a Linux neverland.casa 2.6.21.5-rt20 #2 SMP PREEMPT RT Fri Jul 1318:31:38 BRT 2007 i686 athlon i386 GNU/Linux [root@neverland ~]# cat /proc/loadavg* 0.16 0.16 0.15 1/184 11240 344.65 0.38 311.71 0/184 11240 [williams@torg ~]$ uname -a Linux torg 2.6.23-rc2-rt2 #14 SMP PREEMPT RT Tue Aug 7 20:07:31 CDT 2007 x86_64 x86_64 x86_64 GNU/Linux [williams@torg ~]$ cat /proc/loadavg* 0.88 0.76 0.57 1/257 7267 122947.70 103790.53 -564712.87 0/257 7267 ----------> Fixes spurious system load spikes observed in /proc/loadavgrt, as described in: Bug 253103: /proc/loadavgrt issues weird results https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=253103 Signed-off-by: Luis Claudio R. Goncalves <lclaudio@uudg.org> --- --- kernel/sched_rt.c | 7 +++++++ kernel/timer.c | 14 ++++++++------ 2 files changed, 15 insertions(+), 6 deletions(-) Index: linux-2.6.24.7/kernel/sched_rt.c =================================================================== --- linux-2.6.24.7.orig/kernel/sched_rt.c +++ linux-2.6.24.7/kernel/sched_rt.c @@ -149,6 +149,13 @@ unsigned long rt_nr_uninterruptible(void for_each_online_cpu(i) sum += cpu_rq(i)->rt.rt_nr_uninterruptible; + /* + * Since we read the counters lockless, it might be slightly + * inaccurate. Do not allow it to go below zero though: + */ + if (unlikely((long)sum < 0)) + sum = 0; + return sum; } Index: linux-2.6.24.7/kernel/timer.c =================================================================== --- linux-2.6.24.7.orig/kernel/timer.c +++ linux-2.6.24.7/kernel/timer.c @@ -973,23 +973,25 @@ unsigned long avenrun_rt[3]; static inline void calc_load(unsigned long ticks) { unsigned long active_tasks; /* fixed-point */ + unsigned long active_rt_tasks; /* fixed-point */ static int count = LOAD_FREQ; count -= ticks; if (unlikely(count < 0)) { active_tasks = count_active_tasks(); + active_rt_tasks = count_active_rt_tasks(); do { CALC_LOAD(avenrun[0], EXP_1, active_tasks); CALC_LOAD(avenrun[1], EXP_5, active_tasks); CALC_LOAD(avenrun[2], EXP_15, active_tasks); - count += LOAD_FREQ; - } while (count < 0); #ifdef CONFIG_PREEMPT_RT - active_tasks = count_active_rt_tasks(); - CALC_LOAD(avenrun_rt[0], EXP_1, active_tasks); - CALC_LOAD(avenrun_rt[1], EXP_5, active_tasks); - CALC_LOAD(avenrun_rt[2], EXP_15, active_tasks); + CALC_LOAD(avenrun_rt[0], EXP_1, active_tasks); + CALC_LOAD(avenrun_rt[1], EXP_5, active_tasks); + CALC_LOAD(avenrun_rt[2], EXP_15, active_tasks); #endif + count += LOAD_FREQ; + + } while (count < 0); } } ���������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/preempt-realtime-usb.patch������������������������������������������������������������������0000664�0000764�0000764�00000005630�11041657730�016331� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������--- drivers/net/usb/usbnet.c | 2 ++ drivers/usb/core/devio.c | 8 +++++--- drivers/usb/core/message.c | 11 ++++++----- 3 files changed, 13 insertions(+), 8 deletions(-) Index: linux-2.6.24.7/drivers/net/usb/usbnet.c =================================================================== --- linux-2.6.24.7.orig/drivers/net/usb/usbnet.c +++ linux-2.6.24.7/drivers/net/usb/usbnet.c @@ -905,6 +905,8 @@ static void tx_complete (struct urb *urb urb->dev = NULL; entry->state = tx_done; + spin_lock_rt(&dev->txq.lock); + spin_unlock_rt(&dev->txq.lock); defer_bh(dev, skb, &dev->txq); } Index: linux-2.6.24.7/drivers/usb/core/devio.c =================================================================== --- linux-2.6.24.7.orig/drivers/usb/core/devio.c +++ linux-2.6.24.7/drivers/usb/core/devio.c @@ -307,10 +307,12 @@ static void async_completed(struct urb * struct async *as = urb->context; struct dev_state *ps = as->ps; struct siginfo sinfo; + unsigned long flags; + + spin_lock_irqsave(&ps->lock, flags); + list_move_tail(&as->asynclist, &ps->async_completed); + spin_unlock_irqrestore(&ps->lock, flags); - spin_lock(&ps->lock); - list_move_tail(&as->asynclist, &ps->async_completed); - spin_unlock(&ps->lock); as->status = urb->status; if (as->signr) { sinfo.si_signo = as->signr; Index: linux-2.6.24.7/drivers/usb/core/message.c =================================================================== --- linux-2.6.24.7.orig/drivers/usb/core/message.c +++ linux-2.6.24.7/drivers/usb/core/message.c @@ -259,8 +259,9 @@ static void sg_complete (struct urb *urb { struct usb_sg_request *io = urb->context; int status = urb->status; + unsigned long flags; - spin_lock (&io->lock); + spin_lock_irqsave (&io->lock, flags); /* In 2.5 we require hcds' endpoint queues not to progress after fault * reports, until the completion callback (this!) returns. That lets @@ -294,7 +295,7 @@ static void sg_complete (struct urb *urb * unlink pending urbs so they won't rx/tx bad data. * careful: unlink can sometimes be synchronous... */ - spin_unlock (&io->lock); + spin_unlock_irqrestore (&io->lock, flags); for (i = 0, found = 0; i < io->entries; i++) { if (!io->urbs [i] || !io->urbs [i]->dev) continue; @@ -309,7 +310,7 @@ static void sg_complete (struct urb *urb } else if (urb == io->urbs [i]) found = 1; } - spin_lock (&io->lock); + spin_lock_irqsave (&io->lock, flags); } urb->dev = NULL; @@ -319,7 +320,7 @@ static void sg_complete (struct urb *urb if (!io->count) complete (&io->complete); - spin_unlock (&io->lock); + spin_unlock_irqrestore (&io->lock, flags); } @@ -600,7 +601,7 @@ void usb_sg_cancel (struct usb_sg_reques dev_warn (&io->dev->dev, "%s, unlink --> %d\n", __FUNCTION__, retval); } - spin_lock (&io->lock); + spin_lock_irqsave (&io->lock, flags); } spin_unlock_irqrestore (&io->lock, flags); } ��������������������������������������������������������������������������������������������������������patches/preempt-realtime-warn-and-bug-on-fix.patch��������������������������������������������������0000664�0000764�0000764�00000001645�11041657734�021226� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������� To fix the following compile error by enclosing it in ifndef __ASSEMBLY__/endif. - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - include/asm-generic/bug.h include/asm-generic/bug.h: Assembler messages: include/asm-generic/bug.h:7: Error: Unrecognized opcode: `extern' - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Signed-off-by: Tsutomu Owa <tsutomu.owa@toshiba.co.jp> -- owa --- include/asm-generic/bug.h | 2 ++ 1 file changed, 2 insertions(+) Index: linux-2.6.24.7/include/asm-generic/bug.h =================================================================== --- linux-2.6.24.7.orig/include/asm-generic/bug.h +++ linux-2.6.24.7/include/asm-generic/bug.h @@ -3,7 +3,9 @@ #include <linux/compiler.h> +#ifndef __ASSEMBLY__ extern void __WARN_ON(const char *func, const char *file, const int line); +#endif /* __ASSEMBLY__ */ #ifdef CONFIG_BUG �������������������������������������������������������������������������������������������patches/preempt-realtime-supress-cpulock-warning.patch����������������������������������������������0000664�0000764�0000764�00000001071�11041657735�022345� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������--- kernel/cpu.c | 2 ++ 1 file changed, 2 insertions(+) Index: linux-2.6.24.7/kernel/cpu.c =================================================================== --- linux-2.6.24.7.orig/kernel/cpu.c +++ linux-2.6.24.7/kernel/cpu.c @@ -37,12 +37,14 @@ void lock_cpu_hotplug(void) struct task_struct *tsk = current; if (tsk == recursive) { +#ifdef CONFIG_PREEMPT_RT static int warnings = 10; if (warnings) { printk(KERN_ERR "Lukewarm IQ detected in hotplug locking\n"); WARN_ON(1); warnings--; } +#endif recursive_depth++; return; } �����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/preempt-realtime-supress-nohz-softirq-warning.patch�����������������������������������������0000664�0000764�0000764�00000001131�11041657731�023341� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������--- kernel/time/tick-sched.c | 3 +++ 1 file changed, 3 insertions(+) Index: linux-2.6.24.7/kernel/time/tick-sched.c =================================================================== --- linux-2.6.24.7.orig/kernel/time/tick-sched.c +++ linux-2.6.24.7/kernel/time/tick-sched.c @@ -182,6 +182,8 @@ void tick_nohz_stop_sched_tick(void) goto end; cpu = smp_processor_id(); + +#ifndef CONFIG_PREEMPT_RT if (unlikely(local_softirq_pending())) { static int ratelimit; @@ -191,6 +193,7 @@ void tick_nohz_stop_sched_tick(void) ratelimit++; } } +#endif now = ktime_get(); /* ���������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/preempt-realtime-net.patch������������������������������������������������������������������0000664�0000764�0000764�00000041727�11041657730�016335� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������--- include/linux/netdevice.h | 6 +-- include/net/dn_dev.h | 6 +-- net/core/dev.c | 39 +++++++++++++++++++++---- net/core/netpoll.c | 62 +++++++++++++++++++++++++--------------- net/core/sock.c | 2 - net/decnet/dn_dev.c | 44 ++++++++++++++-------------- net/ipv4/icmp.c | 5 ++- net/ipv4/route.c | 4 +- net/ipv6/netfilter/ip6_tables.c | 4 +- net/sched/sch_generic.c | 12 +++++-- net/unix/af_unix.c | 1 11 files changed, 120 insertions(+), 65 deletions(-) Index: linux-2.6.24.7/include/linux/netdevice.h =================================================================== --- linux-2.6.24.7.orig/include/linux/netdevice.h +++ linux-2.6.24.7/include/linux/netdevice.h @@ -1349,20 +1349,20 @@ static inline void __netif_tx_lock(struc static inline void netif_tx_lock(struct net_device *dev) { - __netif_tx_lock(dev, smp_processor_id()); + __netif_tx_lock(dev, raw_smp_processor_id()); } static inline void netif_tx_lock_bh(struct net_device *dev) { spin_lock_bh(&dev->_xmit_lock); - dev->xmit_lock_owner = smp_processor_id(); + dev->xmit_lock_owner = raw_smp_processor_id(); } static inline int netif_tx_trylock(struct net_device *dev) { int ok = spin_trylock(&dev->_xmit_lock); if (likely(ok)) - dev->xmit_lock_owner = smp_processor_id(); + dev->xmit_lock_owner = raw_smp_processor_id(); return ok; } Index: linux-2.6.24.7/include/net/dn_dev.h =================================================================== --- linux-2.6.24.7.orig/include/net/dn_dev.h +++ linux-2.6.24.7/include/net/dn_dev.h @@ -76,9 +76,9 @@ struct dn_dev_parms { int priority; /* Priority to be a router */ char *name; /* Name for sysctl */ int ctl_name; /* Index for sysctl */ - int (*up)(struct net_device *); - void (*down)(struct net_device *); - void (*timer3)(struct net_device *, struct dn_ifaddr *ifa); + int (*dn_up)(struct net_device *); + void (*dn_down)(struct net_device *); + void (*dn_timer3)(struct net_device *, struct dn_ifaddr *ifa); void *sysctl; }; Index: linux-2.6.24.7/net/core/dev.c =================================================================== --- linux-2.6.24.7.orig/net/core/dev.c +++ linux-2.6.24.7/net/core/dev.c @@ -1692,9 +1692,16 @@ gso: Either shot noqueue qdisc, it is even simpler 8) */ if (dev->flags & IFF_UP) { - int cpu = smp_processor_id(); /* ok because BHs are off */ + int cpu = raw_smp_processor_id(); /* ok because BHs are off */ + /* + * No need to check for recursion with threaded interrupts: + */ +#ifdef CONFIG_PREEMPT_RT + if (1) { +#else if (dev->xmit_lock_owner != cpu) { +#endif HARD_TX_LOCK(dev, cpu); @@ -1830,7 +1837,8 @@ static inline struct net_device *skb_bon static void net_tx_action(struct softirq_action *h) { - struct softnet_data *sd = &__get_cpu_var(softnet_data); + struct softnet_data *sd = &per_cpu(softnet_data, + raw_smp_processor_id()); if (sd->completion_queue) { struct sk_buff *clist; @@ -1846,6 +1854,11 @@ static void net_tx_action(struct softirq BUG_TRAP(!atomic_read(&skb->users)); __kfree_skb(skb); + /* + * Safe to reschedule - the list is private + * at this point. + */ + cond_resched_softirq_context(); } } @@ -1864,12 +1877,27 @@ static void net_tx_action(struct softirq smp_mb__before_clear_bit(); clear_bit(__LINK_STATE_SCHED, &dev->state); + /* + * We are executing in softirq context here, and + * if softirqs are preemptible, we must avoid + * infinite reactivation of the softirq by + * either the tx handler, or by netif_schedule(). + * (it would result in an infinitely looping + * softirq context) + * So we take the spinlock unconditionally. + */ +#ifdef CONFIG_PREEMPT_SOFTIRQS + spin_lock(&dev->queue_lock); + qdisc_run(dev); + spin_unlock(&dev->queue_lock); +#else if (spin_trylock(&dev->queue_lock)) { qdisc_run(dev); spin_unlock(&dev->queue_lock); } else { netif_schedule(dev); } +#endif } } } @@ -2037,7 +2065,7 @@ int netif_receive_skb(struct sk_buff *sk if (!orig_dev) return NET_RX_DROP; - __get_cpu_var(netdev_rx_stat).total++; + per_cpu(netdev_rx_stat, raw_smp_processor_id()).total++; skb_reset_network_header(skb); skb_reset_transport_header(skb); @@ -2104,9 +2132,10 @@ out: static int process_backlog(struct napi_struct *napi, int quota) { int work = 0; - struct softnet_data *queue = &__get_cpu_var(softnet_data); + struct softnet_data *queue; unsigned long start_time = jiffies; + queue = &per_cpu(softnet_data, raw_smp_processor_id()); napi->weight = weight_p; do { struct sk_buff *skb; @@ -2144,7 +2173,7 @@ void fastcall __napi_schedule(struct nap local_irq_save(flags); list_add_tail(&n->poll_list, &__get_cpu_var(softnet_data).poll_list); - __raise_softirq_irqoff(NET_RX_SOFTIRQ); + raise_softirq_irqoff(NET_RX_SOFTIRQ); local_irq_restore(flags); } EXPORT_SYMBOL(__napi_schedule); Index: linux-2.6.24.7/net/core/netpoll.c =================================================================== --- linux-2.6.24.7.orig/net/core/netpoll.c +++ linux-2.6.24.7/net/core/netpoll.c @@ -64,20 +64,20 @@ static void queue_process(struct work_st continue; } - local_irq_save(flags); + local_irq_save_nort(flags); netif_tx_lock(dev); if ((netif_queue_stopped(dev) || netif_subqueue_stopped(dev, skb)) || dev->hard_start_xmit(skb, dev) != NETDEV_TX_OK) { skb_queue_head(&npinfo->txq, skb); netif_tx_unlock(dev); - local_irq_restore(flags); + local_irq_restore_nort(flags); schedule_delayed_work(&npinfo->tx_work, HZ/10); return; } netif_tx_unlock(dev); - local_irq_restore(flags); + local_irq_restore_nort(flags); } } @@ -146,7 +146,7 @@ static void poll_napi(struct netpoll *np int budget = 16; list_for_each_entry(napi, &np->dev->napi_list, dev_list) { - if (napi->poll_owner != smp_processor_id() && + if (napi->poll_owner != raw_smp_processor_id() && spin_trylock(&napi->poll_lock)) { budget = poll_one_napi(npinfo, napi, budget); spin_unlock(&napi->poll_lock); @@ -205,30 +205,33 @@ static void refill_skbs(void) static void zap_completion_queue(void) { - unsigned long flags; struct softnet_data *sd = &get_cpu_var(softnet_data); + struct sk_buff *clist = NULL; + unsigned long flags; if (sd->completion_queue) { - struct sk_buff *clist; - local_irq_save(flags); clist = sd->completion_queue; sd->completion_queue = NULL; local_irq_restore(flags); - - while (clist != NULL) { - struct sk_buff *skb = clist; - clist = clist->next; - if (skb->destructor) { - atomic_inc(&skb->users); - dev_kfree_skb_any(skb); /* put this one back */ - } else { - __kfree_skb(skb); - } - } } + /* + * Took the list private, can drop our softnet + * reference: + */ put_cpu_var(softnet_data); + + while (clist != NULL) { + struct sk_buff *skb = clist; + clist = clist->next; + if (skb->destructor) { + atomic_inc(&skb->users); + dev_kfree_skb_any(skb); /* put this one back */ + } else { + __kfree_skb(skb); + } + } } static struct sk_buff *find_skb(struct netpoll *np, int len, int reserve) @@ -236,13 +239,26 @@ static struct sk_buff *find_skb(struct n int count = 0; struct sk_buff *skb; +#ifdef CONFIG_PREEMPT_RT + /* + * On -rt skb_pool.lock is schedulable, so if we are + * in an atomic context we just try to dequeue from the + * pool and fail if we cannot get one. + */ + if (in_atomic() || irqs_disabled()) + goto pick_atomic; +#endif zap_completion_queue(); refill_skbs(); repeat: skb = alloc_skb(len, GFP_ATOMIC); - if (!skb) + if (!skb) { +#ifdef CONFIG_PREEMPT_RT +pick_atomic: +#endif skb = skb_dequeue(&skb_pool); + } if (!skb) { if (++count < 10) { @@ -262,7 +278,7 @@ static int netpoll_owner_active(struct n struct napi_struct *napi; list_for_each_entry(napi, &dev->napi_list, dev_list) { - if (napi->poll_owner == smp_processor_id()) + if (napi->poll_owner == raw_smp_processor_id()) return 1; } return 0; @@ -284,7 +300,7 @@ static void netpoll_send_skb(struct netp if (skb_queue_len(&npinfo->txq) == 0 && !netpoll_owner_active(dev)) { unsigned long flags; - local_irq_save(flags); + local_irq_save_nort(flags); /* try until next clock tick */ for (tries = jiffies_to_usecs(1)/USEC_PER_POLL; tries > 0; --tries) { @@ -304,7 +320,7 @@ static void netpoll_send_skb(struct netp udelay(USEC_PER_POLL); } - local_irq_restore(flags); + local_irq_restore_nort(flags); } if (status != NETDEV_TX_OK) { @@ -727,7 +743,7 @@ int netpoll_setup(struct netpoll *np) np->name); break; } - cond_resched(); + schedule_timeout_uninterruptible(1); } /* If carrier appears to come up instantly, we don't Index: linux-2.6.24.7/net/core/sock.c =================================================================== --- linux-2.6.24.7.orig/net/core/sock.c +++ linux-2.6.24.7/net/core/sock.c @@ -1504,7 +1504,7 @@ static void sock_def_readable(struct soc { read_lock(&sk->sk_callback_lock); if (sk->sk_sleep && waitqueue_active(sk->sk_sleep)) - wake_up_interruptible(sk->sk_sleep); + wake_up_interruptible_sync(sk->sk_sleep); sk_wake_async(sk,1,POLL_IN); read_unlock(&sk->sk_callback_lock); } Index: linux-2.6.24.7/net/decnet/dn_dev.c =================================================================== --- linux-2.6.24.7.orig/net/decnet/dn_dev.c +++ linux-2.6.24.7/net/decnet/dn_dev.c @@ -90,9 +90,9 @@ static struct dn_dev_parms dn_dev_list[] .t3 = 10, .name = "ethernet", .ctl_name = NET_DECNET_CONF_ETHER, - .up = dn_eth_up, - .down = dn_eth_down, - .timer3 = dn_send_brd_hello, + .dn_up = dn_eth_up, + .dn_down = dn_eth_down, + .dn_timer3 = dn_send_brd_hello, }, { .type = ARPHRD_IPGRE, /* DECnet tunneled over GRE in IP */ @@ -102,7 +102,7 @@ static struct dn_dev_parms dn_dev_list[] .t3 = 10, .name = "ipgre", .ctl_name = NET_DECNET_CONF_GRE, - .timer3 = dn_send_brd_hello, + .dn_timer3 = dn_send_brd_hello, }, #if 0 { @@ -113,7 +113,7 @@ static struct dn_dev_parms dn_dev_list[] .t3 = 120, .name = "x25", .ctl_name = NET_DECNET_CONF_X25, - .timer3 = dn_send_ptp_hello, + .dn_timer3 = dn_send_ptp_hello, }, #endif #if 0 @@ -125,7 +125,7 @@ static struct dn_dev_parms dn_dev_list[] .t3 = 10, .name = "ppp", .ctl_name = NET_DECNET_CONF_PPP, - .timer3 = dn_send_brd_hello, + .dn_timer3 = dn_send_brd_hello, }, #endif { @@ -136,7 +136,7 @@ static struct dn_dev_parms dn_dev_list[] .t3 = 120, .name = "ddcmp", .ctl_name = NET_DECNET_CONF_DDCMP, - .timer3 = dn_send_ptp_hello, + .dn_timer3 = dn_send_ptp_hello, }, { .type = ARPHRD_LOOPBACK, /* Loopback interface - always last */ @@ -146,7 +146,7 @@ static struct dn_dev_parms dn_dev_list[] .t3 = 10, .name = "loopback", .ctl_name = NET_DECNET_CONF_LOOPBACK, - .timer3 = dn_send_brd_hello, + .dn_timer3 = dn_send_brd_hello, } }; @@ -327,11 +327,11 @@ static int dn_forwarding_proc(ctl_table */ tmp = dn_db->parms.forwarding; dn_db->parms.forwarding = old; - if (dn_db->parms.down) - dn_db->parms.down(dev); + if (dn_db->parms.dn_down) + dn_db->parms.dn_down(dev); dn_db->parms.forwarding = tmp; - if (dn_db->parms.up) - dn_db->parms.up(dev); + if (dn_db->parms.dn_up) + dn_db->parms.dn_up(dev); } return err; @@ -365,11 +365,11 @@ static int dn_forwarding_sysctl(ctl_tabl if (value > 2) return -EINVAL; - if (dn_db->parms.down) - dn_db->parms.down(dev); + if (dn_db->parms.dn_down) + dn_db->parms.dn_down(dev); dn_db->parms.forwarding = value; - if (dn_db->parms.up) - dn_db->parms.up(dev); + if (dn_db->parms.dn_up) + dn_db->parms.dn_up(dev); } return 0; @@ -1090,10 +1090,10 @@ static void dn_dev_timer_func(unsigned l struct dn_ifaddr *ifa; if (dn_db->t3 <= dn_db->parms.t2) { - if (dn_db->parms.timer3) { + if (dn_db->parms.dn_timer3) { for(ifa = dn_db->ifa_list; ifa; ifa = ifa->ifa_next) { if (!(ifa->ifa_flags & IFA_F_SECONDARY)) - dn_db->parms.timer3(dev, ifa); + dn_db->parms.dn_timer3(dev, ifa); } } dn_db->t3 = dn_db->parms.t3; @@ -1152,8 +1152,8 @@ struct dn_dev *dn_dev_create(struct net_ return NULL; } - if (dn_db->parms.up) { - if (dn_db->parms.up(dev) < 0) { + if (dn_db->parms.dn_up) { + if (dn_db->parms.dn_up(dev) < 0) { neigh_parms_release(&dn_neigh_table, dn_db->neigh_parms); dev->dn_ptr = NULL; kfree(dn_db); @@ -1247,8 +1247,8 @@ static void dn_dev_delete(struct net_dev dn_dev_check_default(dev); neigh_ifdown(&dn_neigh_table, dev); - if (dn_db->parms.down) - dn_db->parms.down(dev); + if (dn_db->parms.dn_down) + dn_db->parms.dn_down(dev); dev->dn_ptr = NULL; Index: linux-2.6.24.7/net/ipv4/icmp.c =================================================================== --- linux-2.6.24.7.orig/net/ipv4/icmp.c +++ linux-2.6.24.7/net/ipv4/icmp.c @@ -229,7 +229,10 @@ static const struct icmp_control icmp_po * On SMP we have one ICMP socket per-cpu. */ static DEFINE_PER_CPU(struct socket *, __icmp_socket) = NULL; -#define icmp_socket __get_cpu_var(__icmp_socket) +/* + * Should be safe on PREEMPT_SOFTIRQS/HARDIRQS to use raw-smp-processor-id: + */ +#define icmp_socket per_cpu(__icmp_socket, raw_smp_processor_id()) static __inline__ int icmp_xmit_lock(void) { Index: linux-2.6.24.7/net/ipv4/route.c =================================================================== --- linux-2.6.24.7.orig/net/ipv4/route.c +++ linux-2.6.24.7/net/ipv4/route.c @@ -208,13 +208,13 @@ struct rt_hash_bucket { struct rtable *chain; }; #if defined(CONFIG_SMP) || defined(CONFIG_DEBUG_SPINLOCK) || \ - defined(CONFIG_PROVE_LOCKING) + defined(CONFIG_PROVE_LOCKING) || defined(CONFIG_PREEMPT_RT) /* * Instead of using one spinlock for each rt_hash_bucket, we use a table of spinlocks * The size of this table is a power of two and depends on the number of CPUS. * (on lockdep we have a quite big spinlock_t, so keep the size down there) */ -#ifdef CONFIG_LOCKDEP +#if defined(CONFIG_LOCKDEP) || defined(CONFIG_PREEMPT_RT) # define RT_HASH_LOCK_SZ 256 #else # if NR_CPUS >= 32 Index: linux-2.6.24.7/net/ipv6/netfilter/ip6_tables.c =================================================================== --- linux-2.6.24.7.orig/net/ipv6/netfilter/ip6_tables.c +++ linux-2.6.24.7/net/ipv6/netfilter/ip6_tables.c @@ -380,7 +380,7 @@ ip6t_do_table(struct sk_buff *skb, read_lock_bh(&table->lock); private = table->private; IP_NF_ASSERT(table->valid_hooks & (1 << hook)); - table_base = (void *)private->entries[smp_processor_id()]; + table_base = (void *)private->entries[raw_smp_processor_id()]; e = get_entry(table_base, private->hook_entry[hook]); /* For return from builtin chain */ @@ -1190,7 +1190,7 @@ do_add_counters(void __user *user, unsig i = 0; /* Choose the copy that is on our node */ - loc_cpu_entry = private->entries[smp_processor_id()]; + loc_cpu_entry = private->entries[raw_smp_processor_id()]; IP6T_ENTRY_ITERATE(loc_cpu_entry, private->size, add_counter_to_entry, Index: linux-2.6.24.7/net/sched/sch_generic.c =================================================================== --- linux-2.6.24.7.orig/net/sched/sch_generic.c +++ linux-2.6.24.7/net/sched/sch_generic.c @@ -12,6 +12,7 @@ */ #include <linux/bitops.h> +#include <linux/kallsyms.h> #include <linux/module.h> #include <linux/types.h> #include <linux/kernel.h> @@ -24,6 +25,7 @@ #include <linux/init.h> #include <linux/rcupdate.h> #include <linux/list.h> +#include <linux/delay.h> #include <net/pkt_sched.h> /* Main transmission queue. */ @@ -87,7 +89,7 @@ static inline int handle_dev_cpu_collisi { int ret; - if (unlikely(dev->xmit_lock_owner == smp_processor_id())) { + if (unlikely(dev->xmit_lock_owner == raw_smp_processor_id())) { /* * Same CPU holding the lock. It may be a transient * configuration error, when hard_start_xmit() recurses. We @@ -144,7 +146,7 @@ static inline int qdisc_restart(struct n /* And release queue */ spin_unlock(&dev->queue_lock); - HARD_TX_LOCK(dev, smp_processor_id()); + HARD_TX_LOCK(dev, raw_smp_processor_id()); if (!netif_subqueue_stopped(dev, skb)) ret = dev_hard_start_xmit(skb, dev); HARD_TX_UNLOCK(dev); @@ -590,8 +592,12 @@ void dev_deactivate(struct net_device *d /* Wait for outstanding qdisc_run calls. */ do { + /* + * Wait for outstanding qdisc_run calls. + * TODO: shouldnt this be wakeup-based, instead of polling it? + */ while (test_bit(__LINK_STATE_QDISC_RUNNING, &dev->state)) - yield(); + msleep(1); /* * Double-check inside queue lock to ensure that all effects Index: linux-2.6.24.7/net/unix/af_unix.c =================================================================== --- linux-2.6.24.7.orig/net/unix/af_unix.c +++ linux-2.6.24.7/net/unix/af_unix.c @@ -338,6 +338,7 @@ static void unix_write_space(struct sock sk_wake_async(sk, 2, POLL_OUT); } read_unlock(&sk->sk_callback_lock); + preempt_check_resched_delayed(); } /* When dgram socket disconnects (or changes its peer), we clear its receive �����������������������������������������patches/preempt-realtime-net-softirq-fixups.patch���������������������������������������������������0000664�0000764�0000764�00000002301�11041657734�021323� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������Subject: NOHZ: local_softirq_pending with tickless From: Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz> On one of my machines with tickless kernel and plip I get messages : NOHZ: local_softirq_pending 08 always when using plip (on other machine with tickless kernel and plip I get no errors). Thebug happens both on 2.6.21 and 2.6.22-rc1 This patch fixes that. Note that plip calls netif_rx neither from hardware interrupt nor from ksoftirqd, so there is no one who would wake ksoftirqd then. netif_tx calls only __raise_softirq_irqoff(NET_RX_SOFTIRQ), which sets softirq bit, but doesn't wake ksoftirqd. Mikulas Signed-off-by: Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz> Removed the remaining users of __raise_softirq_irqoff() as well. tglx --- net/core/dev.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) Index: linux-2.6.24.7/net/core/dev.c =================================================================== --- linux-2.6.24.7.orig/net/core/dev.c +++ linux-2.6.24.7/net/core/dev.c @@ -2267,7 +2267,7 @@ out: softnet_break: __get_cpu_var(netdev_rx_stat).time_squeeze++; - __raise_softirq_irqoff(NET_RX_SOFTIRQ); + raise_softirq_irqoff(NET_RX_SOFTIRQ); goto out; } �������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/preempt-realtime-loopback.patch�������������������������������������������������������������0000664�0000764�0000764�00000000671�11041657731�017333� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������--- drivers/net/loopback.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) Index: linux-2.6.24.7/drivers/net/loopback.c =================================================================== --- linux-2.6.24.7.orig/drivers/net/loopback.c +++ linux-2.6.24.7/drivers/net/loopback.c @@ -160,7 +160,7 @@ static int loopback_xmit(struct sk_buff lb_stats->packets++; put_cpu(); - netif_rx(skb); + netif_rx_ni(skb); return 0; } �����������������������������������������������������������������������patches/preempt-realtime-mellanox-driver-fix.patch��������������������������������������������������0000664�0000764�0000764�00000006625�11041657732�021443� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From linux-rt-users-owner@vger.kernel.org Fri Aug 24 11:25:36 2007 Return-Path: <linux-rt-users-owner@vger.kernel.org> X-Spam-Checker-Version: SpamAssassin 3.1.7-deb (2006-10-05) on debian X-Spam-Level: X-Spam-Status: No, score=0.0 required=5.0 tests=none autolearn=unavailable version=3.1.7-deb Received: from vger.kernel.org (vger.kernel.org [209.132.176.167]) by mail.tglx.de (Postfix) with ESMTP id A70B065C292; Fri, 24 Aug 2007 11:25:36 +0200 (CEST) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755219AbXHXJZe (ORCPT <rfc822;jan.altenberg@linutronix.de> + 1 other); Fri, 24 Aug 2007 05:25:34 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1755338AbXHXJZe (ORCPT <rfc822;linux-rt-users-outgoing>); Fri, 24 Aug 2007 05:25:34 -0400 Received: from victor.provo.novell.com ([137.65.250.26]:55526 "EHLO victor.provo.novell.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755219AbXHXJZd (ORCPT <rfc822;linux-rt-users@vger.kernel.org>); Fri, 24 Aug 2007 05:25:33 -0400 Received: from [192.168.0.203] (prv-dmz-foundry1.gns.novell.com [137.65.251.211]) by victor.provo.novell.com with ESMTP (TLS encrypted); Fri, 24 Aug 2007 03:25:27 -0600 Subject: [PATCH RT] - Mellanox IB driver patch From: Sven-Thorsten Dietrich <sdietrich@novell.com> To: Ingo Molnar <mingo@elte.hu> Cc: "Michael S. Tsirkin" <mst@dev.mellanox.co.il>, LKML <Linux-kernel@vger.kernel.org>, RT Users List <linux-rt-users@vger.kernel.org>, Linux Solutions Group List <lsg@lists.novell.com> Content-Type: text/plain Organization: Suse Date: Fri, 24 Aug 2007 02:25:26 -0700 Message-Id: <1187947526.16573.56.camel@sx.thebigcorporation.com> Mime-Version: 1.0 X-Mailer: Evolution 2.10.3 (2.10.3-2.fc7) Sender: linux-rt-users-owner@vger.kernel.org Precedence: bulk X-Mailing-List: linux-rt-users@vger.kernel.org X-Filter-To: .Kernel.rt-users X-Evolution-Source: imap://tglx%40linutronix.de@localhost:8993/ Content-Transfer-Encoding: 8bit Hi Ingo, RT driver patch to eliminate in_atomic stack dump. The problem code was identified by Michael S. Tsirkin, and he suggested the fix. I adapted to use RT's _nort primitives- should work correctly in all configs. Thanks, Sven Fixes in_atomic stack-dump, when Mellanox module is loaded into the RT Kernel. From: Michael S. Tsirkin <mst@dev.mellanox.co.il> "Basically, if you just make spin_lock_irqsave (and spin_lock_irq) not disable interrupts for non-raw spinlocks, I think all of infiniband will be fine without changes." signed-off-by: Sven-Thorsten Dietrich <sven@thebigcorporation.com> --- drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) Index: linux-2.6.24.7/drivers/infiniband/ulp/ipoib/ipoib_multicast.c =================================================================== --- linux-2.6.24.7.orig/drivers/infiniband/ulp/ipoib/ipoib_multicast.c +++ linux-2.6.24.7/drivers/infiniband/ulp/ipoib/ipoib_multicast.c @@ -768,7 +768,7 @@ void ipoib_mcast_restart_task(struct wor ipoib_mcast_stop_thread(dev, 0); - local_irq_save(flags); + local_irq_save_nort(flags); netif_tx_lock(dev); spin_lock(&priv->lock); @@ -851,7 +851,7 @@ void ipoib_mcast_restart_task(struct wor spin_unlock(&priv->lock); netif_tx_unlock(dev); - local_irq_restore(flags); + local_irq_restore_nort(flags); /* We have to cancel outside of the spinlock */ list_for_each_entry_safe(mcast, tmcast, &remove_list, list) { �����������������������������������������������������������������������������������������������������������patches/preempt-realtime-supress-rtc-printk.patch���������������������������������������������������0000664�0000764�0000764�00000001053�11041657733�021335� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������--- drivers/char/rtc.c | 2 ++ 1 file changed, 2 insertions(+) Index: linux-2.6.24.7/drivers/char/rtc.c =================================================================== --- linux-2.6.24.7.orig/drivers/char/rtc.c +++ linux-2.6.24.7/drivers/char/rtc.c @@ -1341,8 +1341,10 @@ static void rtc_dropped_irq(unsigned lon spin_unlock_irq(&rtc_lock); +#ifndef CONFIG_PREEMPT_RT if (printk_ratelimit()) printk(KERN_WARNING "rtc: lost some interrupts at %ldHz.\n", freq); +#endif /* Now we have new data */ wake_up_interruptible(&rtc_wait); �������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/hrtimer-no-printk.patch���������������������������������������������������������������������0000664�0000764�0000764�00000002066�11041673174�015657� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������--- kernel/hrtimer.c | 2 -- kernel/time/timekeeping.c | 2 ++ 2 files changed, 2 insertions(+), 2 deletions(-) Index: linux-2.6.24.7/kernel/hrtimer.c =================================================================== --- linux-2.6.24.7.orig/kernel/hrtimer.c +++ linux-2.6.24.7/kernel/hrtimer.c @@ -632,8 +632,6 @@ static int hrtimer_switch_to_hres(void) /* "Retrigger" the interrupt to get things going */ retrigger_next_event(NULL); local_irq_restore(flags); - printk(KERN_DEBUG "Switched to high resolution mode on CPU %d\n", - smp_processor_id()); return 1; } Index: linux-2.6.24.7/kernel/time/timekeeping.c =================================================================== --- linux-2.6.24.7.orig/kernel/time/timekeeping.c +++ linux-2.6.24.7/kernel/time/timekeeping.c @@ -204,8 +204,10 @@ static void change_clocksource(void) tick_clock_notify(); +#ifndef CONFIG_PREEMPT_RT printk(KERN_INFO "Time: %s clocksource has been installed.\n", clock->name); +#endif } #else static inline void change_clocksource(void) { } ��������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/nmi-profiling.patch�������������������������������������������������������������������������0000664�0000764�0000764�00000006366�11041657732�015051� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������--- arch/x86/kernel/irq_32.c | 2 ++ arch/x86/kernel/nmi_32.c | 5 ++--- arch/x86/kernel/nmi_64.c | 4 ++-- drivers/char/sysrq.c | 2 +- include/asm-x86/apic_64.h | 2 ++ include/linux/sched.h | 1 + 6 files changed, 10 insertions(+), 6 deletions(-) Index: linux-2.6.24.7/arch/x86/kernel/irq_32.c =================================================================== --- linux-2.6.24.7.orig/arch/x86/kernel/irq_32.c +++ linux-2.6.24.7/arch/x86/kernel/irq_32.c @@ -79,7 +79,9 @@ fastcall unsigned int do_IRQ(struct pt_r u32 *isp; #endif +#ifdef CONFIG_X86_LOCAL_APIC irq_show_regs_callback(smp_processor_id(), regs); +#endif if (unlikely((unsigned)irq >= NR_IRQS)) { printk(KERN_EMERG "%s: cannot handle IRQ %d\n", Index: linux-2.6.24.7/arch/x86/kernel/nmi_32.c =================================================================== --- linux-2.6.24.7.orig/arch/x86/kernel/nmi_32.c +++ linux-2.6.24.7/arch/x86/kernel/nmi_32.c @@ -348,9 +348,9 @@ void nmi_show_all_regs(void) } } -static DEFINE_SPINLOCK(nmi_print_lock); +static DEFINE_RAW_SPINLOCK(nmi_print_lock); -void irq_show_regs_callback(int cpu, struct pt_regs *regs) +notrace void irq_show_regs_callback(int cpu, struct pt_regs *regs) { if (!nmi_show_regs[cpu]) return; @@ -435,7 +435,6 @@ nmi_watchdog_tick(struct pt_regs * regs, for_each_online_cpu(i) alert_counter[i] = 0; - } } else { Index: linux-2.6.24.7/arch/x86/kernel/nmi_64.c =================================================================== --- linux-2.6.24.7.orig/arch/x86/kernel/nmi_64.c +++ linux-2.6.24.7/arch/x86/kernel/nmi_64.c @@ -338,9 +338,9 @@ void nmi_show_all_regs(void) } } -static DEFINE_SPINLOCK(nmi_print_lock); +static DEFINE_RAW_SPINLOCK(nmi_print_lock); -void irq_show_regs_callback(int cpu, struct pt_regs *regs) +notrace void irq_show_regs_callback(int cpu, struct pt_regs *regs) { if (!nmi_show_regs[cpu]) return; Index: linux-2.6.24.7/drivers/char/sysrq.c =================================================================== --- linux-2.6.24.7.orig/drivers/char/sysrq.c +++ linux-2.6.24.7/drivers/char/sysrq.c @@ -209,7 +209,7 @@ static struct sysrq_key_op sysrq_showreg .enable_mask = SYSRQ_ENABLE_DUMP, }; -#if defined(__i386__) +#if defined(__i386__) || defined(__x86_64__) static void sysrq_handle_showallregs(int key, struct tty_struct *tty) { Index: linux-2.6.24.7/include/asm-x86/apic_64.h =================================================================== --- linux-2.6.24.7.orig/include/asm-x86/apic_64.h +++ linux-2.6.24.7/include/asm-x86/apic_64.h @@ -96,6 +96,8 @@ extern void smp_send_nmi_allbutself(void #define K8_APIC_EXT_INT_MSG_EXT 0x7 #define K8_APIC_EXT_LVT_ENTRY_THRESHOLD 0 +extern void smp_send_nmi_allbutself(void); + #define ARCH_APICTIMER_STOPS_ON_C3 1 extern unsigned boot_cpu_id; Index: linux-2.6.24.7/include/linux/sched.h =================================================================== --- linux-2.6.24.7.orig/include/linux/sched.h +++ linux-2.6.24.7/include/linux/sched.h @@ -292,6 +292,7 @@ static inline void show_state(void) } extern void show_regs(struct pt_regs *); +extern void irq_show_regs_callback(int cpu, struct pt_regs *regs); /* * TASK is a pointer to the task whose backtrace we want to see (or NULL for current ��������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/panic-dont-stop-box.patch�������������������������������������������������������������������0000664�0000764�0000764�00000001002�11041657734�016064� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������--- kernel/panic.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) Index: linux-2.6.24.7/kernel/panic.c =================================================================== --- linux-2.6.24.7.orig/kernel/panic.c +++ linux-2.6.24.7/kernel/panic.c @@ -95,7 +95,7 @@ NORET_TYPE void panic(const char * fmt, * unfortunately means it may not be hardened to work in a panic * situation. */ - smp_send_stop(); +// smp_send_stop(); #endif atomic_notifier_call_chain(&panic_notifier_list, 0, buf); ������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/nmi-watchdog-disable.patch������������������������������������������������������������������0000664�0000764�0000764�00000007034�11041657732�016252� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������Subject: [patch] x86_64: do not enable the NMI watchdog by default From: Ingo Molnar <mingo@elte.hu> do not enable the NMI watchdog by default. Now that we have lockdep i cannot remember the last time it caught a real bug, but the NMI watchdog can /cause/ problems. Furthermore, to the typical user, an NMI watchdog assert results in a total lockup anyway (if under X). In that sense, all that the NMI watchdog does is that it makes the system /less/ stable and /less/ debuggable. people can still enable it either after bootup via: echo 1 > /proc/sys/kernel/nmi or via the nmi_watchdog=1 or nmi_watchdog=2 boot options. build and boot tested on an Athlon64 box. Signed-off-by: Ingo Molnar <mingo@elte.hu> --- arch/x86/kernel/apic_64.c | 1 - arch/x86/kernel/io_apic_64.c | 2 -- arch/x86/kernel/nmi_64.c | 2 +- arch/x86/kernel/smpboot_64.c | 1 - include/asm-x86/nmi_64.h | 1 - 5 files changed, 1 insertion(+), 6 deletions(-) Index: linux-2.6.24.7/arch/x86/kernel/apic_64.c =================================================================== --- linux-2.6.24.7.orig/arch/x86/kernel/apic_64.c +++ linux-2.6.24.7/arch/x86/kernel/apic_64.c @@ -535,7 +535,6 @@ void __cpuinit setup_local_APIC (void) oldvalue, value); } - nmi_watchdog_default(); setup_apic_nmi_watchdog(NULL); apic_pm_activate(); } Index: linux-2.6.24.7/arch/x86/kernel/io_apic_64.c =================================================================== --- linux-2.6.24.7.orig/arch/x86/kernel/io_apic_64.c +++ linux-2.6.24.7/arch/x86/kernel/io_apic_64.c @@ -1732,7 +1732,6 @@ static inline void __init check_timer(vo */ unmask_IO_APIC_irq(0); if (!no_timer_check && timer_irq_works()) { - nmi_watchdog_default(); if (nmi_watchdog == NMI_IO_APIC) { disable_8259A_irq(0); setup_nmi(); @@ -1758,7 +1757,6 @@ static inline void __init check_timer(vo setup_ExtINT_IRQ0_pin(apic2, pin2, cfg->vector); if (timer_irq_works()) { apic_printk(APIC_VERBOSE," works.\n"); - nmi_watchdog_default(); if (nmi_watchdog == NMI_IO_APIC) { setup_nmi(); } Index: linux-2.6.24.7/arch/x86/kernel/nmi_64.c =================================================================== --- linux-2.6.24.7.orig/arch/x86/kernel/nmi_64.c +++ linux-2.6.24.7/arch/x86/kernel/nmi_64.c @@ -52,7 +52,7 @@ static DEFINE_PER_CPU(short, wd_enabled) static int unknown_nmi_panic_callback(struct pt_regs *regs, int cpu); /* Run after command line and cpu_init init, but before all other checks */ -void nmi_watchdog_default(void) +static inline void nmi_watchdog_default(void) { if (nmi_watchdog != NMI_DEFAULT) return; Index: linux-2.6.24.7/arch/x86/kernel/smpboot_64.c =================================================================== --- linux-2.6.24.7.orig/arch/x86/kernel/smpboot_64.c +++ linux-2.6.24.7/arch/x86/kernel/smpboot_64.c @@ -867,7 +867,6 @@ void __init smp_set_apicids(void) */ void __init smp_prepare_cpus(unsigned int max_cpus) { - nmi_watchdog_default(); current_cpu_data = boot_cpu_data; current_thread_info()->cpu = 0; /* needed? */ smp_set_apicids(); Index: linux-2.6.24.7/include/asm-x86/nmi_64.h =================================================================== --- linux-2.6.24.7.orig/include/asm-x86/nmi_64.h +++ linux-2.6.24.7/include/asm-x86/nmi_64.h @@ -59,7 +59,6 @@ extern void disable_timer_nmi_watchdog(v extern void enable_timer_nmi_watchdog(void); extern int nmi_watchdog_tick (struct pt_regs * regs, unsigned reason); -extern void nmi_watchdog_default(void); extern int setup_nmi_watchdog(char *); extern atomic_t nmi_active; ����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/softlockup-add-irq-regs-h.patch�������������������������������������������������������������0000664�0000764�0000764�00000001233�11041657734�017160� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������Subject: core: make asm/irq_regs.h available on every platform From: Ingo Molnar <mingo@elte.hu> the softlockup detector would like to use get_irq_regs(), so generalize the availability on every Linux architecture. (it is fine for an architecture to always return NULL to get_irq_regs(), which it does by default.) Signed-off-by: Ingo Molnar <mingo@elte.hu> --- include/asm-arm26/irq_regs.h | 1 + 1 file changed, 1 insertion(+) Index: linux-2.6.24.7/include/asm-arm26/irq_regs.h =================================================================== --- /dev/null +++ linux-2.6.24.7/include/asm-arm26/irq_regs.h @@ -0,0 +1 @@ +#include <asm-generic/irq_regs.h> ���������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/gtod-optimize.patch�������������������������������������������������������������������������0000664�0000764�0000764�00000001153�11041657733�015060� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������--- kernel/timer.c | 7 +++++++ 1 file changed, 7 insertions(+) Index: linux-2.6.24.7/kernel/timer.c =================================================================== --- linux-2.6.24.7.orig/kernel/timer.c +++ linux-2.6.24.7/kernel/timer.c @@ -1012,6 +1012,13 @@ static inline void update_times(void) static unsigned long last_tick = INITIAL_JIFFIES; unsigned long ticks, flags; + /* + * Dont take the xtime_lock from every CPU in + * every tick - only when needed: + */ + if (jiffies == last_tick) + return; + write_seqlock_irqsave(&xtime_lock, flags); ticks = jiffies - last_tick; if (ticks) { ���������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/rcu-various-fixups.patch��������������������������������������������������������������������0000664�0000764�0000764�00000004004�11041657731�016054� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������--- security/selinux/avc.c | 9 +++++++++ security/selinux/netif.c | 2 ++ 2 files changed, 11 insertions(+) Index: linux-2.6.24.7/security/selinux/avc.c =================================================================== --- linux-2.6.24.7.orig/security/selinux/avc.c +++ linux-2.6.24.7/security/selinux/avc.c @@ -312,6 +312,7 @@ static inline int avc_reclaim_node(void) if (!spin_trylock_irqsave(&avc_cache.slots_lock[hvalue], flags)) continue; + rcu_read_lock(); list_for_each_entry(node, &avc_cache.slots[hvalue], list) { if (atomic_dec_and_test(&node->ae.used)) { /* Recently Unused */ @@ -319,11 +320,13 @@ static inline int avc_reclaim_node(void) avc_cache_stats_incr(reclaims); ecx++; if (ecx >= AVC_CACHE_RECLAIM) { + rcu_read_unlock(); spin_unlock_irqrestore(&avc_cache.slots_lock[hvalue], flags); goto out; } } } + rcu_read_unlock(); spin_unlock_irqrestore(&avc_cache.slots_lock[hvalue], flags); } out: @@ -807,8 +810,14 @@ int avc_ss_reset(u32 seqno) for (i = 0; i < AVC_CACHE_SLOTS; i++) { spin_lock_irqsave(&avc_cache.slots_lock[i], flag); + /* + * On -rt the outer spinlock does not prevent RCU + * from being performed: + */ + rcu_read_lock(); list_for_each_entry(node, &avc_cache.slots[i], list) avc_node_delete(node); + rcu_read_unlock(); spin_unlock_irqrestore(&avc_cache.slots_lock[i], flag); } Index: linux-2.6.24.7/security/selinux/netif.c =================================================================== --- linux-2.6.24.7.orig/security/selinux/netif.c +++ linux-2.6.24.7/security/selinux/netif.c @@ -210,6 +210,7 @@ static void sel_netif_flush(void) { int idx; + rcu_read_lock(); spin_lock_bh(&sel_netif_lock); for (idx = 0; idx < SEL_NETIF_HASH_SIZE; idx++) { struct sel_netif *netif; @@ -218,6 +219,7 @@ static void sel_netif_flush(void) sel_netif_destroy(netif); } spin_unlock_bh(&sel_netif_lock); + rcu_read_unlock(); } static int sel_netif_avc_callback(u32 event, u32 ssid, u32 tsid, ����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/futex-performance-hack.patch����������������������������������������������������������������0000664�0000764�0000764�00000003304�11041657733�016623� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������--- kernel/futex.c | 6 ++++-- kernel/sysctl.c | 9 +++++++++ 2 files changed, 13 insertions(+), 2 deletions(-) Index: linux-2.6.24.7/kernel/futex.c =================================================================== --- linux-2.6.24.7.orig/kernel/futex.c +++ linux-2.6.24.7/kernel/futex.c @@ -126,12 +126,14 @@ static struct futex_hash_bucket futex_qu /* Futex-fs vfsmount entry: */ static struct vfsmount *futex_mnt; +int futex_performance_hack; + /* * Take mm->mmap_sem, when futex is shared */ static inline void futex_lock_mm(struct rw_semaphore *fshared) { - if (fshared) + if (fshared && !futex_performance_hack) down_read(fshared); } @@ -140,7 +142,7 @@ static inline void futex_lock_mm(struct */ static inline void futex_unlock_mm(struct rw_semaphore *fshared) { - if (fshared) + if (fshared && !futex_performance_hack) up_read(fshared); } Index: linux-2.6.24.7/kernel/sysctl.c =================================================================== --- linux-2.6.24.7.orig/kernel/sysctl.c +++ linux-2.6.24.7/kernel/sysctl.c @@ -67,6 +67,7 @@ extern int sysctl_overcommit_memory; extern int sysctl_overcommit_ratio; extern int sysctl_panic_on_oom; extern int sysctl_oom_kill_allocating_task; +extern int futex_performance_hack; extern int max_threads; extern int core_uses_pid; extern int suid_dumpable; @@ -341,6 +342,14 @@ static struct ctl_table kern_table[] = { #endif { .ctl_name = CTL_UNNUMBERED, + .procname = "futex_performance_hack", + .data = &futex_performance_hack, + .maxlen = sizeof(int), + .mode = 0644, + .proc_handler = &proc_dointvec, + }, + { + .ctl_name = CTL_UNNUMBERED, .procname = "prof_pid", .data = &prof_pid, .maxlen = sizeof(int), ����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/futex-performance-hack-sysctl-fix.patch�����������������������������������������������������0000664�0000764�0000764�00000005536�11041657734�020740� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From lethal@linux-sh.org Fri May 18 06:46:43 2007 Return-Path: <lethal@linux-sh.org> Received: from smtp.ocgnet.org (smtp.ocgnet.org [64.20.243.3]) by mail.tglx.de (Postfix) with ESMTP id 0FCC865C065 for <tglx@linutronix.de>; Fri, 18 May 2007 06:46:43 +0200 (CEST) Received: from smtp.ocgnet.org (localhost [127.0.0.1]) by smtp.ocgnet.org (Postfix) with ESMTP id 616355203FB; Thu, 17 May 2007 23:46:39 -0500 (CDT) X-Spam-Checker-Version: SpamAssassin 3.1.3-gr0 (2006-06-01) on smtp.ocgnet.org X-Spam-Level: X-Spam-Status: No, score=0.0 required=5.0 tests=none autolearn=no version=3.1.3-gr0 Received: from master.linux-sh.org (124x34x33x190.ap124.ftth.ucom.ne.jp [124.34.33.190]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by smtp.ocgnet.org (Postfix) with ESMTP id E1F585203E0; Thu, 17 May 2007 23:46:38 -0500 (CDT) Received: from localhost (unknown [127.0.0.1]) by master.linux-sh.org (Postfix) with ESMTP id 4984664C7C; Fri, 18 May 2007 04:46:00 +0000 (UTC) X-Virus-Scanned: amavisd-new at linux-sh.org Received: from master.linux-sh.org ([127.0.0.1]) by localhost (master.linux-sh.org [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id BE+H5LV2TYuQ; Fri, 18 May 2007 13:46:00 +0900 (JST) Received: by master.linux-sh.org (Postfix, from userid 500) id 08A5664C7D; Fri, 18 May 2007 13:46:00 +0900 (JST) Date: Fri, 18 May 2007 13:45:59 +0900 From: Paul Mundt <lethal@linux-sh.org> To: Ingo Molnar <mingo@elte.hu>, Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Subject: [PATCH -rt] futex_performance_hack sysctl build fix Message-ID: <20070518044559.GB22660@linux-sh.org> Mail-Followup-To: Paul Mundt <lethal@linux-sh.org>, Ingo Molnar <mingo@elte.hu>, Thomas Gleixner <tglx@linutronix.de>, linux-kernel@vger.kernel.org MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.5.13 (2006-08-11) X-Virus-Scanned: ClamAV using ClamSMTP X-Evolution-Source: imap://tglx%40linutronix.de@localhost:8993/ Content-Transfer-Encoding: 8bit -rt adds a futex_performance_hack sysctl, which is only defined if kernel/futex.c is built in. This fixes the build in the CONFIG_FUTEX=n case. Signed-off-by: Paul Mundt <lethal@linux-sh.org> -- kernel/sysctl.c | 2 ++ 1 file changed, 2 insertions(+) Index: linux-2.6.24.7/kernel/sysctl.c =================================================================== --- linux-2.6.24.7.orig/kernel/sysctl.c +++ linux-2.6.24.7/kernel/sysctl.c @@ -340,6 +340,7 @@ static struct ctl_table kern_table[] = { .proc_handler = &proc_dointvec, }, #endif +#ifdef CONFIG_FUTEX { .ctl_name = CTL_UNNUMBERED, .procname = "futex_performance_hack", @@ -348,6 +349,7 @@ static struct ctl_table kern_table[] = { .mode = 0644, .proc_handler = &proc_dointvec, }, +#endif { .ctl_name = CTL_UNNUMBERED, .procname = "prof_pid", ������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/s_files-schedule_on_each_cpu_wq.patch�������������������������������������������������������0000664�0000764�0000764�00000006454�11041657731�020544� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������--- include/linux/workqueue.h | 1 kernel/workqueue.c | 65 ++++++++++++++++++++++++++++++++++++++-------- 2 files changed, 55 insertions(+), 11 deletions(-) Index: linux-2.6.24.7/include/linux/workqueue.h =================================================================== --- linux-2.6.24.7.orig/include/linux/workqueue.h +++ linux-2.6.24.7/include/linux/workqueue.h @@ -195,6 +195,7 @@ extern int FASTCALL(schedule_delayed_wor unsigned long delay)); extern int schedule_delayed_work_on(int cpu, struct delayed_work *work, unsigned long delay); +extern int schedule_on_each_cpu_wq(struct workqueue_struct *wq, work_func_t func); extern int schedule_on_each_cpu(work_func_t func); extern int current_is_keventd(void); extern int keventd_up(void); Index: linux-2.6.24.7/kernel/workqueue.c =================================================================== --- linux-2.6.24.7.orig/kernel/workqueue.c +++ linux-2.6.24.7/kernel/workqueue.c @@ -244,6 +244,20 @@ int queue_delayed_work_on(int cpu, struc } EXPORT_SYMBOL_GPL(queue_delayed_work_on); +static void leak_check(void *func) +{ + if (!in_atomic() && lockdep_depth(current) <= 0) + return; + printk(KERN_ERR "BUG: workqueue leaked lock or atomic: " + "%s/0x%08x/%d\n", + current->comm, preempt_count(), + current->pid); + printk(KERN_ERR " last function: "); + print_symbol("%s\n", (unsigned long)func); + debug_show_held_locks(current); + dump_stack(); +} + static void run_workqueue(struct cpu_workqueue_struct *cwq) { spin_lock_irq(&cwq->lock); @@ -276,22 +290,13 @@ static void run_workqueue(struct cpu_wor BUG_ON(get_wq_data(work) != cwq); work_clear_pending(work); + leak_check(NULL); lock_acquire(&cwq->wq->lockdep_map, 0, 0, 0, 2, _THIS_IP_); lock_acquire(&lockdep_map, 0, 0, 0, 2, _THIS_IP_); f(work); lock_release(&lockdep_map, 1, _THIS_IP_); lock_release(&cwq->wq->lockdep_map, 1, _THIS_IP_); - - if (unlikely(in_atomic() || lockdep_depth(current) > 0)) { - printk(KERN_ERR "BUG: workqueue leaked lock or atomic: " - "%s/0x%08x/%d\n", - current->comm, preempt_count(), - task_pid_nr(current)); - printk(KERN_ERR " last function: "); - print_symbol("%s\n", (unsigned long)f); - debug_show_held_locks(current); - dump_stack(); - } + leak_check(f); spin_lock_irq(&cwq->lock); cwq->current_work = NULL; @@ -623,6 +628,44 @@ int schedule_on_each_cpu(work_func_t fun return 0; } +/** + * schedule_on_each_cpu_wq - call a function on each online CPU on a per-CPU wq + * @func: the function to call + * + * Returns zero on success. + * Returns -ve errno on failure. + * + * Appears to be racy against CPU hotplug. + * + * schedule_on_each_cpu() is very slow. + */ +int schedule_on_each_cpu_wq(struct workqueue_struct *wq, work_func_t func) +{ + int cpu; + struct work_struct *works; + + if (is_single_threaded(wq)) { + WARN_ON(1); + return -EINVAL; + } + works = alloc_percpu(struct work_struct); + if (!works) + return -ENOMEM; + + for_each_online_cpu(cpu) { + struct work_struct *work = per_cpu_ptr(works, cpu); + + INIT_WORK(work, func); + set_bit(WORK_STRUCT_PENDING, work_data_bits(work)); + __queue_work(per_cpu_ptr(wq->cpu_wq, cpu), work); + } + flush_workqueue(wq); + free_percpu(works); + + return 0; +} + + void flush_scheduled_work(void) { flush_workqueue(keventd_wq); ��������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/s_files-pipe-fix.patch����������������������������������������������������������������������0000664�0000764�0000764�00000002047�11041657733�015433� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������Subject: s_files: free_write_pipe() fix From: Ingo Molnar <mingo@elte.hu> file_kill() has to look at the file's inode (for the barrier logic), hence make sure we free the inode before the file. Signed-off-by: Ingo Molnar <mingo@elte.hu> --- fs/pipe.c | 15 ++++++++++----- 1 file changed, 10 insertions(+), 5 deletions(-) Index: linux-2.6.24.7/fs/pipe.c =================================================================== --- linux-2.6.24.7.orig/fs/pipe.c +++ linux-2.6.24.7/fs/pipe.c @@ -1011,12 +1011,17 @@ struct file *create_write_pipe(void) return ERR_PTR(err); } -void free_write_pipe(struct file *f) +void free_write_pipe(struct file *file) { - free_pipe_info(f->f_dentry->d_inode); - dput(f->f_path.dentry); - mntput(f->f_path.mnt); - put_filp(f); + struct dentry *dentry = file->f_path.dentry; + struct vfsmount *mnt = file->f_path.mnt; + + free_pipe_info(file->f_dentry->d_inode); + file->f_path.dentry = NULL; + file->f_path.mnt = NULL; + put_filp(file); + dput(dentry); + mntput(mnt); } struct file *create_read_pipe(struct file *wrf) �����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/lockdep_lock_set_subclass_fix.patch���������������������������������������������������������0000664�0000764�0000764�00000001003�11041657731�020326� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������� Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> --- kernel/lockdep.c | 3 +++ 1 file changed, 3 insertions(+) Index: linux-2.6.24.7/kernel/lockdep.c =================================================================== --- linux-2.6.24.7.orig/kernel/lockdep.c +++ linux-2.6.24.7/kernel/lockdep.c @@ -2757,6 +2757,9 @@ lock_set_subclass(struct lockdep_map *lo { unsigned long flags; + if (unlikely(!lock_stat && !prove_locking)) + return; + if (unlikely(current->lockdep_recursion)) return; �����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/qrcu.patch����������������������������������������������������������������������������������0000664�0000764�0000764�00000014606�11041657734�013247� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Subject: [PATCH] QRCU with lockless fastpath Hello! This is an updated version of Oleg Nesterov's QRCU that avoids the earlier lock acquisition on the synchronize_qrcu() fastpath. This passes rcutorture on x86 and the weakly ordered POWER. A promela model of the code passes as noted before for 2 readers and 3 updaters and for 3 readers and 2 updaters. 3 readers and 3 updaters runs every machine that I have access to out of memory -- nothing like a little combinatorial explosion! However, after some thought, the proof ended up being simple enough: 1. If synchronize_qrcu() exits too soon, then by definition there has been a reader present during synchronize_srcu()'s full execution. 2. The counter corresponding to this reader will be at least 1 at all times. 3. The synchronize_qrcu() code forces at least one of the counters to be at least one at all times -- if there is a reader, the sum will be at least two. (Unfortunately, we cannot fetch the pair of counters atomically.) 4. Therefore, the only way that synchronize_qrcu()s fastpath can see a sum of 1 is if it races with another synchronize_qrcu() -- the first synchronize_qrcu() must read one of the counters before the second synchronize_qrcu() increments it, and must read the other counter after the second synchronize_qrcu() decrements it. There can be at most one reader present through this entire operation -- otherwise, the first synchronize_qrcu() will see a sum of 2 or greater. 5. But the second synchronize_qrcu() will not release the mutex until after the reader is done. During this time, the first synchronize_qrcu() will always see a sum of at least 2, and therefore cannot take the remainder of the fastpath until the reader is done. 6. Because the second synchronize_qrcu() holds the mutex, no other synchronize_qrcu() can manipulate the counters until the reader is done. A repeat of the race called out in #4 above therefore cannot happen until after the reader is done, in which case it is safe for the first synchronize_qrcu() to proceed. Therefore, two summations of the counter separated by a memory barrier suffices and the implementation shown below also suffices. (And, yes, the fastpath -could- check for a sum of zero and exit immediately, but this would help only in case of a three-way race between two synchronize_qrcu()s and a qrcu_read_unlock(), would add another compare, so is not worth it.) Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> --- include/linux/srcu.h | 22 +++++++++++++ kernel/srcu.c | 86 +++++++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 108 insertions(+) Index: linux-2.6.24.7/include/linux/srcu.h =================================================================== --- linux-2.6.24.7.orig/include/linux/srcu.h +++ linux-2.6.24.7/include/linux/srcu.h @@ -27,6 +27,8 @@ #ifndef _LINUX_SRCU_H #define _LINUX_SRCU_H +#include <linux/wait.h> + struct srcu_struct_array { int c[2]; }; @@ -50,4 +52,24 @@ void srcu_read_unlock(struct srcu_struct void synchronize_srcu(struct srcu_struct *sp); long srcu_batches_completed(struct srcu_struct *sp); +/* + * fully compatible with srcu, but optimized for writers. + */ + +struct qrcu_struct { + int completed; + atomic_t ctr[2]; + wait_queue_head_t wq; + struct mutex mutex; +}; + +int init_qrcu_struct(struct qrcu_struct *qp); +int qrcu_read_lock(struct qrcu_struct *qp); +void qrcu_read_unlock(struct qrcu_struct *qp, int idx); +void synchronize_qrcu(struct qrcu_struct *qp); + +static inline void cleanup_qrcu_struct(struct qrcu_struct *qp) +{ +} + #endif Index: linux-2.6.24.7/kernel/srcu.c =================================================================== --- linux-2.6.24.7.orig/kernel/srcu.c +++ linux-2.6.24.7/kernel/srcu.c @@ -256,3 +256,89 @@ EXPORT_SYMBOL_GPL(srcu_read_unlock); EXPORT_SYMBOL_GPL(synchronize_srcu); EXPORT_SYMBOL_GPL(srcu_batches_completed); EXPORT_SYMBOL_GPL(srcu_readers_active); + +int init_qrcu_struct(struct qrcu_struct *qp) +{ + qp->completed = 0; + atomic_set(qp->ctr + 0, 1); + atomic_set(qp->ctr + 1, 0); + init_waitqueue_head(&qp->wq); + mutex_init(&qp->mutex); + + return 0; +} + +int qrcu_read_lock(struct qrcu_struct *qp) +{ + for (;;) { + int idx = qp->completed & 0x1; + if (likely(atomic_inc_not_zero(qp->ctr + idx))) + return idx; + } +} + +void qrcu_read_unlock(struct qrcu_struct *qp, int idx) +{ + if (atomic_dec_and_test(qp->ctr + idx)) + wake_up(&qp->wq); +} + +void synchronize_qrcu(struct qrcu_struct *qp) +{ + int idx; + + smp_mb(); /* Force preceding change to happen before fastpath check. */ + + /* + * Fastpath: If the two counters sum to "1" at a given point in + * time, there are no readers. However, it takes two separate + * loads to sample both counters, which won't occur simultaneously. + * So we might race with a counter switch, so that we might see + * ctr[0]==0, then the counter might switch, then we might see + * ctr[1]==1 (unbeknownst to us because there is a reader still + * there). So we do a read memory barrier and recheck. If the + * same race happens again, there must have been a second counter + * switch. This second counter switch could not have happened + * until all preceding readers finished, so if the condition + * is true both times, we may safely proceed. + * + * This relies critically on the atomic increment and atomic + * decrement being seen as executing in order. + */ + + if (atomic_read(&qp->ctr[0]) + atomic_read(&qp->ctr[1]) <= 1) { + smp_rmb(); /* Keep two checks independent. */ + if (atomic_read(&qp->ctr[0]) + atomic_read(&qp->ctr[1]) <= 1) + goto out; + } + + mutex_lock(&qp->mutex); + + idx = qp->completed & 0x1; + if (atomic_read(qp->ctr + idx) == 1) + goto out_unlock; + + atomic_inc(qp->ctr + (idx ^ 0x1)); + + /* + * Prevent subsequent decrement from being seen before previous + * increment -- such an inversion could cause the fastpath + * above to falsely conclude that there were no readers. Also, + * reduce the likelihood that qrcu_read_lock() will loop. + */ + + smp_mb__after_atomic_inc(); + qp->completed++; + + atomic_dec(qp->ctr + idx); + __wait_event(qp->wq, !atomic_read(qp->ctr + idx)); +out_unlock: + mutex_unlock(&qp->mutex); +out: + smp_mb(); /* force subsequent free after qrcu_read_unlock(). */ +} + +EXPORT_SYMBOL_GPL(init_qrcu_struct); +EXPORT_SYMBOL_GPL(qrcu_read_lock); +EXPORT_SYMBOL_GPL(qrcu_read_unlock); +EXPORT_SYMBOL_GPL(synchronize_qrcu); ��������������������������������������������������������������������������������������������������������������������������patches/lock_list.patch�����������������������������������������������������������������������������0000664�0000764�0000764�00000013617�11041657735�014262� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������Subject: lock_list - a fine grain locked double linked list Provide a simple fine grain locked double link list. It build upon the regular double linked list primitives, spinlocks and RCU. In order to avoid deadlocks a prev -> next locking order is observed. This prevents reverse iteration. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu> --- include/linux/lock_list.h | 74 +++++++++++++++++++++++++++++++ lib/Makefile | 2 lib/lock_list.c | 107 ++++++++++++++++++++++++++++++++++++++++++++++ 3 files changed, 182 insertions(+), 1 deletion(-) Index: linux-2.6.24.7/include/linux/lock_list.h =================================================================== --- /dev/null +++ linux-2.6.24.7/include/linux/lock_list.h @@ -0,0 +1,74 @@ +/* + * Copyright (C) 2006, Red Hat, Inc., Peter Zijlstra <pzijlstr@redhat.com> + * Licenced under the GPLv2. + * + * Simple fine grain locked double linked list. + */ +#ifndef _LINUX_LOCK_LIST_H +#define _LINUX_LOCK_LIST_H + +#ifdef __KERNEL__ + +#include <linux/list.h> +#include <linux/rcupdate.h> +#include <linux/spinlock.h> + +struct lock_list_head { + union { + struct list_head head; + struct { + struct lock_list_head *next, *prev; + }; + }; + spinlock_t lock; +}; + +enum { + LOCK_LIST_NESTING_PREV = 1, + LOCK_LIST_NESTING_CUR, + LOCK_LIST_NESTING_NEXT, +}; + +static inline void INIT_LOCK_LIST_HEAD(struct lock_list_head *list) +{ + INIT_LIST_HEAD(&list->head); + spin_lock_init(&list->lock); +} + +/* + * Passed pointers are assumed stable by external means (refcount, rcu) + */ +extern void lock_list_add(struct lock_list_head *new, + struct lock_list_head *list); +extern void lock_list_del_init(struct lock_list_head *entry); +extern void lock_list_splice_init(struct lock_list_head *list, + struct lock_list_head *head); + +struct lock_list_head *lock_list_next_entry(struct lock_list_head *list, + struct lock_list_head *entry); +struct lock_list_head *lock_list_first_entry(struct lock_list_head *list); + +#define lock_list_for_each_entry(pos, list, member) \ + for (pos = list_entry(lock_list_first_entry(list), \ + typeof(*pos), member); \ + pos; \ + pos = list_entry(lock_list_next_entry(list, &pos->member), \ + typeof(*pos), member)) + +/* + * to be used when iteration is terminated by breaking out of the + * lock_list_for_each_entry() loop. + * + * lock_list_for_each_entry(i, list, member) { + * if (cond) { + * lock_list_for_each_entry_stop(i, member); + * goto foo; + * } + * } + * + */ +#define lock_list_for_each_entry_stop(pos, member) \ + spin_unlock(&(pos->member.lock)) + +#endif /* __KERNEL__ */ +#endif /* _LINUX_LOCK_LIST_H */ Index: linux-2.6.24.7/lib/Makefile =================================================================== --- linux-2.6.24.7.orig/lib/Makefile +++ linux-2.6.24.7/lib/Makefile @@ -3,7 +3,7 @@ # lib-y := ctype.o string.o vsprintf.o cmdline.o \ - rbtree.o radix-tree.o dump_stack.o \ + rbtree.o radix-tree.o dump_stack.o lock_list.o \ idr.o int_sqrt.o extable.o prio_tree.o \ sha1.o irq_regs.o reciprocal_div.o argv_split.o \ proportions.o prio_heap.o Index: linux-2.6.24.7/lib/lock_list.c =================================================================== --- /dev/null +++ linux-2.6.24.7/lib/lock_list.c @@ -0,0 +1,107 @@ +/* + * Copyright (C) 2006, Red Hat, Inc., Peter Zijlstra <pzijlstr@redhat.com> + * Licenced under the GPLv2. + * + * Simple fine grain locked double linked list. + * + * Locking order is from prev -> next. + * Edges are locked not nodes; that is, cur->lock protects: + * - cur->next, + * - cur->next->prev. + * + * Passed pointers are assumed to be stable by external means such as + * refcounts or RCU. The individual list entries are assumed to be RCU + * freed (requirement of __lock_list_del). + */ + +#include <linux/lock_list.h> + +void lock_list_add(struct lock_list_head *new, + struct lock_list_head *list) +{ + struct lock_list_head *next; + + spin_lock(&new->lock); + spin_lock_nested(&list->lock, LOCK_LIST_NESTING_PREV); + next = list->next; + __list_add(&new->head, &list->head, &next->head); + spin_unlock(&list->lock); + spin_unlock(&new->lock); +} + +static spinlock_t *__lock_list(struct lock_list_head *entry) +{ + struct lock_list_head *prev; + spinlock_t *lock = NULL; + +again: + prev = entry->prev; + if (prev == entry) + goto one; + spin_lock_nested(&prev->lock, LOCK_LIST_NESTING_PREV); + if (unlikely(entry->prev != prev)) { + /* + * we lost + */ + spin_unlock(&prev->lock); + goto again; + } + lock = &prev->lock; +one: + spin_lock_nested(&entry->lock, LOCK_LIST_NESTING_CUR); + return lock; +} + +void lock_list_del_init(struct lock_list_head *entry) +{ + spinlock_t *lock; + + rcu_read_lock(); + lock = __lock_list(entry); + list_del_init(&entry->head); + spin_unlock(&entry->lock); + if (lock) + spin_unlock(lock); + rcu_read_unlock(); +} + +void lock_list_splice_init(struct lock_list_head *list, + struct lock_list_head *head) +{ + spinlock_t *lock; + + rcu_read_lock(); + lock = __lock_list(list); + if (!list_empty(&list->head)) { + spin_lock_nested(&head->lock, LOCK_LIST_NESTING_NEXT); + __list_splice(&list->head, &head->head); + INIT_LIST_HEAD(&list->head); + spin_unlock(&head->lock); + } + spin_unlock(&list->lock); + if (lock) + spin_unlock(lock); + rcu_read_unlock(); +} + +struct lock_list_head *lock_list_next_entry(struct lock_list_head *list, + struct lock_list_head *entry) +{ + struct lock_list_head *next = entry->next; + if (likely(next != list)) { + lock_set_subclass(&entry->lock.dep_map, + LOCK_LIST_NESTING_CUR, _THIS_IP_); + spin_lock_nested(&next->lock, LOCK_LIST_NESTING_NEXT); + BUG_ON(entry->next != next); + } else + next = NULL; + spin_unlock(&entry->lock); + return next; +} + +struct lock_list_head *lock_list_first_entry(struct lock_list_head *list) +{ + spin_lock(&list->lock); + return lock_list_next_entry(list, list); +} + �����������������������������������������������������������������������������������������������������������������patches/percpu_list.patch���������������������������������������������������������������������������0000664�0000764�0000764�00000005372�11041657732�014624� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������Subject: percpu_list give the lock_list a percpu_head to in order to decrease list head contention due to list adding. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> --- include/linux/percpu_list.h | 119 ++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 119 insertions(+) Index: linux-2.6.24.7/include/linux/percpu_list.h =================================================================== --- /dev/null +++ linux-2.6.24.7/include/linux/percpu_list.h @@ -0,0 +1,119 @@ +#ifndef _LINUX_PERCPU_LIST_H +#define _LINUX_PERCPU_LIST_H + +#include <linux/lock_list.h> +#include <linux/percpu.h> + +#ifdef CONFIG_SMP + +struct percpu_list_element { + spinlock_t lock; + unsigned long nr; + struct lock_list_head list; +}; + +struct percpu_list { + struct lock_list_head list; + struct percpu_list_element *percpu_list; +}; + +static inline +void percpu_list_init(struct percpu_list *pcl) +{ + int cpu; + + INIT_LOCK_LIST_HEAD(&pcl->list); + pcl->percpu_list = alloc_percpu(struct percpu_list_element); + + for_each_possible_cpu(cpu) { + struct percpu_list_element *pcle; + + pcle = per_cpu_ptr(pcl->percpu_list, cpu); + spin_lock_init(&pcle->lock); + pcle->nr = 0; + INIT_LOCK_LIST_HEAD(&pcle->list); + } +} + +static inline +void percpu_list_destroy(struct percpu_list *pcl) +{ + free_percpu(pcl->percpu_list); +} + +static inline +void percpu_list_fold_cpu(struct percpu_list *pcl, int cpu) +{ + struct percpu_list_element *pcle = per_cpu_ptr(pcl->percpu_list, cpu); + + spin_lock(&pcle->lock); + if (pcle->nr) { + pcle->nr = 0; + lock_list_splice_init(&pcle->list, &pcl->list); + } + spin_unlock(&pcle->lock); +} + +static inline +void percpu_list_add(struct percpu_list *pcl, struct lock_list_head *elm) +{ + struct percpu_list_element *pcle; + int cpu = raw_smp_processor_id(); + unsigned long nr; + + pcle = per_cpu_ptr(pcl->percpu_list, cpu); + spin_lock(&pcle->lock); + nr = ++pcle->nr; + lock_list_add(elm, &pcle->list); + spin_unlock(&pcle->lock); + + if (nr >= 16) + percpu_list_fold_cpu(pcl, cpu); +} + +static inline +void percpu_list_fold(struct percpu_list *pcl) +{ + int cpu; + + for_each_possible_cpu(cpu) + percpu_list_fold_cpu(pcl, cpu); +} + +#else /* CONFIG_SMP */ + +struct percpu_list { + struct lock_list_head list; +}; + +static inline +void percpu_list_init(struct percpu_list *pcl) +{ + INIT_LOCK_LIST_HEAD(&pcl->list); +} + +static inline +void percpu_list_destroy(struct percpu_list *pcl) +{ +} + +static inline +void percpu_list_add(struct percpu_list *pcl, struct lock_list_head *elm) +{ + lock_list_add(elm, &pcl->list); +} + +static inline +void percpu_list_fold(struct percpu_list *pcl) +{ +} + +#endif + +static inline +struct lock_list_head *percpu_list_head(struct percpu_list *pcl) +{ + return &pcl->list; +} + +#endif /* _LINUX_PERCPU_LIST_H */ ����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/s_files.patch�������������������������������������������������������������������������������0000664�0000764�0000764�00000024023�11041657735�013714� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������Subject: remove global files_lock remove the global files_lock by reworking super_block and tty file lists. these are replaced by percpu_lists which are fine grain locked lists (lock_list) with a per cpu list head. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> --- drivers/char/tty_io.c | 23 ++++++++++------------- fs/file_table.c | 34 ++++++++++++++++++---------------- fs/proc/generic.c | 2 ++ fs/super.c | 12 ++++++++---- include/linux/fs.h | 14 +++++++------- include/linux/tty.h | 2 +- security/selinux/hooks.c | 9 ++++++--- 7 files changed, 52 insertions(+), 44 deletions(-) Index: linux-2.6.24.7/drivers/char/tty_io.c =================================================================== --- linux-2.6.24.7.orig/drivers/char/tty_io.c +++ linux-2.6.24.7/drivers/char/tty_io.c @@ -242,14 +242,13 @@ int tty_paranoia_check(struct tty_struct static int check_tty_count(struct tty_struct *tty, const char *routine) { #ifdef CHECK_TTY_COUNT - struct list_head *p; + struct file *filp; int count = 0; - - file_list_lock(); - list_for_each(p, &tty->tty_files) { + + percpu_list_fold(&tty->tty_files); + lock_list_for_each_entry(filp, percpu_list_head(&tty->tty_files), f_u.fu_llist) count++; - } - file_list_unlock(); + if (tty->driver->type == TTY_DRIVER_TYPE_PTY && tty->driver->subtype == PTY_TYPE_SLAVE && tty->link && tty->link->count) @@ -1376,9 +1375,8 @@ static void do_tty_hangup(struct work_st spin_unlock(&redirect_lock); check_tty_count(tty, "do_tty_hangup"); - file_list_lock(); /* This breaks for file handles being sent over AF_UNIX sockets ? */ - list_for_each_entry(filp, &tty->tty_files, f_u.fu_list) { + lock_list_for_each_entry(filp, percpu_list_head(&tty->tty_files), f_u.fu_llist) { if (filp->f_op->write == redirected_tty_write) cons_filp = filp; if (filp->f_op->write != tty_write) @@ -1387,7 +1385,6 @@ static void do_tty_hangup(struct work_st tty_fasync(-1, filp, 0); /* can't block */ filp->f_op = &hung_up_tty_fops; } - file_list_unlock(); /* FIXME! What are the locking issues here? This may me overdoing things.. * this question is especially important now that we've removed the irqlock. */ @@ -2268,9 +2265,9 @@ static void release_one_tty(struct tty_s tty->magic = 0; tty->driver->refcount--; - file_list_lock(); - list_del_init(&tty->tty_files); - file_list_unlock(); + percpu_list_fold(&tty->tty_files); + lock_list_del_init(percpu_list_head(&tty->tty_files)); + percpu_list_destroy(&tty->tty_files); free_tty_struct(tty); } @@ -3734,7 +3731,7 @@ static void initialize_tty_struct(struct mutex_init(&tty->atomic_read_lock); mutex_init(&tty->atomic_write_lock); spin_lock_init(&tty->read_lock); - INIT_LIST_HEAD(&tty->tty_files); + percpu_list_init(&tty->tty_files); INIT_WORK(&tty->SAK_work, do_SAK_work); } Index: linux-2.6.24.7/fs/file_table.c =================================================================== --- linux-2.6.24.7.orig/fs/file_table.c +++ linux-2.6.24.7/fs/file_table.c @@ -28,9 +28,6 @@ struct files_stat_struct files_stat = { .max_files = NR_FILE }; -/* public. Not pretty! */ -__cacheline_aligned_in_smp DEFINE_SPINLOCK(files_lock); - static struct percpu_counter nr_files __cacheline_aligned_in_smp; static inline void file_free_rcu(struct rcu_head *head) @@ -111,7 +108,7 @@ struct file *get_empty_filp(void) goto fail_sec; tsk = current; - INIT_LIST_HEAD(&f->f_u.fu_list); + INIT_LOCK_LIST_HEAD(&f->f_u.fu_llist); atomic_set(&f->f_count, 1); rwlock_init(&f->f_owner.lock); f->f_uid = tsk->fsuid; @@ -303,31 +300,35 @@ void put_filp(struct file *file) } } -void file_move(struct file *file, struct list_head *list) +void file_move(struct file *file, struct percpu_list *list) { if (!list) return; - file_list_lock(); - list_move(&file->f_u.fu_list, list); - file_list_unlock(); + + file_kill(file); + percpu_list_add(list, &file->f_u.fu_llist); } void file_kill(struct file *file) { - if (!list_empty(&file->f_u.fu_list)) { - file_list_lock(); - list_del_init(&file->f_u.fu_list); - file_list_unlock(); + if (file && file->f_mapping && file->f_mapping->host) { + struct super_block *sb = file->f_mapping->host->i_sb; + if (sb) + synchronize_qrcu(&sb->s_qrcu); } + + lock_list_del_init(&file->f_u.fu_llist); } int fs_may_remount_ro(struct super_block *sb) { struct file *file; + int idx; /* Check that no files are currently opened for writing. */ - file_list_lock(); - list_for_each_entry(file, &sb->s_files, f_u.fu_list) { + idx = qrcu_read_lock(&sb->s_qrcu); + percpu_list_fold(&sb->s_files); + lock_list_for_each_entry(file, percpu_list_head(&sb->s_files), f_u.fu_llist) { struct inode *inode = file->f_path.dentry->d_inode; /* File with pending delete? */ @@ -338,10 +339,11 @@ int fs_may_remount_ro(struct super_block if (S_ISREG(inode->i_mode) && (file->f_mode & FMODE_WRITE)) goto too_bad; } - file_list_unlock(); + qrcu_read_unlock(&sb->s_qrcu, idx); return 1; /* Tis' cool bro. */ too_bad: - file_list_unlock(); + lock_list_for_each_entry_stop(file, f_u.fu_llist); + qrcu_read_unlock(&sb->s_qrcu, idx); return 0; } Index: linux-2.6.24.7/fs/proc/generic.c =================================================================== --- linux-2.6.24.7.orig/fs/proc/generic.c +++ linux-2.6.24.7/fs/proc/generic.c @@ -698,6 +698,8 @@ void remove_proc_entry(const char *name, goto out; len = strlen(fn); + percpu_list_fold(&proc_mnt->mnt_sb->s_files); + spin_lock(&proc_subdir_lock); for (p = &parent->subdir; *p; p=&(*p)->next ) { if (!proc_match(len, fn, *p)) Index: linux-2.6.24.7/fs/super.c =================================================================== --- linux-2.6.24.7.orig/fs/super.c +++ linux-2.6.24.7/fs/super.c @@ -64,7 +64,8 @@ static struct super_block *alloc_super(s INIT_LIST_HEAD(&s->s_dirty); INIT_LIST_HEAD(&s->s_io); INIT_LIST_HEAD(&s->s_more_io); - INIT_LIST_HEAD(&s->s_files); + percpu_list_init(&s->s_files); + init_qrcu_struct(&s->s_qrcu); INIT_LIST_HEAD(&s->s_instances); INIT_HLIST_HEAD(&s->s_anon); INIT_LIST_HEAD(&s->s_inodes); @@ -103,6 +104,7 @@ out: */ static inline void destroy_super(struct super_block *s) { + percpu_list_destroy(&s->s_files); security_sb_free(s); kfree(s->s_subtype); kfree(s); @@ -565,13 +567,15 @@ out: static void mark_files_ro(struct super_block *sb) { struct file *f; + int idx; - file_list_lock(); - list_for_each_entry(f, &sb->s_files, f_u.fu_list) { + idx = qrcu_read_lock(&sb->s_qrcu); + percpu_list_fold(&sb->s_files); + lock_list_for_each_entry(f, percpu_list_head(&sb->s_files), f_u.fu_llist) { if (S_ISREG(f->f_path.dentry->d_inode->i_mode) && file_count(f)) f->f_mode &= ~FMODE_WRITE; } - file_list_unlock(); + qrcu_read_unlock(&sb->s_qrcu, idx); } /** Index: linux-2.6.24.7/include/linux/fs.h =================================================================== --- linux-2.6.24.7.orig/include/linux/fs.h +++ linux-2.6.24.7/include/linux/fs.h @@ -279,12 +279,14 @@ extern int dir_notify_enable; #include <linux/cache.h> #include <linux/kobject.h> #include <linux/list.h> +#include <linux/percpu_list.h> #include <linux/radix-tree.h> #include <linux/prio_tree.h> #include <linux/init.h> #include <linux/pid.h> #include <linux/mutex.h> #include <linux/capability.h> +#include <linux/srcu.h> #include <asm/atomic.h> #include <asm/semaphore.h> @@ -776,11 +778,11 @@ static inline int ra_has_index(struct fi struct file { /* - * fu_list becomes invalid after file_free is called and queued via + * fu_llist becomes invalid after file_free is called and queued via * fu_rcuhead for RCU freeing */ union { - struct list_head fu_list; + struct lock_list_head fu_llist; struct rcu_head fu_rcuhead; } f_u; struct path f_path; @@ -809,9 +811,6 @@ struct file { #endif /* #ifdef CONFIG_EPOLL */ struct address_space *f_mapping; }; -extern spinlock_t files_lock; -#define file_list_lock() spin_lock(&files_lock); -#define file_list_unlock() spin_unlock(&files_lock); #define get_file(x) atomic_inc(&(x)->f_count) #define file_count(x) atomic_read(&(x)->f_count) @@ -1007,7 +1006,8 @@ struct super_block { struct list_head s_io; /* parked for writeback */ struct list_head s_more_io; /* parked for more writeback */ struct hlist_head s_anon; /* anonymous dentries for (nfs) exporting */ - struct list_head s_files; + struct percpu_list s_files; + struct qrcu_struct s_qrcu; struct block_device *s_bdev; struct mtd_info *s_mtd; @@ -1777,7 +1777,7 @@ static inline void insert_inode_hash(str } extern struct file * get_empty_filp(void); -extern void file_move(struct file *f, struct list_head *list); +extern void file_move(struct file *f, struct percpu_list *list); extern void file_kill(struct file *f); #ifdef CONFIG_BLOCK struct bio; Index: linux-2.6.24.7/include/linux/tty.h =================================================================== --- linux-2.6.24.7.orig/include/linux/tty.h +++ linux-2.6.24.7/include/linux/tty.h @@ -211,7 +211,7 @@ struct tty_struct { struct work_struct hangup_work; void *disc_data; void *driver_data; - struct list_head tty_files; + struct percpu_list tty_files; #define N_TTY_BUF_SIZE 4096 Index: linux-2.6.24.7/security/selinux/hooks.c =================================================================== --- linux-2.6.24.7.orig/security/selinux/hooks.c +++ linux-2.6.24.7/security/selinux/hooks.c @@ -1747,8 +1747,11 @@ static inline void flush_unauthorized_fi mutex_lock(&tty_mutex); tty = get_current_tty(); if (tty) { - file_list_lock(); - file = list_entry(tty->tty_files.next, typeof(*file), f_u.fu_list); + lock_list_for_each_entry(file, + percpu_list_head(&tty->tty_files), + f_u.fu_llist) + break; + if (file) { /* Revalidate access to controlling tty. Use inode_has_perm on the tty inode directly rather @@ -1760,8 +1763,8 @@ static inline void flush_unauthorized_fi FILE__READ | FILE__WRITE, NULL)) { drop_tty = 1; } + lock_list_for_each_entry_stop(file, f_u.fu_llist); } - file_list_unlock(); } mutex_unlock(&tty_mutex); /* Reset controlling tty. */ �������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/fix-circular-locking-deadlock.patch���������������������������������������������������������0000664�0000764�0000764�00000007340�11041657732�020050� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������On Thu, 2007-08-16 at 09:39 +0200, Peter Zijlstra wrote: > On Wed, 2007-08-15 at 18:39 -0700, john stultz wrote: > > Hey Ingo, Thomas, > > > > I was playing with the latency tracer on 2.6.23-rc2-rt2 while a "make > > -j8" was going on in the background and the box hung with this on the > > console: > > Hmm, this would have been me :-/ > > I'll go play... Could you give this a spin... (not sure on the added rmbs, but they can't hurt) --- Fix a deadlock in the fine grain locked list primitives. Delete and splice use a double lock, which normally locks in the prev->cur order. For delete this is deadlock free provided one will never delete the list head - which is exactly what splice attempts. In order to solve this, use the reverse locking order for splice - which then assumes that the list passes is indeed the list head (no fancy dummy item headless lists here). Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> --- lib/lock_list.c | 66 ++++++++++++++++++++++++++++++++++++++++++++++++++------ 1 file changed, 60 insertions(+), 6 deletions(-) Index: linux-2.6.24.7/lib/lock_list.c =================================================================== --- linux-2.6.24.7.orig/lib/lock_list.c +++ linux-2.6.24.7/lib/lock_list.c @@ -11,7 +11,7 @@ * * Passed pointers are assumed to be stable by external means such as * refcounts or RCU. The individual list entries are assumed to be RCU - * freed (requirement of __lock_list_del). + * freed (requirement of __lock_list). */ #include <linux/lock_list.h> @@ -19,12 +19,9 @@ void lock_list_add(struct lock_list_head *new, struct lock_list_head *list) { - struct lock_list_head *next; - spin_lock(&new->lock); spin_lock_nested(&list->lock, LOCK_LIST_NESTING_PREV); - next = list->next; - __list_add(&new->head, &list->head, &next->head); + __list_add(&new->head, &list->head, &list->next->head); spin_unlock(&list->lock); spin_unlock(&new->lock); } @@ -35,6 +32,13 @@ static spinlock_t *__lock_list(struct lo spinlock_t *lock = NULL; again: + /* + * all modifications are done under spinlocks + * but this read is not, the unlock acks as a wmb + * for modifications. + */ + smp_rmb(); + prev = entry->prev; if (prev == entry) goto one; @@ -52,6 +56,56 @@ one: return lock; } +/* + * deadlock galore... + * + * when using __lock_list to lock the list head we get this: + * + * lock H 2 1 + * lock 1 a b + * lock 2 A B + * + * list: ..-> [H] <-> [1] <-> [2] <-.. + * + * obvious dead-lock, to solve this we must use a reverse order + * when trying to acquire a double lock on the head: + * + * lock H r 1 2 + * lock 1 a b + * lock 2 A B + * + * list: ..-> [H] <-> [1] <-> [2] <-.. + */ +static spinlock_t *__lock_list_reverse(struct lock_list_head *entry) +{ + struct lock_list_head *prev; + spinlock_t *lock = NULL; + + spin_lock(&entry->lock); +again: + /* + * all modifications are done under spinlocks + * but this read is not, the unlock acks as a wmb + * for modifications. + */ + smp_rmb(); + prev = entry->prev; + if (prev == entry) + goto done; + + spin_lock_nested(&prev->lock, LOCK_LIST_NESTING_PREV); + if (unlikely(entry->prev != prev)) { + /* + * we lost + */ + spin_unlock(&prev->lock); + goto again; + } + lock = &prev->lock; +done: + return lock; +} + void lock_list_del_init(struct lock_list_head *entry) { spinlock_t *lock; @@ -71,7 +125,7 @@ void lock_list_splice_init(struct lock_l spinlock_t *lock; rcu_read_lock(); - lock = __lock_list(list); + lock = __lock_list_reverse(list); if (!list_empty(&list->head)) { spin_lock_nested(&head->lock, LOCK_LIST_NESTING_NEXT); __list_splice(&list->head, &head->head); ������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/2.6.21-rc6-lockless3-radix-tree-gang-slot-lookups.patch�������������������������������������0000664�0000764�0000764�00000024273�11041657732�023037� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From: Nick Piggin <npiggin@suse.de> Subject: [patch 3/9] radix-tree: gang slot lookups Introduce gang_lookup_slot and gang_lookup_slot_tag functions, which are used by lockless pagecache. Signed-off-by: Nick Piggin <npiggin@suse.de> --- include/linux/radix-tree.h | 12 ++- lib/radix-tree.c | 176 +++++++++++++++++++++++++++++++++++++++------ 2 files changed, 166 insertions(+), 22 deletions(-) Index: linux-2.6.24.7/include/linux/radix-tree.h =================================================================== --- linux-2.6.24.7.orig/include/linux/radix-tree.h +++ linux-2.6.24.7/include/linux/radix-tree.h @@ -99,12 +99,15 @@ do { \ * * The notable exceptions to this rule are the following functions: * radix_tree_lookup + * radix_tree_lookup_slot * radix_tree_tag_get * radix_tree_gang_lookup + * radix_tree_gang_lookup_slot * radix_tree_gang_lookup_tag + * radix_tree_gang_lookup_tag_slot * radix_tree_tagged * - * The first 4 functions are able to be called locklessly, using RCU. The + * The first 7 functions are able to be called locklessly, using RCU. The * caller must ensure calls to these functions are made within rcu_read_lock() * regions. Other readers (lock-free or otherwise) and modifications may be * running concurrently. @@ -159,6 +162,9 @@ void *radix_tree_delete(struct radix_tre unsigned int radix_tree_gang_lookup(struct radix_tree_root *root, void **results, unsigned long first_index, unsigned int max_items); +unsigned int +radix_tree_gang_lookup_slot(struct radix_tree_root *root, void ***results, + unsigned long first_index, unsigned int max_items); unsigned long radix_tree_next_hole(struct radix_tree_root *root, unsigned long index, unsigned long max_scan); /* @@ -184,6 +190,10 @@ unsigned int radix_tree_gang_lookup_tag(struct radix_tree_root *root, void **results, unsigned long first_index, unsigned int max_items, unsigned int tag); +unsigned int +radix_tree_gang_lookup_tag_slot(struct radix_tree_root *root, void ***results, + unsigned long first_index, unsigned int max_items, + unsigned int tag); int radix_tree_tagged(struct radix_tree_root *root, unsigned int tag); static inline void radix_tree_preload_end(void) Index: linux-2.6.24.7/lib/radix-tree.c =================================================================== --- linux-2.6.24.7.orig/lib/radix-tree.c +++ linux-2.6.24.7/lib/radix-tree.c @@ -350,18 +350,17 @@ EXPORT_SYMBOL(radix_tree_insert); * Returns: the slot corresponding to the position @index in the * radix tree @root. This is useful for update-if-exists operations. * - * This function cannot be called under rcu_read_lock, it must be - * excluded from writers, as must the returned slot for subsequent - * use by radix_tree_deref_slot() and radix_tree_replace slot. - * Caller must hold tree write locked across slot lookup and - * replace. + * This function can be called under rcu_read_lock iff the slot is not + * modified by radix_tree_replace_slot, otherwise it must be called + * exclusive from other writers. Any dereference of the slot must be done + * using radix_tree_deref_slot. */ void **radix_tree_lookup_slot(struct radix_tree_root *root, unsigned long index) { unsigned int height, shift; struct radix_tree_node *node, **slot; - node = root->rnode; + node = rcu_dereference(root->rnode); if (node == NULL) return NULL; @@ -381,7 +380,7 @@ void **radix_tree_lookup_slot(struct rad do { slot = (struct radix_tree_node **) (node->slots + ((index>>shift) & RADIX_TREE_MAP_MASK)); - node = *slot; + node = rcu_dereference(*slot); if (node == NULL) return NULL; @@ -658,7 +657,7 @@ unsigned long radix_tree_next_hole(struc EXPORT_SYMBOL(radix_tree_next_hole); static unsigned int -__lookup(struct radix_tree_node *slot, void **results, unsigned long index, +__lookup(struct radix_tree_node *slot, void ***results, unsigned long index, unsigned int max_items, unsigned long *next_index) { unsigned int nr_found = 0; @@ -692,11 +691,9 @@ __lookup(struct radix_tree_node *slot, v /* Bottom level: grab some items */ for (i = index & RADIX_TREE_MAP_MASK; i < RADIX_TREE_MAP_SIZE; i++) { - struct radix_tree_node *node; index++; - node = slot->slots[i]; - if (node) { - results[nr_found++] = rcu_dereference(node); + if (slot->slots[i]) { + results[nr_found++] = &(slot->slots[i]); if (nr_found == max_items) goto out; } @@ -750,13 +747,22 @@ radix_tree_gang_lookup(struct radix_tree ret = 0; while (ret < max_items) { - unsigned int nr_found; + unsigned int nr_found, slots_found, i; unsigned long next_index; /* Index of next search */ if (cur_index > max_index) break; - nr_found = __lookup(node, results + ret, cur_index, + slots_found = __lookup(node, (void ***)results + ret, cur_index, max_items - ret, &next_index); + nr_found = 0; + for (i = 0; i < slots_found; i++) { + struct radix_tree_node *slot; + slot = *(((void ***)results)[ret + i]); + if (!slot) + continue; + results[ret + nr_found] = rcu_dereference(slot); + nr_found++; + } ret += nr_found; if (next_index == 0) break; @@ -767,12 +773,71 @@ radix_tree_gang_lookup(struct radix_tree } EXPORT_SYMBOL(radix_tree_gang_lookup); +/** + * radix_tree_gang_lookup_slot - perform multiple slot lookup on radix tree + * @root: radix tree root + * @results: where the results of the lookup are placed + * @first_index: start the lookup from this key + * @max_items: place up to this many items at *results + * + * Performs an index-ascending scan of the tree for present items. Places + * their slots at *@results and returns the number of items which were + * placed at *@results. + * + * The implementation is naive. + * + * Like radix_tree_gang_lookup as far as RCU and locking goes. Slots must + * be dereferenced with radix_tree_deref_slot, and if using only RCU + * protection, radix_tree_deref_slot may fail requiring a retry. + */ +unsigned int +radix_tree_gang_lookup_slot(struct radix_tree_root *root, void ***results, + unsigned long first_index, unsigned int max_items) +{ + unsigned long max_index; + struct radix_tree_node *node; + unsigned long cur_index = first_index; + unsigned int ret; + + node = rcu_dereference(root->rnode); + if (!node) + return 0; + + if (!radix_tree_is_indirect_ptr(node)) { + if (first_index > 0) + return 0; + results[0] = (void **)&root->rnode; + return 1; + } + node = radix_tree_indirect_to_ptr(node); + + max_index = radix_tree_maxindex(node->height); + + ret = 0; + while (ret < max_items) { + unsigned int slots_found; + unsigned long next_index; /* Index of next search */ + + if (cur_index > max_index) + break; + slots_found = __lookup(node, results + ret, cur_index, + max_items - ret, &next_index); + ret += slots_found; + if (next_index == 0) + break; + cur_index = next_index; + } + + return ret; +} +EXPORT_SYMBOL(radix_tree_gang_lookup_slot); + /* * FIXME: the two tag_get()s here should use find_next_bit() instead of * open-coding the search. */ static unsigned int -__lookup_tag(struct radix_tree_node *slot, void **results, unsigned long index, +__lookup_tag(struct radix_tree_node *slot, void ***results, unsigned long index, unsigned int max_items, unsigned long *next_index, unsigned int tag) { unsigned int nr_found = 0; @@ -817,9 +882,8 @@ __lookup_tag(struct radix_tree_node *slo * lookup ->slots[x] without a lock (ie. can't * rely on its value remaining the same). */ - if (node) { - node = rcu_dereference(node); - results[nr_found++] = node; + if (slot->slots[j]) { + results[nr_found++] = &slot->slots[j]; if (nr_found == max_items) goto out; } @@ -878,13 +942,22 @@ radix_tree_gang_lookup_tag(struct radix_ ret = 0; while (ret < max_items) { - unsigned int nr_found; + unsigned int slots_found, nr_found, i; unsigned long next_index; /* Index of next search */ if (cur_index > max_index) break; - nr_found = __lookup_tag(node, results + ret, cur_index, - max_items - ret, &next_index, tag); + slots_found = __lookup_tag(node, (void ***)results + ret, + cur_index, max_items - ret, &next_index, tag); + nr_found = 0; + for (i = 0; i < slots_found; i++) { + struct radix_tree_node *slot; + slot = *((void ***)results)[ret + i]; + if (!slot) + continue; + results[ret + nr_found] = rcu_dereference(slot); + nr_found++; + } ret += nr_found; if (next_index == 0) break; @@ -896,6 +969,67 @@ radix_tree_gang_lookup_tag(struct radix_ EXPORT_SYMBOL(radix_tree_gang_lookup_tag); /** + * radix_tree_gang_lookup_tag_slot - perform multiple slot lookup on a + * radix tree based on a tag + * @root: radix tree root + * @results: where the results of the lookup are placed + * @first_index: start the lookup from this key + * @max_items: place up to this many items at *results + * @tag: the tag index (< RADIX_TREE_MAX_TAGS) + * + * Performs an index-ascending scan of the tree for present items which + * have the tag indexed by @tag set. Places the slots at *@results and + * returns the number of slots which were placed at *@results. + */ +unsigned int +radix_tree_gang_lookup_tag_slot(struct radix_tree_root *root, void ***results, + unsigned long first_index, unsigned int max_items, + unsigned int tag) +{ + struct radix_tree_node *node; + unsigned long max_index; + unsigned long cur_index = first_index; + unsigned int ret; + + /* check the root's tag bit */ + if (!root_tag_get(root, tag)) + return 0; + + node = rcu_dereference(root->rnode); + if (!node) + return 0; + + if (!radix_tree_is_indirect_ptr(node)) { + if (first_index > 0) + return 0; + results[0] = (void **)&root->rnode; + return 1; + } + node = radix_tree_indirect_to_ptr(node); + + max_index = radix_tree_maxindex(node->height); + + ret = 0; + while (ret < max_items) { + unsigned int slots_found; + unsigned long next_index; /* Index of next search */ + + if (cur_index > max_index) + break; + slots_found = __lookup_tag(node, results + ret, + cur_index, max_items - ret, &next_index, tag); + ret += slots_found; + if (next_index == 0) + break; + cur_index = next_index; + } + + return ret; +} +EXPORT_SYMBOL(radix_tree_gang_lookup_tag_slot); + + +/** * radix_tree_shrink - shrink height of a radix tree to minimal * @root radix tree root */ �������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/2.6.21-rc6-lockless5-lockless-probe.patch���������������������������������������������������0000664�0000764�0000764�00000001577�11041657731�020337� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From: Nick Piggin <npiggin@suse.de> Subject: [patch 5/9] mm: lockless probe Probing pages and radix_tree_tagged are lockless operations with the lockless radix-tree. Convert these users to RCU locking rather than using tree_lock. Signed-off-by: Nick Piggin <npiggin@suse.de> --- mm/readahead.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) Index: linux-2.6.24.7/mm/readahead.c =================================================================== --- linux-2.6.24.7.orig/mm/readahead.c +++ linux-2.6.24.7/mm/readahead.c @@ -376,9 +376,9 @@ ondemand_readahead(struct address_space if (hit_readahead_marker) { pgoff_t start; - read_lock_irq(&mapping->tree_lock); + rcu_read_lock(); start = radix_tree_next_hole(&mapping->page_tree, offset, max+1); - read_unlock_irq(&mapping->tree_lock); + rcu_read_unlock(); if (!start || start - offset > max) return 0; ���������������������������������������������������������������������������������������������������������������������������������patches/2.6.21-rc6-lockless6-speculative-get-page.patch���������������������������������������������0000664�0000764�0000764�00000026535�11041657732�021431� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From: Nick Piggin <npiggin@suse.de> Subject: [patch 6/9] mm: speculative get page If we can be sure that elevating the page_count on a pagecache page will pin it, we can speculatively run this operation, and subsequently check to see if we hit the right page rather than relying on holding a lock or otherwise pinning a reference to the page. This can be done if get_page/put_page behaves consistently throughout the whole tree (ie. if we "get" the page after it has been used for something else, we must be able to free it with a put_page). Actually, there is a period where the count behaves differently: when the page is free or if it is a constituent page of a compound page. We need an atomic_inc_not_zero operation to ensure we don't try to grab the page in either case. This patch introduces the core locking protocol to the pagecache (ie. adds page_cache_get_speculative, and tweaks some update-side code to make it work). [Hugh notices that PG_nonewrefs might be dispensed with entirely if current set_page_nonewrefs instead atomically save the page count and temporarily set it to zero. This is a nice idea, and simplifies find_get_page very much, but cannot be applied to all current SetPageNoNewRefs sites. Need to verify that add_to_page_cache and add_to_swap_cache can cope without it or make do some other way. Also, migration pages with PagePrivate set means that the filesystem has a ref on the page, so it might muck with page count, which is a big problem. ] Signed-off-by: Nick Piggin <npiggin@suse.de> --- include/linux/page-flags.h | 28 ++++++++++++ include/linux/pagemap.h | 105 +++++++++++++++++++++++++++++++++++++++++++++ mm/filemap.c | 2 mm/migrate.c | 7 ++- mm/swap_state.c | 2 mm/vmscan.c | 10 +++- 6 files changed, 149 insertions(+), 5 deletions(-) Index: linux-2.6.24.7/include/linux/page-flags.h =================================================================== --- linux-2.6.24.7.orig/include/linux/page-flags.h +++ linux-2.6.24.7/include/linux/page-flags.h @@ -83,6 +83,8 @@ #define PG_private 11 /* If pagecache, has fs-private data */ #define PG_writeback 12 /* Page is under writeback */ +#define PG_nonewrefs 13 /* Block concurrent pagecache lookups + * while testing refcount */ #define PG_compound 14 /* Part of a compound page */ #define PG_swapcache 15 /* Swap page: swp_entry_t in private */ @@ -260,6 +262,11 @@ static inline void __ClearPageTail(struc #define SetPageUncached(page) set_bit(PG_uncached, &(page)->flags) #define ClearPageUncached(page) clear_bit(PG_uncached, &(page)->flags) +#define PageNoNewRefs(page) test_bit(PG_nonewrefs, &(page)->flags) +#define SetPageNoNewRefs(page) set_bit(PG_nonewrefs, &(page)->flags) +#define ClearPageNoNewRefs(page) clear_bit(PG_nonewrefs, &(page)->flags) +#define __ClearPageNoNewRefs(page) __clear_bit(PG_nonewrefs, &(page)->flags) + struct page; /* forward declaration */ extern void cancel_dirty_page(struct page *page, unsigned int account_size); @@ -272,4 +279,25 @@ static inline void set_page_writeback(st test_set_page_writeback(page); } +static inline void set_page_nonewrefs(struct page *page) +{ + preempt_disable(); + SetPageNoNewRefs(page); + smp_wmb(); +} + +static inline void __clear_page_nonewrefs(struct page *page) +{ + smp_wmb(); + __ClearPageNoNewRefs(page); + preempt_enable(); +} + +static inline void clear_page_nonewrefs(struct page *page) +{ + smp_wmb(); + ClearPageNoNewRefs(page); + preempt_enable(); +} + #endif /* PAGE_FLAGS_H */ Index: linux-2.6.24.7/include/linux/pagemap.h =================================================================== --- linux-2.6.24.7.orig/include/linux/pagemap.h +++ linux-2.6.24.7/include/linux/pagemap.h @@ -12,6 +12,8 @@ #include <asm/uaccess.h> #include <linux/gfp.h> #include <linux/bitops.h> +#include <linux/page-flags.h> +#include <linux/hardirq.h> /* for in_interrupt() */ /* * Bits in mapping->flags. The lower __GFP_BITS_SHIFT bits are the page @@ -62,6 +64,109 @@ static inline void mapping_set_gfp_mask( #define page_cache_release(page) put_page(page) void release_pages(struct page **pages, int nr, int cold); +/* + * speculatively take a reference to a page. + * If the page is free (_count == 0), then _count is untouched, and 0 + * is returned. Otherwise, _count is incremented by 1 and 1 is returned. + * + * This function must be run in the same rcu_read_lock() section as has + * been used to lookup the page in the pagecache radix-tree: this allows + * allocators to use a synchronize_rcu() to stabilize _count. + * + * Unless an RCU grace period has passed, the count of all pages coming out + * of the allocator must be considered unstable. page_count may return higher + * than expected, and put_page must be able to do the right thing when the + * page has been finished with (because put_page is what is used to drop an + * invalid speculative reference). + * + * After incrementing the refcount, this function spins until PageNoNewRefs + * is clear, then a read memory barrier is issued. + * + * This forms the core of the lockless pagecache locking protocol, where + * the lookup-side (eg. find_get_page) has the following pattern: + * 1. find page in radix tree + * 2. conditionally increment refcount + * 3. wait for PageNoNewRefs + * 4. check the page is still in pagecache + * + * Remove-side (that cares about _count, eg. reclaim) has the following: + * A. SetPageNoNewRefs + * B. check refcount is correct + * C. remove page + * D. ClearPageNoNewRefs + * + * There are 2 critical interleavings that matter: + * - 2 runs before B: in this case, B sees elevated refcount and bails out + * - B runs before 2: in this case, 3 ensures 4 will not run until *after* C + * (after D, even). In which case, 4 will notice C and lookup side can retry + * + * It is possible that between 1 and 2, the page is removed then the exact same + * page is inserted into the same position in pagecache. That's OK: the + * old find_get_page using tree_lock could equally have run before or after + * the write-side, depending on timing. + * + * Pagecache insertion isn't a big problem: either 1 will find the page or + * it will not. Likewise, the old find_get_page could run either before the + * insertion or afterwards, depending on timing. + */ +static inline int page_cache_get_speculative(struct page *page) +{ + VM_BUG_ON(in_interrupt()); + +#ifndef CONFIG_SMP +# ifdef CONFIG_PREEMPT + VM_BUG_ON(!in_atomic()); +# endif + /* + * Preempt must be disabled here - we rely on rcu_read_lock doing + * this for us. + * + * Pagecache won't be truncated from interrupt context, so if we have + * found a page in the radix tree here, we have pinned its refcount by + * disabling preempt, and hence no need for the "speculative get" that + * SMP requires. + */ + VM_BUG_ON(page_count(page) == 0); + atomic_inc(&page->_count); + +#else + if (unlikely(!get_page_unless_zero(page))) + return 0; /* page has been freed */ + + /* + * Note that get_page_unless_zero provides a memory barrier. + * This is needed to ensure PageNoNewRefs is evaluated after the + * page refcount has been raised. See below comment. + */ + + while (unlikely(PageNoNewRefs(page))) + cpu_relax(); + + /* + * smp_rmb is to ensure the load of page->flags (for PageNoNewRefs()) + * is performed before a future load used to ensure the page is + * the correct on (usually: page->mapping and page->index). + * + * Those places that set PageNoNewRefs have the following pattern: + * SetPageNoNewRefs(page) + * wmb(); + * if (page_count(page) == X) + * remove page from pagecache + * wmb(); + * ClearPageNoNewRefs(page) + * + * If the load was out of order, page->mapping might be loaded before + * the page is removed from pagecache but PageNoNewRefs evaluated + * after the ClearPageNoNewRefs(). + */ + smp_rmb(); + +#endif + VM_BUG_ON(PageCompound(page) && (struct page *)page_private(page) != page); + + return 1; +} + #ifdef CONFIG_NUMA extern struct page *__page_cache_alloc(gfp_t gfp); #else Index: linux-2.6.24.7/mm/filemap.c =================================================================== --- linux-2.6.24.7.orig/mm/filemap.c +++ linux-2.6.24.7/mm/filemap.c @@ -456,6 +456,7 @@ int add_to_page_cache(struct page *page, int error = radix_tree_preload(gfp_mask & ~__GFP_HIGHMEM); if (error == 0) { + set_page_nonewrefs(page); write_lock_irq(&mapping->tree_lock); error = radix_tree_insert(&mapping->page_tree, offset, page); if (!error) { @@ -467,6 +468,7 @@ int add_to_page_cache(struct page *page, __inc_zone_page_state(page, NR_FILE_PAGES); } write_unlock_irq(&mapping->tree_lock); + clear_page_nonewrefs(page); radix_tree_preload_end(); } return error; Index: linux-2.6.24.7/mm/migrate.c =================================================================== --- linux-2.6.24.7.orig/mm/migrate.c +++ linux-2.6.24.7/mm/migrate.c @@ -303,6 +303,7 @@ static int migrate_page_move_mapping(str return 0; } + set_page_nonewrefs(page); write_lock_irq(&mapping->tree_lock); pslot = radix_tree_lookup_slot(&mapping->page_tree, @@ -311,6 +312,7 @@ static int migrate_page_move_mapping(str if (page_count(page) != 2 + !!PagePrivate(page) || (struct page *)radix_tree_deref_slot(pslot) != page) { write_unlock_irq(&mapping->tree_lock); + clear_page_nonewrefs(page); return -EAGAIN; } @@ -326,6 +328,9 @@ static int migrate_page_move_mapping(str #endif radix_tree_replace_slot(pslot, newpage); + page->mapping = NULL; + write_unlock_irq(&mapping->tree_lock); + clear_page_nonewrefs(page); /* * Drop cache reference from old page. @@ -346,8 +351,6 @@ static int migrate_page_move_mapping(str __dec_zone_page_state(page, NR_FILE_PAGES); __inc_zone_page_state(newpage, NR_FILE_PAGES); - write_unlock_irq(&mapping->tree_lock); - return 0; } Index: linux-2.6.24.7/mm/swap_state.c =================================================================== --- linux-2.6.24.7.orig/mm/swap_state.c +++ linux-2.6.24.7/mm/swap_state.c @@ -79,6 +79,7 @@ static int __add_to_swap_cache(struct pa BUG_ON(PagePrivate(page)); error = radix_tree_preload(gfp_mask); if (!error) { + set_page_nonewrefs(page); write_lock_irq(&swapper_space.tree_lock); error = radix_tree_insert(&swapper_space.page_tree, entry.val, page); @@ -90,6 +91,7 @@ static int __add_to_swap_cache(struct pa __inc_zone_page_state(page, NR_FILE_PAGES); } write_unlock_irq(&swapper_space.tree_lock); + clear_page_nonewrefs(page); radix_tree_preload_end(); } return error; Index: linux-2.6.24.7/mm/vmscan.c =================================================================== --- linux-2.6.24.7.orig/mm/vmscan.c +++ linux-2.6.24.7/mm/vmscan.c @@ -385,6 +385,7 @@ int remove_mapping(struct address_space BUG_ON(!PageLocked(page)); BUG_ON(mapping != page_mapping(page)); + set_page_nonewrefs(page); write_lock_irq(&mapping->tree_lock); /* * The non racy check for a busy page. @@ -422,17 +423,20 @@ int remove_mapping(struct address_space __delete_from_swap_cache(page); write_unlock_irq(&mapping->tree_lock); swap_free(swap); - __put_page(page); /* The pagecache ref */ - return 1; + goto free_it; } __remove_from_page_cache(page); write_unlock_irq(&mapping->tree_lock); - __put_page(page); + +free_it: + __clear_page_nonewrefs(page); + __put_page(page); /* The pagecache ref */ return 1; cannot_free: write_unlock_irq(&mapping->tree_lock); + clear_page_nonewrefs(page); return 0; } �������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/2.6.21-rc6-lockless7-lockless-pagecache-lookups.patch���������������������������������������0000664�0000764�0000764�00000014371�11041657731�022620� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From: Nick Piggin <npiggin@suse.de> Subject: [patch 7/9] mm: lockless pagecache lookups Combine page_cache_get_speculative with lockless radix tree lookups to introduce lockless page cache lookups (ie. no mapping->tree_lock on the read-side). The only atomicity changes this introduces is that the gang pagecache lookup functions now behave as if they are implemented with multiple find_get_page calls, rather than operating on a snapshot of the pages. In practice, this atomicity guarantee is not used anyway, and it is difficult to see how it could be. Gang pagecache lookups are designed to replace individual lookups, so these semantics are natural. Signed-off-by: Nick Piggin <npiggin@suse.de> --- mm/filemap.c | 173 ++++++++++++++++++++++++++++++++++++++++++++--------------- 1 file changed, 131 insertions(+), 42 deletions(-) Index: linux-2.6.24.7/mm/filemap.c =================================================================== --- linux-2.6.24.7.orig/mm/filemap.c +++ linux-2.6.24.7/mm/filemap.c @@ -612,13 +612,33 @@ void fastcall __lock_page_nosync(struct */ struct page * find_get_page(struct address_space *mapping, pgoff_t offset) { + void **pagep; struct page *page; - read_lock_irq(&mapping->tree_lock); - page = radix_tree_lookup(&mapping->page_tree, offset); - if (page) - page_cache_get(page); - read_unlock_irq(&mapping->tree_lock); + rcu_read_lock(); +repeat: + page = NULL; + pagep = radix_tree_lookup_slot(&mapping->page_tree, offset); + if (pagep) { + page = radix_tree_deref_slot(pagep); + if (unlikely(!page || page == RADIX_TREE_RETRY)) + goto repeat; + + if (!page_cache_get_speculative(page)) + goto repeat; + + /* + * Has the page moved? + * This is part of the lockless pagecache protocol. See + * include/linux/pagemap.h for details. + */ + if (unlikely(page != *pagep)) { + page_cache_release(page); + goto repeat; + } + } + rcu_read_unlock(); + return page; } EXPORT_SYMBOL(find_get_page); @@ -639,26 +659,16 @@ struct page *find_lock_page(struct addre struct page *page; repeat: - read_lock_irq(&mapping->tree_lock); - page = radix_tree_lookup(&mapping->page_tree, offset); + page = find_get_page(mapping, offset); if (page) { - page_cache_get(page); - if (TestSetPageLocked(page)) { - read_unlock_irq(&mapping->tree_lock); - __lock_page(page); - - /* Has the page been truncated while we slept? */ - if (unlikely(page->mapping != mapping)) { - unlock_page(page); - page_cache_release(page); - goto repeat; - } - VM_BUG_ON(page->index != offset); - goto out; + lock_page(page); + /* Has the page been truncated? */ + if (unlikely(page->mapping != mapping)) { + unlock_page(page); + page_cache_release(page); + goto repeat; } } - read_unlock_irq(&mapping->tree_lock); -out: return page; } EXPORT_SYMBOL(find_lock_page); @@ -724,13 +734,39 @@ unsigned find_get_pages(struct address_s { unsigned int i; unsigned int ret; + unsigned int nr_found; + + rcu_read_lock(); +restart: + nr_found = radix_tree_gang_lookup_slot(&mapping->page_tree, + (void ***)pages, start, nr_pages); + ret = 0; + for (i = 0; i < nr_found; i++) { + struct page *page; +repeat: + page = radix_tree_deref_slot((void **)pages[i]); + if (unlikely(!page)) + continue; + /* + * this can only trigger if nr_found == 1, making livelock + * a non issue. + */ + if (unlikely(page == RADIX_TREE_RETRY)) + goto restart; + + if (!page_cache_get_speculative(page)) + goto repeat; + + /* Has the page moved? */ + if (unlikely(page != *((void **)pages[i]))) { + page_cache_release(page); + goto repeat; + } - read_lock_irq(&mapping->tree_lock); - ret = radix_tree_gang_lookup(&mapping->page_tree, - (void **)pages, start, nr_pages); - for (i = 0; i < ret; i++) - page_cache_get(pages[i]); - read_unlock_irq(&mapping->tree_lock); + pages[ret] = page; + ret++; + } + rcu_read_unlock(); return ret; } @@ -751,19 +787,44 @@ unsigned find_get_pages_contig(struct ad { unsigned int i; unsigned int ret; + unsigned int nr_found; + + rcu_read_lock(); +restart: + nr_found = radix_tree_gang_lookup_slot(&mapping->page_tree, + (void ***)pages, index, nr_pages); + ret = 0; + for (i = 0; i < nr_found; i++) { + struct page *page; +repeat: + page = radix_tree_deref_slot((void **)pages[i]); + if (unlikely(!page)) + continue; + /* + * this can only trigger if nr_found == 1, making livelock + * a non issue. + */ + if (unlikely(page == RADIX_TREE_RETRY)) + goto restart; - read_lock_irq(&mapping->tree_lock); - ret = radix_tree_gang_lookup(&mapping->page_tree, - (void **)pages, index, nr_pages); - for (i = 0; i < ret; i++) { - if (pages[i]->mapping == NULL || pages[i]->index != index) + if (page->mapping == NULL || page->index != index) break; - page_cache_get(pages[i]); + if (!page_cache_get_speculative(page)) + goto repeat; + + /* Has the page moved? */ + if (unlikely(page != *((void **)pages[i]))) { + page_cache_release(page); + goto repeat; + } + + pages[ret] = page; + ret++; index++; } - read_unlock_irq(&mapping->tree_lock); - return i; + rcu_read_unlock(); + return ret; } EXPORT_SYMBOL(find_get_pages_contig); @@ -783,15 +844,43 @@ unsigned find_get_pages_tag(struct addre { unsigned int i; unsigned int ret; + unsigned int nr_found; + + rcu_read_lock(); +restart: + nr_found = radix_tree_gang_lookup_tag_slot(&mapping->page_tree, + (void ***)pages, *index, nr_pages, tag); + ret = 0; + for (i = 0; i < nr_found; i++) { + struct page *page; +repeat: + page = radix_tree_deref_slot((void **)pages[i]); + if (unlikely(!page)) + continue; + /* + * this can only trigger if nr_found == 1, making livelock + * a non issue. + */ + if (unlikely(page == RADIX_TREE_RETRY)) + goto restart; + + if (!page_cache_get_speculative(page)) + goto repeat; + + /* Has the page moved? */ + if (unlikely(page != *((void **)pages[i]))) { + page_cache_release(page); + goto repeat; + } + + pages[ret] = page; + ret++; + } + rcu_read_unlock(); - read_lock_irq(&mapping->tree_lock); - ret = radix_tree_gang_lookup_tag(&mapping->page_tree, - (void **)pages, *index, nr_pages, tag); - for (i = 0; i < ret; i++) - page_cache_get(pages[i]); if (ret) *index = pages[ret - 1]->index + 1; - read_unlock_irq(&mapping->tree_lock); + return ret; } EXPORT_SYMBOL(find_get_pages_tag); �����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/2.6.21-rc6-lockless8-spinlock-tree_lock.patch�����������������������������������������������0000664�0000764�0000764�00000031072�11041657734�021201� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From: Nick Piggin <npiggin@suse.de> Subject: [patch 8/9] mm: spinlock tree_lock mapping->tree_lock has no read lockers. convert the lock from an rwlock to a spinlock. Signed-off-by: Nick Piggin <npiggin@suse.de> --- fs/buffer.c | 4 ++-- fs/inode.c | 2 +- include/asm-arm/cacheflush.h | 4 ++-- include/asm-parisc/cacheflush.h | 4 ++-- include/linux/fs.h | 2 +- mm/filemap.c | 10 +++++----- mm/migrate.c | 6 +++--- mm/page-writeback.c | 12 ++++++------ mm/swap_state.c | 10 +++++----- mm/swapfile.c | 4 ++-- mm/truncate.c | 6 +++--- mm/vmscan.c | 8 ++++---- 12 files changed, 36 insertions(+), 36 deletions(-) Index: linux-2.6.24.7/fs/buffer.c =================================================================== --- linux-2.6.24.7.orig/fs/buffer.c +++ linux-2.6.24.7/fs/buffer.c @@ -697,7 +697,7 @@ static int __set_page_dirty(struct page if (TestSetPageDirty(page)) return 0; - write_lock_irq(&mapping->tree_lock); + spin_lock_irq(&mapping->tree_lock); if (page->mapping) { /* Race with truncate? */ WARN_ON_ONCE(warn && !PageUptodate(page)); @@ -710,7 +710,7 @@ static int __set_page_dirty(struct page radix_tree_tag_set(&mapping->page_tree, page_index(page), PAGECACHE_TAG_DIRTY); } - write_unlock_irq(&mapping->tree_lock); + spin_unlock_irq(&mapping->tree_lock); __mark_inode_dirty(mapping->host, I_DIRTY_PAGES); return 1; Index: linux-2.6.24.7/fs/inode.c =================================================================== --- linux-2.6.24.7.orig/fs/inode.c +++ linux-2.6.24.7/fs/inode.c @@ -209,7 +209,7 @@ void inode_init_once(struct inode *inode INIT_LIST_HEAD(&inode->i_dentry); INIT_LIST_HEAD(&inode->i_devices); INIT_RADIX_TREE(&inode->i_data.page_tree, GFP_ATOMIC); - rwlock_init(&inode->i_data.tree_lock); + spin_lock_init(&inode->i_data.tree_lock); spin_lock_init(&inode->i_data.i_mmap_lock); INIT_LIST_HEAD(&inode->i_data.private_list); spin_lock_init(&inode->i_data.private_lock); Index: linux-2.6.24.7/include/asm-arm/cacheflush.h =================================================================== --- linux-2.6.24.7.orig/include/asm-arm/cacheflush.h +++ linux-2.6.24.7/include/asm-arm/cacheflush.h @@ -413,9 +413,9 @@ static inline void flush_anon_page(struc } #define flush_dcache_mmap_lock(mapping) \ - write_lock_irq(&(mapping)->tree_lock) + spin_lock_irq(&(mapping)->tree_lock) #define flush_dcache_mmap_unlock(mapping) \ - write_unlock_irq(&(mapping)->tree_lock) + spin_unlock_irq(&(mapping)->tree_lock) #define flush_icache_user_range(vma,page,addr,len) \ flush_dcache_page(page) Index: linux-2.6.24.7/include/asm-parisc/cacheflush.h =================================================================== --- linux-2.6.24.7.orig/include/asm-parisc/cacheflush.h +++ linux-2.6.24.7/include/asm-parisc/cacheflush.h @@ -45,9 +45,9 @@ void flush_cache_mm(struct mm_struct *mm extern void flush_dcache_page(struct page *page); #define flush_dcache_mmap_lock(mapping) \ - write_lock_irq(&(mapping)->tree_lock) + spin_lock_irq(&(mapping)->tree_lock) #define flush_dcache_mmap_unlock(mapping) \ - write_unlock_irq(&(mapping)->tree_lock) + spin_unlock_irq(&(mapping)->tree_lock) #define flush_icache_page(vma,page) do { \ flush_kernel_dcache_page(page); \ Index: linux-2.6.24.7/include/linux/fs.h =================================================================== --- linux-2.6.24.7.orig/include/linux/fs.h +++ linux-2.6.24.7/include/linux/fs.h @@ -499,7 +499,7 @@ struct backing_dev_info; struct address_space { struct inode *host; /* owner: inode, block_device */ struct radix_tree_root page_tree; /* radix tree of all pages */ - rwlock_t tree_lock; /* and rwlock protecting it */ + spinlock_t tree_lock; /* and lock protecting it */ unsigned int i_mmap_writable;/* count VM_SHARED mappings */ struct prio_tree_root i_mmap; /* tree of private and shared mappings */ struct list_head i_mmap_nonlinear;/*list VM_NONLINEAR mappings */ Index: linux-2.6.24.7/mm/filemap.c =================================================================== --- linux-2.6.24.7.orig/mm/filemap.c +++ linux-2.6.24.7/mm/filemap.c @@ -113,7 +113,7 @@ generic_file_direct_IO(int rw, struct ki /* * Remove a page from the page cache and free it. Caller has to make * sure the page is locked and that nobody else uses it - or that usage - * is safe. The caller must hold a write_lock on the mapping's tree_lock. + * is safe. The caller must hold the mapping's tree_lock. */ void __remove_from_page_cache(struct page *page) { @@ -144,9 +144,9 @@ void remove_from_page_cache(struct page BUG_ON(!PageLocked(page)); - write_lock_irq(&mapping->tree_lock); + spin_lock_irq(&mapping->tree_lock); __remove_from_page_cache(page); - write_unlock_irq(&mapping->tree_lock); + spin_unlock_irq(&mapping->tree_lock); } static int sync_page(void *word) @@ -457,7 +457,7 @@ int add_to_page_cache(struct page *page, if (error == 0) { set_page_nonewrefs(page); - write_lock_irq(&mapping->tree_lock); + spin_lock_irq(&mapping->tree_lock); error = radix_tree_insert(&mapping->page_tree, offset, page); if (!error) { page_cache_get(page); @@ -467,7 +467,7 @@ int add_to_page_cache(struct page *page, mapping->nrpages++; __inc_zone_page_state(page, NR_FILE_PAGES); } - write_unlock_irq(&mapping->tree_lock); + spin_unlock_irq(&mapping->tree_lock); clear_page_nonewrefs(page); radix_tree_preload_end(); } Index: linux-2.6.24.7/mm/migrate.c =================================================================== --- linux-2.6.24.7.orig/mm/migrate.c +++ linux-2.6.24.7/mm/migrate.c @@ -304,14 +304,14 @@ static int migrate_page_move_mapping(str } set_page_nonewrefs(page); - write_lock_irq(&mapping->tree_lock); + spin_lock_irq(&mapping->tree_lock); pslot = radix_tree_lookup_slot(&mapping->page_tree, page_index(page)); if (page_count(page) != 2 + !!PagePrivate(page) || (struct page *)radix_tree_deref_slot(pslot) != page) { - write_unlock_irq(&mapping->tree_lock); + spin_unlock_irq(&mapping->tree_lock); clear_page_nonewrefs(page); return -EAGAIN; } @@ -329,7 +329,7 @@ static int migrate_page_move_mapping(str radix_tree_replace_slot(pslot, newpage); page->mapping = NULL; - write_unlock_irq(&mapping->tree_lock); + spin_unlock_irq(&mapping->tree_lock); clear_page_nonewrefs(page); /* Index: linux-2.6.24.7/mm/page-writeback.c =================================================================== --- linux-2.6.24.7.orig/mm/page-writeback.c +++ linux-2.6.24.7/mm/page-writeback.c @@ -1008,7 +1008,7 @@ int __set_page_dirty_nobuffers(struct pa if (!mapping) return 1; - write_lock_irq(&mapping->tree_lock); + spin_lock_irq(&mapping->tree_lock); mapping2 = page_mapping(page); if (mapping2) { /* Race with truncate? */ BUG_ON(mapping2 != mapping); @@ -1022,7 +1022,7 @@ int __set_page_dirty_nobuffers(struct pa radix_tree_tag_set(&mapping->page_tree, page_index(page), PAGECACHE_TAG_DIRTY); } - write_unlock_irq(&mapping->tree_lock); + spin_unlock_irq(&mapping->tree_lock); if (mapping->host) { /* !PageAnon && !swapper_space */ __mark_inode_dirty(mapping->host, I_DIRTY_PAGES); @@ -1178,7 +1178,7 @@ int test_clear_page_writeback(struct pag struct backing_dev_info *bdi = mapping->backing_dev_info; unsigned long flags; - write_lock_irqsave(&mapping->tree_lock, flags); + spin_lock_irqsave(&mapping->tree_lock, flags); ret = TestClearPageWriteback(page); if (ret) { radix_tree_tag_clear(&mapping->page_tree, @@ -1189,7 +1189,7 @@ int test_clear_page_writeback(struct pag __bdi_writeout_inc(bdi); } } - write_unlock_irqrestore(&mapping->tree_lock, flags); + spin_unlock_irqrestore(&mapping->tree_lock, flags); } else { ret = TestClearPageWriteback(page); } @@ -1207,7 +1207,7 @@ int test_set_page_writeback(struct page struct backing_dev_info *bdi = mapping->backing_dev_info; unsigned long flags; - write_lock_irqsave(&mapping->tree_lock, flags); + spin_lock_irqsave(&mapping->tree_lock, flags); ret = TestSetPageWriteback(page); if (!ret) { radix_tree_tag_set(&mapping->page_tree, @@ -1220,7 +1220,7 @@ int test_set_page_writeback(struct page radix_tree_tag_clear(&mapping->page_tree, page_index(page), PAGECACHE_TAG_DIRTY); - write_unlock_irqrestore(&mapping->tree_lock, flags); + spin_unlock_irqrestore(&mapping->tree_lock, flags); } else { ret = TestSetPageWriteback(page); } Index: linux-2.6.24.7/mm/swap_state.c =================================================================== --- linux-2.6.24.7.orig/mm/swap_state.c +++ linux-2.6.24.7/mm/swap_state.c @@ -38,7 +38,7 @@ static struct backing_dev_info swap_back struct address_space swapper_space = { .page_tree = RADIX_TREE_INIT(GFP_ATOMIC|__GFP_NOWARN), - .tree_lock = __RW_LOCK_UNLOCKED(swapper_space.tree_lock), + .tree_lock = __SPIN_LOCK_UNLOCKED(swapper_space.tree_lock), .a_ops = &swap_aops, .i_mmap_nonlinear = LIST_HEAD_INIT(swapper_space.i_mmap_nonlinear), .backing_dev_info = &swap_backing_dev_info, @@ -80,7 +80,7 @@ static int __add_to_swap_cache(struct pa error = radix_tree_preload(gfp_mask); if (!error) { set_page_nonewrefs(page); - write_lock_irq(&swapper_space.tree_lock); + spin_lock_irq(&swapper_space.tree_lock); error = radix_tree_insert(&swapper_space.page_tree, entry.val, page); if (!error) { @@ -90,7 +90,7 @@ static int __add_to_swap_cache(struct pa total_swapcache_pages++; __inc_zone_page_state(page, NR_FILE_PAGES); } - write_unlock_irq(&swapper_space.tree_lock); + spin_unlock_irq(&swapper_space.tree_lock); clear_page_nonewrefs(page); radix_tree_preload_end(); } @@ -205,9 +205,9 @@ void delete_from_swap_cache(struct page entry.val = page_private(page); - write_lock_irq(&swapper_space.tree_lock); + spin_lock_irq(&swapper_space.tree_lock); __delete_from_swap_cache(page); - write_unlock_irq(&swapper_space.tree_lock); + spin_unlock_irq(&swapper_space.tree_lock); swap_free(entry); page_cache_release(page); Index: linux-2.6.24.7/mm/swapfile.c =================================================================== --- linux-2.6.24.7.orig/mm/swapfile.c +++ linux-2.6.24.7/mm/swapfile.c @@ -367,13 +367,13 @@ int remove_exclusive_swap_page(struct pa retval = 0; if (p->swap_map[swp_offset(entry)] == 1) { /* Recheck the page count with the swapcache lock held.. */ - write_lock_irq(&swapper_space.tree_lock); + spin_lock_irq(&swapper_space.tree_lock); if ((page_count(page) == 2) && !PageWriteback(page)) { __delete_from_swap_cache(page); SetPageDirty(page); retval = 1; } - write_unlock_irq(&swapper_space.tree_lock); + spin_unlock_irq(&swapper_space.tree_lock); } spin_unlock(&swap_lock); Index: linux-2.6.24.7/mm/truncate.c =================================================================== --- linux-2.6.24.7.orig/mm/truncate.c +++ linux-2.6.24.7/mm/truncate.c @@ -350,18 +350,18 @@ invalidate_complete_page2(struct address if (PagePrivate(page) && !try_to_release_page(page, GFP_KERNEL)) return 0; - write_lock_irq(&mapping->tree_lock); + spin_lock_irq(&mapping->tree_lock); if (PageDirty(page)) goto failed; BUG_ON(PagePrivate(page)); __remove_from_page_cache(page); - write_unlock_irq(&mapping->tree_lock); + spin_unlock_irq(&mapping->tree_lock); ClearPageUptodate(page); page_cache_release(page); /* pagecache ref */ return 1; failed: - write_unlock_irq(&mapping->tree_lock); + spin_unlock_irq(&mapping->tree_lock); return 0; } Index: linux-2.6.24.7/mm/vmscan.c =================================================================== --- linux-2.6.24.7.orig/mm/vmscan.c +++ linux-2.6.24.7/mm/vmscan.c @@ -386,7 +386,7 @@ int remove_mapping(struct address_space BUG_ON(mapping != page_mapping(page)); set_page_nonewrefs(page); - write_lock_irq(&mapping->tree_lock); + spin_lock_irq(&mapping->tree_lock); /* * The non racy check for a busy page. * @@ -421,13 +421,13 @@ int remove_mapping(struct address_space if (PageSwapCache(page)) { swp_entry_t swap = { .val = page_private(page) }; __delete_from_swap_cache(page); - write_unlock_irq(&mapping->tree_lock); + spin_unlock_irq(&mapping->tree_lock); swap_free(swap); goto free_it; } __remove_from_page_cache(page); - write_unlock_irq(&mapping->tree_lock); + spin_unlock_irq(&mapping->tree_lock); free_it: __clear_page_nonewrefs(page); @@ -435,7 +435,7 @@ free_it: return 1; cannot_free: - write_unlock_irq(&mapping->tree_lock); + spin_unlock_irq(&mapping->tree_lock); clear_page_nonewrefs(page); return 0; } ����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/radix-tree-concurrent.patch�����������������������������������������������������������������0000664�0000764�0000764�00000046525�11041657732�016524� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������Subject: radix-tree: concurrent write side support Provide support for concurrent write side operations without changing the API for all current uses. Concurrency is realized by means of two locking models; the simple one is ladder locking, the more complex one is path locking. Ladder locking is like walking down a ladder, you place your foot on a spoke below the one your other foot finds support etc.. There is no walking with both feet in the air. Likewise with walking a tree, you lock a node below the current node before releasing it. This allows other modifying operations to start as soon as you release the lock on the root node and even complete before you if they walk another path downward. The modifying operations: insert, lookup_slot and set_tag, use this simple method. The more complex path locking method is needed for operations that need to walk upwards again after they walked down, those are: tag_clear and delete. These lock their whole path downwards and release whole sections at points where it can be determined the walk upwards will stop, thus also allowing concurrency. Finding the conditions for the terminated walk upwards while doing the downward walk is the 'interesting' part of this approach. The remaining - unmodified - operations will have exclusive locking (since they're unmodified, they never move the lock downwards from the root node). The API for this looks like: DEFINE_RADIX_TREE_CONTEXT(ctx, &mapping->page_tree) radix_tree_lock(&ctx) ... do _1_ modifying operation ... radix_tree_unlock(&ctx) Note that before the radix operation the root node is held and will provide exclusive locking, after the operation the held lock might only be enough to protect a single item. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> --- include/linux/radix-tree.h | 77 +++++++++++- init/Kconfig | 4 lib/radix-tree.c | 283 ++++++++++++++++++++++++++++++++++++--------- 3 files changed, 302 insertions(+), 62 deletions(-) Index: linux-2.6.24.7/include/linux/radix-tree.h =================================================================== --- linux-2.6.24.7.orig/include/linux/radix-tree.h +++ linux-2.6.24.7/include/linux/radix-tree.h @@ -62,23 +62,65 @@ struct radix_tree_root { unsigned int height; gfp_t gfp_mask; struct radix_tree_node *rnode; + spinlock_t lock; }; #define RADIX_TREE_INIT(mask) { \ .height = 0, \ .gfp_mask = (mask), \ .rnode = NULL, \ + .lock = __SPIN_LOCK_UNLOCKED(radix_tree_root.lock), \ } #define RADIX_TREE(name, mask) \ struct radix_tree_root name = RADIX_TREE_INIT(mask) -#define INIT_RADIX_TREE(root, mask) \ -do { \ - (root)->height = 0; \ - (root)->gfp_mask = (mask); \ - (root)->rnode = NULL; \ -} while (0) +static inline void INIT_RADIX_TREE(struct radix_tree_root *root, gfp_t gfp_mask) +{ + root->height = 0; + root->gfp_mask = gfp_mask; + root->rnode = NULL; + spin_lock_init(&root->lock); +} + +struct radix_tree_context { + struct radix_tree_root *tree; + struct radix_tree_root *root; +#ifdef CONFIG_RADIX_TREE_CONCURRENT + spinlock_t *locked; +#endif +}; + +#ifdef CONFIG_RADIX_TREE_CONCURRENT +#define RADIX_CONTEXT_ROOT(context) \ + ((struct radix_tree_root *)(((unsigned long)context) + 1)) + +#define __RADIX_TREE_CONTEXT_INIT(context, _tree) \ + .tree = RADIX_CONTEXT_ROOT(&context), \ + .locked = NULL, +#else +#define __RADIX_TREE_CONTEXT_INIT(context, _tree) \ + .tree = (_tree), +#endif + +#define DEFINE_RADIX_TREE_CONTEXT(context, _tree) \ + struct radix_tree_context context = { \ + .root = (_tree), \ + __RADIX_TREE_CONTEXT_INIT(context, _tree) \ + } + +static inline void +init_radix_tree_context(struct radix_tree_context *ctx, + struct radix_tree_root *root) +{ + ctx->root = root; +#ifdef CONFIG_RADIX_TREE_CONCURRENT + ctx->tree = RADIX_CONTEXT_ROOT(ctx); + ctx->locked = NULL; +#else + ctx->tree = root; +#endif +} /** * Radix-tree synchronization @@ -155,6 +197,29 @@ static inline void radix_tree_replace_sl rcu_assign_pointer(*pslot, item); } +static inline void radix_tree_lock(struct radix_tree_context *context) +{ + struct radix_tree_root *root = context->root; + rcu_read_lock(); + spin_lock(&root->lock); +#ifdef CONFIG_RADIX_TREE_CONCURRENT + BUG_ON(context->locked); + context->locked = &root->lock; +#endif +} + +static inline void radix_tree_unlock(struct radix_tree_context *context) +{ +#ifdef CONFIG_RADIX_TREE_CONCURRENT + BUG_ON(!context->locked); + spin_unlock(context->locked); + context->locked = NULL; +#else + spin_unlock(&context->root->lock); +#endif + rcu_read_unlock(); +} + int radix_tree_insert(struct radix_tree_root *, unsigned long, void *); void *radix_tree_lookup(struct radix_tree_root *, unsigned long); void **radix_tree_lookup_slot(struct radix_tree_root *, unsigned long); Index: linux-2.6.24.7/init/Kconfig =================================================================== --- linux-2.6.24.7.orig/init/Kconfig +++ linux-2.6.24.7/init/Kconfig @@ -435,6 +435,10 @@ config CC_OPTIMIZE_FOR_SIZE config SYSCTL bool +config RADIX_TREE_CONCURRENT + bool "Enable concurrent radix tree operations (EXPERIMENTAL)" + default y if SMP + menuconfig EMBEDDED bool "Configure standard kernel features (for small systems)" help Index: linux-2.6.24.7/lib/radix-tree.c =================================================================== --- linux-2.6.24.7.orig/lib/radix-tree.c +++ linux-2.6.24.7/lib/radix-tree.c @@ -32,6 +32,7 @@ #include <linux/string.h> #include <linux/bitops.h> #include <linux/rcupdate.h> +#include <linux/spinlock.h> #ifdef __KERNEL__ @@ -52,11 +53,17 @@ struct radix_tree_node { struct rcu_head rcu_head; void *slots[RADIX_TREE_MAP_SIZE]; unsigned long tags[RADIX_TREE_MAX_TAGS][RADIX_TREE_TAG_LONGS]; +#ifdef CONFIG_RADIX_TREE_CONCURRENT + spinlock_t lock; +#endif }; struct radix_tree_path { struct radix_tree_node *node; int offset; +#ifdef CONFIG_RADIX_TREE_CONCURRENT + spinlock_t *locked; +#endif }; #define RADIX_TREE_INDEX_BITS (8 /* CHAR_BIT */ * sizeof(unsigned long)) @@ -69,6 +76,10 @@ struct radix_tree_path { */ static unsigned long height_to_maxindex[RADIX_TREE_MAX_PATH + 1] __read_mostly; +#ifdef CONFIG_RADIX_TREE_CONCURRENT +static struct lock_class_key radix_node_class[RADIX_TREE_MAX_PATH]; +#endif + /* * Radix tree node cache. */ @@ -93,7 +104,7 @@ static inline gfp_t root_gfp_mask(struct * that the caller has pinned this thread of control to the current CPU. */ static struct radix_tree_node * -radix_tree_node_alloc(struct radix_tree_root *root) +radix_tree_node_alloc(struct radix_tree_root *root, int height) { struct radix_tree_node *ret; gfp_t gfp_mask = root_gfp_mask(root); @@ -112,6 +123,11 @@ radix_tree_node_alloc(struct radix_tree_ put_cpu_var(radix_tree_preloads); } BUG_ON(radix_tree_is_indirect_ptr(ret)); +#ifdef CONFIG_RADIX_TREE_CONCURRENT + spin_lock_init(&ret->lock); + lockdep_set_class(&ret->lock, &radix_node_class[height]); +#endif + ret->height = height; return ret; } @@ -218,6 +234,22 @@ static inline int any_tag_set(struct rad return 0; } +static inline int any_tag_set_but(struct radix_tree_node *node, + unsigned int tag, int offset) +{ + int idx; + int offset_idx = offset / BITS_PER_LONG; + unsigned long offset_mask = ~(1UL << (offset % BITS_PER_LONG)); + for (idx = 0; idx < RADIX_TREE_TAG_LONGS; idx++) { + unsigned long mask = ~0UL; + if (idx == offset_idx) + mask = offset_mask; + if (node->tags[tag][idx] & mask) + return 1; + } + return 0; +} + /* * Return the maximum key which can be store into a * radix tree with height HEIGHT. @@ -247,8 +279,8 @@ static int radix_tree_extend(struct radi } do { - unsigned int newheight; - if (!(node = radix_tree_node_alloc(root))) + unsigned int newheight = root->height + 1; + if (!(node = radix_tree_node_alloc(root, newheight))) return -ENOMEM; /* Increase the height. */ @@ -260,8 +292,6 @@ static int radix_tree_extend(struct radi tag_set(node, tag, 0); } - newheight = root->height+1; - node->height = newheight; node->count = 1; node = radix_tree_ptr_to_indirect(node); rcu_assign_pointer(root->rnode, node); @@ -271,6 +301,80 @@ out: return 0; } +#ifdef CONFIG_RADIX_TREE_CONCURRENT +static inline struct radix_tree_context * +radix_tree_get_context(struct radix_tree_root **rootp) +{ + struct radix_tree_context *context = NULL; + unsigned long addr = (unsigned long)*rootp; + + if (addr & 1) { + context = (struct radix_tree_context *)(addr - 1); + *rootp = context->root; + } + + return context; +} + +#define RADIX_TREE_CONTEXT(context, root) \ + struct radix_tree_context *context = \ + radix_tree_get_context(&root) + +static inline spinlock_t *radix_node_lock(struct radix_tree_root *root, + struct radix_tree_node *node) +{ + spinlock_t *locked = &node->lock; + spin_lock(locked); + return locked; +} + +static inline void radix_ladder_lock(struct radix_tree_context *context, + struct radix_tree_node *node) +{ + if (context) { + struct radix_tree_root *root = context->root; + spinlock_t *locked = radix_node_lock(root, node); + if (locked) { + spin_unlock(context->locked); + context->locked = locked; + } + } +} + +static inline void radix_path_init(struct radix_tree_context *context, + struct radix_tree_path *pathp) +{ + pathp->locked = context ? context->locked : NULL; +} + +static inline void radix_path_lock(struct radix_tree_context *context, + struct radix_tree_path *pathp, struct radix_tree_node *node) +{ + if (context) { + struct radix_tree_root *root = context->root; + spinlock_t *locked = radix_node_lock(root, node); + if (locked) + context->locked = locked; + pathp->locked = locked; + } else + pathp->locked = NULL; +} + +static inline void radix_path_unlock(struct radix_tree_context *context, + struct radix_tree_path *punlock) +{ + if (context && punlock->locked && + context->locked != punlock->locked) + spin_unlock(punlock->locked); +} +#else +#define RADIX_TREE_CONTEXT(context, root) do { } while (0) +#define radix_ladder_lock(context, node) do { } while (0) +#define radix_path_init(context, pathp) do { } while (0) +#define radix_path_lock(context, pathp, node) do { } while (0) +#define radix_path_unlock(context, punlock) do { } while (0) +#endif + /** * radix_tree_insert - insert into a radix tree * @root: radix tree root @@ -286,6 +390,8 @@ int radix_tree_insert(struct radix_tree_ unsigned int height, shift; int offset; int error; + int tag; + RADIX_TREE_CONTEXT(context, root); BUG_ON(radix_tree_is_indirect_ptr(item)); @@ -305,9 +411,8 @@ int radix_tree_insert(struct radix_tree_ while (height > 0) { if (slot == NULL) { /* Have to add a child node. */ - if (!(slot = radix_tree_node_alloc(root))) + if (!(slot = radix_tree_node_alloc(root, height))) return -ENOMEM; - slot->height = height; if (node) { rcu_assign_pointer(node->slots[offset], slot); node->count++; @@ -319,6 +424,9 @@ int radix_tree_insert(struct radix_tree_ /* Go a level down */ offset = (index >> shift) & RADIX_TREE_MAP_MASK; node = slot; + + radix_ladder_lock(context, node); + slot = node->slots[offset]; shift -= RADIX_TREE_MAP_SHIFT; height--; @@ -330,12 +438,12 @@ int radix_tree_insert(struct radix_tree_ if (node) { node->count++; rcu_assign_pointer(node->slots[offset], item); - BUG_ON(tag_get(node, 0, offset)); - BUG_ON(tag_get(node, 1, offset)); + for (tag = 0; tag < RADIX_TREE_MAX_TAGS; tag++) + BUG_ON(tag_get(node, tag, offset)); } else { rcu_assign_pointer(root->rnode, item); - BUG_ON(root_tag_get(root, 0)); - BUG_ON(root_tag_get(root, 1)); + for (tag = 0; tag < RADIX_TREE_MAX_TAGS; tag++) + BUG_ON(root_tag_get(root, tag)); } return 0; @@ -359,6 +467,7 @@ void **radix_tree_lookup_slot(struct rad { unsigned int height, shift; struct radix_tree_node *node, **slot; + RADIX_TREE_CONTEXT(context, root); node = rcu_dereference(root->rnode); if (node == NULL) @@ -384,6 +493,8 @@ void **radix_tree_lookup_slot(struct rad if (node == NULL) return NULL; + radix_ladder_lock(context, node); + shift -= RADIX_TREE_MAP_SHIFT; height--; } while (height > 0); @@ -459,6 +570,7 @@ void *radix_tree_tag_set(struct radix_tr { unsigned int height, shift; struct radix_tree_node *slot; + RADIX_TREE_CONTEXT(context, root); height = root->height; BUG_ON(index > radix_tree_maxindex(height)); @@ -466,9 +578,15 @@ void *radix_tree_tag_set(struct radix_tr slot = radix_tree_indirect_to_ptr(root->rnode); shift = (height - 1) * RADIX_TREE_MAP_SHIFT; + /* set the root's tag bit */ + if (slot && !root_tag_get(root, tag)) + root_tag_set(root, tag); + while (height > 0) { int offset; + radix_ladder_lock(context, slot); + offset = (index >> shift) & RADIX_TREE_MAP_MASK; if (!tag_get(slot, tag, offset)) tag_set(slot, tag, offset); @@ -478,14 +596,24 @@ void *radix_tree_tag_set(struct radix_tr height--; } - /* set the root's tag bit */ - if (slot && !root_tag_get(root, tag)) - root_tag_set(root, tag); - return slot; } EXPORT_SYMBOL(radix_tree_tag_set); +/* + * the change can never propagate upwards from here. + */ +static inline int radix_tree_unlock_tag(struct radix_tree_root *root, + struct radix_tree_path *pathp, int tag) +{ + int this, other; + + this = tag_get(pathp->node, tag, pathp->offset); + other = any_tag_set_but(pathp->node, tag, pathp->offset); + + return !this || other; +} + /** * radix_tree_tag_clear - clear a tag on a radix tree node * @root: radix tree root @@ -508,15 +636,19 @@ void *radix_tree_tag_clear(struct radix_ * since the "list" is null terminated. */ struct radix_tree_path path[RADIX_TREE_MAX_PATH + 1], *pathp = path; + struct radix_tree_path *punlock = path, *piter; struct radix_tree_node *slot = NULL; unsigned int height, shift; + RADIX_TREE_CONTEXT(context, root); + + pathp->node = NULL; + radix_path_init(context, pathp); height = root->height; if (index > radix_tree_maxindex(height)) goto out; shift = (height - 1) * RADIX_TREE_MAP_SHIFT; - pathp->node = NULL; slot = radix_tree_indirect_to_ptr(root->rnode); while (height > 0) { @@ -526,10 +658,17 @@ void *radix_tree_tag_clear(struct radix_ goto out; offset = (index >> shift) & RADIX_TREE_MAP_MASK; - pathp[1].offset = offset; - pathp[1].node = slot; - slot = slot->slots[offset]; pathp++; + pathp->offset = offset; + pathp->node = slot; + radix_path_lock(context, pathp, slot); + + if (radix_tree_unlock_tag(root, pathp, tag)) { + for (; punlock < pathp; punlock++) + radix_path_unlock(context, punlock); + } + + slot = slot->slots[offset]; shift -= RADIX_TREE_MAP_SHIFT; height--; } @@ -537,20 +676,22 @@ void *radix_tree_tag_clear(struct radix_ if (slot == NULL) goto out; - while (pathp->node) { - if (!tag_get(pathp->node, tag, pathp->offset)) - goto out; - tag_clear(pathp->node, tag, pathp->offset); - if (any_tag_set(pathp->node, tag)) - goto out; - pathp--; + for (piter = pathp; piter >= punlock; piter--) { + if (piter->node) { + if (!tag_get(piter->node, tag, piter->offset)) + break; + tag_clear(piter->node, tag, piter->offset); + if (any_tag_set(piter->node, tag)) + break; + } else { + if (root_tag_get(root, tag)) + root_tag_clear(root, tag); + } } - /* clear the root's tag bit */ - if (root_tag_get(root, tag)) - root_tag_clear(root, tag); - out: + for (; punlock < pathp; punlock++) + radix_path_unlock(context, punlock); return slot; } EXPORT_SYMBOL(radix_tree_tag_clear); @@ -1039,6 +1180,7 @@ static inline void radix_tree_shrink(str while (root->height > 0) { struct radix_tree_node *to_free = root->rnode; void *newptr; + int tag; BUG_ON(!radix_tree_is_indirect_ptr(to_free)); to_free = radix_tree_indirect_to_ptr(to_free); @@ -1065,14 +1207,29 @@ static inline void radix_tree_shrink(str root->rnode = newptr; root->height--; /* must only free zeroed nodes into the slab */ - tag_clear(to_free, 0, 0); - tag_clear(to_free, 1, 0); + for (tag = 0; tag < RADIX_TREE_MAX_TAGS; tag++) + tag_clear(to_free, tag, 0); to_free->slots[0] = NULL; to_free->count = 0; - radix_tree_node_free(to_free); } } +static inline int radix_tree_unlock_all(struct radix_tree_root *root, + struct radix_tree_path *pathp) +{ + int tag; + int unlock = 1; + + for (tag = 0; tag < RADIX_TREE_MAX_TAGS; tag++) { + if (!radix_tree_unlock_tag(root, pathp, tag)) { + unlock = 0; + break; + } + } + + return unlock; +} + /** * radix_tree_delete - delete an item from a radix tree * @root: radix tree root @@ -1089,11 +1246,15 @@ void *radix_tree_delete(struct radix_tre * since the "list" is null terminated. */ struct radix_tree_path path[RADIX_TREE_MAX_PATH + 1], *pathp = path; + struct radix_tree_path *punlock = path, *piter; struct radix_tree_node *slot = NULL; - struct radix_tree_node *to_free; unsigned int height, shift; int tag; int offset; + RADIX_TREE_CONTEXT(context, root); + + pathp->node = NULL; + radix_path_init(context, pathp); height = root->height; if (index > radix_tree_maxindex(height)) @@ -1108,7 +1269,6 @@ void *radix_tree_delete(struct radix_tre slot = radix_tree_indirect_to_ptr(slot); shift = (height - 1) * RADIX_TREE_MAP_SHIFT; - pathp->node = NULL; do { if (slot == NULL) @@ -1118,6 +1278,13 @@ void *radix_tree_delete(struct radix_tre offset = (index >> shift) & RADIX_TREE_MAP_MASK; pathp->offset = offset; pathp->node = slot; + radix_path_lock(context, pathp, slot); + + if (slot->count > 2 && radix_tree_unlock_all(root, pathp)) { + for (; punlock < pathp; punlock++) + radix_path_unlock(context, punlock); + } + slot = slot->slots[offset]; shift -= RADIX_TREE_MAP_SHIFT; height--; @@ -1130,41 +1297,45 @@ void *radix_tree_delete(struct radix_tre * Clear all tags associated with the just-deleted item */ for (tag = 0; tag < RADIX_TREE_MAX_TAGS; tag++) { - if (tag_get(pathp->node, tag, pathp->offset)) - radix_tree_tag_clear(root, index, tag); + for (piter = pathp; piter >= punlock; piter--) { + if (piter->node) { + if (!tag_get(piter->node, tag, piter->offset)) + break; + tag_clear(piter->node, tag, piter->offset); + if (any_tag_set(piter->node, tag)) + break; + } else { + if (root_tag_get(root, tag)) + root_tag_clear(root, tag); + } + } } - to_free = NULL; - /* Now free the nodes we do not need anymore */ - while (pathp->node) { - pathp->node->slots[pathp->offset] = NULL; - pathp->node->count--; - /* - * Queue the node for deferred freeing after the - * last reference to it disappears (set NULL, above). - */ - if (to_free) - radix_tree_node_free(to_free); + /* Now unhook the nodes we do not need anymore */ + for (piter = pathp; piter >= punlock && piter->node; piter--) { + piter->node->slots[piter->offset] = NULL; + piter->node->count--; - if (pathp->node->count) { - if (pathp->node == + if (piter->node->count) { + if (piter->node == radix_tree_indirect_to_ptr(root->rnode)) radix_tree_shrink(root); goto out; } + } - /* Node with zero slots in use so free it */ - to_free = pathp->node; - pathp--; + BUG_ON(piter->node); - } root_tag_clear_all(root); root->height = 0; root->rnode = NULL; - if (to_free) - radix_tree_node_free(to_free); out: + for (; punlock <= pathp; punlock++) { + radix_path_unlock(context, punlock); + if (punlock->node && punlock->node->count == 0) + radix_tree_node_free(punlock->node); + } return slot; } EXPORT_SYMBOL(radix_tree_delete); ���������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/mapping_nrpages.patch�����������������������������������������������������������������������0000664�0000764�0000764�00000043027�11041657734�015446� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������Subject: mm/fs: abstract address_space::nrpages Currently the tree_lock protects mapping->nrpages, this will not be possible much longer. Hence abstract the access to this variable so that it can be easily replaced by an atomic_ulong_t. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> --- arch/sh64/lib/dbg.c | 2 +- fs/block_dev.c | 4 ++-- fs/buffer.c | 2 +- fs/gfs2/glock.c | 2 +- fs/gfs2/glops.c | 4 ++-- fs/gfs2/meta_io.c | 2 +- fs/hugetlbfs/inode.c | 2 +- fs/inode.c | 10 +++++----- fs/jffs2/dir.c | 4 ++-- fs/jffs2/fs.c | 2 +- fs/libfs.c | 2 +- fs/nfs/inode.c | 6 +++--- fs/xfs/linux-2.6/xfs_vnode.h | 2 +- include/linux/fs.h | 22 +++++++++++++++++++++- include/linux/swap.h | 2 +- ipc/shm.c | 4 ++-- mm/filemap.c | 14 +++++++------- mm/shmem.c | 8 ++++---- mm/swap_state.c | 4 ++-- mm/truncate.c | 2 +- 20 files changed, 60 insertions(+), 40 deletions(-) Index: linux-2.6.24.7/arch/sh64/lib/dbg.c =================================================================== --- linux-2.6.24.7.orig/arch/sh64/lib/dbg.c +++ linux-2.6.24.7/arch/sh64/lib/dbg.c @@ -425,6 +425,6 @@ void print_page(struct page *page) printk(" page[%p] -> index 0x%lx, count 0x%x, flags 0x%lx\n", page, page->index, page_count(page), page->flags); printk(" address_space = %p, pages =%ld\n", page->mapping, - page->mapping->nrpages); + mapping_nrpages(page->mapping)); } Index: linux-2.6.24.7/fs/block_dev.c =================================================================== --- linux-2.6.24.7.orig/fs/block_dev.c +++ linux-2.6.24.7/fs/block_dev.c @@ -59,7 +59,7 @@ static sector_t max_block(struct block_d /* Kill _all_ buffers and pagecache , dirty or not.. */ static void kill_bdev(struct block_device *bdev) { - if (bdev->bd_inode->i_mapping->nrpages == 0) + if (mapping_nrpages(bdev->bd_inode->i_mapping) == 0) return; invalidate_bh_lrus(); truncate_inode_pages(bdev->bd_inode->i_mapping, 0); @@ -604,7 +604,7 @@ long nr_blockdev_pages(void) long ret = 0; spin_lock(&bdev_lock); list_for_each_entry(bdev, &all_bdevs, bd_list) { - ret += bdev->bd_inode->i_mapping->nrpages; + ret += mapping_nrpages(bdev->bd_inode->i_mapping); } spin_unlock(&bdev_lock); return ret; Index: linux-2.6.24.7/fs/buffer.c =================================================================== --- linux-2.6.24.7.orig/fs/buffer.c +++ linux-2.6.24.7/fs/buffer.c @@ -347,7 +347,7 @@ void invalidate_bdev(struct block_device { struct address_space *mapping = bdev->bd_inode->i_mapping; - if (mapping->nrpages == 0) + if (mapping_nrpages(mapping) == 0) return; invalidate_bh_lrus(); Index: linux-2.6.24.7/fs/gfs2/glock.c =================================================================== --- linux-2.6.24.7.orig/fs/gfs2/glock.c +++ linux-2.6.24.7/fs/gfs2/glock.c @@ -1916,7 +1916,7 @@ static int dump_glock(struct glock_iter (list_empty(&gl->gl_reclaim)) ? "no" : "yes"); if (gl->gl_aspace) print_dbg(gi, " aspace = 0x%p nrpages = %lu\n", gl->gl_aspace, - gl->gl_aspace->i_mapping->nrpages); + mapping_nrpages(gl->gl_aspace->i_mapping)); else print_dbg(gi, " aspace = no\n"); print_dbg(gi, " ail = %d\n", atomic_read(&gl->gl_ail_count)); Index: linux-2.6.24.7/fs/gfs2/glops.c =================================================================== --- linux-2.6.24.7.orig/fs/gfs2/glops.c +++ linux-2.6.24.7/fs/gfs2/glops.c @@ -252,7 +252,7 @@ static int inode_go_demote_ok(struct gfs struct gfs2_sbd *sdp = gl->gl_sbd; int demote = 0; - if (!gl->gl_object && !gl->gl_aspace->i_mapping->nrpages) + if (!gl->gl_object && !mapping_nrpages(gl->gl_aspace->i_mapping)) demote = 1; else if (!sdp->sd_args.ar_localcaching && time_after_eq(jiffies, gl->gl_stamp + @@ -319,7 +319,7 @@ static void inode_go_unlock(struct gfs2_ static int rgrp_go_demote_ok(struct gfs2_glock *gl) { - return !gl->gl_aspace->i_mapping->nrpages; + return !mapping_nrpages(gl->gl_aspace->i_mapping); } /** Index: linux-2.6.24.7/fs/gfs2/meta_io.c =================================================================== --- linux-2.6.24.7.orig/fs/gfs2/meta_io.c +++ linux-2.6.24.7/fs/gfs2/meta_io.c @@ -104,7 +104,7 @@ void gfs2_meta_inval(struct gfs2_glock * truncate_inode_pages(mapping, 0); atomic_dec(&aspace->i_writecount); - gfs2_assert_withdraw(sdp, !mapping->nrpages); + gfs2_assert_withdraw(sdp, !mapping_nrpages(mapping)); } /** Index: linux-2.6.24.7/fs/hugetlbfs/inode.c =================================================================== --- linux-2.6.24.7.orig/fs/hugetlbfs/inode.c +++ linux-2.6.24.7/fs/hugetlbfs/inode.c @@ -368,7 +368,7 @@ static void truncate_hugepages(struct in } huge_pagevec_release(&pvec); } - BUG_ON(!lstart && mapping->nrpages); + BUG_ON(!lstart && mapping_nrpages(mapping)); hugetlb_unreserve_pages(inode, start, freed); } Index: linux-2.6.24.7/fs/inode.c =================================================================== --- linux-2.6.24.7.orig/fs/inode.c +++ linux-2.6.24.7/fs/inode.c @@ -259,7 +259,7 @@ void clear_inode(struct inode *inode) might_sleep(); invalidate_inode_buffers(inode); - BUG_ON(inode->i_data.nrpages); + BUG_ON(mapping_nrpages(&inode->i_data)); BUG_ON(!(inode->i_state & I_FREEING)); BUG_ON(inode->i_state & I_CLEAR); inode_sync_wait(inode); @@ -292,7 +292,7 @@ static void dispose_list(struct list_hea inode = list_first_entry(head, struct inode, i_list); list_del(&inode->i_list); - if (inode->i_data.nrpages) + if (mapping_nrpages(&inode->i_data)) truncate_inode_pages(&inode->i_data, 0); clear_inode(inode); @@ -384,7 +384,7 @@ static int can_unuse(struct inode *inode return 0; if (atomic_read(&inode->i_count)) return 0; - if (inode->i_data.nrpages) + if (mapping_nrpages(&inode->i_data)) return 0; return 1; } @@ -423,7 +423,7 @@ static void prune_icache(int nr_to_scan) list_move(&inode->i_list, &inode_unused); continue; } - if (inode_has_buffers(inode) || inode->i_data.nrpages) { + if (inode_has_buffers(inode) || mapping_nrpages(&inode->i_data)) { __iget(inode); spin_unlock(&inode_lock); if (remove_inode_buffers(inode)) @@ -1100,7 +1100,7 @@ static void generic_forget_inode(struct inode->i_state |= I_FREEING; inodes_stat.nr_inodes--; spin_unlock(&inode_lock); - if (inode->i_data.nrpages) + if (mapping_nrpages(&inode->i_data)) truncate_inode_pages(&inode->i_data, 0); clear_inode(inode); wake_up_inode(inode); Index: linux-2.6.24.7/fs/jffs2/dir.c =================================================================== --- linux-2.6.24.7.orig/fs/jffs2/dir.c +++ linux-2.6.24.7/fs/jffs2/dir.c @@ -203,7 +203,7 @@ static int jffs2_create(struct inode *di inode->i_op = &jffs2_file_inode_operations; inode->i_fop = &jffs2_file_operations; inode->i_mapping->a_ops = &jffs2_file_address_operations; - inode->i_mapping->nrpages = 0; + mapping_nrpages_init(inode->i_mapping); f = JFFS2_INODE_INFO(inode); dir_f = JFFS2_INODE_INFO(dir_i); @@ -219,7 +219,7 @@ static int jffs2_create(struct inode *di d_instantiate(dentry, inode); D1(printk(KERN_DEBUG "jffs2_create: Created ino #%lu with mode %o, nlink %d(%d). nrpages %ld\n", - inode->i_ino, inode->i_mode, inode->i_nlink, f->inocache->nlink, inode->i_mapping->nrpages)); + inode->i_ino, inode->i_mode, inode->i_nlink, f->inocache->nlink, mapping_nrpages(inode->i_mapping))); return 0; fail: Index: linux-2.6.24.7/fs/jffs2/fs.c =================================================================== --- linux-2.6.24.7.orig/fs/jffs2/fs.c +++ linux-2.6.24.7/fs/jffs2/fs.c @@ -294,7 +294,7 @@ void jffs2_read_inode (struct inode *ino inode->i_op = &jffs2_file_inode_operations; inode->i_fop = &jffs2_file_operations; inode->i_mapping->a_ops = &jffs2_file_address_operations; - inode->i_mapping->nrpages = 0; + mapping_nrpages_init(inode->i_mapping); break; case S_IFBLK: Index: linux-2.6.24.7/fs/libfs.c =================================================================== --- linux-2.6.24.7.orig/fs/libfs.c +++ linux-2.6.24.7/fs/libfs.c @@ -17,7 +17,7 @@ int simple_getattr(struct vfsmount *mnt, { struct inode *inode = dentry->d_inode; generic_fillattr(inode, stat); - stat->blocks = inode->i_mapping->nrpages << (PAGE_CACHE_SHIFT - 9); + stat->blocks = mapping_nrpages(inode->i_mapping) << (PAGE_CACHE_SHIFT - 9); return 0; } Index: linux-2.6.24.7/fs/nfs/inode.c =================================================================== --- linux-2.6.24.7.orig/fs/nfs/inode.c +++ linux-2.6.24.7/fs/nfs/inode.c @@ -120,7 +120,7 @@ int nfs_sync_mapping(struct address_spac { int ret; - if (mapping->nrpages == 0) + if (mapping_nrpages(mapping) == 0) return 0; unmap_mapping_range(mapping, 0, 0, 0); ret = filemap_write_and_wait(mapping); @@ -160,7 +160,7 @@ void nfs_zap_caches(struct inode *inode) void nfs_zap_mapping(struct inode *inode, struct address_space *mapping) { - if (mapping->nrpages != 0) { + if (mapping_nrpages(mapping) != 0) { spin_lock(&inode->i_lock); NFS_I(inode)->cache_validity |= NFS_INO_INVALID_DATA; spin_unlock(&inode->i_lock); @@ -718,7 +718,7 @@ static int nfs_invalidate_mapping_nolock { struct nfs_inode *nfsi = NFS_I(inode); - if (mapping->nrpages != 0) { + if (mapping_nrpages(mapping) != 0) { int ret = invalidate_inode_pages2(mapping); if (ret < 0) return ret; Index: linux-2.6.24.7/fs/xfs/linux-2.6/xfs_vnode.h =================================================================== --- linux-2.6.24.7.orig/fs/xfs/linux-2.6/xfs_vnode.h +++ linux-2.6.24.7/fs/xfs/linux-2.6/xfs_vnode.h @@ -271,7 +271,7 @@ static inline void vn_atime_to_time_t(bh * Some useful predicates. */ #define VN_MAPPED(vp) mapping_mapped(vn_to_inode(vp)->i_mapping) -#define VN_CACHED(vp) (vn_to_inode(vp)->i_mapping->nrpages) +#define VN_CACHED(vp) mapping_nrpages(vn_to_inode(vp)->i_mapping) #define VN_DIRTY(vp) mapping_tagged(vn_to_inode(vp)->i_mapping, \ PAGECACHE_TAG_DIRTY) Index: linux-2.6.24.7/include/linux/fs.h =================================================================== --- linux-2.6.24.7.orig/include/linux/fs.h +++ linux-2.6.24.7/include/linux/fs.h @@ -505,7 +505,7 @@ struct address_space { struct list_head i_mmap_nonlinear;/*list VM_NONLINEAR mappings */ spinlock_t i_mmap_lock; /* protect tree, count, list */ unsigned int truncate_count; /* Cover race condition with truncate */ - unsigned long nrpages; /* number of total pages */ + unsigned long __nrpages; /* number of total pages */ pgoff_t writeback_index;/* writeback starts here */ const struct address_space_operations *a_ops; /* methods */ unsigned long flags; /* error bits/gfp mask */ @@ -520,6 +520,26 @@ struct address_space { * of struct page's "mapping" pointer be used for PAGE_MAPPING_ANON. */ +static inline void mapping_nrpages_init(struct address_space *mapping) +{ + mapping->__nrpages = 0; +} + +static inline unsigned long mapping_nrpages(struct address_space *mapping) +{ + return mapping->__nrpages; +} + +static inline void mapping_nrpages_inc(struct address_space *mapping) +{ + mapping->__nrpages++; +} + +static inline void mapping_nrpages_dec(struct address_space *mapping) +{ + mapping->__nrpages--; +} + struct block_device { dev_t bd_dev; /* not a kdev_t - it's a search key */ struct inode * bd_inode; /* will die */ Index: linux-2.6.24.7/include/linux/swap.h =================================================================== --- linux-2.6.24.7.orig/include/linux/swap.h +++ linux-2.6.24.7/include/linux/swap.h @@ -220,7 +220,7 @@ extern void end_swap_bio_read(struct bio /* linux/mm/swap_state.c */ extern struct address_space swapper_space; -#define total_swapcache_pages swapper_space.nrpages +#define total_swapcache_pages mapping_nrpages(&swapper_space) extern void show_swap_cache_info(void); extern int add_to_swap(struct page *, gfp_t); extern void __delete_from_swap_cache(struct page *); Index: linux-2.6.24.7/ipc/shm.c =================================================================== --- linux-2.6.24.7.orig/ipc/shm.c +++ linux-2.6.24.7/ipc/shm.c @@ -628,11 +628,11 @@ static void shm_get_stat(struct ipc_name if (is_file_hugepages(shp->shm_file)) { struct address_space *mapping = inode->i_mapping; - *rss += (HPAGE_SIZE/PAGE_SIZE)*mapping->nrpages; + *rss += (HPAGE_SIZE/PAGE_SIZE)*mapping_nrpages(mapping); } else { struct shmem_inode_info *info = SHMEM_I(inode); spin_lock(&info->lock); - *rss += inode->i_mapping->nrpages; + *rss += mapping_nrpages(inode->i_mapping); *swp += info->swapped; spin_unlock(&info->lock); } Index: linux-2.6.24.7/mm/filemap.c =================================================================== --- linux-2.6.24.7.orig/mm/filemap.c +++ linux-2.6.24.7/mm/filemap.c @@ -121,7 +121,7 @@ void __remove_from_page_cache(struct pag radix_tree_delete(&mapping->page_tree, page->index); page->mapping = NULL; - mapping->nrpages--; + mapping_nrpages_dec(mapping); __dec_zone_page_state(page, NR_FILE_PAGES); BUG_ON(page_mapped(page)); @@ -206,7 +206,7 @@ int __filemap_fdatawrite_range(struct ad int ret; struct writeback_control wbc = { .sync_mode = sync_mode, - .nr_to_write = mapping->nrpages * 2, + .nr_to_write = mapping_nrpages(mapping) * 2, .range_start = start, .range_end = end, }; @@ -388,7 +388,7 @@ int filemap_write_and_wait(struct addres { int err = 0; - if (mapping->nrpages) { + if (mapping_nrpages(mapping)) { err = filemap_fdatawrite(mapping); /* * Even if the above returned error, the pages may be @@ -422,7 +422,7 @@ int filemap_write_and_wait_range(struct { int err = 0; - if (mapping->nrpages) { + if (mapping_nrpages(mapping)) { err = __filemap_fdatawrite_range(mapping, lstart, lend, WB_SYNC_ALL); /* See comment of filemap_write_and_wait() */ @@ -464,7 +464,7 @@ int add_to_page_cache(struct page *page, SetPageLocked(page); page->mapping = mapping; page->index = offset; - mapping->nrpages++; + mapping_nrpages_inc(mapping); __inc_zone_page_state(page, NR_FILE_PAGES); } spin_unlock_irq(&mapping->tree_lock); @@ -2599,7 +2599,7 @@ generic_file_direct_IO(int rw, struct ki * about to write. We do this *before* the write so that we can return * -EIO without clobbering -EIOCBQUEUED from ->direct_IO(). */ - if (rw == WRITE && mapping->nrpages) { + if (rw == WRITE && mapping_nrpages(mapping)) { retval = invalidate_inode_pages2_range(mapping, offset >> PAGE_CACHE_SHIFT, end); if (retval) @@ -2616,7 +2616,7 @@ generic_file_direct_IO(int rw, struct ki * so we don't support it 100%. If this invalidation * fails, tough, the write still worked... */ - if (rw == WRITE && mapping->nrpages) { + if (rw == WRITE && mapping_nrpages(mapping)) { invalidate_inode_pages2_range(mapping, offset >> PAGE_CACHE_SHIFT, end); } out: Index: linux-2.6.24.7/mm/shmem.c =================================================================== --- linux-2.6.24.7.orig/mm/shmem.c +++ linux-2.6.24.7/mm/shmem.c @@ -215,8 +215,8 @@ static void shmem_free_blocks(struct ino * We have to calculate the free blocks since the mm can drop * undirtied hole pages behind our back. * - * But normally info->alloced == inode->i_mapping->nrpages + info->swapped - * So mm freed is info->alloced - (inode->i_mapping->nrpages + info->swapped) + * But normally info->alloced == mapping_nrpages(inode->i_mapping) + info->swapped + * So mm freed is info->alloced - (mapping_nrpages(inode->i_mapping) + info->swapped) * * It has to be called with the spinlock held. */ @@ -225,7 +225,7 @@ static void shmem_recalc_inode(struct in struct shmem_inode_info *info = SHMEM_I(inode); long freed; - freed = info->alloced - info->swapped - inode->i_mapping->nrpages; + freed = info->alloced - info->swapped - mapping_nrpages(inode->i_mapping); if (freed > 0) { info->alloced -= freed; shmem_unacct_blocks(info->flags, freed); @@ -671,7 +671,7 @@ static void shmem_truncate_range(struct done1: shmem_dir_unmap(dir); done2: - if (inode->i_mapping->nrpages && (info->flags & SHMEM_PAGEIN)) { + if (mapping_nrpages(inode->i_mapping) && (info->flags & SHMEM_PAGEIN)) { /* * Call truncate_inode_pages again: racing shmem_unuse_inode * may have swizzled a page in from swap since vmtruncate or Index: linux-2.6.24.7/mm/swap_state.c =================================================================== --- linux-2.6.24.7.orig/mm/swap_state.c +++ linux-2.6.24.7/mm/swap_state.c @@ -87,7 +87,7 @@ static int __add_to_swap_cache(struct pa page_cache_get(page); SetPageSwapCache(page); set_page_private(page, entry.val); - total_swapcache_pages++; + mapping_nrpages_inc(&swapper_space); __inc_zone_page_state(page, NR_FILE_PAGES); } spin_unlock_irq(&swapper_space.tree_lock); @@ -136,7 +136,7 @@ void __delete_from_swap_cache(struct pag radix_tree_delete(&swapper_space.page_tree, page_private(page)); set_page_private(page, 0); ClearPageSwapCache(page); - total_swapcache_pages--; + mapping_nrpages_dec(&swapper_space); __dec_zone_page_state(page, NR_FILE_PAGES); INC_CACHE_INFO(del_total); } Index: linux-2.6.24.7/mm/truncate.c =================================================================== --- linux-2.6.24.7.orig/mm/truncate.c +++ linux-2.6.24.7/mm/truncate.c @@ -167,7 +167,7 @@ void truncate_inode_pages_range(struct a pgoff_t next; int i; - if (mapping->nrpages == 0) + if (mapping_nrpages(mapping) == 0) return; BUG_ON((lend & (PAGE_CACHE_SIZE - 1)) != (PAGE_CACHE_SIZE - 1)); ���������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/lock_page_ref.patch�������������������������������������������������������������������������0000664�0000764�0000764�00000032456�11041657734�015060� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������Subject: mm: lock_page_ref Change the PG_nonewref operations into locking primitives and place them so that they provide page level serialization with regard to the page_tree operations. (basically replace the tree_lock with a per page lock). The normal page lock has sufficiently different (and overlapping) scope and protection rules that this second lock is needed. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> --- fs/buffer.c | 6 ++++-- include/linux/page-flags.h | 21 --------------------- include/linux/pagemap.h | 45 +++++++++++++++++++++++++++++++++++++++++++-- mm/filemap.c | 14 ++++++++------ mm/migrate.c | 25 +++++++++++++------------ mm/page-writeback.c | 18 ++++++++++++------ mm/swap_state.c | 14 ++++++++------ mm/swapfile.c | 6 ++++-- mm/truncate.c | 9 ++++++--- mm/vmscan.c | 14 +++++++------- 10 files changed, 105 insertions(+), 67 deletions(-) Index: linux-2.6.24.7/fs/buffer.c =================================================================== --- linux-2.6.24.7.orig/fs/buffer.c +++ linux-2.6.24.7/fs/buffer.c @@ -697,7 +697,8 @@ static int __set_page_dirty(struct page if (TestSetPageDirty(page)) return 0; - spin_lock_irq(&mapping->tree_lock); + lock_page_ref_irq(page); + spin_lock(&mapping->tree_lock); if (page->mapping) { /* Race with truncate? */ WARN_ON_ONCE(warn && !PageUptodate(page)); @@ -710,7 +711,8 @@ static int __set_page_dirty(struct page radix_tree_tag_set(&mapping->page_tree, page_index(page), PAGECACHE_TAG_DIRTY); } - spin_unlock_irq(&mapping->tree_lock); + spin_unlock(&mapping->tree_lock); + unlock_page_ref_irq(page); __mark_inode_dirty(mapping->host, I_DIRTY_PAGES); return 1; Index: linux-2.6.24.7/include/linux/page-flags.h =================================================================== --- linux-2.6.24.7.orig/include/linux/page-flags.h +++ linux-2.6.24.7/include/linux/page-flags.h @@ -279,25 +279,4 @@ static inline void set_page_writeback(st test_set_page_writeback(page); } -static inline void set_page_nonewrefs(struct page *page) -{ - preempt_disable(); - SetPageNoNewRefs(page); - smp_wmb(); -} - -static inline void __clear_page_nonewrefs(struct page *page) -{ - smp_wmb(); - __ClearPageNoNewRefs(page); - preempt_enable(); -} - -static inline void clear_page_nonewrefs(struct page *page) -{ - smp_wmb(); - ClearPageNoNewRefs(page); - preempt_enable(); -} - #endif /* PAGE_FLAGS_H */ Index: linux-2.6.24.7/include/linux/pagemap.h =================================================================== --- linux-2.6.24.7.orig/include/linux/pagemap.h +++ linux-2.6.24.7/include/linux/pagemap.h @@ -14,6 +14,7 @@ #include <linux/bitops.h> #include <linux/page-flags.h> #include <linux/hardirq.h> /* for in_interrupt() */ +#include <linux/bit_spinlock.h> /* * Bits in mapping->flags. The lower __GFP_BITS_SHIFT bits are the page @@ -64,6 +65,47 @@ static inline void mapping_set_gfp_mask( #define page_cache_release(page) put_page(page) void release_pages(struct page **pages, int nr, int cold); +static inline void lock_page_ref(struct page *page) +{ + bit_spin_lock(PG_nonewrefs, &page->flags); + smp_wmb(); +} + +static inline void unlock_page_ref(struct page *page) +{ + bit_spin_unlock(PG_nonewrefs, &page->flags); +} + +static inline void wait_on_page_ref(struct page *page) +{ + while (unlikely(test_bit(PG_nonewrefs, &page->flags))) + cpu_relax(); +} + +#define lock_page_ref_irq(page) \ + do { \ + local_irq_disable(); \ + lock_page_ref(page); \ + } while (0) + +#define unlock_page_ref_irq(page) \ + do { \ + unlock_page_ref(page); \ + local_irq_enable(); \ + } while (0) + +#define lock_page_ref_irqsave(page, flags) \ + do { \ + local_irq_save(flags); \ + lock_page_ref(page); \ + } while (0) + +#define unlock_page_ref_irqrestore(page, flags) \ + do { \ + unlock_page_ref(page); \ + local_irq_restore(flags); \ + } while (0) + /* * speculatively take a reference to a page. * If the page is free (_count == 0), then _count is untouched, and 0 @@ -139,8 +181,7 @@ static inline int page_cache_get_specula * page refcount has been raised. See below comment. */ - while (unlikely(PageNoNewRefs(page))) - cpu_relax(); + wait_on_page_ref(page); /* * smp_rmb is to ensure the load of page->flags (for PageNoNewRefs()) Index: linux-2.6.24.7/mm/filemap.c =================================================================== --- linux-2.6.24.7.orig/mm/filemap.c +++ linux-2.6.24.7/mm/filemap.c @@ -144,9 +144,11 @@ void remove_from_page_cache(struct page BUG_ON(!PageLocked(page)); - spin_lock_irq(&mapping->tree_lock); + lock_page_ref_irq(page); + spin_lock(&mapping->tree_lock); __remove_from_page_cache(page); - spin_unlock_irq(&mapping->tree_lock); + spin_unlock(&mapping->tree_lock); + unlock_page_ref_irq(page); } static int sync_page(void *word) @@ -456,8 +458,8 @@ int add_to_page_cache(struct page *page, int error = radix_tree_preload(gfp_mask & ~__GFP_HIGHMEM); if (error == 0) { - set_page_nonewrefs(page); - spin_lock_irq(&mapping->tree_lock); + lock_page_ref_irq(page); + spin_lock(&mapping->tree_lock); error = radix_tree_insert(&mapping->page_tree, offset, page); if (!error) { page_cache_get(page); @@ -467,8 +469,8 @@ int add_to_page_cache(struct page *page, mapping_nrpages_inc(mapping); __inc_zone_page_state(page, NR_FILE_PAGES); } - spin_unlock_irq(&mapping->tree_lock); - clear_page_nonewrefs(page); + spin_unlock(&mapping->tree_lock); + unlock_page_ref_irq(page); radix_tree_preload_end(); } return error; Index: linux-2.6.24.7/mm/migrate.c =================================================================== --- linux-2.6.24.7.orig/mm/migrate.c +++ linux-2.6.24.7/mm/migrate.c @@ -303,16 +303,16 @@ static int migrate_page_move_mapping(str return 0; } - set_page_nonewrefs(page); - spin_lock_irq(&mapping->tree_lock); + lock_page_ref_irq(page); + spin_lock(&mapping->tree_lock); pslot = radix_tree_lookup_slot(&mapping->page_tree, page_index(page)); if (page_count(page) != 2 + !!PagePrivate(page) || (struct page *)radix_tree_deref_slot(pslot) != page) { - spin_unlock_irq(&mapping->tree_lock); - clear_page_nonewrefs(page); + spin_unlock(&mapping->tree_lock); + unlock_page_ref_irq(page); return -EAGAIN; } @@ -329,14 +329,7 @@ static int migrate_page_move_mapping(str radix_tree_replace_slot(pslot, newpage); page->mapping = NULL; - spin_unlock_irq(&mapping->tree_lock); - clear_page_nonewrefs(page); - - /* - * Drop cache reference from old page. - * We know this isn't the last reference. - */ - __put_page(page); + spin_unlock(&mapping->tree_lock); /* * If moved to a different zone then also account @@ -351,6 +344,14 @@ static int migrate_page_move_mapping(str __dec_zone_page_state(page, NR_FILE_PAGES); __inc_zone_page_state(newpage, NR_FILE_PAGES); + unlock_page_ref_irq(page); + + /* + * Drop cache reference from old page. + * We know this isn't the last reference. + */ + __put_page(page); + return 0; } Index: linux-2.6.24.7/mm/page-writeback.c =================================================================== --- linux-2.6.24.7.orig/mm/page-writeback.c +++ linux-2.6.24.7/mm/page-writeback.c @@ -1008,7 +1008,8 @@ int __set_page_dirty_nobuffers(struct pa if (!mapping) return 1; - spin_lock_irq(&mapping->tree_lock); + lock_page_ref_irq(page); + spin_lock(&mapping->tree_lock); mapping2 = page_mapping(page); if (mapping2) { /* Race with truncate? */ BUG_ON(mapping2 != mapping); @@ -1022,7 +1023,8 @@ int __set_page_dirty_nobuffers(struct pa radix_tree_tag_set(&mapping->page_tree, page_index(page), PAGECACHE_TAG_DIRTY); } - spin_unlock_irq(&mapping->tree_lock); + spin_unlock(&mapping->tree_lock); + unlock_page_ref_irq(page); if (mapping->host) { /* !PageAnon && !swapper_space */ __mark_inode_dirty(mapping->host, I_DIRTY_PAGES); @@ -1178,7 +1180,8 @@ int test_clear_page_writeback(struct pag struct backing_dev_info *bdi = mapping->backing_dev_info; unsigned long flags; - spin_lock_irqsave(&mapping->tree_lock, flags); + lock_page_ref_irqsave(page, flags); + spin_lock(&mapping->tree_lock); ret = TestClearPageWriteback(page); if (ret) { radix_tree_tag_clear(&mapping->page_tree, @@ -1189,7 +1192,8 @@ int test_clear_page_writeback(struct pag __bdi_writeout_inc(bdi); } } - spin_unlock_irqrestore(&mapping->tree_lock, flags); + spin_unlock(&mapping->tree_lock); + unlock_page_ref_irqrestore(page, flags); } else { ret = TestClearPageWriteback(page); } @@ -1207,7 +1211,8 @@ int test_set_page_writeback(struct page struct backing_dev_info *bdi = mapping->backing_dev_info; unsigned long flags; - spin_lock_irqsave(&mapping->tree_lock, flags); + lock_page_ref_irqsave(page, flags); + spin_lock(&mapping->tree_lock); ret = TestSetPageWriteback(page); if (!ret) { radix_tree_tag_set(&mapping->page_tree, @@ -1220,7 +1225,8 @@ int test_set_page_writeback(struct page radix_tree_tag_clear(&mapping->page_tree, page_index(page), PAGECACHE_TAG_DIRTY); - spin_unlock_irqrestore(&mapping->tree_lock, flags); + spin_unlock(&mapping->tree_lock); + unlock_page_ref_irqrestore(page, flags); } else { ret = TestSetPageWriteback(page); } Index: linux-2.6.24.7/mm/swap_state.c =================================================================== --- linux-2.6.24.7.orig/mm/swap_state.c +++ linux-2.6.24.7/mm/swap_state.c @@ -79,8 +79,8 @@ static int __add_to_swap_cache(struct pa BUG_ON(PagePrivate(page)); error = radix_tree_preload(gfp_mask); if (!error) { - set_page_nonewrefs(page); - spin_lock_irq(&swapper_space.tree_lock); + lock_page_ref_irq(page); + spin_lock(&swapper_space.tree_lock); error = radix_tree_insert(&swapper_space.page_tree, entry.val, page); if (!error) { @@ -90,8 +90,8 @@ static int __add_to_swap_cache(struct pa mapping_nrpages_inc(&swapper_space); __inc_zone_page_state(page, NR_FILE_PAGES); } - spin_unlock_irq(&swapper_space.tree_lock); - clear_page_nonewrefs(page); + spin_unlock(&swapper_space.tree_lock); + unlock_page_ref_irq(page); radix_tree_preload_end(); } return error; @@ -205,9 +205,11 @@ void delete_from_swap_cache(struct page entry.val = page_private(page); - spin_lock_irq(&swapper_space.tree_lock); + lock_page_ref_irq(page); + spin_lock(&swapper_space.tree_lock); __delete_from_swap_cache(page); - spin_unlock_irq(&swapper_space.tree_lock); + spin_unlock(&swapper_space.tree_lock); + unlock_page_ref_irq(page); swap_free(entry); page_cache_release(page); Index: linux-2.6.24.7/mm/swapfile.c =================================================================== --- linux-2.6.24.7.orig/mm/swapfile.c +++ linux-2.6.24.7/mm/swapfile.c @@ -367,13 +367,15 @@ int remove_exclusive_swap_page(struct pa retval = 0; if (p->swap_map[swp_offset(entry)] == 1) { /* Recheck the page count with the swapcache lock held.. */ - spin_lock_irq(&swapper_space.tree_lock); + lock_page_ref_irq(page); + spin_lock(&swapper_space.tree_lock); if ((page_count(page) == 2) && !PageWriteback(page)) { __delete_from_swap_cache(page); SetPageDirty(page); retval = 1; } - spin_unlock_irq(&swapper_space.tree_lock); + spin_unlock(&swapper_space.tree_lock); + unlock_page_ref_irq(page); } spin_unlock(&swap_lock); Index: linux-2.6.24.7/mm/truncate.c =================================================================== --- linux-2.6.24.7.orig/mm/truncate.c +++ linux-2.6.24.7/mm/truncate.c @@ -350,18 +350,21 @@ invalidate_complete_page2(struct address if (PagePrivate(page) && !try_to_release_page(page, GFP_KERNEL)) return 0; - spin_lock_irq(&mapping->tree_lock); + lock_page_ref_irq(page); + spin_lock(&mapping->tree_lock); if (PageDirty(page)) goto failed; BUG_ON(PagePrivate(page)); __remove_from_page_cache(page); - spin_unlock_irq(&mapping->tree_lock); + spin_unlock(&mapping->tree_lock); + unlock_page_ref_irq(page); ClearPageUptodate(page); page_cache_release(page); /* pagecache ref */ return 1; failed: - spin_unlock_irq(&mapping->tree_lock); + spin_unlock(&mapping->tree_lock); + unlock_page_ref_irq(page); return 0; } Index: linux-2.6.24.7/mm/vmscan.c =================================================================== --- linux-2.6.24.7.orig/mm/vmscan.c +++ linux-2.6.24.7/mm/vmscan.c @@ -385,8 +385,8 @@ int remove_mapping(struct address_space BUG_ON(!PageLocked(page)); BUG_ON(mapping != page_mapping(page)); - set_page_nonewrefs(page); - spin_lock_irq(&mapping->tree_lock); + lock_page_ref_irq(page); + spin_lock(&mapping->tree_lock); /* * The non racy check for a busy page. * @@ -421,22 +421,22 @@ int remove_mapping(struct address_space if (PageSwapCache(page)) { swp_entry_t swap = { .val = page_private(page) }; __delete_from_swap_cache(page); - spin_unlock_irq(&mapping->tree_lock); + spin_unlock(&mapping->tree_lock); swap_free(swap); goto free_it; } __remove_from_page_cache(page); - spin_unlock_irq(&mapping->tree_lock); + spin_unlock(&mapping->tree_lock); free_it: - __clear_page_nonewrefs(page); + unlock_page_ref_irq(page); __put_page(page); /* The pagecache ref */ return 1; cannot_free: - spin_unlock_irq(&mapping->tree_lock); - clear_page_nonewrefs(page); + spin_unlock(&mapping->tree_lock); + unlock_page_ref_irq(page); return 0; } ������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/mm-concurrent-pagecache.patch���������������������������������������������������������������0000664�0000764�0000764�00000037512�11041657735�016766� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������Subject: mm: concurrent pagecache write side Remove the tree_lock, change address_space::nrpages to atomic_long_t because its not protected any longer and use the concurrent radix tree API to protect the modifying radix tree operations. The tree_lock is actually renamed to priv_lock and its only remaining user will be the __flush_dcache_page logic on arm an parisc. Another potential user would be the per address_space node mask allocation Christoph is working on. [ BUG: the NFS client code seems to rely on mapping->tree_lock in some hidden way, which makes it crash... ] Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> --- fs/buffer.c | 7 ++++--- fs/inode.c | 2 +- include/asm-arm/cacheflush.h | 4 ++-- include/asm-parisc/cacheflush.h | 4 ++-- include/linux/fs.h | 12 ++++++------ mm/filemap.c | 17 +++++++++-------- mm/migrate.c | 12 ++++++------ mm/page-writeback.c | 33 +++++++++++++++++++-------------- mm/swap_state.c | 18 ++++++++++-------- mm/swapfile.c | 2 -- mm/truncate.c | 3 --- mm/vmscan.c | 4 ---- 12 files changed, 59 insertions(+), 59 deletions(-) Index: linux-2.6.24.7/fs/buffer.c =================================================================== --- linux-2.6.24.7.orig/fs/buffer.c +++ linux-2.6.24.7/fs/buffer.c @@ -698,8 +698,8 @@ static int __set_page_dirty(struct page return 0; lock_page_ref_irq(page); - spin_lock(&mapping->tree_lock); if (page->mapping) { /* Race with truncate? */ + DEFINE_RADIX_TREE_CONTEXT(ctx, &mapping->page_tree); WARN_ON_ONCE(warn && !PageUptodate(page)); if (mapping_cap_account_dirty(mapping)) { @@ -708,10 +708,11 @@ static int __set_page_dirty(struct page BDI_RECLAIMABLE); task_io_account_write(PAGE_CACHE_SIZE); } - radix_tree_tag_set(&mapping->page_tree, + radix_tree_lock(&ctx); + radix_tree_tag_set(ctx.tree, page_index(page), PAGECACHE_TAG_DIRTY); + radix_tree_unlock(&ctx); } - spin_unlock(&mapping->tree_lock); unlock_page_ref_irq(page); __mark_inode_dirty(mapping->host, I_DIRTY_PAGES); Index: linux-2.6.24.7/fs/inode.c =================================================================== --- linux-2.6.24.7.orig/fs/inode.c +++ linux-2.6.24.7/fs/inode.c @@ -209,7 +209,7 @@ void inode_init_once(struct inode *inode INIT_LIST_HEAD(&inode->i_dentry); INIT_LIST_HEAD(&inode->i_devices); INIT_RADIX_TREE(&inode->i_data.page_tree, GFP_ATOMIC); - spin_lock_init(&inode->i_data.tree_lock); + spin_lock_init(&inode->i_data.priv_lock); spin_lock_init(&inode->i_data.i_mmap_lock); INIT_LIST_HEAD(&inode->i_data.private_list); spin_lock_init(&inode->i_data.private_lock); Index: linux-2.6.24.7/include/asm-arm/cacheflush.h =================================================================== --- linux-2.6.24.7.orig/include/asm-arm/cacheflush.h +++ linux-2.6.24.7/include/asm-arm/cacheflush.h @@ -413,9 +413,9 @@ static inline void flush_anon_page(struc } #define flush_dcache_mmap_lock(mapping) \ - spin_lock_irq(&(mapping)->tree_lock) + spin_lock_irq(&(mapping)->priv_lock) #define flush_dcache_mmap_unlock(mapping) \ - spin_unlock_irq(&(mapping)->tree_lock) + spin_unlock_irq(&(mapping)->priv_lock) #define flush_icache_user_range(vma,page,addr,len) \ flush_dcache_page(page) Index: linux-2.6.24.7/include/asm-parisc/cacheflush.h =================================================================== --- linux-2.6.24.7.orig/include/asm-parisc/cacheflush.h +++ linux-2.6.24.7/include/asm-parisc/cacheflush.h @@ -45,9 +45,9 @@ void flush_cache_mm(struct mm_struct *mm extern void flush_dcache_page(struct page *page); #define flush_dcache_mmap_lock(mapping) \ - spin_lock_irq(&(mapping)->tree_lock) + spin_lock_irq(&(mapping)->priv_lock) #define flush_dcache_mmap_unlock(mapping) \ - spin_unlock_irq(&(mapping)->tree_lock) + spin_unlock_irq(&(mapping)->priv_lock) #define flush_icache_page(vma,page) do { \ flush_kernel_dcache_page(page); \ Index: linux-2.6.24.7/include/linux/fs.h =================================================================== --- linux-2.6.24.7.orig/include/linux/fs.h +++ linux-2.6.24.7/include/linux/fs.h @@ -499,13 +499,13 @@ struct backing_dev_info; struct address_space { struct inode *host; /* owner: inode, block_device */ struct radix_tree_root page_tree; /* radix tree of all pages */ - spinlock_t tree_lock; /* and lock protecting it */ + spinlock_t priv_lock; /* spinlock protecting various stuffs */ unsigned int i_mmap_writable;/* count VM_SHARED mappings */ struct prio_tree_root i_mmap; /* tree of private and shared mappings */ struct list_head i_mmap_nonlinear;/*list VM_NONLINEAR mappings */ spinlock_t i_mmap_lock; /* protect tree, count, list */ unsigned int truncate_count; /* Cover race condition with truncate */ - unsigned long __nrpages; /* number of total pages */ + atomic_long_t __nrpages; /* number of total pages */ pgoff_t writeback_index;/* writeback starts here */ const struct address_space_operations *a_ops; /* methods */ unsigned long flags; /* error bits/gfp mask */ @@ -522,22 +522,22 @@ struct address_space { static inline void mapping_nrpages_init(struct address_space *mapping) { - mapping->__nrpages = 0; + mapping->__nrpages = (atomic_long_t)ATOMIC_LONG_INIT(0); } static inline unsigned long mapping_nrpages(struct address_space *mapping) { - return mapping->__nrpages; + return (unsigned long)atomic_long_read(&mapping->__nrpages); } static inline void mapping_nrpages_inc(struct address_space *mapping) { - mapping->__nrpages++; + atomic_long_inc(&mapping->__nrpages); } static inline void mapping_nrpages_dec(struct address_space *mapping) { - mapping->__nrpages--; + atomic_long_dec(&mapping->__nrpages); } struct block_device { Index: linux-2.6.24.7/mm/filemap.c =================================================================== --- linux-2.6.24.7.orig/mm/filemap.c +++ linux-2.6.24.7/mm/filemap.c @@ -118,8 +118,11 @@ generic_file_direct_IO(int rw, struct ki void __remove_from_page_cache(struct page *page) { struct address_space *mapping = page->mapping; + DEFINE_RADIX_TREE_CONTEXT(ctx, &mapping->page_tree); - radix_tree_delete(&mapping->page_tree, page->index); + radix_tree_lock(&ctx); + radix_tree_delete(ctx.tree, page->index); + radix_tree_unlock(&ctx); page->mapping = NULL; mapping_nrpages_dec(mapping); __dec_zone_page_state(page, NR_FILE_PAGES); @@ -140,14 +143,10 @@ void __remove_from_page_cache(struct pag void remove_from_page_cache(struct page *page) { - struct address_space *mapping = page->mapping; - BUG_ON(!PageLocked(page)); lock_page_ref_irq(page); - spin_lock(&mapping->tree_lock); __remove_from_page_cache(page); - spin_unlock(&mapping->tree_lock); unlock_page_ref_irq(page); } @@ -458,9 +457,12 @@ int add_to_page_cache(struct page *page, int error = radix_tree_preload(gfp_mask & ~__GFP_HIGHMEM); if (error == 0) { + DEFINE_RADIX_TREE_CONTEXT(ctx, &mapping->page_tree); + lock_page_ref_irq(page); - spin_lock(&mapping->tree_lock); - error = radix_tree_insert(&mapping->page_tree, offset, page); + radix_tree_lock(&ctx); + error = radix_tree_insert(ctx.tree, offset, page); + radix_tree_unlock(&ctx); if (!error) { page_cache_get(page); SetPageLocked(page); @@ -469,7 +471,6 @@ int add_to_page_cache(struct page *page, mapping_nrpages_inc(mapping); __inc_zone_page_state(page, NR_FILE_PAGES); } - spin_unlock(&mapping->tree_lock); unlock_page_ref_irq(page); radix_tree_preload_end(); } Index: linux-2.6.24.7/mm/migrate.c =================================================================== --- linux-2.6.24.7.orig/mm/migrate.c +++ linux-2.6.24.7/mm/migrate.c @@ -295,6 +295,7 @@ static int migrate_page_move_mapping(str struct page *newpage, struct page *page) { void **pslot; + struct radix_tree_context ctx; if (!mapping) { /* Anonymous page without mapping */ @@ -303,15 +304,14 @@ static int migrate_page_move_mapping(str return 0; } + init_radix_tree_context(&ctx, &mapping->page_tree); lock_page_ref_irq(page); - spin_lock(&mapping->tree_lock); - - pslot = radix_tree_lookup_slot(&mapping->page_tree, - page_index(page)); + radix_tree_lock(&ctx); + pslot = radix_tree_lookup_slot(ctx.tree, page_index(page)); if (page_count(page) != 2 + !!PagePrivate(page) || (struct page *)radix_tree_deref_slot(pslot) != page) { - spin_unlock(&mapping->tree_lock); + radix_tree_unlock(&ctx); unlock_page_ref_irq(page); return -EAGAIN; } @@ -329,7 +329,7 @@ static int migrate_page_move_mapping(str radix_tree_replace_slot(pslot, newpage); page->mapping = NULL; - spin_unlock(&mapping->tree_lock); + radix_tree_unlock(&ctx); /* * If moved to a different zone then also account Index: linux-2.6.24.7/mm/page-writeback.c =================================================================== --- linux-2.6.24.7.orig/mm/page-writeback.c +++ linux-2.6.24.7/mm/page-writeback.c @@ -1009,9 +1009,10 @@ int __set_page_dirty_nobuffers(struct pa return 1; lock_page_ref_irq(page); - spin_lock(&mapping->tree_lock); mapping2 = page_mapping(page); if (mapping2) { /* Race with truncate? */ + DEFINE_RADIX_TREE_CONTEXT(ctx, &mapping->page_tree); + BUG_ON(mapping2 != mapping); WARN_ON_ONCE(!PagePrivate(page) && !PageUptodate(page)); if (mapping_cap_account_dirty(mapping)) { @@ -1020,10 +1021,11 @@ int __set_page_dirty_nobuffers(struct pa BDI_RECLAIMABLE); task_io_account_write(PAGE_CACHE_SIZE); } - radix_tree_tag_set(&mapping->page_tree, + radix_tree_lock(&ctx); + radix_tree_tag_set(ctx.tree, page_index(page), PAGECACHE_TAG_DIRTY); + radix_tree_unlock(&ctx); } - spin_unlock(&mapping->tree_lock); unlock_page_ref_irq(page); if (mapping->host) { /* !PageAnon && !swapper_space */ @@ -1181,18 +1183,19 @@ int test_clear_page_writeback(struct pag unsigned long flags; lock_page_ref_irqsave(page, flags); - spin_lock(&mapping->tree_lock); ret = TestClearPageWriteback(page); if (ret) { - radix_tree_tag_clear(&mapping->page_tree, - page_index(page), + DEFINE_RADIX_TREE_CONTEXT(ctx, &mapping->page_tree); + + radix_tree_lock(&ctx); + radix_tree_tag_clear(ctx.tree, page_index(page), PAGECACHE_TAG_WRITEBACK); + radix_tree_unlock(&ctx); if (bdi_cap_writeback_dirty(bdi)) { __dec_bdi_stat(bdi, BDI_WRITEBACK); __bdi_writeout_inc(bdi); } } - spin_unlock(&mapping->tree_lock); unlock_page_ref_irqrestore(page, flags); } else { ret = TestClearPageWriteback(page); @@ -1210,22 +1213,24 @@ int test_set_page_writeback(struct page if (mapping) { struct backing_dev_info *bdi = mapping->backing_dev_info; unsigned long flags; + DEFINE_RADIX_TREE_CONTEXT(ctx, &mapping->page_tree); lock_page_ref_irqsave(page, flags); - spin_lock(&mapping->tree_lock); ret = TestSetPageWriteback(page); if (!ret) { - radix_tree_tag_set(&mapping->page_tree, - page_index(page), + radix_tree_lock(&ctx); + radix_tree_tag_set(ctx.tree, page_index(page), PAGECACHE_TAG_WRITEBACK); + radix_tree_unlock(&ctx); if (bdi_cap_writeback_dirty(bdi)) __inc_bdi_stat(bdi, BDI_WRITEBACK); } - if (!PageDirty(page)) - radix_tree_tag_clear(&mapping->page_tree, - page_index(page), + if (!PageDirty(page)) { + radix_tree_lock(&ctx); + radix_tree_tag_clear(ctx.tree, page_index(page), PAGECACHE_TAG_DIRTY); - spin_unlock(&mapping->tree_lock); + radix_tree_unlock(&ctx); + } unlock_page_ref_irqrestore(page, flags); } else { ret = TestSetPageWriteback(page); Index: linux-2.6.24.7/mm/swap_state.c =================================================================== --- linux-2.6.24.7.orig/mm/swap_state.c +++ linux-2.6.24.7/mm/swap_state.c @@ -38,7 +38,6 @@ static struct backing_dev_info swap_back struct address_space swapper_space = { .page_tree = RADIX_TREE_INIT(GFP_ATOMIC|__GFP_NOWARN), - .tree_lock = __SPIN_LOCK_UNLOCKED(swapper_space.tree_lock), .a_ops = &swap_aops, .i_mmap_nonlinear = LIST_HEAD_INIT(swapper_space.i_mmap_nonlinear), .backing_dev_info = &swap_backing_dev_info, @@ -79,10 +78,12 @@ static int __add_to_swap_cache(struct pa BUG_ON(PagePrivate(page)); error = radix_tree_preload(gfp_mask); if (!error) { + DEFINE_RADIX_TREE_CONTEXT(ctx, &swapper_space.page_tree); + lock_page_ref_irq(page); - spin_lock(&swapper_space.tree_lock); - error = radix_tree_insert(&swapper_space.page_tree, - entry.val, page); + radix_tree_lock(&ctx); + error = radix_tree_insert(ctx.tree, entry.val, page); + radix_tree_unlock(&ctx); if (!error) { page_cache_get(page); SetPageSwapCache(page); @@ -90,7 +91,6 @@ static int __add_to_swap_cache(struct pa mapping_nrpages_inc(&swapper_space); __inc_zone_page_state(page, NR_FILE_PAGES); } - spin_unlock(&swapper_space.tree_lock); unlock_page_ref_irq(page); radix_tree_preload_end(); } @@ -128,12 +128,16 @@ static int add_to_swap_cache(struct page */ void __delete_from_swap_cache(struct page *page) { + DEFINE_RADIX_TREE_CONTEXT(ctx, &swapper_space.page_tree); + BUG_ON(!PageLocked(page)); BUG_ON(!PageSwapCache(page)); BUG_ON(PageWriteback(page)); BUG_ON(PagePrivate(page)); - radix_tree_delete(&swapper_space.page_tree, page_private(page)); + radix_tree_lock(&ctx); + radix_tree_delete(ctx.tree, page_private(page)); + radix_tree_unlock(&ctx); set_page_private(page, 0); ClearPageSwapCache(page); mapping_nrpages_dec(&swapper_space); @@ -206,9 +210,7 @@ void delete_from_swap_cache(struct page entry.val = page_private(page); lock_page_ref_irq(page); - spin_lock(&swapper_space.tree_lock); __delete_from_swap_cache(page); - spin_unlock(&swapper_space.tree_lock); unlock_page_ref_irq(page); swap_free(entry); Index: linux-2.6.24.7/mm/swapfile.c =================================================================== --- linux-2.6.24.7.orig/mm/swapfile.c +++ linux-2.6.24.7/mm/swapfile.c @@ -368,13 +368,11 @@ int remove_exclusive_swap_page(struct pa if (p->swap_map[swp_offset(entry)] == 1) { /* Recheck the page count with the swapcache lock held.. */ lock_page_ref_irq(page); - spin_lock(&swapper_space.tree_lock); if ((page_count(page) == 2) && !PageWriteback(page)) { __delete_from_swap_cache(page); SetPageDirty(page); retval = 1; } - spin_unlock(&swapper_space.tree_lock); unlock_page_ref_irq(page); } spin_unlock(&swap_lock); Index: linux-2.6.24.7/mm/truncate.c =================================================================== --- linux-2.6.24.7.orig/mm/truncate.c +++ linux-2.6.24.7/mm/truncate.c @@ -351,19 +351,16 @@ invalidate_complete_page2(struct address return 0; lock_page_ref_irq(page); - spin_lock(&mapping->tree_lock); if (PageDirty(page)) goto failed; BUG_ON(PagePrivate(page)); __remove_from_page_cache(page); - spin_unlock(&mapping->tree_lock); unlock_page_ref_irq(page); ClearPageUptodate(page); page_cache_release(page); /* pagecache ref */ return 1; failed: - spin_unlock(&mapping->tree_lock); unlock_page_ref_irq(page); return 0; } Index: linux-2.6.24.7/mm/vmscan.c =================================================================== --- linux-2.6.24.7.orig/mm/vmscan.c +++ linux-2.6.24.7/mm/vmscan.c @@ -386,7 +386,6 @@ int remove_mapping(struct address_space BUG_ON(mapping != page_mapping(page)); lock_page_ref_irq(page); - spin_lock(&mapping->tree_lock); /* * The non racy check for a busy page. * @@ -421,13 +420,11 @@ int remove_mapping(struct address_space if (PageSwapCache(page)) { swp_entry_t swap = { .val = page_private(page) }; __delete_from_swap_cache(page); - spin_unlock(&mapping->tree_lock); swap_free(swap); goto free_it; } __remove_from_page_cache(page); - spin_unlock(&mapping->tree_lock); free_it: unlock_page_ref_irq(page); @@ -435,7 +432,6 @@ free_it: return 1; cannot_free: - spin_unlock(&mapping->tree_lock); unlock_page_ref_irq(page); return 0; } ��������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/radix-tree-optimistic.patch�����������������������������������������������������������������0000664�0000764�0000764�00000026562�11041657731�016524� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������Subject: radix-tree: optimistic locking Implement optimistic locking for the concurrent radix tree. Optimistic locking is aimed at avoiding taking higher level node locks. We decent the tree using an RCU lookup, looking for the lowest modification termination point. If found, we try to acquire the lock of that node. After we have obtained this lock, we will need to validate if the initial conditions still hold true. We do this by repeating the steps that found us this node in the first place. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> --- include/linux/radix-tree.h | 27 +++++- init/Kconfig | 6 + lib/radix-tree.c | 194 +++++++++++++++++++++++++++++++++++++++++---- 3 files changed, 206 insertions(+), 21 deletions(-) Index: linux-2.6.24.7/include/linux/radix-tree.h =================================================================== --- linux-2.6.24.7.orig/include/linux/radix-tree.h +++ linux-2.6.24.7/include/linux/radix-tree.h @@ -197,28 +197,47 @@ static inline void radix_tree_replace_sl rcu_assign_pointer(*pslot, item); } +#if defined(CONFIG_RADIX_TREE_OPTIMISTIC) +static inline void radix_tree_lock(struct radix_tree_context *context) +{ + rcu_read_lock(); + BUG_ON(context->locked); +} +#elif defined(CONFIG_RADIX_TREE_CONCURRENT) static inline void radix_tree_lock(struct radix_tree_context *context) { struct radix_tree_root *root = context->root; + rcu_read_lock(); spin_lock(&root->lock); -#ifdef CONFIG_RADIX_TREE_CONCURRENT BUG_ON(context->locked); context->locked = &root->lock; -#endif } +#else +static inline void radix_tree_lock(struct radix_tree_context *context) +{ + struct radix_tree_root *root = context->root; + + rcu_read_lock(); + spin_lock(&root->lock); +} +#endif +#if defined(CONFIG_RADIX_TREE_CONCURRENT) static inline void radix_tree_unlock(struct radix_tree_context *context) { -#ifdef CONFIG_RADIX_TREE_CONCURRENT BUG_ON(!context->locked); spin_unlock(context->locked); context->locked = NULL; + rcu_read_unlock(); +} #else +static inline void radix_tree_unlock(struct radix_tree_context *context) +{ spin_unlock(&context->root->lock); -#endif rcu_read_unlock(); } +#endif int radix_tree_insert(struct radix_tree_root *, unsigned long, void *); void *radix_tree_lookup(struct radix_tree_root *, unsigned long); Index: linux-2.6.24.7/init/Kconfig =================================================================== --- linux-2.6.24.7.orig/init/Kconfig +++ linux-2.6.24.7/init/Kconfig @@ -437,8 +437,14 @@ config SYSCTL config RADIX_TREE_CONCURRENT bool "Enable concurrent radix tree operations (EXPERIMENTAL)" + depends on EXPERIMENTAL default y if SMP +config RADIX_TREE_OPTIMISTIC + bool "Enabled optimistic locking (EXPERIMENTAL)" + depends on RADIX_TREE_CONCURRENT + default y + menuconfig EMBEDDED bool "Configure standard kernel features (for small systems)" help Index: linux-2.6.24.7/lib/radix-tree.c =================================================================== --- linux-2.6.24.7.orig/lib/radix-tree.c +++ linux-2.6.24.7/lib/radix-tree.c @@ -375,6 +375,117 @@ static inline void radix_path_unlock(str #define radix_path_unlock(context, punlock) do { } while (0) #endif +#ifdef CONFIG_RADIX_TREE_OPTIMISTIC +typedef int (*radix_valid_fn)(struct radix_tree_node *, int, int); + +static struct radix_tree_node * +radix_optimistic_lookup(struct radix_tree_context *context, unsigned long index, + int tag, radix_valid_fn valid) +{ + unsigned int height, shift; + struct radix_tree_node *node, *ret = NULL, **slot; + struct radix_tree_root *root = context->root; + + node = rcu_dereference(root->rnode); + if (node == NULL) + return NULL; + + if (!radix_tree_is_indirect_ptr(node)) + return NULL; + + node = radix_tree_indirect_to_ptr(node); + + height = node->height; + if (index > radix_tree_maxindex(height)) + return NULL; + + shift = (height-1) * RADIX_TREE_MAP_SHIFT; + do { + int offset = (index >> shift) & RADIX_TREE_MAP_MASK; + if ((*valid)(node, offset, tag)) + ret = node; + slot = (struct radix_tree_node **)(node->slots + offset); + node = rcu_dereference(*slot); + if (!node) + break; + + shift -= RADIX_TREE_MAP_SHIFT; + height--; + } while (height > 0); + + return ret; +} + +static struct radix_tree_node * +__radix_optimistic_lock(struct radix_tree_context *context, unsigned long index, + int tag, radix_valid_fn valid) +{ + struct radix_tree_node *node; + spinlock_t *locked; + unsigned int shift, offset; + + node = radix_optimistic_lookup(context, index, tag, valid); + if (!node) + goto out; + + locked = radix_node_lock(context->root, node); + if (!locked) + goto out; + +#if 0 + if (node != radix_optimistic_lookup(context, index, tag, valid)) + goto out_unlock; +#else + /* check if the node got freed */ + if (!node->count) + goto out_unlock; + + /* check if the node is still a valid termination point */ + shift = (node->height - 1) * RADIX_TREE_MAP_SHIFT; + offset = (index >> shift) & RADIX_TREE_MAP_MASK; + if (!(*valid)(node, offset, tag)) + goto out_unlock; +#endif + + context->locked = locked; + return node; + +out_unlock: + spin_unlock(locked); +out: + return NULL; +} + +static struct radix_tree_node * +radix_optimistic_lock(struct radix_tree_context *context, unsigned long index, + int tag, radix_valid_fn valid) +{ + struct radix_tree_node *node = NULL; + + if (context) { + node = __radix_optimistic_lock(context, index, tag, valid); + if (!node) { + BUG_ON(context->locked); + spin_lock(&context->root->lock); + context->locked = &context->root->lock; + } + } + return node; +} + +static int radix_valid_always(struct radix_tree_node *node, int offset, int tag) +{ + return 1; +} + +static int radix_valid_tag(struct radix_tree_node *node, int offset, int tag) +{ + return tag_get(node, tag, offset); +} +#else +#define radix_optimistic_lock(context, index, tag, valid) NULL +#endif + /** * radix_tree_insert - insert into a radix tree * @root: radix tree root @@ -395,6 +506,13 @@ int radix_tree_insert(struct radix_tree_ BUG_ON(radix_tree_is_indirect_ptr(item)); + node = radix_optimistic_lock(context, index, 0, radix_valid_always); + if (node) { + height = node->height; + shift = (height-1) * RADIX_TREE_MAP_SHIFT; + goto optimistic; + } + /* Make sure the tree is high enough. */ if (index > radix_tree_maxindex(root->height)) { error = radix_tree_extend(root, index); @@ -403,7 +521,6 @@ int radix_tree_insert(struct radix_tree_ } slot = radix_tree_indirect_to_ptr(root->rnode); - height = root->height; shift = (height-1) * RADIX_TREE_MAP_SHIFT; @@ -422,11 +539,11 @@ int radix_tree_insert(struct radix_tree_ } /* Go a level down */ - offset = (index >> shift) & RADIX_TREE_MAP_MASK; node = slot; - radix_ladder_lock(context, node); +optimistic: + offset = (index >> shift) & RADIX_TREE_MAP_MASK; slot = node->slots[offset]; shift -= RADIX_TREE_MAP_SHIFT; height--; @@ -469,6 +586,10 @@ void **radix_tree_lookup_slot(struct rad struct radix_tree_node *node, **slot; RADIX_TREE_CONTEXT(context, root); + node = radix_optimistic_lock(context, index, 0, radix_valid_always); + if (node) + goto optimistic; + node = rcu_dereference(root->rnode); if (node == NULL) return NULL; @@ -480,6 +601,7 @@ void **radix_tree_lookup_slot(struct rad } node = radix_tree_indirect_to_ptr(node); +optimistic: height = node->height; if (index > radix_tree_maxindex(height)) return NULL; @@ -572,6 +694,13 @@ void *radix_tree_tag_set(struct radix_tr struct radix_tree_node *slot; RADIX_TREE_CONTEXT(context, root); + slot = radix_optimistic_lock(context, index, tag, radix_valid_tag); + if (slot) { + height = slot->height; + shift = (height - 1) * RADIX_TREE_MAP_SHIFT; + goto optimistic; + } + height = root->height; BUG_ON(index > radix_tree_maxindex(height)); @@ -587,6 +716,7 @@ void *radix_tree_tag_set(struct radix_tr radix_ladder_lock(context, slot); +optimistic: offset = (index >> shift) & RADIX_TREE_MAP_MASK; if (!tag_get(slot, tag, offset)) tag_set(slot, tag, offset); @@ -603,13 +733,13 @@ EXPORT_SYMBOL(radix_tree_tag_set); /* * the change can never propagate upwards from here. */ -static inline int radix_tree_unlock_tag(struct radix_tree_root *root, - struct radix_tree_path *pathp, int tag) +static +int radix_valid_tag_clear(struct radix_tree_node *node, int offset, int tag) { int this, other; - this = tag_get(pathp->node, tag, pathp->offset); - other = any_tag_set_but(pathp->node, tag, pathp->offset); + this = tag_get(node, tag, offset); + other = any_tag_set_but(node, tag, offset); return !this || other; } @@ -638,9 +768,22 @@ void *radix_tree_tag_clear(struct radix_ struct radix_tree_path path[RADIX_TREE_MAX_PATH + 1], *pathp = path; struct radix_tree_path *punlock = path, *piter; struct radix_tree_node *slot = NULL; - unsigned int height, shift; + unsigned int height, shift, offset; + RADIX_TREE_CONTEXT(context, root); + slot = radix_optimistic_lock(context, index, tag, + radix_valid_tag_clear); + if (slot) { + height = slot->height; + shift = (height - 1) * RADIX_TREE_MAP_SHIFT; + offset = (index >> shift) & RADIX_TREE_MAP_MASK; + pathp->offset = offset; + pathp->node = slot; + radix_path_init(context, pathp); + goto optimistic; + } + pathp->node = NULL; radix_path_init(context, pathp); @@ -652,8 +795,6 @@ void *radix_tree_tag_clear(struct radix_ slot = radix_tree_indirect_to_ptr(root->rnode); while (height > 0) { - int offset; - if (slot == NULL) goto out; @@ -663,11 +804,12 @@ void *radix_tree_tag_clear(struct radix_ pathp->node = slot; radix_path_lock(context, pathp, slot); - if (radix_tree_unlock_tag(root, pathp, tag)) { + if (radix_valid_tag_clear(slot, offset, tag)) { for (; punlock < pathp; punlock++) radix_path_unlock(context, punlock); } +optimistic: slot = slot->slots[offset]; shift -= RADIX_TREE_MAP_SHIFT; height--; @@ -1214,14 +1356,20 @@ static inline void radix_tree_shrink(str } } -static inline int radix_tree_unlock_all(struct radix_tree_root *root, - struct radix_tree_path *pathp) +static +int radix_valid_delete(struct radix_tree_node *node, int offset, int tag) { - int tag; - int unlock = 1; + /* + * we need to check for > 2, because nodes with a single child + * can still be deleted, see radix_tree_shrink(). + */ + int unlock = (node->count > 2); + + if (!unlock) + return unlock; for (tag = 0; tag < RADIX_TREE_MAX_TAGS; tag++) { - if (!radix_tree_unlock_tag(root, pathp, tag)) { + if (!radix_valid_tag_clear(node, offset, tag)) { unlock = 0; break; } @@ -1253,6 +1401,17 @@ void *radix_tree_delete(struct radix_tre int offset; RADIX_TREE_CONTEXT(context, root); + slot = radix_optimistic_lock(context, index, 0, radix_valid_delete); + if (slot) { + height = slot->height; + shift = (height - 1) * RADIX_TREE_MAP_SHIFT; + offset = (index >> shift) & RADIX_TREE_MAP_MASK; + pathp->offset = offset; + pathp->node = slot; + radix_path_init(context, pathp); + goto optimistic; + } + pathp->node = NULL; radix_path_init(context, pathp); @@ -1280,11 +1439,12 @@ void *radix_tree_delete(struct radix_tre pathp->node = slot; radix_path_lock(context, pathp, slot); - if (slot->count > 2 && radix_tree_unlock_all(root, pathp)) { + if (radix_valid_delete(slot, offset, 0)) { for (; punlock < pathp; punlock++) radix_path_unlock(context, punlock); } +optimistic: slot = slot->slots[offset]; shift -= RADIX_TREE_MAP_SHIFT; height--; ����������������������������������������������������������������������������������������������������������������������������������������������patches/radix-tree-optimistic-hist.patch������������������������������������������������������������0000664�0000764�0000764�00000010417�11041657731�017461� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������Subject: debug: optimistic lock histogram A simple histogram measuring the efficiency of the optimistic locking Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> --- fs/proc/proc_misc.c | 22 +++++++++++ lib/radix-tree.c | 103 +++++++++++++++++++++++++++++++++++++++++++++++++++- 2 files changed, 124 insertions(+), 1 deletion(-) Index: linux-2.6.24.7/fs/proc/proc_misc.c =================================================================== --- linux-2.6.24.7.orig/fs/proc/proc_misc.c +++ linux-2.6.24.7/fs/proc/proc_misc.c @@ -302,6 +302,25 @@ static const struct file_operations proc .release = seq_release, }; +#ifdef CONFIG_RADIX_TREE_OPTIMISTIC +extern struct seq_operations optimistic_op; +static int optimistic_open(struct inode *inode, struct file *file) +{ + (void)inode; + return seq_open(file, &optimistic_op); +} + +extern ssize_t optimistic_write(struct file *, const char __user *, size_t, loff_t *); + +static struct file_operations optimistic_file_operations = { + .open = optimistic_open, + .read = seq_read, + .llseek = seq_lseek, + .release = seq_release, + .write = optimistic_write, +}; +#endif + static int devinfo_show(struct seq_file *f, void *v) { int i = *(loff_t *) v; @@ -788,6 +807,9 @@ void __init proc_misc_init(void) entry->proc_fops = &proc_kmsg_operations; } #endif +#ifdef CONFIG_RADIX_TREE_OPTIMISTIC + create_seq_entry("radix_optimistic", 0, &optimistic_file_operations); +#endif create_seq_entry("locks", 0, &proc_locks_operations); create_seq_entry("devices", 0, &proc_devinfo_operations); create_seq_entry("cpuinfo", 0, &proc_cpuinfo_operations); Index: linux-2.6.24.7/lib/radix-tree.c =================================================================== --- linux-2.6.24.7.orig/lib/radix-tree.c +++ linux-2.6.24.7/lib/radix-tree.c @@ -80,6 +80,105 @@ static unsigned long height_to_maxindex[ static struct lock_class_key radix_node_class[RADIX_TREE_MAX_PATH]; #endif +#ifdef CONFIG_RADIX_TREE_OPTIMISTIC +static DEFINE_PER_CPU(unsigned long[RADIX_TREE_MAX_PATH+1], optimistic_histogram); + +static void optimistic_hit(unsigned long height) +{ + if (height > RADIX_TREE_MAX_PATH) + height = RADIX_TREE_MAX_PATH; + + __get_cpu_var(optimistic_histogram)[height]++; +} + +#ifdef CONFIG_PROC_FS + +#include <linux/seq_file.h> +#include <linux/uaccess.h> + +static void *frag_start(struct seq_file *m, loff_t *pos) +{ + if (*pos < 0 || *pos > RADIX_TREE_MAX_PATH) + return NULL; + + m->private = (void *)(unsigned long)*pos; + return pos; +} + +static void *frag_next(struct seq_file *m, void *arg, loff_t *pos) +{ + if (*pos < RADIX_TREE_MAX_PATH) { + (*pos)++; + (*((unsigned long *)&m->private))++; + return pos; + } + return NULL; +} + +static void frag_stop(struct seq_file *m, void *arg) +{ +} + +unsigned long get_optimistic_stat(unsigned long index) +{ + unsigned long total = 0; + int cpu; + + for_each_possible_cpu(cpu) { + total += per_cpu(optimistic_histogram, cpu)[index]; + } + return total; +} + +static int frag_show(struct seq_file *m, void *arg) +{ + unsigned long index = (unsigned long)m->private; + unsigned long hits = get_optimistic_stat(index); + + if (index == 0) + seq_printf(m, "levels skipped\thits\n"); + + if (index < RADIX_TREE_MAX_PATH) + seq_printf(m, "%9lu\t%9lu\n", index, hits); + else + seq_printf(m, "failed\t%9lu\n", hits); + + return 0; +} + +struct seq_operations optimistic_op = { + .start = frag_start, + .next = frag_next, + .stop = frag_stop, + .show = frag_show, +}; + +static void optimistic_reset(void) +{ + int cpu; + int height; + for_each_possible_cpu(cpu) { + for (height = 0; height <= RADIX_TREE_MAX_PATH; height++) + per_cpu(optimistic_histogram, cpu)[height] = 0; + } +} + +ssize_t optimistic_write(struct file *file, const char __user *buf, + size_t count, loff_t *ppos) +{ + if (count) { + char c; + if (get_user(c, buf)) + return -EFAULT; + if (c == '0') + optimistic_reset(); + } + return count; +} + +#endif // CONFIG_PROC_FS +#endif // CONFIG_RADIX_TREE_OPTIMISTIC + /* * Radix tree node cache. */ @@ -468,7 +567,9 @@ radix_optimistic_lock(struct radix_tree_ BUG_ON(context->locked); spin_lock(&context->root->lock); context->locked = &context->root->lock; - } + optimistic_hit(RADIX_TREE_MAX_PATH); + } else + optimistic_hit(context->root->height - node->height); } return node; } �������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/radix-concurrent-lockdep.patch��������������������������������������������������������������0000664�0000764�0000764�00000002551�11041657730�017173� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������--- lib/radix-tree.c | 24 +++++++++++++++++++++++- 1 file changed, 23 insertions(+), 1 deletion(-) Index: linux-2.6.24.7/lib/radix-tree.c =================================================================== --- linux-2.6.24.7.orig/lib/radix-tree.c +++ linux-2.6.24.7/lib/radix-tree.c @@ -79,6 +79,26 @@ static unsigned long height_to_maxindex[ #ifdef CONFIG_RADIX_TREE_CONCURRENT static struct lock_class_key radix_node_class[RADIX_TREE_MAX_PATH]; #endif +#ifdef CONFIG_DEBUG_LOCK_ALLOC +static const char *radix_node_key_string[RADIX_TREE_MAX_PATH] = { + "radix-node-00", + "radix-node-01", + "radix-node-02", + "radix-node-03", + "radix-node-04", + "radix-node-05", + "radix-node-06", + "radix-node-07", + "radix-node-08", + "radix-node-09", + "radix-node-10", + "radix-node-11", + "radix-node-12", + "radix-node-13", + "radix-node-14", + "radix-node-15", +}; +#endif #ifdef CONFIG_RADIX_TREE_OPTIMISTIC static DEFINE_PER_CPU(unsigned long[RADIX_TREE_MAX_PATH+1], optimistic_histogram); @@ -224,7 +244,9 @@ radix_tree_node_alloc(struct radix_tree_ BUG_ON(radix_tree_is_indirect_ptr(ret)); #ifdef CONFIG_RADIX_TREE_CONCURRENT spin_lock_init(&ret->lock); - lockdep_set_class(&ret->lock, &radix_node_class[height]); + lockdep_set_class_and_name(&ret->lock, + &radix_node_class[height], + radix_node_key_string[height]); #endif ret->height = height; return ret; �������������������������������������������������������������������������������������������������������������������������������������������������������patches/mm-concurrent-pagecache-rt.patch������������������������������������������������������������0000664�0000764�0000764�00000011050�11041657732�017373� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������Subject: mm: -rt bits for concurrent pagecache Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> --- include/linux/pagemap.h | 62 ++++++++++++++++++++++++++++++++++++++++++++---- mm/filemap.c | 17 ++----------- 2 files changed, 60 insertions(+), 19 deletions(-) Index: linux-2.6.24.7/include/linux/pagemap.h =================================================================== --- linux-2.6.24.7.orig/include/linux/pagemap.h +++ linux-2.6.24.7/include/linux/pagemap.h @@ -15,6 +15,9 @@ #include <linux/page-flags.h> #include <linux/hardirq.h> /* for in_interrupt() */ #include <linux/bit_spinlock.h> +#include <linux/wait.h> +#include <linux/hash.h> +#include <linux/interrupt.h> /* * Bits in mapping->flags. The lower __GFP_BITS_SHIFT bits are the page @@ -65,6 +68,26 @@ static inline void mapping_set_gfp_mask( #define page_cache_release(page) put_page(page) void release_pages(struct page **pages, int nr, int cold); +/* + * In order to wait for pages to become available there must be + * waitqueues associated with pages. By using a hash table of + * waitqueues where the bucket discipline is to maintain all + * waiters on the same queue and wake all when any of the pages + * become available, and for the woken contexts to check to be + * sure the appropriate page became available, this saves space + * at a cost of "thundering herd" phenomena during rare hash + * collisions. + */ +static inline wait_queue_head_t *page_waitqueue(struct page *page) +{ + const struct zone *zone = page_zone(page); + + return &zone->wait_table[hash_ptr(page, zone->wait_table_bits)]; +} + +extern int __sleep_on_page(void *); + +#ifndef CONFIG_PREEMPT_RT static inline void lock_page_ref(struct page *page) { bit_spin_lock(PG_nonewrefs, &page->flags); @@ -81,29 +104,58 @@ static inline void wait_on_page_ref(stru while (unlikely(test_bit(PG_nonewrefs, &page->flags))) cpu_relax(); } +#else // CONFIG_PREEMPT_RT +static inline void wait_on_page_ref(struct page *page) +{ + might_sleep(); + if (unlikely(PageNoNewRefs(page))) { + DEFINE_WAIT_BIT(wait, &page->flags, PG_nonewrefs); + __wait_on_bit(page_waitqueue(page), &wait, __sleep_on_page, + TASK_UNINTERRUPTIBLE); + } +} + +static inline void lock_page_ref(struct page *page) +{ + while (test_and_set_bit(PG_nonewrefs, &page->flags)) + wait_on_page_ref(page); + __acquire(bitlock); + smp_wmb(); +} + +static inline void unlock_page_ref(struct page *page) +{ + VM_BUG_ON(!PageNoNewRefs(page)); + smp_mb__before_clear_bit(); + ClearPageNoNewRefs(page); + smp_mb__after_clear_bit(); + __wake_up_bit(page_waitqueue(page), &page->flags, PG_nonewrefs); + __release(bitlock); +} +#endif // CONFIG_PREEMPT_RT #define lock_page_ref_irq(page) \ do { \ - local_irq_disable(); \ + local_irq_disable_nort(); \ lock_page_ref(page); \ } while (0) #define unlock_page_ref_irq(page) \ do { \ unlock_page_ref(page); \ - local_irq_enable(); \ + local_irq_enable_nort(); \ } while (0) #define lock_page_ref_irqsave(page, flags) \ do { \ - local_irq_save(flags); \ + local_irq_save_nort(flags); \ lock_page_ref(page); \ } while (0) #define unlock_page_ref_irqrestore(page, flags) \ do { \ unlock_page_ref(page); \ - local_irq_restore(flags); \ + local_irq_restore_nort(flags); \ } while (0) /* @@ -155,7 +207,7 @@ static inline int page_cache_get_specula { VM_BUG_ON(in_interrupt()); -#ifndef CONFIG_SMP +#if !defined(CONFIG_SMP) && !defined(CONFIG_PREEMPT_RT) # ifdef CONFIG_PREEMPT VM_BUG_ON(!in_atomic()); # endif Index: linux-2.6.24.7/mm/filemap.c =================================================================== --- linux-2.6.24.7.orig/mm/filemap.c +++ linux-2.6.24.7/mm/filemap.c @@ -505,21 +505,10 @@ static int __sleep_on_page_lock(void *wo return 0; } -/* - * In order to wait for pages to become available there must be - * waitqueues associated with pages. By using a hash table of - * waitqueues where the bucket discipline is to maintain all - * waiters on the same queue and wake all when any of the pages - * become available, and for the woken contexts to check to be - * sure the appropriate page became available, this saves space - * at a cost of "thundering herd" phenomena during rare hash - * collisions. - */ -static wait_queue_head_t *page_waitqueue(struct page *page) +int __sleep_on_page(void *word) { - const struct zone *zone = page_zone(page); - - return &zone->wait_table[hash_ptr(page, zone->wait_table_bits)]; + schedule(); + return 0; } static inline void wake_up_page(struct page *page, int bit) ����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/kmap-atomic-prepare.patch�������������������������������������������������������������������0000664�0000764�0000764�00000011126�11041657730�016121� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������� With the separation of pagefault_{disable,enable}() from the preempt_count a previously overlooked dependancy became painfully clear. kmap_atomic() is per cpu and relies not only on disabling the pagefault handler, but really needs preemption disabled too. make this explicit now - so that we can change pagefault_disable(). Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> --- arch/mips/mm/highmem.c | 5 ++++- arch/sparc/mm/highmem.c | 4 +++- arch/x86/mm/highmem_32.c | 4 +++- include/asm-frv/highmem.h | 2 ++ include/asm-ppc/highmem.h | 4 +++- 5 files changed, 15 insertions(+), 4 deletions(-) Index: linux-2.6.24.7/arch/mips/mm/highmem.c =================================================================== --- linux-2.6.24.7.orig/arch/mips/mm/highmem.c +++ linux-2.6.24.7/arch/mips/mm/highmem.c @@ -38,7 +38,7 @@ void *__kmap_atomic(struct page *page, e enum fixed_addresses idx; unsigned long vaddr; - /* even !CONFIG_PREEMPT needs this, for in_atomic in do_page_fault */ + preempt_disable(); pagefault_disable(); if (!PageHighMem(page)) return page_address(page); @@ -63,6 +63,7 @@ void __kunmap_atomic(void *kvaddr, enum if (vaddr < FIXADDR_START) { // FIXME pagefault_enable(); + preempt_enable(); return; } @@ -78,6 +79,7 @@ void __kunmap_atomic(void *kvaddr, enum #endif pagefault_enable(); + preempt_enable(); } /* @@ -89,6 +91,7 @@ void *kmap_atomic_pfn(unsigned long pfn, enum fixed_addresses idx; unsigned long vaddr; + preempt_disable(); pagefault_disable(); idx = type + KM_TYPE_NR*smp_processor_id(); Index: linux-2.6.24.7/arch/sparc/mm/highmem.c =================================================================== --- linux-2.6.24.7.orig/arch/sparc/mm/highmem.c +++ linux-2.6.24.7/arch/sparc/mm/highmem.c @@ -34,7 +34,7 @@ void *kmap_atomic(struct page *page, enu unsigned long idx; unsigned long vaddr; - /* even !CONFIG_PREEMPT needs this, for in_atomic in do_page_fault */ + preempt_disable(); pagefault_disable(); if (!PageHighMem(page)) return page_address(page); @@ -71,6 +71,7 @@ void kunmap_atomic(void *kvaddr, enum km if (vaddr < FIXADDR_START) { // FIXME pagefault_enable(); + preempt_enable(); return; } @@ -97,6 +98,7 @@ void kunmap_atomic(void *kvaddr, enum km #endif pagefault_enable(); + preempt_enable(); } /* We may be fed a pagetable here by ptep_to_xxx and others. */ Index: linux-2.6.24.7/arch/x86/mm/highmem_32.c =================================================================== --- linux-2.6.24.7.orig/arch/x86/mm/highmem_32.c +++ linux-2.6.24.7/arch/x86/mm/highmem_32.c @@ -51,7 +51,7 @@ void *__kmap_atomic_prot(struct page *pa enum fixed_addresses idx; unsigned long vaddr; - /* even !CONFIG_PREEMPT needs this, for in_atomic in do_page_fault */ + preempt_disable(); pagefault_disable(); if (!PageHighMem(page)) @@ -93,6 +93,7 @@ void __kunmap_atomic(void *kvaddr, enum arch_flush_lazy_mmu_mode(); pagefault_enable(); + preempt_enable(); } /* This is the same as kmap_atomic() but can map memory that doesn't @@ -103,6 +104,7 @@ void *__kmap_atomic_pfn(unsigned long pf enum fixed_addresses idx; unsigned long vaddr; + preempt_disable(); pagefault_disable(); idx = type + KM_TYPE_NR*smp_processor_id(); Index: linux-2.6.24.7/include/asm-frv/highmem.h =================================================================== --- linux-2.6.24.7.orig/include/asm-frv/highmem.h +++ linux-2.6.24.7/include/asm-frv/highmem.h @@ -115,6 +115,7 @@ static inline void *kmap_atomic(struct p { unsigned long paddr; + preempt_disable(); pagefault_disable(); paddr = page_to_phys(page); @@ -171,6 +172,7 @@ static inline void kunmap_atomic(void *k BUG(); } pagefault_enable(); + preempt_enable(); } #endif /* !__ASSEMBLY__ */ Index: linux-2.6.24.7/include/asm-ppc/highmem.h =================================================================== --- linux-2.6.24.7.orig/include/asm-ppc/highmem.h +++ linux-2.6.24.7/include/asm-ppc/highmem.h @@ -78,7 +78,7 @@ static inline void *kmap_atomic(struct p unsigned int idx; unsigned long vaddr; - /* even !CONFIG_PREEMPT needs this, for in_atomic in do_page_fault */ + preempt_disable(); pagefault_disable(); if (!PageHighMem(page)) return page_address(page); @@ -102,6 +102,7 @@ static inline void kunmap_atomic(void *k if (vaddr < KMAP_FIX_BEGIN) { // FIXME pagefault_enable(); + preempt_enable(); return; } @@ -115,6 +116,7 @@ static inline void kunmap_atomic(void *k flush_tlb_page(NULL, vaddr); #endif pagefault_enable(); + preempt_enable(); } static inline struct page *kmap_atomic_to_page(void *ptr) ������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/pagefault-disable-cleanup.patch�������������������������������������������������������������0000664�0000764�0000764�00000013533�11041673163�017263� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������Subject: [patch] clean up the page fault disabling logic From: Ingo Molnar <mingo@elte.hu> decouple the pagefault-disabled logic from the preempt count. Signed-off-by: Ingo Molnar <mingo@elte.hu> --- arch/arm/mm/fault.c | 2 +- arch/mips/mm/fault.c | 2 +- arch/powerpc/mm/fault.c | 2 +- arch/x86/mm/fault_32.c | 2 +- arch/x86/mm/fault_64.c | 2 +- include/linux/sched.h | 1 + include/linux/uaccess.h | 33 +++------------------------------ kernel/fork.c | 1 + mm/memory.c | 22 ++++++++++++++++++++++ 9 files changed, 32 insertions(+), 35 deletions(-) Index: linux-2.6.24.7/arch/arm/mm/fault.c =================================================================== --- linux-2.6.24.7.orig/arch/arm/mm/fault.c +++ linux-2.6.24.7/arch/arm/mm/fault.c @@ -229,7 +229,7 @@ do_page_fault(unsigned long addr, unsign * If we're in an interrupt or have no user * context, we must not take the fault.. */ - if (in_atomic() || !mm) + if (in_atomic() || !mm || current->pagefault_disabled) goto no_context; /* Index: linux-2.6.24.7/arch/mips/mm/fault.c =================================================================== --- linux-2.6.24.7.orig/arch/mips/mm/fault.c +++ linux-2.6.24.7/arch/mips/mm/fault.c @@ -69,7 +69,7 @@ asmlinkage void do_page_fault(struct pt_ * If we're in an interrupt or have no user * context, we must not take the fault.. */ - if (in_atomic() || !mm) + if (in_atomic() || !mm || current->pagefault_disabled) goto bad_area_nosemaphore; down_read(&mm->mmap_sem); Index: linux-2.6.24.7/arch/powerpc/mm/fault.c =================================================================== --- linux-2.6.24.7.orig/arch/powerpc/mm/fault.c +++ linux-2.6.24.7/arch/powerpc/mm/fault.c @@ -184,7 +184,7 @@ int __kprobes do_page_fault(struct pt_re } #endif /* !(CONFIG_4xx || CONFIG_BOOKE)*/ - if (in_atomic() || mm == NULL) { + if (in_atomic() || mm == NULL || current->pagefault_disabled) { if (!user_mode(regs)) return SIGSEGV; /* in_atomic() in user mode is really bad, Index: linux-2.6.24.7/arch/x86/mm/fault_32.c =================================================================== --- linux-2.6.24.7.orig/arch/x86/mm/fault_32.c +++ linux-2.6.24.7/arch/x86/mm/fault_32.c @@ -358,7 +358,7 @@ fastcall void __kprobes do_page_fault(st * If we're in an interrupt, have no user context or are running in an * atomic region then we must not take the fault.. */ - if (in_atomic() || !mm) + if (in_atomic() || !mm || current->pagefault_disabled) goto bad_area_nosemaphore; /* When running in the kernel we expect faults to occur only to Index: linux-2.6.24.7/arch/x86/mm/fault_64.c =================================================================== --- linux-2.6.24.7.orig/arch/x86/mm/fault_64.c +++ linux-2.6.24.7/arch/x86/mm/fault_64.c @@ -369,7 +369,7 @@ asmlinkage void __kprobes do_page_fault( * If we're in an interrupt or have no user * context, we must not take the fault.. */ - if (unlikely(in_atomic() || !mm)) + if (unlikely(in_atomic() || !mm || current->pagefault_disabled)) goto bad_area_nosemaphore; /* Index: linux-2.6.24.7/include/linux/sched.h =================================================================== --- linux-2.6.24.7.orig/include/linux/sched.h +++ linux-2.6.24.7/include/linux/sched.h @@ -1179,6 +1179,7 @@ struct task_struct { /* mutex deadlock detection */ struct mutex_waiter *blocked_on; #endif + int pagefault_disabled; #ifdef CONFIG_TRACE_IRQFLAGS unsigned int irq_events; int hardirqs_enabled; Index: linux-2.6.24.7/include/linux/uaccess.h =================================================================== --- linux-2.6.24.7.orig/include/linux/uaccess.h +++ linux-2.6.24.7/include/linux/uaccess.h @@ -6,37 +6,10 @@ /* * These routines enable/disable the pagefault handler in that - * it will not take any locks and go straight to the fixup table. - * - * They have great resemblance to the preempt_disable/enable calls - * and in fact they are identical; this is because currently there is - * no other way to make the pagefault handlers do this. So we do - * disable preemption but we don't necessarily care about that. + * it will not take any MM locks and go straight to the fixup table. */ -static inline void pagefault_disable(void) -{ - inc_preempt_count(); - /* - * make sure to have issued the store before a pagefault - * can hit. - */ - barrier(); -} - -static inline void pagefault_enable(void) -{ - /* - * make sure to issue those last loads/stores before enabling - * the pagefault handler again. - */ - barrier(); - dec_preempt_count(); - /* - * make sure we do.. - */ - barrier(); - preempt_check_resched(); -} +extern void pagefault_disable(void); +extern void pagefault_enable(void); #ifndef ARCH_HAS_NOCACHE_UACCESS Index: linux-2.6.24.7/kernel/fork.c =================================================================== --- linux-2.6.24.7.orig/kernel/fork.c +++ linux-2.6.24.7/kernel/fork.c @@ -1158,6 +1158,7 @@ static struct task_struct *copy_process( p->hardirq_context = 0; p->softirq_context = 0; #endif + p->pagefault_disabled = 0; #ifdef CONFIG_LOCKDEP p->lockdep_depth = 0; /* no locks held yet */ p->curr_chain_key = 0; Index: linux-2.6.24.7/mm/memory.c =================================================================== --- linux-2.6.24.7.orig/mm/memory.c +++ linux-2.6.24.7/mm/memory.c @@ -2613,6 +2613,28 @@ unlock: return 0; } +void pagefault_disable(void) +{ + current->pagefault_disabled++; + /* + * make sure to have issued the store before a pagefault + * can hit. + */ + barrier(); +} +EXPORT_SYMBOL(pagefault_disable); + +void pagefault_enable(void) +{ + /* + * make sure to issue those last loads/stores before enabling + * the pagefault handler again. + */ + barrier(); + current->pagefault_disabled--; +} +EXPORT_SYMBOL(pagefault_enable); + /* * By the time we get here, we already hold the mm semaphore */ ���������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/nommu-fix-build.patch�����������������������������������������������������������������������0000664�0000764�0000764�00000002141�11041657733�015277� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From f575ccd685ef46c6abf88922f4df5cfb8283ae77 Mon Sep 17 00:00:00 2001 From: Sebastian Siewior <bigeasy@linutronix.de> Date: Fri, 18 Apr 2008 17:02:28 +0200 Subject: [PATCH] add CONFIG_MMU in uaccess.h because non-MMU arches don't need to enable or disable page faults. Signed-off-by: Sebastian Siewior <bigeasy@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> --- include/linux/uaccess.h | 5 +++++ 1 file changed, 5 insertions(+) Index: linux-2.6.24.7/include/linux/uaccess.h =================================================================== --- linux-2.6.24.7.orig/include/linux/uaccess.h +++ linux-2.6.24.7/include/linux/uaccess.h @@ -4,12 +4,17 @@ #include <linux/preempt.h> #include <asm/uaccess.h> +#ifdef CONFIG_MMU /* * These routines enable/disable the pagefault handler in that * it will not take any MM locks and go straight to the fixup table. */ extern void pagefault_disable(void); extern void pagefault_enable(void); +#else +static inline void pagefault_disable(void) { } +static inline void pagefault_enable(void) { } +#endif #ifndef ARCH_HAS_NOCACHE_UACCESS �������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/kmap-atomic-i386-fix.patch������������������������������������������������������������������0000664�0000764�0000764�00000002733�11041657735�015751� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������--- arch/x86/mm/highmem_32.c | 2 +- include/asm-x86/highmem.h | 6 +++--- 2 files changed, 4 insertions(+), 4 deletions(-) Index: linux-2.6.24.7/arch/x86/mm/highmem_32.c =================================================================== --- linux-2.6.24.7.orig/arch/x86/mm/highmem_32.c +++ linux-2.6.24.7/arch/x86/mm/highmem_32.c @@ -3,9 +3,9 @@ void *kmap(struct page *page) { - might_sleep(); if (!PageHighMem(page)) return page_address(page); + might_sleep(); return kmap_high(page); } Index: linux-2.6.24.7/include/asm-x86/highmem.h =================================================================== --- linux-2.6.24.7.orig/include/asm-x86/highmem.h +++ linux-2.6.24.7/include/asm-x86/highmem.h @@ -94,10 +94,10 @@ struct page *kmap_atomic_to_page(void *p * on PREEMPT_RT kmap_atomic() is a wrapper that uses kmap(): */ #ifdef CONFIG_PREEMPT_RT -# define kmap_atomic_prot(page, type, prot) kmap(page) -# define kmap_atomic(page, type) kmap(page) +# define kmap_atomic_prot(page, type, prot) ({ pagefault_disable(); kmap(page); }) +# define kmap_atomic(page, type) ({ pagefault_disable(); kmap(page); }) # define kmap_atomic_pfn(pfn, type) kmap(pfn_to_page(pfn)) -# define kunmap_atomic(kvaddr, type) kunmap_virt(kvaddr) +# define kunmap_atomic(kvaddr, type) do { pagefault_enable(); kunmap_virt(kvaddr); } while(0) # define kmap_atomic_to_page(kvaddr) kmap_to_page(kvaddr) #else # define kmap_atomic_prot(page, type, prot) __kmap_atomic_prot(page, type, prot) �������������������������������������patches/select-error-leak-fix.patch�����������������������������������������������������������������0000664�0000764�0000764�00000002636�11041657735�016402� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������As it is currently written, sys_select checks its return code to convert ERESTARTNOHAND to EINTR. However, the check is within an if (tvp) clause, and so if select is called from userspace with a NULL timeval, then it is possible for the ERESTARTNOHAND errno to leak into userspace, which is incorrect. This patch moves that check outside of the conditional, and prevents the errno leak. Thanks & Regards Neil Signed-Off-By: Neil Horman <nhorman@tuxdriver.com> fs/select.c | 18 +++++------------- 1 file changed, 5 insertions(+), 13 deletions(-) Index: linux-2.6.24.7/fs/select.c =================================================================== --- linux-2.6.24.7.orig/fs/select.c +++ linux-2.6.24.7/fs/select.c @@ -407,20 +407,12 @@ asmlinkage long sys_select(int n, fd_set rtv.tv_sec = timeout; if (timeval_compare(&rtv, &tv) >= 0) rtv = tv; - if (copy_to_user(tvp, &rtv, sizeof(rtv))) { -sticky: - /* - * If an application puts its timeval in read-only - * memory, we don't want the Linux-specific update to - * the timeval to cause a fault after the select has - * completed successfully. However, because we're not - * updating the timeval, we can't restart the system - * call. - */ - if (ret == -ERESTARTNOHAND) - ret = -EINTR; - } + if (copy_to_user(tvp, &rtv, sizeof(rtv))) + return -EFAULT; } +sticky: + if (ret == -ERESTARTNOHAND) + ret = -EINTR; return ret; } ��������������������������������������������������������������������������������������������������patches/fix-emergency-reboot.patch������������������������������������������������������������������0000664�0000764�0000764�00000003006�11041657734�016317� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������Subject: [patch] call reboot notifier list when doing an emergency reboot From: Ingo Molnar <mingo@elte.hu> my laptop does not reboot unless the shutdown notifiers are called first. So the following command, which i use as a fast way to reboot into a new kernel: echo b > /proc/sysrq-trigger just hangs indefinitely after the kernel prints "System rebooting". the thing is, that the kernel is actually reschedulable in this stage, so we could as well process the reboot_notifier_list. (furthermore, on -rt kernels this place is preemptable even during SysRq-b) So just process the reboot notifier list if we are preemptable. This will shut disk caches and chipsets off. Signed-off-by: Ingo Molnar <mingo@elte.hu> --- kernel/sys.c | 10 ++++++++++ 1 file changed, 10 insertions(+) Index: linux-2.6.24.7/kernel/sys.c =================================================================== --- linux-2.6.24.7.orig/kernel/sys.c +++ linux-2.6.24.7/kernel/sys.c @@ -32,6 +32,7 @@ #include <linux/getcpu.h> #include <linux/task_io_accounting_ops.h> #include <linux/seccomp.h> +#include <linux/hardirq.h> #include <linux/cpu.h> #include <linux/compat.h> @@ -265,6 +266,15 @@ out_unlock: */ void emergency_restart(void) { + /* + * Call the notifier chain if we are not in an + * atomic context: + */ +#ifdef CONFIG_PREEMPT + if (!in_atomic() && !irqs_disabled()) + blocking_notifier_call_chain(&reboot_notifier_list, + SYS_RESTART, NULL); +#endif machine_emergency_restart(); } EXPORT_SYMBOL_GPL(emergency_restart); ��������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/timer-freq-tweaks.patch���������������������������������������������������������������������0000664�0000764�0000764�00000007402�11041657735�015641� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������--- kernel/rcutorture.c | 2 +- mm/slab.c | 25 +++++++++++++++---------- 2 files changed, 16 insertions(+), 11 deletions(-) Index: linux-2.6.24.7/kernel/rcutorture.c =================================================================== --- linux-2.6.24.7.orig/kernel/rcutorture.c +++ linux-2.6.24.7/kernel/rcutorture.c @@ -647,7 +647,7 @@ rcu_torture_reader(void *arg) if (p == NULL) { /* Wait for rcu_torture_writer to get underway */ cur_ops->readunlock(idx); - schedule_timeout_interruptible(HZ); + schedule_timeout_interruptible(round_jiffies_relative(HZ)); continue; } if (p->rtort_mbtest == 0) Index: linux-2.6.24.7/mm/slab.c =================================================================== --- linux-2.6.24.7.orig/mm/slab.c +++ linux-2.6.24.7/mm/slab.c @@ -1051,7 +1051,7 @@ static int transfer_objects(struct array #ifndef CONFIG_NUMA #define drain_alien_cache(cachep, alien) do { } while (0) -#define reap_alien(cachep, l3, this_cpu) do { } while (0) +#define reap_alien(cachep, l3, this_cpu) 0 static inline struct array_cache **alloc_alien_cache(int node, int limit) { @@ -1149,7 +1149,7 @@ static void __drain_alien_cache(struct k /* * Called from cache_reap() to regularly drain alien caches round robin. */ -static void +static int reap_alien(struct kmem_cache *cachep, struct kmem_list3 *l3, int *this_cpu) { int node = per_cpu(reap_node, *this_cpu); @@ -1160,8 +1160,10 @@ reap_alien(struct kmem_cache *cachep, st if (ac && ac->avail && spin_trylock_irq(&ac->lock)) { __drain_alien_cache(cachep, ac, node, this_cpu); spin_unlock_irq(&ac->lock); + return 1; } } + return 0; } static void drain_alien_cache(struct kmem_cache *cachep, @@ -2514,7 +2516,7 @@ static void check_spinlock_acquired_node #define check_spinlock_acquired_node(x, y) do { } while(0) #endif -static void drain_array(struct kmem_cache *cachep, struct kmem_list3 *l3, +static int drain_array(struct kmem_cache *cachep, struct kmem_list3 *l3, struct array_cache *ac, int force, int node); @@ -4148,14 +4150,15 @@ static int enable_cpucache(struct kmem_c * Drain an array if it contains any elements taking the l3 lock only if * necessary. Note that the l3 listlock also protects the array_cache * if drain_array() is used on the shared array. + * returns non-zero if some work is done */ -void drain_array(struct kmem_cache *cachep, struct kmem_list3 *l3, +int drain_array(struct kmem_cache *cachep, struct kmem_list3 *l3, struct array_cache *ac, int force, int node) { int tofree, this_cpu; if (!ac || !ac->avail) - return; + return 0; if (ac->touched && !force) { ac->touched = 0; } else { @@ -4171,6 +4174,7 @@ void drain_array(struct kmem_cache *cach } slab_spin_unlock_irq(&l3->list_lock, this_cpu); } + return 1; } /** @@ -4208,10 +4212,10 @@ static void cache_reap(struct work_struc */ l3 = searchp->nodelists[node]; - reap_alien(searchp, l3, &this_cpu); + work_done += reap_alien(searchp, l3, &this_cpu); - drain_array(searchp, l3, cpu_cache_get(searchp, this_cpu), - 0, node); + work_done += drain_array(searchp, l3, + cpu_cache_get(searchp, this_cpu), 0, node); /* * These are racy checks but it does not matter @@ -4222,7 +4226,7 @@ static void cache_reap(struct work_struc l3->next_reap = jiffies + REAPTIMEOUT_LIST3; - drain_array(searchp, l3, l3->shared, 0, node); + work_done += drain_array(searchp, l3, l3->shared, 0, node); if (l3->free_touched) l3->free_touched = 0; @@ -4241,7 +4245,8 @@ next: next_reap_node(); out: /* Set up the next iteration */ - schedule_delayed_work(work, round_jiffies_relative(REAPTIMEOUT_CPUC)); + schedule_delayed_work(work, + round_jiffies_relative((1+!work_done) * REAPTIMEOUT_CPUC)); } #ifdef CONFIG_SLABINFO ��������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/highmem-revert-mainline.patch���������������������������������������������������������������0000664�0000764�0000764�00000001217�11041657730�017000� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������--- mm/highmem.c | 9 --------- 1 file changed, 9 deletions(-) Index: linux-2.6.24.7/mm/highmem.c =================================================================== --- linux-2.6.24.7.orig/mm/highmem.c +++ linux-2.6.24.7/mm/highmem.c @@ -104,15 +104,6 @@ static void flush_all_zero_pkmaps(void) flush_tlb_kernel_range(PKMAP_ADDR(0), PKMAP_ADDR(LAST_PKMAP)); } -/* Flush all unused kmap mappings in order to remove stray - mappings. */ -void kmap_flush_unused(void) -{ - spin_lock(&kmap_lock); - flush_all_zero_pkmaps(); - spin_unlock(&kmap_lock); -} - static inline unsigned long map_new_virtual(struct page *page) { unsigned long vaddr; ���������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/highmem_rewrite.patch�����������������������������������������������������������������������0000664�0000764�0000764�00000037656�11041657734�015466� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������Subject: mm: remove kmap_lock Eradicate global locks. - kmap_lock is removed by extensive use of atomic_t and a new flush scheme. - pool_lock is removed by using the pkmap index for the page_address_maps and modifying set_page_address to only allow NULL<->virt transitions. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> --- include/linux/mm.h | 32 ++- mm/highmem.c | 433 ++++++++++++++++++++++++++++++----------------------- 2 files changed, 276 insertions(+), 189 deletions(-) Index: linux-2.6.24.7/include/linux/mm.h =================================================================== --- linux-2.6.24.7.orig/include/linux/mm.h +++ linux-2.6.24.7/include/linux/mm.h @@ -543,23 +543,39 @@ static __always_inline void *lowmem_page #endif #if defined(WANT_PAGE_VIRTUAL) -#define page_address(page) ((page)->virtual) -#define set_page_address(page, address) \ - do { \ - (page)->virtual = (address); \ - } while(0) -#define page_address_init() do { } while(0) +/* + * wrap page->virtual so it is safe to set/read locklessly + */ +#define page_address(page) \ + ({ typeof((page)->virtual) v = (page)->virtual; \ + smp_read_barrier_depends(); \ + v; }) + +static inline int set_page_address(struct page *page, void *address) +{ + if (address) + return cmpxchg(&page->virtual, NULL, address) == NULL; + else { + /* + * cmpxchg is a bit abused because it is not guaranteed + * safe wrt direct assignment on all platforms. + */ + void *virt = page->virtual; + return cmpxchg(&page->vitrual, virt, NULL) == virt; + } +} +void page_address_init(void); #endif #if defined(HASHED_PAGE_VIRTUAL) void *page_address(struct page *page); -void set_page_address(struct page *page, void *virtual); +int set_page_address(struct page *page, void *virtual); void page_address_init(void); #endif #if !defined(HASHED_PAGE_VIRTUAL) && !defined(WANT_PAGE_VIRTUAL) #define page_address(page) lowmem_page_address(page) -#define set_page_address(page, address) do { } while(0) +#define set_page_address(page, address) (0) #define page_address_init() do { } while(0) #endif Index: linux-2.6.24.7/mm/highmem.c =================================================================== --- linux-2.6.24.7.orig/mm/highmem.c +++ linux-2.6.24.7/mm/highmem.c @@ -14,6 +14,11 @@ * based on Linus' idea. * * Copyright (C) 1999 Ingo Molnar <mingo@redhat.com> + * + * Largely rewritten to get rid of all global locks + * + * Copyright (C) 2006 Red Hat, Inc., Peter Zijlstra <pzijlstr@redhat.com> + * */ #include <linux/mm.h> @@ -27,18 +32,14 @@ #include <linux/hash.h> #include <linux/highmem.h> #include <linux/blktrace_api.h> + #include <asm/tlbflush.h> +#include <asm/pgtable.h> -/* - * Virtual_count is not a pure "count". - * 0 means that it is not mapped, and has not been mapped - * since a TLB flush - it is usable. - * 1 means that there are no users, but it has been mapped - * since the last TLB flush - so we can't use it. - * n means that there are (n-1) current users of it. - */ #ifdef CONFIG_HIGHMEM +static int __set_page_address(struct page *page, void *virtual, int pos); + unsigned long totalhigh_pages __read_mostly; unsigned int nr_free_highpages (void) @@ -58,164 +59,208 @@ unsigned int nr_free_highpages (void) return pages; } -static int pkmap_count[LAST_PKMAP]; -static unsigned int last_pkmap_nr; -static __cacheline_aligned_in_smp DEFINE_SPINLOCK(kmap_lock); +/* + * count is not a pure "count". + * 0 means its owned exclusively by someone + * 1 means its free for use - either mapped or not. + * n means that there are (n-1) current users of it. + */ +static atomic_t pkmap_count[LAST_PKMAP]; +static atomic_t pkmap_hand; pte_t * pkmap_page_table; static DECLARE_WAIT_QUEUE_HEAD(pkmap_map_wait); -static void flush_all_zero_pkmaps(void) +/* + * Try to free a given kmap slot. + * + * Returns: + * -1 - in use + * 0 - free, no TLB flush needed + * 1 - free, needs TLB flush + */ +static int pkmap_try_free(int pos) { - int i; - - flush_cache_kmaps(); + if (atomic_cmpxchg(&pkmap_count[pos], 1, 0) != 1) + return -1; - for (i = 0; i < LAST_PKMAP; i++) { - struct page *page; + /* + * TODO: add a young bit to make it CLOCK + */ + if (!pte_none(pkmap_page_table[pos])) { + struct page *page = pte_page(pkmap_page_table[pos]); + unsigned long addr = PKMAP_ADDR(pos); + pte_t *ptep = &pkmap_page_table[pos]; + + VM_BUG_ON(addr != (unsigned long)page_address(page)); + + if (!__set_page_address(page, NULL, pos)) + BUG(); + flush_kernel_dcache_page(page); + pte_clear(&init_mm, addr, ptep); - /* - * zero means we don't have anything to do, - * >1 means that it is still in use. Only - * a count of 1 means that it is free but - * needs to be unmapped - */ - if (pkmap_count[i] != 1) - continue; - pkmap_count[i] = 0; + return 1; + } - /* sanity check */ - BUG_ON(pte_none(pkmap_page_table[i])); + return 0; +} - /* - * Don't need an atomic fetch-and-clear op here; - * no-one has the page mapped, and cannot get at - * its virtual address (and hence PTE) without first - * getting the kmap_lock (which is held here). - * So no dangers, even with speculative execution. - */ - page = pte_page(pkmap_page_table[i]); - pte_clear(&init_mm, (unsigned long)page_address(page), - &pkmap_page_table[i]); +static inline void pkmap_put(atomic_t *counter) +{ + switch (atomic_dec_return(counter)) { + case 0: + BUG(); - set_page_address(page, NULL); + case 1: + wake_up(&pkmap_map_wait); } - flush_tlb_kernel_range(PKMAP_ADDR(0), PKMAP_ADDR(LAST_PKMAP)); } -static inline unsigned long map_new_virtual(struct page *page) +#define TLB_BATCH 32 + +static int pkmap_get_free(void) { - unsigned long vaddr; - int count; + int i, pos, flush; + DECLARE_WAITQUEUE(wait, current); -start: - count = LAST_PKMAP; - /* Find an empty entry */ - for (;;) { - last_pkmap_nr = (last_pkmap_nr + 1) & LAST_PKMAP_MASK; - if (!last_pkmap_nr) { - flush_all_zero_pkmaps(); - count = LAST_PKMAP; - } - if (!pkmap_count[last_pkmap_nr]) - break; /* Found a usable entry */ - if (--count) - continue; +restart: + for (i = 0; i < LAST_PKMAP; i++) { + pos = atomic_inc_return(&pkmap_hand) % LAST_PKMAP; + flush = pkmap_try_free(pos); + if (flush >= 0) + goto got_one; + } + + /* + * wait for somebody else to unmap their entries + */ + __set_current_state(TASK_UNINTERRUPTIBLE); + add_wait_queue(&pkmap_map_wait, &wait); + schedule(); + remove_wait_queue(&pkmap_map_wait, &wait); + + goto restart; + +got_one: + if (flush) { +#if 0 + flush_tlb_kernel_range(PKMAP_ADDR(pos), PKMAP_ADDR(pos+1)); +#else + int pos2 = (pos + 1) % LAST_PKMAP; + int nr; + int entries[TLB_BATCH]; /* - * Sleep for somebody else to unmap their entries + * For those architectures that cannot help but flush the + * whole TLB, flush some more entries to make it worthwhile. + * Scan ahead of the hand to minimise search distances. */ - { - DECLARE_WAITQUEUE(wait, current); + for (i = 0, nr = 0; i < LAST_PKMAP && nr < TLB_BATCH; + i++, pos2 = (pos2 + 1) % LAST_PKMAP) { - __set_current_state(TASK_UNINTERRUPTIBLE); - add_wait_queue(&pkmap_map_wait, &wait); - spin_unlock(&kmap_lock); - schedule(); - remove_wait_queue(&pkmap_map_wait, &wait); - spin_lock(&kmap_lock); - - /* Somebody else might have mapped it while we slept */ - if (page_address(page)) - return (unsigned long)page_address(page); + flush = pkmap_try_free(pos2); + if (flush < 0) + continue; + + if (!flush) { + atomic_t *counter = &pkmap_count[pos2]; + VM_BUG_ON(atomic_read(counter) != 0); + atomic_set(counter, 2); + pkmap_put(counter); + } else + entries[nr++] = pos2; + } + flush_tlb_kernel_range(PKMAP_ADDR(0), PKMAP_ADDR(LAST_PKMAP)); - /* Re-start */ - goto start; + for (i = 0; i < nr; i++) { + atomic_t *counter = &pkmap_count[entries[i]]; + VM_BUG_ON(atomic_read(counter) != 0); + atomic_set(counter, 2); + pkmap_put(counter); } +#endif } - vaddr = PKMAP_ADDR(last_pkmap_nr); - set_pte_at(&init_mm, vaddr, - &(pkmap_page_table[last_pkmap_nr]), mk_pte(page, kmap_prot)); + return pos; +} + +static unsigned long pkmap_insert(struct page *page) +{ + int pos = pkmap_get_free(); + unsigned long vaddr = PKMAP_ADDR(pos); + pte_t *ptep = &pkmap_page_table[pos]; + pte_t entry = mk_pte(page, kmap_prot); + atomic_t *counter = &pkmap_count[pos]; + + VM_BUG_ON(atomic_read(counter) != 0); - pkmap_count[last_pkmap_nr] = 1; - set_page_address(page, (void *)vaddr); + set_pte_at(&init_mm, vaddr, ptep, entry); + if (unlikely(!__set_page_address(page, (void *)vaddr, pos))) { + /* + * concurrent pkmap_inserts for this page - + * the other won the race, release this entry. + * + * we can still clear the pte without a tlb flush since + * it couldn't have been used yet. + */ + pte_clear(&init_mm, vaddr, ptep); + VM_BUG_ON(atomic_read(counter) != 0); + atomic_set(counter, 2); + pkmap_put(counter); + vaddr = 0; + } else + atomic_set(counter, 2); return vaddr; } -void fastcall *kmap_high(struct page *page) +fastcall void *kmap_high(struct page *page) { unsigned long vaddr; - - /* - * For highmem pages, we can't trust "virtual" until - * after we have the lock. - * - * We cannot call this from interrupts, as it may block - */ - spin_lock(&kmap_lock); +again: vaddr = (unsigned long)page_address(page); + if (vaddr) { + atomic_t *counter = &pkmap_count[PKMAP_NR(vaddr)]; + if (atomic_inc_not_zero(counter)) { + /* + * atomic_inc_not_zero implies a (memory) barrier on success + * so page address will be reloaded. + */ + unsigned long vaddr2 = (unsigned long)page_address(page); + if (likely(vaddr == vaddr2)) + return (void *)vaddr; + + /* + * Oops, we got someone else. + * + * This can happen if we get preempted after + * page_address() and before atomic_inc_not_zero() + * and during that preemption this slot is freed and + * reused. + */ + pkmap_put(counter); + goto again; + } + } + + vaddr = pkmap_insert(page); if (!vaddr) - vaddr = map_new_virtual(page); - pkmap_count[PKMAP_NR(vaddr)]++; - BUG_ON(pkmap_count[PKMAP_NR(vaddr)] < 2); - spin_unlock(&kmap_lock); - return (void*) vaddr; + goto again; + + return (void *)vaddr; } EXPORT_SYMBOL(kmap_high); -void fastcall kunmap_high(struct page *page) +fastcall void kunmap_high(struct page *page) { - unsigned long vaddr; - unsigned long nr; - int need_wakeup; - - spin_lock(&kmap_lock); - vaddr = (unsigned long)page_address(page); + unsigned long vaddr = (unsigned long)page_address(page); BUG_ON(!vaddr); - nr = PKMAP_NR(vaddr); - - /* - * A count must never go down to zero - * without a TLB flush! - */ - need_wakeup = 0; - switch (--pkmap_count[nr]) { - case 0: - BUG(); - case 1: - /* - * Avoid an unnecessary wake_up() function call. - * The common case is pkmap_count[] == 1, but - * no waiters. - * The tasks queued in the wait-queue are guarded - * by both the lock in the wait-queue-head and by - * the kmap_lock. As the kmap_lock is held here, - * no need for the wait-queue-head's lock. Simply - * test if the queue is empty. - */ - need_wakeup = waitqueue_active(&pkmap_map_wait); - } - spin_unlock(&kmap_lock); - - /* do wake-up, if needed, race-free outside of the spin lock */ - if (need_wakeup) - wake_up(&pkmap_map_wait); + pkmap_put(&pkmap_count[PKMAP_NR(vaddr)]); } EXPORT_SYMBOL(kunmap_high); + #endif #if defined(HASHED_PAGE_VIRTUAL) @@ -223,19 +268,13 @@ EXPORT_SYMBOL(kunmap_high); #define PA_HASH_ORDER 7 /* - * Describes one page->virtual association + * Describes one page->virtual address association. */ -struct page_address_map { +static struct page_address_map { struct page *page; void *virtual; struct list_head list; -}; - -/* - * page_address_map freelist, allocated from page_address_maps. - */ -static struct list_head page_address_pool; /* freelist */ -static spinlock_t pool_lock; /* protects page_address_pool */ +} page_address_maps[LAST_PKMAP]; /* * Hash table bucket @@ -250,91 +289,123 @@ static struct page_address_slot *page_sl return &page_address_htable[hash_ptr(page, PA_HASH_ORDER)]; } -void *page_address(struct page *page) +static void *__page_address(struct page_address_slot *pas, struct page *page) { - unsigned long flags; - void *ret; - struct page_address_slot *pas; - - if (!PageHighMem(page)) - return lowmem_page_address(page); + void *ret = NULL; - pas = page_slot(page); - ret = NULL; - spin_lock_irqsave(&pas->lock, flags); if (!list_empty(&pas->lh)) { struct page_address_map *pam; list_for_each_entry(pam, &pas->lh, list) { if (pam->page == page) { ret = pam->virtual; - goto done; + break; } } } -done: + + return ret; +} + +void *page_address(struct page *page) +{ + unsigned long flags; + void *ret; + struct page_address_slot *pas; + + if (!PageHighMem(page)) + return lowmem_page_address(page); + + pas = page_slot(page); + spin_lock_irqsave(&pas->lock, flags); + ret = __page_address(pas, page); spin_unlock_irqrestore(&pas->lock, flags); return ret; } EXPORT_SYMBOL(page_address); -void set_page_address(struct page *page, void *virtual) +static int __set_page_address(struct page *page, void *virtual, int pos) { + int ret = 0; unsigned long flags; struct page_address_slot *pas; struct page_address_map *pam; - BUG_ON(!PageHighMem(page)); + VM_BUG_ON(!PageHighMem(page)); + VM_BUG_ON(atomic_read(&pkmap_count[pos]) != 0); + VM_BUG_ON(pos < 0 || pos >= LAST_PKMAP); pas = page_slot(page); - if (virtual) { /* Add */ - BUG_ON(list_empty(&page_address_pool)); + pam = &page_address_maps[pos]; - spin_lock_irqsave(&pool_lock, flags); - pam = list_entry(page_address_pool.next, - struct page_address_map, list); - list_del(&pam->list); - spin_unlock_irqrestore(&pool_lock, flags); - - pam->page = page; - pam->virtual = virtual; - - spin_lock_irqsave(&pas->lock, flags); - list_add_tail(&pam->list, &pas->lh); - spin_unlock_irqrestore(&pas->lock, flags); - } else { /* Remove */ - spin_lock_irqsave(&pas->lock, flags); - list_for_each_entry(pam, &pas->lh, list) { - if (pam->page == page) { - list_del(&pam->list); - spin_unlock_irqrestore(&pas->lock, flags); - spin_lock_irqsave(&pool_lock, flags); - list_add_tail(&pam->list, &page_address_pool); - spin_unlock_irqrestore(&pool_lock, flags); - goto done; - } + spin_lock_irqsave(&pas->lock, flags); + if (virtual) { /* add */ + VM_BUG_ON(!list_empty(&pam->list)); + + if (!__page_address(pas, page)) { + pam->page = page; + pam->virtual = virtual; + list_add_tail(&pam->list, &pas->lh); + ret = 1; + } + } else { /* remove */ + if (!list_empty(&pam->list)) { + list_del_init(&pam->list); + ret = 1; } - spin_unlock_irqrestore(&pas->lock, flags); } -done: - return; + spin_unlock_irqrestore(&pas->lock, flags); + + return ret; } -static struct page_address_map page_address_maps[LAST_PKMAP]; +int set_page_address(struct page *page, void *virtual) +{ + /* + * set_page_address is not supposed to be called when using + * hashed virtual addresses. + */ + BUG(); + return 0; +} -void __init page_address_init(void) +void __init __page_address_init(void) { int i; - INIT_LIST_HEAD(&page_address_pool); for (i = 0; i < ARRAY_SIZE(page_address_maps); i++) - list_add(&page_address_maps[i].list, &page_address_pool); + INIT_LIST_HEAD(&page_address_maps[i].list); + for (i = 0; i < ARRAY_SIZE(page_address_htable); i++) { INIT_LIST_HEAD(&page_address_htable[i].lh); spin_lock_init(&page_address_htable[i].lock); } - spin_lock_init(&pool_lock); +} + +#elif defined (CONFIG_HIGHMEM) /* HASHED_PAGE_VIRTUAL */ + +static int __set_page_address(struct page *page, void *virtual, int pos) +{ + return set_page_address(page, virtual); } #endif /* defined(CONFIG_HIGHMEM) && !defined(WANT_PAGE_VIRTUAL) */ + +#if defined(CONFIG_HIGHMEM) || defined(HASHED_PAGE_VIRTUAL) + +void __init page_address_init(void) +{ +#ifdef CONFIG_HIGHMEM + int i; + + for (i = 0; i < ARRAY_SIZE(pkmap_count); i++) + atomic_set(&pkmap_count[i], 1); +#endif + +#ifdef HASHED_PAGE_VIRTUAL + __page_address_init(); +#endif +} + +#endif ����������������������������������������������������������������������������������patches/highmem-redo-mainline.patch�����������������������������������������������������������������0000664�0000764�0000764�00000001022�11041657734�016420� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������--- mm/highmem.c | 8 ++++++++ 1 file changed, 8 insertions(+) Index: linux-2.6.24.7/mm/highmem.c =================================================================== --- linux-2.6.24.7.orig/mm/highmem.c +++ linux-2.6.24.7/mm/highmem.c @@ -214,6 +214,14 @@ static unsigned long pkmap_insert(struct return vaddr; } +/* + * Flush all unused kmap mappings in order to remove stray mappings. + */ +void kmap_flush_unused(void) +{ + WARN_ON_ONCE(1); +} + fastcall void *kmap_high(struct page *page) { unsigned long vaddr; ��������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/rt-kmap-scale-fix.patch���������������������������������������������������������������������0000664�0000764�0000764�00000014454�11041657731�015517� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������Hi Ingo, Apply on top of what is still in -rt. This seems to survive a kbuild -j64 & -j512 (although with that latter the machine goes off for a while, but does return with a kernel). If you can spare a cycle between hacking syslets and -rt, could you have a look at the logic this patch adds? --- Solve 2 deadlocks in the current kmap code. 1) akpm spotted a race in the waitqueue usage that could deadlock the machine. the very unlikely scenario was what we would not find a usable map in LAST_PKMAP tries but right before we hit schedule the very last returns. Solve this by keeping a free count. 2) akpm told about the kmap deadlock where multiple processes each require 2 maps (src, dst). When they deplete the maps for the src maps they will be stuck waiting for their dst maps. Solve this by by tracking (and limiting) kmap users and account two maps for each. This all adds more atomic globals, this will bounce like mad on real large smp. (perhaps add some __cacheline_aligned_on_smp) Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> --- include/linux/sched.h | 1 mm/highmem.c | 96 ++++++++++++++++++++++++++++++++++++++++++++------ 2 files changed, 87 insertions(+), 10 deletions(-) Index: linux-2.6.24.7/include/linux/sched.h =================================================================== --- linux-2.6.24.7.orig/include/linux/sched.h +++ linux-2.6.24.7/include/linux/sched.h @@ -1493,6 +1493,7 @@ static inline void put_task_struct(struc #define PF_MEMALLOC 0x00000800 /* Allocating memory */ #define PF_FLUSHER 0x00001000 /* responsible for disk writeback */ #define PF_USED_MATH 0x00002000 /* if unset the fpu must be initialized before use */ +#define PF_KMAP 0x00004000 /* this context has a kmap */ #define PF_NOFREEZE 0x00008000 /* this thread should not be frozen */ #define PF_FROZEN 0x00010000 /* frozen for system suspend */ #define PF_FSTRANS 0x00020000 /* inside a filesystem transaction */ Index: linux-2.6.24.7/mm/highmem.c =================================================================== --- linux-2.6.24.7.orig/mm/highmem.c +++ linux-2.6.24.7/mm/highmem.c @@ -32,6 +32,7 @@ #include <linux/hash.h> #include <linux/highmem.h> #include <linux/blktrace_api.h> +#include <linux/hardirq.h> #include <asm/tlbflush.h> #include <asm/pgtable.h> @@ -67,10 +68,12 @@ unsigned int nr_free_highpages (void) */ static atomic_t pkmap_count[LAST_PKMAP]; static atomic_t pkmap_hand; +static atomic_t pkmap_free; +static atomic_t pkmap_users; pte_t * pkmap_page_table; -static DECLARE_WAIT_QUEUE_HEAD(pkmap_map_wait); +static DECLARE_WAIT_QUEUE_HEAD(pkmap_wait); /* * Try to free a given kmap slot. @@ -85,6 +88,7 @@ static int pkmap_try_free(int pos) if (atomic_cmpxchg(&pkmap_count[pos], 1, 0) != 1) return -1; + atomic_dec(&pkmap_free); /* * TODO: add a young bit to make it CLOCK */ @@ -113,7 +117,8 @@ static inline void pkmap_put(atomic_t *c BUG(); case 1: - wake_up(&pkmap_map_wait); + atomic_inc(&pkmap_free); + wake_up(&pkmap_wait); } } @@ -122,11 +127,10 @@ static inline void pkmap_put(atomic_t *c static int pkmap_get_free(void) { int i, pos, flush; - DECLARE_WAITQUEUE(wait, current); restart: for (i = 0; i < LAST_PKMAP; i++) { - pos = atomic_inc_return(&pkmap_hand) % LAST_PKMAP; + pos = atomic_inc_return(&pkmap_hand) & LAST_PKMAP_MASK; flush = pkmap_try_free(pos); if (flush >= 0) goto got_one; @@ -135,10 +139,8 @@ restart: /* * wait for somebody else to unmap their entries */ - __set_current_state(TASK_UNINTERRUPTIBLE); - add_wait_queue(&pkmap_map_wait, &wait); - schedule(); - remove_wait_queue(&pkmap_map_wait, &wait); + if (likely(!in_interrupt())) + wait_event(pkmap_wait, atomic_read(&pkmap_free) != 0); goto restart; @@ -147,7 +149,7 @@ got_one: #if 0 flush_tlb_kernel_range(PKMAP_ADDR(pos), PKMAP_ADDR(pos+1)); #else - int pos2 = (pos + 1) % LAST_PKMAP; + int pos2 = (pos + 1) & LAST_PKMAP_MASK; int nr; int entries[TLB_BATCH]; @@ -157,7 +159,7 @@ got_one: * Scan ahead of the hand to minimise search distances. */ for (i = 0, nr = 0; i < LAST_PKMAP && nr < TLB_BATCH; - i++, pos2 = (pos2 + 1) % LAST_PKMAP) { + i++, pos2 = (pos2 + 1) & LAST_PKMAP_MASK) { flush = pkmap_try_free(pos2); if (flush < 0) @@ -222,9 +224,79 @@ void kmap_flush_unused(void) WARN_ON_ONCE(1); } +/* + * Avoid starvation deadlock by limiting the number of tasks that can obtain a + * kmap to (LAST_PKMAP - KM_TYPE_NR*NR_CPUS)/2. + */ +static void kmap_account(void) +{ + int weight; + +#ifndef CONFIG_PREEMPT_RT + if (in_interrupt()) { + /* irqs can always get them */ + weight = -1; + } else +#endif + if (current->flags & PF_KMAP) { + current->flags &= ~PF_KMAP; + /* we already accounted the second */ + weight = 0; + } else { + /* mark 1, account 2 */ + current->flags |= PF_KMAP; + weight = 2; + } + + if (weight > 0) { + /* + * reserve KM_TYPE_NR maps per CPU for interrupt context + */ + const int target = LAST_PKMAP +#ifndef CONFIG_PREEMPT_RT + - KM_TYPE_NR*NR_CPUS +#endif + ; + +again: + wait_event(pkmap_wait, + atomic_read(&pkmap_users) + weight <= target); + + if (atomic_add_return(weight, &pkmap_users) > target) { + atomic_sub(weight, &pkmap_users); + goto again; + } + } +} + +static void kunmap_account(void) +{ + int weight; + +#ifndef CONFIG_PREEMPT_RT + if (in_irq()) { + weight = -1; + } else +#endif + if (current->flags & PF_KMAP) { + /* there was only 1 kmap, un-account both */ + current->flags &= ~PF_KMAP; + weight = 2; + } else { + /* there were two kmaps, un-account per kunmap */ + weight = 1; + } + + if (weight > 0) + atomic_sub(weight, &pkmap_users); + wake_up(&pkmap_wait); +} + fastcall void *kmap_high(struct page *page) { unsigned long vaddr; + + kmap_account(); again: vaddr = (unsigned long)page_address(page); if (vaddr) { @@ -265,6 +337,7 @@ fastcall void kunmap_high(struct page *p unsigned long vaddr = (unsigned long)page_address(page); BUG_ON(!vaddr); pkmap_put(&pkmap_count[PKMAP_NR(vaddr)]); + kunmap_account(); } EXPORT_SYMBOL(kunmap_high); @@ -409,6 +482,9 @@ void __init page_address_init(void) for (i = 0; i < ARRAY_SIZE(pkmap_count); i++) atomic_set(&pkmap_count[i], 1); + atomic_set(&pkmap_hand, 0); + atomic_set(&pkmap_free, LAST_PKMAP); + atomic_set(&pkmap_users, 0); #endif #ifdef HASHED_PAGE_VIRTUAL ��������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/pause-on-oops-head-tail.patch���������������������������������������������������������������0000664�0000764�0000764�00000007450�11041657732�016625� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������Subject: [patch] introduce pause_on_oops_head/tail boot options From: Ingo Molnar <mingo@elte.hu> if a system crashes with hard to debug oopses which scroll off the screen then it's useful to stop the crash right after the register info or right after the callback printout. Signed-off-by: Ingo Molnar <mingo@elte.hu> --- arch/x86/kernel/traps_32.c | 6 +++++ arch/x86/kernel/traps_64.c | 2 + include/linux/kernel.h | 4 +++ kernel/panic.c | 49 ++++++++++++++++++++++++++++++++++++++++++++- 4 files changed, 60 insertions(+), 1 deletion(-) Index: linux-2.6.24.7/arch/x86/kernel/traps_32.c =================================================================== --- linux-2.6.24.7.orig/arch/x86/kernel/traps_32.c +++ linux-2.6.24.7/arch/x86/kernel/traps_32.c @@ -269,8 +269,14 @@ static void show_stack_log_lvl(struct ta printk("\n%s ", log_lvl); printk("%08lx ", *stack++); } + + pause_on_oops_head(); + printk("\n%sCall Trace:\n", log_lvl); show_trace_log_lvl(task, regs, esp, log_lvl); + + pause_on_oops_tail(); + debug_show_held_locks(task); } Index: linux-2.6.24.7/arch/x86/kernel/traps_64.c =================================================================== --- linux-2.6.24.7.orig/arch/x86/kernel/traps_64.c +++ linux-2.6.24.7/arch/x86/kernel/traps_64.c @@ -349,9 +349,11 @@ static const struct stacktrace_ops print void show_trace(struct task_struct *tsk, struct pt_regs *regs, unsigned long *stack) { + pause_on_oops_head(); printk("\nCall Trace:\n"); dump_trace(tsk, regs, stack, &print_trace_ops, NULL); printk("\n"); + pause_on_oops_tail(); print_preempt_trace(tsk); } Index: linux-2.6.24.7/include/linux/kernel.h =================================================================== --- linux-2.6.24.7.orig/include/linux/kernel.h +++ linux-2.6.24.7/include/linux/kernel.h @@ -229,6 +229,10 @@ extern void wake_up_klogd(void); extern int oops_in_progress; /* If set, an oops, panic(), BUG() or die() is in progress */ extern int panic_timeout; extern int panic_on_oops; + +extern void pause_on_oops_head(void); +extern void pause_on_oops_tail(void); + extern int panic_on_unrecovered_nmi; extern int tainted; extern const char *print_tainted(void); Index: linux-2.6.24.7/kernel/panic.c =================================================================== --- linux-2.6.24.7.orig/kernel/panic.c +++ linux-2.6.24.7/kernel/panic.c @@ -27,7 +27,38 @@ static int pause_on_oops; static int pause_on_oops_flag; static DEFINE_SPINLOCK(pause_on_oops_lock); -int panic_timeout; +/* + * Debugging helper: freeze all console output after printing the + * first oops's head (or tail): + */ +static int pause_on_oops_head_flag = 0; +static int pause_on_oops_tail_flag = 0; + +static void pause_on_oops_loop(int flag) +{ + switch (flag) { + default: + break; + case 1: + for (;;) + local_irq_disable(); + case 2: + for (;;) + local_irq_enable(); + } +} + +void pause_on_oops_head(void) +{ + pause_on_oops_loop(pause_on_oops_head_flag); +} + +void pause_on_oops_tail(void) +{ + pause_on_oops_loop(pause_on_oops_tail_flag); +} + +int panic_timeout __read_mostly; ATOMIC_NOTIFIER_HEAD(panic_notifier_list); @@ -190,6 +221,22 @@ static int __init pause_on_oops_setup(ch } __setup("pause_on_oops=", pause_on_oops_setup); +static int __init pause_on_oops_head_setup(char *str) +{ + pause_on_oops_head_flag = simple_strtoul(str, NULL, 0); + printk(KERN_INFO "pause_on_oops_head: %d\n", pause_on_oops_head_flag); + return 1; +} +__setup("pause_on_oops_head=", pause_on_oops_head_setup); + +static int __init pause_on_oops_tail_setup(char *str) +{ + pause_on_oops_tail_flag = simple_strtoul(str, NULL, 0); + printk(KERN_INFO "pause_on_oops_tail: %d\n", pause_on_oops_tail_flag); + return 1; +} +__setup("pause_on_oops_tail=", pause_on_oops_tail_setup); + static void spin_msec(int msecs) { int i; ������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/i386-nmi-watchdog-show-regs.patch�����������������������������������������������������������0000664�0000764�0000764�00000001015�11041657732�017245� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������--- arch/x86/kernel/nmi_32.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) Index: linux-2.6.24.7/arch/x86/kernel/nmi_32.c =================================================================== --- linux-2.6.24.7.orig/arch/x86/kernel/nmi_32.c +++ linux-2.6.24.7/arch/x86/kernel/nmi_32.c @@ -392,7 +392,7 @@ nmi_watchdog_tick(struct pt_regs * regs, spin_lock(&lock); printk("NMI backtrace for cpu %d\n", cpu); - dump_stack(); + show_regs(regs); spin_unlock(&lock); cpu_clear(cpu, backtrace_mask); } �������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/x86-64-traps-move-held-locks-output.patch���������������������������������������������������0000664�0000764�0000764�00000001324�11041657733�020615� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������--- arch/x86/kernel/traps_64.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) Index: linux-2.6.24.7/arch/x86/kernel/traps_64.c =================================================================== --- linux-2.6.24.7.orig/arch/x86/kernel/traps_64.c +++ linux-2.6.24.7/arch/x86/kernel/traps_64.c @@ -319,7 +319,6 @@ print_trace_warning_symbol(void *data, c { print_symbol(msg, symbol); printk("\n"); - debug_show_held_locks(tsk); } static void print_trace_warning(void *data, char *msg) @@ -354,6 +353,7 @@ show_trace(struct task_struct *tsk, stru dump_trace(tsk, regs, stack, &print_trace_ops, NULL); printk("\n"); pause_on_oops_tail(); + debug_show_held_locks(tsk); print_preempt_trace(tsk); } ������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/x86-64-tscless-vgettimeofday.patch����������������������������������������������������������0000664�0000764�0000764�00000002500�11041673160�017451� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������Subject: [patch] x86_64 GTOD: offer scalable vgettimeofday From: Ingo Molnar <mingo@elte.hu> offer scalable vgettimeofday independently of whether the TSC is synchronous or not. Off by default. this patch also fixes an SMP bug in sys_vtime(): we should read __vsyscall_gtod_data.wall_time_tv.tv_sec only once. Signed-off-by: Ingo Molnar <mingo@elte.hu> --- arch/x86/kernel/vsyscall_64.c | 19 +++++++++++++++++++ 1 file changed, 19 insertions(+) Index: linux-2.6.24.7/arch/x86/kernel/vsyscall_64.c =================================================================== --- linux-2.6.24.7.orig/arch/x86/kernel/vsyscall_64.c +++ linux-2.6.24.7/arch/x86/kernel/vsyscall_64.c @@ -119,6 +119,25 @@ static __always_inline void do_vgettimeo unsigned seq; unsigned long mult, shift, nsec; cycle_t (*vread)(void); + + if (likely(__vsyscall_gtod_data.sysctl_enabled == 2)) { + struct timeval tmp; + + do { + barrier(); + tv->tv_sec = __vsyscall_gtod_data.wall_time_sec; + tv->tv_usec = __vsyscall_gtod_data.wall_time_nsec; + barrier(); + tmp.tv_sec = __vsyscall_gtod_data.wall_time_sec; + tmp.tv_usec = __vsyscall_gtod_data.wall_time_nsec; + + } while (tmp.tv_usec != tv->tv_usec || + tmp.tv_sec != tv->tv_sec); + + tv->tv_usec /= NSEC_PER_USEC; + return; + } + do { seq = read_seqbegin(&__vsyscall_gtod_data.lock); ������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/Add-dev-rmem-device-driver-for-real-time-JVM-testing.patch����������������������������������0000664�0000764�0000764�00000013301�11041657731�023757� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������Add /dev/rmem device driver for real-time JVM testing From: Theodore Ts'o <tytso@mit.edu> This kernel modules is needed for use by the TCK conformance test which tests the JVM's RTSJ implementation. Unfortunately, RTSJ requires that Java programs have direct access to physical memory, and /dev/mem does not allow mmap to work to anything beyond I/O mapped memory regions on the x86 platform. Since this is a spectacularly bad idea (so much for write once, debug everywehere) and could potentially destablize the kernel, set the TAINT_USER flag if available. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> --- drivers/char/Kconfig | 11 ++++ drivers/char/Makefile | 1 drivers/char/rmem.c | 134 ++++++++++++++++++++++++++++++++++++++++++++++++++ 3 files changed, 146 insertions(+) Index: linux-2.6.24.7/drivers/char/Kconfig =================================================================== --- linux-2.6.24.7.orig/drivers/char/Kconfig +++ linux-2.6.24.7/drivers/char/Kconfig @@ -1072,6 +1072,17 @@ config TELCLOCK /sys/devices/platform/telco_clock, with a number of files for controlling the behavior of this hardware. +config RMEM + tristate "Access to physical memory via /dev/rmem" + default m + help + The /dev/mem device only allows mmap() memory available to + I/O mapped memory; it does not allow access to "real" + physical memory. The /dev/rmem device is a hack which does + allow access to physical memory. We use this instead of + patching /dev/mem because we don't expect this functionality + to ever be accepted into mainline. + config DEVPORT bool depends on !M68K Index: linux-2.6.24.7/drivers/char/Makefile =================================================================== --- linux-2.6.24.7.orig/drivers/char/Makefile +++ linux-2.6.24.7/drivers/char/Makefile @@ -98,6 +98,7 @@ obj-$(CONFIG_CS5535_GPIO) += cs5535_gpio obj-$(CONFIG_GPIO_VR41XX) += vr41xx_giu.o obj-$(CONFIG_GPIO_TB0219) += tb0219.o obj-$(CONFIG_TELCLOCK) += tlclk.o +obj-$(CONFIG_RMEM) += rmem.o obj-$(CONFIG_MWAVE) += mwave/ obj-$(CONFIG_AGP) += agp/ Index: linux-2.6.24.7/drivers/char/rmem.c =================================================================== --- /dev/null +++ linux-2.6.24.7/drivers/char/rmem.c @@ -0,0 +1,134 @@ +/* + * Rmem - REALLY simple memory mapping demonstration. + * + * Copyright (C) 2005 by Theodore Ts'o + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + * + */ + +#include <linux/module.h> +#include <linux/moduleparam.h> +#include <linux/init.h> + +#include <linux/kernel.h> +#include <linux/slab.h> +#include <linux/fs.h> +#include <linux/errno.h> +#include <linux/types.h> +#include <linux/mm.h> +#include <linux/kdev_t.h> +#include <asm/page.h> +#include <linux/cdev.h> +#include <linux/device.h> + +static int rmem_major = 0; +module_param(rmem_major, int, 0444); + +static struct class *rmem_class; + +MODULE_AUTHOR("Theodore Ts'o"); +MODULE_LICENSE("GPL"); + +struct page *rmem_vma_nopage(struct vm_area_struct *vma, + unsigned long address, int *type) +{ + struct page *pageptr; + unsigned long offset = vma->vm_pgoff << PAGE_SHIFT; + unsigned long physaddr = address - vma->vm_start + offset; + unsigned long pageframe = physaddr >> PAGE_SHIFT; + + if (!pfn_valid(pageframe)) + return NOPAGE_SIGBUS; + pageptr = pfn_to_page(pageframe); + get_page(pageptr); + if (type) + *type = VM_FAULT_MINOR; + return pageptr; +} + +static struct vm_operations_struct rmem_nopage_vm_ops = { + .nopage = rmem_vma_nopage, +}; + +static int rmem_nopage_mmap(struct file *filp, struct vm_area_struct *vma) +{ + unsigned long offset = vma->vm_pgoff << PAGE_SHIFT; + + if (offset >= __pa(high_memory) || (filp->f_flags & O_SYNC)) + vma->vm_flags |= VM_IO; + vma->vm_flags |= VM_RESERVED; + vma->vm_ops = &rmem_nopage_vm_ops; +#ifdef TAINT_USER + add_taint(TAINT_USER); +#endif + return 0; +} + +static struct file_operations rmem_nopage_ops = { + .owner = THIS_MODULE, + .mmap = rmem_nopage_mmap, +}; + +static struct cdev rmem_cdev = { + .kobj = {.k_name = "rmem", }, + .owner = THIS_MODULE, +}; + +static int __init rmem_init(void) +{ + int result; + dev_t dev = MKDEV(rmem_major, 0); + + /* Figure out our device number. */ + if (rmem_major) + result = register_chrdev_region(dev, 1, "rmem"); + else { + result = alloc_chrdev_region(&dev, 0, 1, "rmem"); + rmem_major = MAJOR(dev); + } + if (result < 0) { + printk(KERN_WARNING "rmem: unable to get major %d\n", rmem_major); + return result; + } + if (rmem_major == 0) + rmem_major = result; + + cdev_init(&rmem_cdev, &rmem_nopage_ops); + result = cdev_add(&rmem_cdev, dev, 1); + if (result) { + printk (KERN_NOTICE "Error %d adding /dev/rmem", result); + kobject_put(&rmem_cdev.kobj); + unregister_chrdev_region(dev, 1); + return 1; + } + + rmem_class = class_create(THIS_MODULE, "rmem"); + class_device_create(rmem_class, NULL, dev, NULL, "rmem"); + + return 0; +} + + +static void __exit rmem_cleanup(void) +{ + cdev_del(&rmem_cdev); + unregister_chrdev_region(MKDEV(rmem_major, 0), 1); + class_destroy(rmem_class); +} + + +module_init(rmem_init); +module_exit(rmem_cleanup); �������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/Allocate-RTSJ-memory-for-TCK-conformance-test.patch�����������������������������������������0000664�0000764�0000764�00000011651�11041657732�022552� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������Allocate RTSJ memory for TCK conformance test. From: Theodore Ts'o <tytso@mit.edu> This kernel message allocates memory which is required by the real-time TCK conformance test which tests the JVM's RTSJ implementation. Unfortunately, RTSJ requires that Java programs have direct access to physical memory. This kernel reserves memory which can then be used by an external /dev/rmem loadable kernel module. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> --- drivers/char/Kconfig | 7 +++ drivers/char/Makefile | 2 drivers/char/alloc_rtsj_mem.c | 88 ++++++++++++++++++++++++++++++++++++++++++ init/main.c | 7 +++ 4 files changed, 104 insertions(+) Index: linux-2.6.24.7/drivers/char/Kconfig =================================================================== --- linux-2.6.24.7.orig/drivers/char/Kconfig +++ linux-2.6.24.7/drivers/char/Kconfig @@ -1083,6 +1083,13 @@ config RMEM patching /dev/mem because we don't expect this functionality to ever be accepted into mainline. +config ALLOC_RTSJ_MEM + tristate "RTSJ-specific hack to reserve memory" + default m + help + The RTSJ TCK conformance test requires reserving some physical + memory for testing /dev/rmem. + config DEVPORT bool depends on !M68K Index: linux-2.6.24.7/drivers/char/Makefile =================================================================== --- linux-2.6.24.7.orig/drivers/char/Makefile +++ linux-2.6.24.7/drivers/char/Makefile @@ -114,6 +114,8 @@ obj-$(CONFIG_PS3_FLASH) += ps3flash.o obj-$(CONFIG_JS_RTC) += js-rtc.o js-rtc-y = rtc.o +obj-$(CONFIG_ALLOC_RTSJ_MEM) += alloc_rtsj_mem.o + # Files generated that shall be removed upon make clean clean-files := consolemap_deftbl.c defkeymap.c Index: linux-2.6.24.7/drivers/char/alloc_rtsj_mem.c =================================================================== --- /dev/null +++ linux-2.6.24.7/drivers/char/alloc_rtsj_mem.c @@ -0,0 +1,88 @@ +/* + * alloc_rtsj_mem.c -- Hack to allocate some memory + * + * Copyright (C) 2005 by Theodore Ts'o + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + * + */ + +#include <linux/kernel.h> +#include <linux/module.h> +#include <linux/init.h> +#include <linux/types.h> +#include <linux/sysctl.h> +#include <linux/bootmem.h> + +#include <asm/io.h> + +MODULE_AUTHOR("Theodore Tso"); +MODULE_DESCRIPTION("RTSJ alloc memory"); +MODULE_LICENSE("GPL"); + +static void *mem = 0; +int size = 0, addr = 0; + +module_param(size, int, 0444); +module_param(addr, int, 0444); + +static void __exit shutdown_module(void) +{ + kfree(mem); +} + +#ifndef MODULE +void __init alloc_rtsj_mem_early_setup(void) +{ + if (size > PAGE_SIZE*2) { + mem = alloc_bootmem(size); + if (mem) { + printk(KERN_INFO "alloc_rtsj_mem: got %d bytes " + "using alloc_bootmem\n", size); + } else { + printk(KERN_INFO "alloc_rtsj_mem: failed to " + "get %d bytes from alloc_bootmem\n", size); + } + } +} +#endif + +static int __init startup_module(void) +{ + static char test_string[] = "The BOFH: Servicing users the way the " + "military\n\tservices targets for 15 years.\n"; + + if (!size) + return 0; + + if (!mem) { + mem = kmalloc(size, GFP_KERNEL); + if (mem) { + printk(KERN_INFO "alloc_rtsj_mem: got %d bytes " + "using kmalloc\n", size); + } else { + printk(KERN_ERR "alloc_rtsj_mem: failed to get " + "%d bytes using kmalloc\n", size); + return -ENOMEM; + } + } + memcpy(mem, test_string, min(sizeof(test_string), (size_t) size)); + addr = virt_to_phys(mem); + return 0; +} + +module_init(startup_module); +module_exit(shutdown_module); + Index: linux-2.6.24.7/init/main.c =================================================================== --- linux-2.6.24.7.orig/init/main.c +++ linux-2.6.24.7/init/main.c @@ -100,6 +100,12 @@ static inline void acpi_early_init(void) #ifndef CONFIG_DEBUG_RODATA static inline void mark_rodata_ro(void) { } #endif +#ifdef CONFIG_ALLOC_RTSJ_MEM +extern void alloc_rtsj_mem_early_setup(void); +#else +static inline void alloc_rtsj_mem_early_setup(void) { } +#endif + #ifdef CONFIG_TC extern void tc_init(void); @@ -613,6 +619,7 @@ asmlinkage void __init start_kernel(void #endif vfs_caches_init_early(); cpuset_init_early(); + alloc_rtsj_mem_early_setup(); mem_init(); kmem_cache_init(); setup_per_cpu_pageset(); ���������������������������������������������������������������������������������������patches/new-softirq-code.patch����������������������������������������������������������������������0000664�0000764�0000764�00000023153�11041657734�015460� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������Subject: [patch] softirq preemption: optimization From: Ingo Molnar <mingo@elte.hu> optimize softirq preemption by allowing a hardirq context to pick up softirq processing. Signed-off-by: Ingo Molnar <mingo@elte.hu> --- kernel/irq/manage.c | 19 +----- kernel/softirq.c | 160 ++++++++++++++++++++++++++++++++++++++++------------ 2 files changed, 131 insertions(+), 48 deletions(-) Index: linux-2.6.24.7/kernel/irq/manage.c =================================================================== --- linux-2.6.24.7.orig/kernel/irq/manage.c +++ linux-2.6.24.7/kernel/irq/manage.c @@ -708,7 +708,6 @@ static void thread_edge_irq(irq_desc_t * desc->status &= ~IRQ_PENDING; spin_unlock(&desc->lock); action_ret = handle_IRQ_event(irq, action); - cond_resched_hardirq_context(); spin_lock_irq(&desc->lock); if (!noirqdebug) note_interrupt(irq, desc, action_ret); @@ -737,7 +736,6 @@ static void thread_do_irq(irq_desc_t *de desc->status &= ~IRQ_PENDING; spin_unlock(&desc->lock); action_ret = handle_IRQ_event(irq, action); - cond_resched_hardirq_context(); spin_lock_irq(&desc->lock); if (!noirqdebug) note_interrupt(irq, desc, action_ret); @@ -773,8 +771,6 @@ static void do_hardirq(struct irq_desc * wake_up(&desc->wait_for_handler); } -extern asmlinkage void __do_softirq(void); - static int do_irqd(void * __desc) { struct sched_param param = { 0, }; @@ -794,16 +790,13 @@ static int do_irqd(void * __desc) while (!kthread_should_stop()) { local_irq_disable_nort(); - set_current_state(TASK_INTERRUPTIBLE); -#ifndef CONFIG_PREEMPT_RT - irq_enter(); -#endif - do_hardirq(desc); -#ifndef CONFIG_PREEMPT_RT - irq_exit(); -#endif + do { + set_current_state(TASK_INTERRUPTIBLE); + do_hardirq(desc); + do_softirq_from_hardirq(); + } while (current->state == TASK_RUNNING); + local_irq_enable_nort(); - cond_resched(); #ifdef CONFIG_SMP /* * Did IRQ affinities change? Index: linux-2.6.24.7/kernel/softirq.c =================================================================== --- linux-2.6.24.7.orig/kernel/softirq.c +++ linux-2.6.24.7/kernel/softirq.c @@ -101,8 +101,26 @@ static void wakeup_softirqd(int softirq) /* Interrupts are disabled: no need to stop preemption */ struct task_struct *tsk = __get_cpu_var(ksoftirqd)[softirq].tsk; - if (tsk && tsk->state != TASK_RUNNING) - wake_up_process(tsk); + if (unlikely(!tsk)) + return; +#if defined(CONFIG_PREEMPT_SOFTIRQS) && defined(CONFIG_PREEMPT_HARDIRQS) + /* + * Optimization: if we are in a hardirq thread context, and + * if the priority of the softirq thread is the same as the + * priority of the hardirq thread, then 'merge' softirq + * processing into the hardirq context. (it will later on + * execute softirqs via do_softirq_from_hardirq()). + * So here we can skip the wakeup and can rely on the hardirq + * context processing it later on. + */ + if ((current->flags & PF_HARDIRQ) && !hardirq_count() && + (tsk->normal_prio == current->normal_prio)) + return; +#endif + /* + * Wake up the softirq task: + */ + wake_up_process(tsk); } /* @@ -251,50 +269,100 @@ EXPORT_SYMBOL(local_bh_enable_ip); * we want to handle softirqs as soon as possible, but they * should not be able to lock up the box. */ -#define MAX_SOFTIRQ_RESTART 10 +#define MAX_SOFTIRQ_RESTART 20 + +static DEFINE_PER_CPU(u32, softirq_running); -asmlinkage void ___do_softirq(void) +static void ___do_softirq(const int same_prio_only) { + int max_restart = MAX_SOFTIRQ_RESTART, max_loops = MAX_SOFTIRQ_RESTART; + __u32 pending, available_mask, same_prio_skipped; struct softirq_action *h; - __u32 pending; - int max_restart = MAX_SOFTIRQ_RESTART; - int cpu; + struct task_struct *tsk; + int cpu, softirq; pending = local_softirq_pending(); account_system_vtime(current); cpu = smp_processor_id(); restart: + available_mask = -1; + softirq = 0; + same_prio_skipped = 0; /* Reset the pending bitmask before enabling irqs */ set_softirq_pending(0); - local_irq_enable(); - h = softirq_vec; do { + u32 softirq_mask = 1 << softirq; + if (pending & 1) { - { - u32 preempt_count = preempt_count(); - h->action(h); - if (preempt_count != preempt_count()) { - print_symbol("BUG: softirq exited %s with wrong preemption count!\n", (unsigned long) h->action); - printk("entered with %08x, exited with %08x.\n", preempt_count, preempt_count()); - preempt_count() = preempt_count; + u32 preempt_count = preempt_count(); + +#if defined(CONFIG_PREEMPT_SOFTIRQS) && defined(CONFIG_PREEMPT_HARDIRQS) + /* + * If executed by a same-prio hardirq thread + * then skip pending softirqs that belong + * to softirq threads with different priority: + */ + if (same_prio_only) { + tsk = __get_cpu_var(ksoftirqd)[softirq].tsk; + if (tsk && tsk->normal_prio != + current->normal_prio) { + same_prio_skipped |= softirq_mask; + available_mask &= ~softirq_mask; + goto next; } } +#endif + /* + * Is this softirq already being processed? + */ + if (per_cpu(softirq_running, cpu) & softirq_mask) { + available_mask &= ~softirq_mask; + goto next; + } + per_cpu(softirq_running, cpu) |= softirq_mask; + local_irq_enable(); + + h->action(h); + if (preempt_count != preempt_count()) { + print_symbol("BUG: softirq exited %s with wrong preemption count!\n", (unsigned long) h->action); + printk("entered with %08x, exited with %08x.\n", preempt_count, preempt_count()); + preempt_count() = preempt_count; + } rcu_bh_qsctr_inc(cpu); cond_resched_softirq_context(); + local_irq_disable(); + per_cpu(softirq_running, cpu) &= ~softirq_mask; } +next: h++; + softirq++; pending >>= 1; } while (pending); - local_irq_disable(); - + or_softirq_pending(same_prio_skipped); pending = local_softirq_pending(); - if (pending && --max_restart) - goto restart; + if (pending & available_mask) { + if (--max_restart) + goto restart; + /* + * With softirq threading there's no reason not to + * finish the workload we have: + */ +#ifdef CONFIG_PREEMPT_SOFTIRQS + if (--max_loops) { + if (printk_ratelimit()) + printk("INFO: softirq overload: %08x\n", pending); + max_restart = MAX_SOFTIRQ_RESTART; + goto restart; + } + if (printk_ratelimit()) + printk("BUG: softirq loop! %08x\n", pending); +#endif + } if (pending) trigger_softirqs(); @@ -322,7 +390,7 @@ asmlinkage void __do_softirq(void) p_flags = current->flags & PF_HARDIRQ; current->flags &= ~PF_HARDIRQ; - ___do_softirq(); + ___do_softirq(0); trace_softirq_exit(); @@ -346,20 +414,29 @@ void do_softirq_from_hardirq(void) if (!local_softirq_pending()) return; /* - * 'immediate' softirq execution: + * 'immediate' softirq execution, from hardirq context: */ + local_irq_disable(); __local_bh_disable((unsigned long)__builtin_return_address(0)); +#ifndef CONFIG_PREEMPT_SOFTIRQS + trace_softirq_enter(); +#endif p_flags = current->flags & PF_HARDIRQ; current->flags &= ~PF_HARDIRQ; + current->flags |= PF_SOFTIRQ; - ___do_softirq(); + ___do_softirq(1); +#ifndef CONFIG_PREEMPT_SOFTIRQS trace_softirq_exit(); - +#endif account_system_vtime(current); - _local_bh_enable(); current->flags |= p_flags; + current->flags &= ~PF_SOFTIRQ; + + _local_bh_enable(); + local_irq_enable(); } #ifndef __ARCH_HAS_DO_SOFTIRQ @@ -690,8 +767,9 @@ static int ksoftirqd(void * __data) { struct sched_param param = { .sched_priority = MAX_USER_RT_PRIO/2 }; struct softirqdata *data = __data; - u32 mask = (1 << data->nr); + u32 softirq_mask = (1 << data->nr); struct softirq_action *h; + int cpu = data->cpu; #ifdef CONFIG_PREEMPT_SOFTIRQS init_waitqueue_head(&data->wait); @@ -703,7 +781,8 @@ static int ksoftirqd(void * __data) while (!kthread_should_stop()) { preempt_disable(); - if (!(local_softirq_pending() & mask)) { + if (!(local_softirq_pending() & softirq_mask)) { +sleep_more: __preempt_enable_no_resched(); schedule(); preempt_disable(); @@ -715,16 +794,26 @@ static int ksoftirqd(void * __data) data->running = 1; #endif - while (local_softirq_pending() & mask) { + while (local_softirq_pending() & softirq_mask) { /* Preempt disable stops cpu going offline. If already offline, we'll be on wrong CPU: don't process */ - if (cpu_is_offline(data->cpu)) + if (cpu_is_offline(cpu)) goto wait_to_die; local_irq_disable(); + /* + * Is the softirq already being executed by + * a hardirq context? + */ + if (per_cpu(softirq_running, cpu) & softirq_mask) { + local_irq_enable(); + set_current_state(TASK_INTERRUPTIBLE); + goto sleep_more; + } + per_cpu(softirq_running, cpu) |= softirq_mask; __preempt_enable_no_resched(); - set_softirq_pending(local_softirq_pending() & ~mask); + set_softirq_pending(local_softirq_pending() & ~softirq_mask); local_bh_disable(); local_irq_enable(); @@ -734,6 +823,7 @@ static int ksoftirqd(void * __data) rcu_bh_qsctr_inc(data->cpu); local_irq_disable(); + per_cpu(softirq_running, cpu) &= ~softirq_mask; _local_bh_enable(); local_irq_enable(); @@ -876,19 +966,19 @@ static int __cpuinit cpu_callback(struct } #endif case CPU_DEAD: - case CPU_DEAD_FROZEN: { - struct sched_param param = { .sched_priority = MAX_RT_PRIO-1 }; - - sched_setscheduler(p, SCHED_FIFO, ¶m); + case CPU_DEAD_FROZEN: for (i = 0; i < MAX_SOFTIRQ; i++) { + struct sched_param param; + + param.sched_priority = MAX_RT_PRIO-1; p = per_cpu(ksoftirqd, hotcpu)[i].tsk; + sched_setscheduler(p, SCHED_FIFO, ¶m); per_cpu(ksoftirqd, hotcpu)[i].tsk = NULL; kthread_stop(p); } takeover_tasklets(hotcpu); break; #endif /* CONFIG_HOTPLUG_CPU */ - } } return NOTIFY_OK; } ���������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/softirq-per-cpu-assumptions-fixes.patch�����������������������������������������������������0000664�0000764�0000764�00000012773�11041673157�021034� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������--- kernel/hrtimer.c | 38 +++++++++++++++++++++----------------- kernel/sched.c | 2 +- kernel/softirq.c | 5 +++-- kernel/timer.c | 2 +- 4 files changed, 26 insertions(+), 21 deletions(-) Index: linux-2.6.24.7/kernel/hrtimer.c =================================================================== --- linux-2.6.24.7.orig/kernel/hrtimer.c +++ linux-2.6.24.7/kernel/hrtimer.c @@ -380,9 +380,9 @@ static inline int hrtimer_is_hres_enable /* * Is the high resolution mode active ? */ -static inline int hrtimer_hres_active(void) +static inline int hrtimer_hres_active(struct hrtimer_cpu_base *cpu_base) { - return __get_cpu_var(hrtimer_bases).hres_active; + return cpu_base->hres_active; } /* @@ -470,11 +470,12 @@ static int hrtimer_reprogram(struct hrti */ static void retrigger_next_event(void *arg) { - struct hrtimer_cpu_base *base; + struct hrtimer_cpu_base *base = &__get_cpu_var(hrtimer_bases); + struct timespec realtime_offset; unsigned long seq; - if (!hrtimer_hres_active()) + if (!hrtimer_hres_active(base)) return; do { @@ -484,8 +485,6 @@ static void retrigger_next_event(void *a -wall_to_monotonic.tv_nsec); } while (read_seqretry(&xtime_lock, seq)); - base = &__get_cpu_var(hrtimer_bases); - /* Adjust CLOCK_REALTIME offset */ spin_lock(&base->lock); base->clock_base[CLOCK_REALTIME].offset = @@ -606,10 +605,8 @@ static inline int hrtimer_enqueue_reprog /* * Switch to high resolution mode */ -static int hrtimer_switch_to_hres(void) +static int hrtimer_switch_to_hres(struct hrtimer_cpu_base *base) { - int cpu = smp_processor_id(); - struct hrtimer_cpu_base *base = &per_cpu(hrtimer_bases, cpu); unsigned long flags; if (base->hres_active) @@ -620,7 +617,7 @@ static int hrtimer_switch_to_hres(void) if (tick_init_highres()) { local_irq_restore(flags); printk(KERN_WARNING "Could not switch to high resolution " - "mode on CPU %d\n", cpu); + "mode on CPU %d\n", raw_smp_processor_id()); return 0; } base->hres_active = 1; @@ -642,9 +639,15 @@ static inline void hrtimer_raise_softirq #else -static inline int hrtimer_hres_active(void) { return 0; } +static inline int hrtimer_hres_active(struct hrtimer_cpu_base *base) +{ + return 0; +} static inline int hrtimer_is_hres_enabled(void) { return 0; } -static inline int hrtimer_switch_to_hres(void) { return 0; } +static inline int hrtimer_switch_to_hres(struct hrtimer_cpu_base *base) +{ + return 0; +} static inline void hrtimer_force_reprogram(struct hrtimer_cpu_base *base) { } static inline int hrtimer_enqueue_reprogram(struct hrtimer *timer, struct hrtimer_clock_base *base) @@ -836,7 +839,7 @@ static void __remove_hrtimer(struct hrti if (base->first == &timer->node) { base->first = rb_next(&timer->node); /* Reprogram the clock event device. if enabled */ - if (reprogram && hrtimer_hres_active()) + if (reprogram && hrtimer_hres_active(base->cpu_base)) hrtimer_force_reprogram(base->cpu_base); } rb_erase(&timer->node, &base->active); @@ -1027,7 +1030,7 @@ ktime_t hrtimer_get_next_event(void) spin_lock_irqsave(&cpu_base->lock, flags); - if (!hrtimer_hres_active()) { + if (!hrtimer_hres_active(cpu_base)) { for (i = 0; i < HRTIMER_MAX_CLOCK_BASES; i++, base++) { struct hrtimer *timer; @@ -1335,10 +1338,11 @@ static inline void run_hrtimer_queue(str */ void hrtimer_run_queues(void) { - struct hrtimer_cpu_base *cpu_base = &__get_cpu_var(hrtimer_bases); + struct hrtimer_cpu_base *cpu_base; int i; - if (hrtimer_hres_active()) + cpu_base = &per_cpu(hrtimer_bases, raw_smp_processor_id()); + if (hrtimer_hres_active(cpu_base)) return; /* @@ -1350,7 +1354,7 @@ void hrtimer_run_queues(void) * deadlock vs. xtime_lock. */ if (tick_check_oneshot_change(!hrtimer_is_hres_enabled())) - if (hrtimer_switch_to_hres()) + if (hrtimer_switch_to_hres(cpu_base)) return; hrtimer_get_softirq_time(cpu_base); Index: linux-2.6.24.7/kernel/sched.c =================================================================== --- linux-2.6.24.7.orig/kernel/sched.c +++ linux-2.6.24.7/kernel/sched.c @@ -3392,7 +3392,7 @@ out: */ static void run_rebalance_domains(struct softirq_action *h) { - int this_cpu = smp_processor_id(); + int this_cpu = raw_smp_processor_id(); struct rq *this_rq = cpu_rq(this_cpu); enum cpu_idle_type idle = this_rq->idle_at_tick ? CPU_IDLE : CPU_NOT_IDLE; Index: linux-2.6.24.7/kernel/softirq.c =================================================================== --- linux-2.6.24.7.orig/kernel/softirq.c +++ linux-2.6.24.7/kernel/softirq.c @@ -411,12 +411,12 @@ void do_softirq_from_hardirq(void) { unsigned long p_flags; - if (!local_softirq_pending()) - return; /* * 'immediate' softirq execution, from hardirq context: */ local_irq_disable(); + if (!local_softirq_pending()) + goto out; __local_bh_disable((unsigned long)__builtin_return_address(0)); #ifndef CONFIG_PREEMPT_SOFTIRQS trace_softirq_enter(); @@ -436,6 +436,7 @@ void do_softirq_from_hardirq(void) current->flags &= ~PF_SOFTIRQ; _local_bh_enable(); +out: local_irq_enable(); } Index: linux-2.6.24.7/kernel/timer.c =================================================================== --- linux-2.6.24.7.orig/kernel/timer.c +++ linux-2.6.24.7/kernel/timer.c @@ -1035,7 +1035,7 @@ static inline void update_times(void) */ static void run_timer_softirq(struct softirq_action *h) { - tvec_base_t *base = __get_cpu_var(tvec_bases); + tvec_base_t *base = per_cpu(tvec_bases, raw_smp_processor_id()); update_times(); hrtimer_run_queues(); �����patches/fix-migrating-softirq.patch�����������������������������������������������������������������0000664�0000764�0000764�00000011163�11041657735�016523� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From rostedt@goodmis.org Wed Jun 13 14:47:26 2007 Return-Path: <rostedt@goodmis.org> X-Spam-Checker-Version: SpamAssassin 3.1.7-deb (2006-10-05) on debian X-Spam-Level: X-Spam-Status: No, score=0.0 required=5.0 tests=AWL autolearn=unavailable version=3.1.7-deb Received: from ms-smtp-02.nyroc.rr.com (ms-smtp-02.nyroc.rr.com [24.24.2.56]) by mail.tglx.de (Postfix) with ESMTP id AB7B665C3D9 for <tglx@linutronix.de>; Wed, 13 Jun 2007 14:47:26 +0200 (CEST) Received: from [192.168.23.10] (cpe-24-94-51-176.stny.res.rr.com [24.94.51.176]) by ms-smtp-02.nyroc.rr.com (8.13.6/8.13.6) with ESMTP id l5DClGVg022890; Wed, 13 Jun 2007 08:47:17 -0400 (EDT) Subject: [PATCH RT] fix migrating softirq [cause of network hang] From: Steven Rostedt <rostedt@goodmis.org> To: Ingo Molnar <mingo@elte.hu> Cc: LKML <linux-kernel@vger.kernel.org>, RT <linux-rt-users@vger.kernel.org>, Thomas Gleixner <tglx@linutronix.de>, john stultz <johnstul@us.ibm.com> Content-Type: text/plain Date: Wed, 13 Jun 2007 08:47:16 -0400 Message-Id: <1181738836.10408.54.camel@localhost.localdomain> Mime-Version: 1.0 X-Mailer: Evolution 2.6.3 X-Virus-Scanned: Symantec AntiVirus Scan Engine X-Evolution-Source: imap://tglx%40linutronix.de@localhost:8993/ Content-Transfer-Encoding: 8bit Softirqs are bound to a single CPU. That is to say, that once a softirq function starts to run, it will stay on the CPU that it is running on while it's running. In RT, softirqs are threads, and we have a softirq thread per cpu. Each softirq thread is bound to a single CPU that it represents. In order to speed things up and lower context switches in RT, if a softirq thread is of the same priority as an interrupt thread, then when the interrupt thread is about to exit, it tests to see if any softirq threads need to be run on that cpu. Instead of running the softirq thread, it simply performs the functions for the softirq within the interrupt thread. The problem is, nothing prevents the interrupt thread from migrating. So while the interrupt thread is running the softirq function, it may migrate to another CPU in the middle of that function. This means that any CPU data that the softirq is touching can be corrupted. I was experiencing a network hang that sometimes would come back, and sometimes not. Using my logdev debugger, I started to debug this problem. I came across this at the moment of the hang: [ 389.131279] cpu:0 (IRQ-11:427) tcp_rcv_established:4056 rcv_nxt=-1665585797 [ 389.131615] cpu:1 192.168.23.72:22 <== 192.168.23.60:41352 ack:2629381499 seq:1773074099 (----A-) len:0 win:790 end_seq:1773074099 [ 389.131626] cpu:1 (IRQ-11:427) ip_finish_output2:187 dst->hh=ffff81003b213080 [ 389.131635] cpu:1 (IRQ-11:427) ip_finish_output2:189 hh_output=ffffffff80429009 Here we see IRQ-11 in the process of finishing up the softirq-net-tx function. In the middle of it, we receive a packet, and that must have pushed the interrupt thread over to CPU 1, and it finished up the softirq there. This patch temporarily binds the hardirq thread on the CPU that it runs the softirqs on. With this patch I have not seen my network hang. I ran it over night, doing compiles and such, and it seems fine. I would be able to cause the hang with various loads within a minute, now I can't cause it after several minutes. I'm assuming that this fix may fix other bugs too. Signed-off-by: Steven Rostedt <rostedt@goodmis.org> --- kernel/irq/manage.c | 22 +++++++++++++++++++--- 1 file changed, 19 insertions(+), 3 deletions(-) Index: linux-2.6.24.7/kernel/irq/manage.c =================================================================== --- linux-2.6.24.7.orig/kernel/irq/manage.c +++ linux-2.6.24.7/kernel/irq/manage.c @@ -777,7 +777,15 @@ static int do_irqd(void * __desc) struct irq_desc *desc = __desc; #ifdef CONFIG_SMP - set_cpus_allowed(current, desc->affinity); + cpumask_t cpus_allowed, mask; + + cpus_allowed = desc->affinity; + /* + * Restrict it to one cpu so we avoid being migrated inside of + * do_softirq_from_hardirq() + */ + mask = cpumask_of_cpu(first_cpu(desc->affinity)); + set_cpus_allowed(current, mask); #endif current->flags |= PF_NOFREEZE | PF_HARDIRQ; @@ -801,8 +809,16 @@ static int do_irqd(void * __desc) /* * Did IRQ affinities change? */ - if (!cpus_equal(current->cpus_allowed, desc->affinity)) - set_cpus_allowed(current, desc->affinity); + if (!cpus_equal(cpus_allowed, desc->affinity)) { + cpus_allowed = desc->affinity; + /* + * Restrict it to one cpu so we avoid being + * migrated inside of + * do_softirq_from_hardirq() + */ + mask = cpumask_of_cpu(first_cpu(desc->affinity)); + set_cpus_allowed(current, mask); + } #endif schedule(); } �������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/only-run-softirqs-from-irq-thread-when-irq-affinity-is-set.patch����������������������������0000664�0000764�0000764�00000016023�11041657734�025463� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From linux-rt-users-owner@vger.kernel.org Wed Aug 8 22:43:28 2007 Return-Path: <linux-rt-users-owner@vger.kernel.org> X-Spam-Checker-Version: SpamAssassin 3.1.7-deb (2006-10-05) on debian X-Spam-Level: X-Spam-Status: No, score=0.0 required=5.0 tests=AWL autolearn=unavailable version=3.1.7-deb Received: from vger.kernel.org (vger.kernel.org [209.132.176.167]) by mail.tglx.de (Postfix) with ESMTP id 6193665C3D9; Wed, 8 Aug 2007 22:43:28 +0200 (CEST) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755519AbXHHUn0 (ORCPT <rfc822;jan.altenberg@linutronix.de> + 1 other); Wed, 8 Aug 2007 16:43:26 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1755399AbXHHUn0 (ORCPT <rfc822;linux-rt-users-outgoing>); Wed, 8 Aug 2007 16:43:26 -0400 Received: from ms-smtp-03.nyroc.rr.com ([24.24.2.57]:59763 "EHLO ms-smtp-03.nyroc.rr.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754194AbXHHUnY (ORCPT <rfc822;linux-rt-users@vger.kernel.org>); Wed, 8 Aug 2007 16:43:24 -0400 Received: from gandalf.stny.rr.com (cpe-24-94-51-176.stny.res.rr.com [24.94.51.176]) by ms-smtp-03.nyroc.rr.com (8.13.6/8.13.6) with ESMTP id l78KgX4S011873; Wed, 8 Aug 2007 16:42:33 -0400 (EDT) Received: from localhost ([127.0.0.1] ident=rostedt) by gandalf.stny.rr.com with esmtp (Exim 4.67) (envelope-from <rostedt@goodmis.org>) id 1IIsMT-0003mx-ET; Wed, 08 Aug 2007 16:42:33 -0400 Subject: [PATCH RT] Only run softirqs from the irq thread if the irq affinity is set to 1 CPU From: Steven Rostedt <rostedt@goodmis.org> To: Ingo Molnar <mingo@elte.hu> Cc: RT <linux-rt-users@vger.kernel.org>, LKML <linux-kernel@vger.kernel.org>, Thomas Gleixner <tglx@linutronix.de>, john stultz <johnstul@us.ibm.com> Content-Type: text/plain Date: Wed, 08 Aug 2007 16:42:32 -0400 Message-Id: <1186605752.29097.18.camel@localhost.localdomain> Mime-Version: 1.0 X-Mailer: Evolution 2.10.2 X-Virus-Scanned: Symantec AntiVirus Scan Engine Sender: linux-rt-users-owner@vger.kernel.org Precedence: bulk X-Mailing-List: linux-rt-users@vger.kernel.org X-Filter-To: .Kernel.rt-users X-Evolution-Source: imap://tglx%40linutronix.de@localhost:8993/ Content-Transfer-Encoding: 8bit Ingo and Thomas, John and I have been discussing all the "run softirq from IRQ thread" lately and discovered something nasty. Now it is a nice optimization to run softirqs from the IRQ thread, but it may not be feasible because of the semantics of the IRQ thread compared with the softirq thread. Namely, the softirq thread is bound to a single CPU and the IRQ thread is not. We use to think that it would be fine to simply bind an IRQ thread to a single CPU, either at the start of the IRQ thread code, or just while it is running the softirq code. But this has a major flaw as John Stultz discovered. If a RT hog that is of higher priority than the IRQ thread preempts the IRQ thread while it is bound to the CPU (more likely with the latest code that always binds the IRQ thread to 1 CPU), then that IRQ is, in essence, masked. That means no more actions will be taken place by that IRQ while the RT thread is running. Normally, one would expect, that if the IRQ has its affinity set to all CPUS, if a RT thread were to preempt the IRQ thread and run for a long time, it would be expected that the IRQ thread would migrate to another CPU and finish. Letting more interrupts from the IRQ line in (remember that the IRQ line is masked until the IRQ finishes its handler). This patch will only run the softirq functions if the IRQ thread and the softirq thread have the same priority **and** the IRQ thread is already bound to a single CPU. If we are running on UP or the IRQ thread is bound to a single CPU, we already have the possibility of having a RT hog starve the IRQ. But we should not add that scenario when the IRQ thread has its affinity set to run on other CPUS that don't have RT hogs on them. Signed-off-by: Steven Rostedt <rostedt@goodmis.org> --- kernel/irq/manage.c | 32 +++++++++++++++++++++----------- kernel/softirq.c | 9 ++++++++- 2 files changed, 29 insertions(+), 12 deletions(-) Index: linux-2.6.24.7/kernel/irq/manage.c =================================================================== --- linux-2.6.24.7.orig/kernel/irq/manage.c +++ linux-2.6.24.7/kernel/irq/manage.c @@ -775,17 +775,28 @@ static int do_irqd(void * __desc) { struct sched_param param = { 0, }; struct irq_desc *desc = __desc; + int run_softirq = 1; #ifdef CONFIG_SMP - cpumask_t cpus_allowed, mask; + cpumask_t cpus_allowed; cpus_allowed = desc->affinity; /* - * Restrict it to one cpu so we avoid being migrated inside of - * do_softirq_from_hardirq() + * If the irqd is bound to one CPU we let it run softirqs + * that have the same priority as the irqd thread. We do + * not run it if the irqd is bound to more than one CPU + * due to the fact that it can + * 1) migrate to other CPUS while running the softirqd + * 2) if we pin the irqd to a CPU to run the softirqd, then + * we risk a high priority process from waking up and + * preempting the irqd. Although the irqd may be able to + * run on other CPUS due to its irq affinity, it will not + * be able to since we bound it to a CPU to run softirqs. + * So a RT hog could starve the irqd from running on + * other CPUS that it's allowed to run on. */ - mask = cpumask_of_cpu(first_cpu(desc->affinity)); - set_cpus_allowed(current, mask); + if (cpus_weight(cpus_allowed) != 1) + run_softirq = 0; /* turn it off */ #endif current->flags |= PF_NOFREEZE | PF_HARDIRQ; @@ -801,7 +812,8 @@ static int do_irqd(void * __desc) do { set_current_state(TASK_INTERRUPTIBLE); do_hardirq(desc); - do_softirq_from_hardirq(); + if (run_softirq) + do_softirq_from_hardirq(); } while (current->state == TASK_RUNNING); local_irq_enable_nort(); @@ -812,12 +824,10 @@ static int do_irqd(void * __desc) if (!cpus_equal(cpus_allowed, desc->affinity)) { cpus_allowed = desc->affinity; /* - * Restrict it to one cpu so we avoid being - * migrated inside of - * do_softirq_from_hardirq() + * Only allow the irq thread to run the softirqs + * if it is bound to a single CPU. */ - mask = cpumask_of_cpu(first_cpu(desc->affinity)); - set_cpus_allowed(current, mask); + run_softirq = (cpus_weight(cpus_allowed) == 1); } #endif schedule(); Index: linux-2.6.24.7/kernel/softirq.c =================================================================== --- linux-2.6.24.7.orig/kernel/softirq.c +++ linux-2.6.24.7/kernel/softirq.c @@ -114,7 +114,14 @@ static void wakeup_softirqd(int softirq) * context processing it later on. */ if ((current->flags & PF_HARDIRQ) && !hardirq_count() && - (tsk->normal_prio == current->normal_prio)) + (tsk->normal_prio == current->normal_prio) && + /* + * The hard irq thread must be bound to a single CPU to run + * a softirq. Don't worry about locking, the irq thread + * should be the only one to modify the cpus_allowed, when + * the irq affinity changes. + */ + (cpus_weight(current->cpus_allowed) == 1)) return; #endif /* �������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/fix-softirq-checks-for-non-rt-preempt-hardirq.patch�����������������������������������������0000664�0000764�0000764�00000001713�11041657735�023103� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������--- include/linux/bottom_half.h | 2 +- kernel/softirq.c | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) Index: linux-2.6.24.7/include/linux/bottom_half.h =================================================================== --- linux-2.6.24.7.orig/include/linux/bottom_half.h +++ linux-2.6.24.7/include/linux/bottom_half.h @@ -1,7 +1,7 @@ #ifndef _LINUX_BH_H #define _LINUX_BH_H -#ifdef CONFIG_PREEMPT_RT +#ifdef CONFIG_PREEMPT_HARDIRQS # define local_bh_disable() do { } while (0) # define __local_bh_disable(ip) do { } while (0) # define _local_bh_enable() do { } while (0) Index: linux-2.6.24.7/kernel/softirq.c =================================================================== --- linux-2.6.24.7.orig/kernel/softirq.c +++ linux-2.6.24.7/kernel/softirq.c @@ -146,7 +146,7 @@ static void trigger_softirqs(void) } } -#ifndef CONFIG_PREEMPT_RT +#ifndef CONFIG_PREEMPT_HARDIRQS /* * This one is for softirq.c-internal use, �����������������������������������������������������patches/smp-processor-id-fixups.patch���������������������������������������������������������������0000664�0000764�0000764�00000005741�11041673156�017013� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������--- arch/x86/kernel/kprobes_32.c | 5 ++--- arch/x86/kernel/kprobes_64.c | 4 +--- include/linux/netpoll.h | 2 +- kernel/hrtimer.c | 4 +++- kernel/workqueue.c | 2 +- 5 files changed, 8 insertions(+), 9 deletions(-) Index: linux-2.6.24.7/arch/x86/kernel/kprobes_32.c =================================================================== --- linux-2.6.24.7.orig/arch/x86/kernel/kprobes_32.c +++ linux-2.6.24.7/arch/x86/kernel/kprobes_32.c @@ -668,12 +668,11 @@ int __kprobes kprobe_exceptions_notify(s ret = NOTIFY_STOP; break; case DIE_GPF: + // TODO: do this better on PREEMPT_RT /* kprobe_running() needs smp_processor_id() */ - preempt_disable(); - if (kprobe_running() && + if (per_cpu(current_kprobe, raw_smp_processor_id()) && kprobe_fault_handler(args->regs, args->trapnr)) ret = NOTIFY_STOP; - preempt_enable(); break; default: break; Index: linux-2.6.24.7/arch/x86/kernel/kprobes_64.c =================================================================== --- linux-2.6.24.7.orig/arch/x86/kernel/kprobes_64.c +++ linux-2.6.24.7/arch/x86/kernel/kprobes_64.c @@ -655,11 +655,9 @@ int __kprobes kprobe_exceptions_notify(s break; case DIE_GPF: /* kprobe_running() needs smp_processor_id() */ - preempt_disable(); - if (kprobe_running() && + if (per_cpu(current_kprobe, raw_smp_processor_id()) && kprobe_fault_handler(args->regs, args->trapnr)) ret = NOTIFY_STOP; - preempt_enable(); break; default: break; Index: linux-2.6.24.7/include/linux/netpoll.h =================================================================== --- linux-2.6.24.7.orig/include/linux/netpoll.h +++ linux-2.6.24.7/include/linux/netpoll.h @@ -77,7 +77,7 @@ static inline void *netpoll_poll_lock(st rcu_read_lock(); /* deal with race on ->npinfo */ if (dev && dev->npinfo) { spin_lock(&napi->poll_lock); - napi->poll_owner = smp_processor_id(); + napi->poll_owner = raw_smp_processor_id(); return napi; } return NULL; Index: linux-2.6.24.7/kernel/hrtimer.c =================================================================== --- linux-2.6.24.7.orig/kernel/hrtimer.c +++ linux-2.6.24.7/kernel/hrtimer.c @@ -1222,7 +1222,9 @@ void hrtimer_interrupt(struct clock_even static void run_hrtimer_softirq(struct softirq_action *h) { - struct hrtimer_cpu_base *cpu_base = &__get_cpu_var(hrtimer_bases); + struct hrtimer_cpu_base *cpu_base; + + cpu_base = &per_cpu(hrtimer_bases, raw_smp_processor_id()); spin_lock_irq(&cpu_base->lock); Index: linux-2.6.24.7/kernel/workqueue.c =================================================================== --- linux-2.6.24.7.orig/kernel/workqueue.c +++ linux-2.6.24.7/kernel/workqueue.c @@ -186,7 +186,7 @@ void delayed_work_timer_fn(unsigned long struct cpu_workqueue_struct *cwq = get_wq_data(&dwork->work); struct workqueue_struct *wq = cwq->wq; - __queue_work(wq_per_cpu(wq, smp_processor_id()), &dwork->work); + __queue_work(wq_per_cpu(wq, raw_smp_processor_id()), &dwork->work); } /** �������������������������������patches/irda-fix.patch������������������������������������������������������������������������������0000664�0000764�0000764�00000001755�11041657734�014001� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������This was found around the 2.6.10 timeframe when testing with the -rt patch and I believe is still is an issue. irttp_dup() does a memcpy() of the tsap_cb structure causing the spinlock protecting various fields in the structure to be duped. This works OK in the non-RT case but in the RT case we end up with two mutexes pointing to the same wait_list and leading to an OOPS. Fix is to simply initialize the spinlock after the memcpy(). Signed-off-by: Deepak Saxena <dsaxena@mvista.com> --- net/irda/irttp.c | 1 + 1 file changed, 1 insertion(+) Index: linux-2.6.24.7/net/irda/irttp.c =================================================================== --- linux-2.6.24.7.orig/net/irda/irttp.c +++ linux-2.6.24.7/net/irda/irttp.c @@ -1453,6 +1453,7 @@ struct tsap_cb *irttp_dup(struct tsap_cb } /* Dup */ memcpy(new, orig, sizeof(struct tsap_cb)); + spin_lock_init(&new->lock); /* We don't need the old instance any more */ spin_unlock_irqrestore(&irttp->tsaps->hb_spinlock, flags); �������������������patches/nf_conntrack-weird-crash-fix.patch����������������������������������������������������������0000664�0000764�0000764�00000002220�11041657731�017716� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������--- net/netfilter/nf_conntrack_core.c | 18 ++++++++++++++++++ 1 file changed, 18 insertions(+) Index: linux-2.6.24.7/net/netfilter/nf_conntrack_core.c =================================================================== --- linux-2.6.24.7.orig/net/netfilter/nf_conntrack_core.c +++ linux-2.6.24.7/net/netfilter/nf_conntrack_core.c @@ -1133,6 +1133,24 @@ int __init nf_conntrack_init(void) /* - and look it like as a confirmed connection */ set_bit(IPS_CONFIRMED_BIT, &nf_conntrack_untracked.status); + /* + * There's something really weird (read: crash) going on in + * this module when lockdep and rt is enabled - the locks are + * not initialized in the per-CPU area properly - or they might + * be initialized by getting a copy of the first CPU's per-cpu + * area? Only seems to happen when things are modular. Maybe + * per-cpu-alloc does not zero buffers properly? Needs + * investigating. Reported and fixed by Mike. + */ +#if defined(CONFIG_NF_CONNTRACK_EVENTS) && defined(CONFIG_SMP) + { + int cpu; + + for_each_possible_cpu(cpu) + spin_lock_init(&per_cpu_lock(nf_conntrack_ecache, cpu)); + } +#endif + return ret; out_fini_expect: ��������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/nf_conntrack-fix-smp-processor-id.patch�����������������������������������������������������0000664�0000764�0000764�00000001275�11041657732�020726� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������--- include/net/netfilter/nf_conntrack.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) Index: linux-2.6.24.7/include/net/netfilter/nf_conntrack.h =================================================================== --- linux-2.6.24.7.orig/include/net/netfilter/nf_conntrack.h +++ linux-2.6.24.7/include/net/netfilter/nf_conntrack.h @@ -262,7 +262,7 @@ DECLARE_PER_CPU(struct ip_conntrack_stat #define NF_CT_STAT_INC_ATOMIC(count) \ do { \ local_bh_disable(); \ - __get_cpu_var(nf_conntrack_stat).count++; \ + __raw_get_cpu_var(nf_conntrack_stat).count++; \ local_bh_enable(); \ } while (0) #define NF_CT_STAT_INC(count) (__raw_get_cpu_var(nf_conntrack_stat).count++) �����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/print-might-sleep-hack.patch����������������������������������������������������������������0000664�0000764�0000764�00000005062�11041657735�016546� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������Temporary HACK!!!! PREEMPT_RT suffers from the on going problem of running printk in atomic operations. It is very advantageous to do so but with PREEMPT_RT making spin_locks sleep, it can also be devastating. This patch does not solve the problem of printk sleeping in an atomic operation. This patch just makes printk not report that it is. Of course if printk does report that it's sleeping in an atomic operation, then that printing of the report will also print a report, and you go into recursive hell. We need to really sit down and solve the real issue here. --- include/linux/sched.h | 13 +++++++++++++ kernel/printk.c | 5 ++++- kernel/rtmutex.c | 4 +++- 3 files changed, 20 insertions(+), 2 deletions(-) Index: linux-2.6.24.7/include/linux/sched.h =================================================================== --- linux-2.6.24.7.orig/include/linux/sched.h +++ linux-2.6.24.7/include/linux/sched.h @@ -1291,8 +1291,21 @@ struct task_struct { int make_it_fail; #endif struct prop_local_single dirties; +#ifdef CONFIG_PREEMPT_RT + /* + * Temporary hack, until we find a solution to + * handle printk in atomic operations. + */ + int in_printk; +#endif }; +#ifdef CONFIG_PREEMPT_RT +# define set_printk_might_sleep(x) do { current->in_printk = x; } while(0) +#else +# define set_printk_might_sleep(x) do { } while(0) +#endif + /* * Priority of a process goes from 0..MAX_PRIO-1, valid RT * priority is 0..MAX_RT_PRIO-1, and SCHED_NORMAL/SCHED_BATCH Index: linux-2.6.24.7/kernel/printk.c =================================================================== --- linux-2.6.24.7.orig/kernel/printk.c +++ linux-2.6.24.7/kernel/printk.c @@ -436,8 +436,11 @@ static void __call_console_drivers(unsig for (con = console_drivers; con; con = con->next) { if ((con->flags & CON_ENABLED) && con->write && (cpu_online(raw_smp_processor_id()) || - (con->flags & CON_ANYTIME))) + (con->flags & CON_ANYTIME))) { + set_printk_might_sleep(1); con->write(con, &LOG_BUF(start), end - start); + set_printk_might_sleep(0); + } } } Index: linux-2.6.24.7/kernel/rtmutex.c =================================================================== --- linux-2.6.24.7.orig/kernel/rtmutex.c +++ linux-2.6.24.7/kernel/rtmutex.c @@ -631,7 +631,9 @@ static inline void rt_spin_lock_fastlock(struct rt_mutex *lock, void fastcall (*slowfn)(struct rt_mutex *lock)) { - might_sleep(); + /* Temporary HACK! */ + if (!current->in_printk) + might_sleep(); if (likely(rt_mutex_cmpxchg(lock, NULL, current))) rt_mutex_deadlock_account_lock(lock, current); ������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/lockdep-rt-mutex.patch����������������������������������������������������������������������0000664�0000764�0000764�00000011274�11041657732�015475� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������Subject: lockdep-rt: annotate PREEMPT_RT DEFINE_MUTEX Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> --- include/linux/mutex.h | 16 ++++++---- include/linux/rt_lock.h | 70 ++++++++++++++++++++---------------------------- 2 files changed, 39 insertions(+), 47 deletions(-) Index: linux-2.6.24.7/include/linux/mutex.h =================================================================== --- linux-2.6.24.7.orig/include/linux/mutex.h +++ linux-2.6.24.7/include/linux/mutex.h @@ -18,6 +18,13 @@ #include <asm/atomic.h> +#ifdef CONFIG_DEBUG_LOCK_ALLOC +# define __DEP_MAP_MUTEX_INITIALIZER(lockname) \ + , .dep_map = { .name = #lockname } +#else +# define __DEP_MAP_MUTEX_INITIALIZER(lockname) +#endif + #ifdef CONFIG_PREEMPT_RT #include <linux/rtmutex.h> @@ -29,9 +36,11 @@ struct mutex { #endif }; + #define __MUTEX_INITIALIZER(mutexname) \ { \ .lock = __RT_MUTEX_INITIALIZER(mutexname.lock) \ + __DEP_MAP_MUTEX_INITIALIZER(mutexname) \ } #define DEFINE_MUTEX(mutexname) \ @@ -141,13 +150,6 @@ do { \ # define mutex_destroy(mutex) do { } while (0) #endif -#ifdef CONFIG_DEBUG_LOCK_ALLOC -# define __DEP_MAP_MUTEX_INITIALIZER(lockname) \ - , .dep_map = { .name = #lockname } -#else -# define __DEP_MAP_MUTEX_INITIALIZER(lockname) -#endif - #define __MUTEX_INITIALIZER(lockname) \ { .count = ATOMIC_INIT(1) \ , .wait_lock = __SPIN_LOCK_UNLOCKED(lockname.wait_lock) \ Index: linux-2.6.24.7/include/linux/rt_lock.h =================================================================== --- linux-2.6.24.7.orig/include/linux/rt_lock.h +++ linux-2.6.24.7/include/linux/rt_lock.h @@ -27,30 +27,31 @@ typedef struct { } spinlock_t; #ifdef CONFIG_DEBUG_RT_MUTEXES -# define __SPIN_LOCK_UNLOCKED(name) \ - (spinlock_t) { { .wait_lock = _RAW_SPIN_LOCK_UNLOCKED(name) \ - , .save_state = 1, .file = __FILE__, .line = __LINE__ }, SPIN_DEP_MAP_INIT(name) } +# define __RT_SPIN_INITIALIZER(name) \ + { .wait_lock = _RAW_SPIN_LOCK_UNLOCKED(name.wait_lock), \ + .save_state = 1, \ + .file = __FILE__, \ + .line = __LINE__, } #else -# define __SPIN_LOCK_UNLOCKED(name) \ - (spinlock_t) { { .wait_lock = _RAW_SPIN_LOCK_UNLOCKED(name) }, SPIN_DEP_MAP_INIT(name) } +# define __RT_SPIN_INITIALIZER(name) \ + { .wait_lock = _RAW_SPIN_LOCK_UNLOCKED(name.wait_lock) } #endif -# define SPIN_LOCK_UNLOCKED __SPIN_LOCK_UNLOCKED(spin_old_style) + +#define __SPIN_LOCK_UNLOCKED(name) (spinlock_t) \ + { .lock = __RT_SPIN_INITIALIZER(name), \ + SPIN_DEP_MAP_INIT(name) } + #else /* !PREEMPT_RT */ - typedef raw_spinlock_t spinlock_t; -# ifdef CONFIG_DEBUG_SPINLOCK -# define _SPIN_LOCK_UNLOCKED \ - { .raw_lock = __RAW_SPIN_LOCK_UNLOCKED, \ - .magic = SPINLOCK_MAGIC, \ - .owner = SPINLOCK_OWNER_INIT, \ - .owner_cpu = -1 } -# else -# define _SPIN_LOCK_UNLOCKED \ - { .raw_lock = __RAW_SPIN_LOCK_UNLOCKED } -# endif -# define SPIN_LOCK_UNLOCKED _SPIN_LOCK_UNLOCKED -# define __SPIN_LOCK_UNLOCKED(name) _SPIN_LOCK_UNLOCKED + +typedef raw_spinlock_t spinlock_t; + +#define __SPIN_LOCK_UNLOCKED _RAW_SPIN_LOCK_UNLOCKED + #endif +#define SPIN_LOCK_UNLOCKED __SPIN_LOCK_UNLOCKED(spin_old_style) + + #define __DEFINE_SPINLOCK(name) \ spinlock_t name = __SPIN_LOCK_UNLOCKED(name) @@ -89,32 +90,20 @@ typedef struct { #endif } rwlock_t; -# ifdef CONFIG_DEBUG_RT_MUTEXES -# define __RW_LOCK_UNLOCKED(name) (rwlock_t) \ - { .lock = { .wait_lock = _RAW_SPIN_LOCK_UNLOCKED(name), \ - .save_state = 1, .file = __FILE__, .line = __LINE__ } } -# else -# define __RW_LOCK_UNLOCKED(name) (rwlock_t) \ - { .lock = { .wait_lock = _RAW_SPIN_LOCK_UNLOCKED(name) } } -# endif +#define __RW_LOCK_UNLOCKED(name) (rwlock_t) \ + { .lock = __RT_SPIN_INITIALIZER(name), \ + RW_DEP_MAP_INIT(name) } #else /* !PREEMPT_RT */ - typedef raw_rwlock_t rwlock_t; -# ifdef CONFIG_DEBUG_SPINLOCK -# define _RW_LOCK_UNLOCKED \ - (rwlock_t) { .raw_lock = __RAW_RW_LOCK_UNLOCKED, \ - .magic = RWLOCK_MAGIC, \ - .owner = SPINLOCK_OWNER_INIT, \ - .owner_cpu = -1 } -# else -# define _RW_LOCK_UNLOCKED \ - (rwlock_t) { .raw_lock = __RAW_RW_LOCK_UNLOCKED } -# endif -# define __RW_LOCK_UNLOCKED(name) _RW_LOCK_UNLOCKED +typedef raw_rwlock_t rwlock_t; + +#define __RW_LOCK_UNLOCKED _RAW_RW_LOCK_UNLOCKED + #endif #define RW_LOCK_UNLOCKED __RW_LOCK_UNLOCKED(rw_old_style) + #define DEFINE_RWLOCK(name) \ rwlock_t name __cacheline_aligned_in_smp = __RW_LOCK_UNLOCKED(name) @@ -236,7 +225,8 @@ do { \ */ #define __RWSEM_INITIALIZER(name) \ - { .lock = __RT_MUTEX_INITIALIZER(name.lock) } + { .lock = __RT_MUTEX_INITIALIZER(name.lock), \ + RW_DEP_MAP_INIT(name) } #define DECLARE_RWSEM(lockname) \ struct rw_semaphore lockname = __RWSEM_INITIALIZER(lockname) ������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/lockstat-rt-hooks.patch���������������������������������������������������������������������0000664�0000764�0000764�00000012341�11041657735�015660� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������--- include/linux/lockdep.h | 28 ++++++++++++++++++++++++++++ kernel/rt.c | 25 ++++++++++++++++--------- kernel/rtmutex.c | 4 ++-- 3 files changed, 46 insertions(+), 11 deletions(-) Index: linux-2.6.24.7/include/linux/lockdep.h =================================================================== --- linux-2.6.24.7.orig/include/linux/lockdep.h +++ linux-2.6.24.7/include/linux/lockdep.h @@ -361,6 +361,28 @@ do { \ lock_acquired(&(_lock)->dep_map); \ } while (0) +#define LOCK_CONTENDED_RT(_lock, f_try, f_lock) \ +do { \ + if (!f_try(&(_lock)->lock)) { \ + lock_contended(&(_lock)->dep_map, _RET_IP_); \ + f_lock(&(_lock)->lock); \ + lock_acquired(&(_lock)->dep_map); \ + } \ +} while (0) + + +#define LOCK_CONTENDED_RT_RET(_lock, f_try, f_lock) \ +({ \ + int ret = 0; \ + if (!f_try(&(_lock)->lock)) { \ + lock_contended(&(_lock)->dep_map, _RET_IP_); \ + ret = f_lock(&(_lock)->lock); \ + if (!ret) \ + lock_acquired(&(_lock)->dep_map); \ + } \ + ret; \ +}) + #else /* CONFIG_LOCK_STAT */ #define lock_contended(lockdep_map, ip) do {} while (0) @@ -369,6 +391,12 @@ do { \ #define LOCK_CONTENDED(_lock, try, lock) \ lock(_lock) +#define LOCK_CONTENDED_RT(_lock, f_try, f_lock) \ + f_lock(&(_lock)->lock) + +#define LOCK_CONTENDED_RT_RET(_lock, f_try, f_lock) \ + f_lock(&(_lock)->lock) + #endif /* CONFIG_LOCK_STAT */ #if defined(CONFIG_TRACE_IRQFLAGS) && defined(CONFIG_GENERIC_HARDIRQS) Index: linux-2.6.24.7/kernel/rt.c =================================================================== --- linux-2.6.24.7.orig/kernel/rt.c +++ linux-2.6.24.7/kernel/rt.c @@ -98,16 +98,22 @@ EXPORT_SYMBOL(_mutex_init); void __lockfunc _mutex_lock(struct mutex *lock) { mutex_acquire(&lock->dep_map, 0, 0, _RET_IP_); - rt_mutex_lock(&lock->lock); + LOCK_CONTENDED_RT(lock, rt_mutex_trylock, rt_mutex_lock); } EXPORT_SYMBOL(_mutex_lock); +static int __lockfunc __rt_mutex_lock_interruptible(struct rt_mutex *lock) +{ + return rt_mutex_lock_interruptible(lock, 0); +} + int __lockfunc _mutex_lock_interruptible(struct mutex *lock) { int ret; mutex_acquire(&lock->dep_map, 0, 0, _RET_IP_); - ret = rt_mutex_lock_interruptible(&lock->lock, 0); + ret = LOCK_CONTENDED_RT_RET(lock, rt_mutex_trylock, + __rt_mutex_lock_interruptible); if (ret) mutex_release(&lock->dep_map, 1, _RET_IP_); return ret; @@ -118,7 +124,7 @@ EXPORT_SYMBOL(_mutex_lock_interruptible) void __lockfunc _mutex_lock_nested(struct mutex *lock, int subclass) { mutex_acquire(&lock->dep_map, subclass, 0, _RET_IP_); - rt_mutex_lock(&lock->lock); + LOCK_CONTENDED_RT(lock, rt_mutex_trylock, rt_mutex_lock); } EXPORT_SYMBOL(_mutex_lock_nested); @@ -127,7 +133,8 @@ int __lockfunc _mutex_lock_interruptible int ret; mutex_acquire(&lock->dep_map, subclass, 0, _RET_IP_); - ret = rt_mutex_lock_interruptible(&lock->lock, 0); + ret = LOCK_CONTENDED_RT_RET(lock, rt_mutex_trylock, + __rt_mutex_lock_interruptible); if (ret) mutex_release(&lock->dep_map, 1, _RET_IP_); return ret; @@ -203,7 +210,7 @@ EXPORT_SYMBOL(rt_read_trylock); void __lockfunc rt_write_lock(rwlock_t *rwlock) { rwlock_acquire(&rwlock->dep_map, 0, 0, _RET_IP_); - __rt_spin_lock(&rwlock->lock); + LOCK_CONTENDED_RT(rwlock, rt_mutex_trylock, __rt_spin_lock); } EXPORT_SYMBOL(rt_write_lock); @@ -223,7 +230,7 @@ void __lockfunc rt_read_lock(rwlock_t *r return; } spin_unlock_irqrestore(&lock->wait_lock, flags); - __rt_spin_lock(lock); + LOCK_CONTENDED_RT(rwlock, rt_mutex_trylock, __rt_spin_lock); } EXPORT_SYMBOL(rt_read_lock); @@ -359,14 +366,14 @@ EXPORT_SYMBOL(rt_down_write_trylock); void fastcall rt_down_write(struct rw_semaphore *rwsem) { rwsem_acquire(&rwsem->dep_map, 0, 0, _RET_IP_); - rt_mutex_lock(&rwsem->lock); + LOCK_CONTENDED_RT(rwsem, rt_mutex_trylock, rt_mutex_lock); } EXPORT_SYMBOL(rt_down_write); void fastcall rt_down_write_nested(struct rw_semaphore *rwsem, int subclass) { rwsem_acquire(&rwsem->dep_map, subclass, 0, _RET_IP_); - rt_mutex_lock(&rwsem->lock); + LOCK_CONTENDED_RT(rwsem, rt_mutex_trylock, rt_mutex_lock); } EXPORT_SYMBOL(rt_down_write_nested); @@ -411,7 +418,7 @@ static void __rt_down_read(struct rw_sem return; } spin_unlock_irqrestore(&rwsem->lock.wait_lock, flags); - rt_mutex_lock(&rwsem->lock); + LOCK_CONTENDED_RT(rwsem, rt_mutex_trylock, rt_mutex_lock); } void fastcall rt_down_read(struct rw_semaphore *rwsem) Index: linux-2.6.24.7/kernel/rtmutex.c =================================================================== --- linux-2.6.24.7.orig/kernel/rtmutex.c +++ linux-2.6.24.7/kernel/rtmutex.c @@ -785,8 +785,8 @@ rt_spin_lock_slowunlock(struct rt_mutex void __lockfunc rt_spin_lock(spinlock_t *lock) { - rt_spin_lock_fastlock(&lock->lock, rt_spin_lock_slowlock); spin_acquire(&lock->dep_map, 0, 0, _RET_IP_); + LOCK_CONTENDED_RT(lock, rt_mutex_trylock, __rt_spin_lock); } EXPORT_SYMBOL(rt_spin_lock); @@ -800,8 +800,8 @@ EXPORT_SYMBOL(__rt_spin_lock); void __lockfunc rt_spin_lock_nested(spinlock_t *lock, int subclass) { - rt_spin_lock_fastlock(&lock->lock, rt_spin_lock_slowlock); spin_acquire(&lock->dep_map, subclass, 0, _RET_IP_); + LOCK_CONTENDED_RT(lock, rt_mutex_trylock, __rt_spin_lock); } EXPORT_SYMBOL(rt_spin_lock_nested); �����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/lockstat_bounce_rt.patch��������������������������������������������������������������������0000664�0000764�0000764�00000001547�11041657735�016162� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������--- include/linux/lockdep.h | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) Index: linux-2.6.24.7/include/linux/lockdep.h =================================================================== --- linux-2.6.24.7.orig/include/linux/lockdep.h +++ linux-2.6.24.7/include/linux/lockdep.h @@ -366,8 +366,8 @@ do { \ if (!f_try(&(_lock)->lock)) { \ lock_contended(&(_lock)->dep_map, _RET_IP_); \ f_lock(&(_lock)->lock); \ - lock_acquired(&(_lock)->dep_map); \ } \ + lock_acquired(&(_lock)->dep_map); \ } while (0) @@ -377,9 +377,9 @@ do { \ if (!f_try(&(_lock)->lock)) { \ lock_contended(&(_lock)->dep_map, _RET_IP_); \ ret = f_lock(&(_lock)->lock); \ - if (!ret) \ - lock_acquired(&(_lock)->dep_map); \ } \ + if (!ret) \ + lock_acquired(&(_lock)->dep_map); \ ret; \ }) ���������������������������������������������������������������������������������������������������������������������������������������������������������patches/RT_utsname.patch����������������������������������������������������������������������������0000664�0000764�0000764�00000002543�11041657730�014347� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������--- init/Makefile | 2 +- scripts/mkcompile_h | 4 +++- 2 files changed, 4 insertions(+), 2 deletions(-) Index: linux-2.6.24.7/init/Makefile =================================================================== --- linux-2.6.24.7.orig/init/Makefile +++ linux-2.6.24.7/init/Makefile @@ -30,4 +30,4 @@ $(obj)/version.o: include/linux/compile. include/linux/compile.h: FORCE @echo ' CHK $@' $(Q)$(CONFIG_SHELL) $(srctree)/scripts/mkcompile_h $@ \ - "$(UTS_MACHINE)" "$(CONFIG_SMP)" "$(CONFIG_PREEMPT)" "$(CC) $(KBUILD_CFLAGS)" + "$(UTS_MACHINE)" "$(CONFIG_SMP)" "$(CONFIG_PREEMPT)" "$(CONFIG_PREEMPT_RT)" "$(CC) $(KBUILD_CFLAGS)" Index: linux-2.6.24.7/scripts/mkcompile_h =================================================================== --- linux-2.6.24.7.orig/scripts/mkcompile_h +++ linux-2.6.24.7/scripts/mkcompile_h @@ -2,7 +2,8 @@ TARGET=$1 ARCH=$2 SMP=$3 PREEMPT=$4 -CC=$5 +PREEMPT_RT=$5 +CC=$6 # If compile.h exists already and we don't own autoconf.h # (i.e. we're not the same user who did make *config), don't @@ -43,6 +44,7 @@ UTS_VERSION="#$VERSION" CONFIG_FLAGS="" if [ -n "$SMP" ] ; then CONFIG_FLAGS="SMP"; fi if [ -n "$PREEMPT" ] ; then CONFIG_FLAGS="$CONFIG_FLAGS PREEMPT"; fi +if [ -n "$PREEMPT_RT" ] ; then CONFIG_FLAGS="$CONFIG_FLAGS RT"; fi UTS_VERSION="$UTS_VERSION $CONFIG_FLAGS $TIMESTAMP" # Truncate to maximum length �������������������������������������������������������������������������������������������������������������������������������������������������������������patches/preempt-rt-no-slub.patch��������������������������������������������������������������������0000664�0000764�0000764�00000000733�11041657731�015742� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������--- init/Kconfig | 1 + 1 file changed, 1 insertion(+) Index: linux-2.6.24.7/init/Kconfig =================================================================== --- linux-2.6.24.7.orig/init/Kconfig +++ linux-2.6.24.7/init/Kconfig @@ -647,6 +647,7 @@ config SLAB config SLUB bool "SLUB (Unqueued Allocator)" + depends on !PREEMPT_RT help SLUB is a slab allocator that minimizes cache line usage instead of managing queues of cached objects (SLAB approach). �������������������������������������patches/paravirt-function-pointer-fix.patch���������������������������������������������������������0000664�0000764�0000764�00000001702�11041657732�020201� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������--- arch/x86/kernel/paravirt_32.c | 12 +++++++++++- 1 file changed, 11 insertions(+), 1 deletion(-) Index: linux-2.6.24.7/arch/x86/kernel/paravirt_32.c =================================================================== --- linux-2.6.24.7.orig/arch/x86/kernel/paravirt_32.c +++ linux-2.6.24.7/arch/x86/kernel/paravirt_32.c @@ -407,6 +407,16 @@ struct pv_apic_ops pv_apic_ops = { #endif }; +#ifdef CONFIG_HIGHPTE +/* + * kmap_atomic() might be an inline or a macro: + */ +static void *kmap_atomic_func(struct page *page, enum km_type idx) +{ + return kmap_atomic(page, idx); +} +#endif + struct pv_mmu_ops pv_mmu_ops = { .pagetable_setup_start = native_pagetable_setup_start, .pagetable_setup_done = native_pagetable_setup_done, @@ -434,7 +444,7 @@ struct pv_mmu_ops pv_mmu_ops = { .pte_update_defer = paravirt_nop, #ifdef CONFIG_HIGHPTE - .kmap_atomic_pte = kmap_atomic, + .kmap_atomic_pte = kmap_atomic_func, #endif #ifdef CONFIG_X86_PAE ��������������������������������������������������������������patches/quicklist-release-before-free-page.patch����������������������������������������������������0000664�0000764�0000764�00000015250�11041657731�021005� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From peterz@infradead.org Mon Jul 23 21:40:44 2007 Return-Path: <mingo@elte.hu> X-Spam-Checker-Version: SpamAssassin 3.1.7-deb (2006-10-05) on debian X-Spam-Level: X-Spam-Status: No, score=0.0 required=5.0 tests=none autolearn=ham version=3.1.7-deb Received: from mx2.mail.elte.hu (mx2.mail.elte.hu [157.181.151.9]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mail.tglx.de (Postfix) with ESMTP id CAC4B65C003 for <tglx@linutronix.de>; Mon, 23 Jul 2007 21:40:44 +0200 (CEST) Received: from elvis.elte.hu ([157.181.1.14]) by mx2.mail.elte.hu with esmtp (Exim) id 1ID3lr-0000tI-MW from <mingo@elte.hu> for <tglx@linutronix.de>; Mon, 23 Jul 2007 21:40:43 +0200 Received: by elvis.elte.hu (Postfix, from userid 1004) id 1D9593E2153; Mon, 23 Jul 2007 21:40:43 +0200 (CEST) Resent-From: Ingo Molnar <mingo@elte.hu> Resent-Date: Mon, 23 Jul 2007 21:40:40 +0200 Resent-Message-ID: <20070723194040.GA7831@elte.hu> Resent-To: Thomas Gleixner <tglx@linutronix.de> X-Original-To: mingo@elvis.elte.hu Delivered-To: mingo@elvis.elte.hu Received: from mx3.mail.elte.hu (mx3.mail.elte.hu [157.181.1.138]) by elvis.elte.hu (Postfix) with ESMTP id 03EA13E214E for <mingo@elvis.elte.hu>; Mon, 23 Jul 2007 18:33:06 +0200 (CEST) Received: from pentafluge.infradead.org ([213.146.154.40]) by mx3.mail.elte.hu with esmtp (Exim) id 1ID0qK-0003mK-9A from <peterz@infradead.org> for <mingo@elte.hu>; Mon, 23 Jul 2007 18:33:08 +0200 Received: from i55087.upc-i.chello.nl ([62.195.55.87] helo=[192.168.0.111]) by pentafluge.infradead.org with esmtpsa (Exim 4.63 #1 (Red Hat Linux)) id 1ID0qB-0003Kf-Tf; Mon, 23 Jul 2007 17:33:00 +0100 Subject: Re: [PATCH] release quicklist before free_page From: Peter Zijlstra <peterz@infradead.org> To: Daniel Walker <dwalker@mvista.com> Cc: mingo@elte.hu, paulmck@linux.vnet.ibm.com, linux-kernel@vger.kernel.org, linux-rt-users@vger.kernel.org In-Reply-To: <20070723152129.036573829@mvista.com> References: <20070723152129.036573829@mvista.com> Content-Type: text/plain Date: Mon, 23 Jul 2007 18:32:58 +0200 Message-Id: <1185208378.8197.20.camel@twins> Mime-Version: 1.0 X-Mailer: Evolution 2.10.1 X-ELTE-VirusStatus: clean X-ELTE-SpamScore: -1.0 X-ELTE-SpamLevel: X-ELTE-SpamCheck: no X-ELTE-SpamVersion: ELTE 2.0 X-ELTE-SpamCheck-Details: score=-1.0 required=5.9 tests=BAYES_00 autolearn=no SpamAssassin version=3.0.3 -1.0 BAYES_00 BODY: Bayesian spam probability is 0 to 1% [score: 0.0000] Received-SPF: softfail (mx2: transitioning domain of elte.hu does not designate 157.181.1.14 as permitted sender) client-ip=157.181.1.14; envelope-from=mingo@elte.hu; helo=elvis.elte.hu; X-ELTE-VirusStatus: clean X-Evolution-Source: imap://tglx%40linutronix.de@localhost:8993/ Content-Transfer-Encoding: 8bit On Mon, 2007-07-23 at 08:21 -0700, Daniel Walker wrote: > Resolves, > > BUG: sleeping function called from invalid context cc1(29651) at kernel/rtmutex.c:636 > in_atomic():1 [00000001], irqs_disabled():0 > [<c0119f50>] __might_sleep+0xf3/0xf9 > [<c031600e>] __rt_spin_lock+0x21/0x3c > [<c014102c>] get_zone_pcp+0x20/0x29 > [<c0141a40>] free_hot_cold_page+0xdc/0x167 > [<c013a3f4>] add_preempt_count+0x12/0xcc > [<c0110d92>] pgd_dtor+0x0/0x1 > [<c015d865>] quicklist_trim+0xb7/0xe3 > [<c0111025>] check_pgt_cache+0x19/0x1c > [<c0148df5>] free_pgtables+0x54/0x12c > [<c013a3f4>] add_preempt_count+0x12/0xcc > [<c014e5be>] unmap_region+0xeb/0x13b > > > It looks like the quicklist isn't used after a few variables are evaluated. > So no need to keep preemption disabled over the whole function. Not quite, it uses preempt_disable() to avoid migration and stick to a cpu. Without that it might end up freeing pages from another quicklist. How about this - compile tested only --- We cannot call the page allocator with preemption-disabled, use the per_cpu_locked construct to allow preemption while guarding the per cpu data. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> --- include/linux/quicklist.h | 19 +++++++++++++++---- mm/quicklist.c | 9 +++++---- 2 files changed, 20 insertions(+), 8 deletions(-) Index: linux-2.6.24.7/include/linux/quicklist.h =================================================================== --- linux-2.6.24.7.orig/include/linux/quicklist.h +++ linux-2.6.24.7/include/linux/quicklist.h @@ -18,7 +18,7 @@ struct quicklist { int nr_pages; }; -DECLARE_PER_CPU(struct quicklist, quicklist)[CONFIG_NR_QUICK]; +DECLARE_PER_CPU_LOCKED(struct quicklist, quicklist)[CONFIG_NR_QUICK]; /* * The two key functions quicklist_alloc and quicklist_free are inline so @@ -30,19 +30,30 @@ DECLARE_PER_CPU(struct quicklist, quickl * The fast patch in quicklist_alloc touched only a per cpu cacheline and * the first cacheline of the page itself. There is minmal overhead involved. */ -static inline void *quicklist_alloc(int nr, gfp_t flags, void (*ctor)(void *)) +static inline void *__quicklist_alloc(int cpu, int nr, gfp_t flags, void (*ctor)(void *)) { struct quicklist *q; void **p = NULL; - q =&get_cpu_var(quicklist)[nr]; + q = &__get_cpu_var_locked(quicklist, cpu)[nr]; p = q->page; if (likely(p)) { q->page = p[0]; p[0] = NULL; q->nr_pages--; } - put_cpu_var(quicklist); + return p; +} + +static inline void *quicklist_alloc(int nr, gfp_t flags, void (*ctor)(void *)) +{ + struct quicklist *q; + void **p = NULL; + int cpu; + + (void)get_cpu_var_locked(quicklist, &cpu)[nr]; + p = __quicklist_alloc(cpu, nr, flags, ctor); + put_cpu_var_locked(quicklist, cpu); if (likely(p)) return p; Index: linux-2.6.24.7/mm/quicklist.c =================================================================== --- linux-2.6.24.7.orig/mm/quicklist.c +++ linux-2.6.24.7/mm/quicklist.c @@ -19,7 +19,7 @@ #include <linux/module.h> #include <linux/quicklist.h> -DEFINE_PER_CPU(struct quicklist, quicklist)[CONFIG_NR_QUICK]; +DEFINE_PER_CPU_LOCKED(struct quicklist, quicklist)[CONFIG_NR_QUICK]; #define FRACTION_OF_NODE_MEM 16 @@ -59,8 +59,9 @@ void quicklist_trim(int nr, void (*dtor) { long pages_to_free; struct quicklist *q; + int cpu; - q = &get_cpu_var(quicklist)[nr]; + q = &get_cpu_var_locked(quicklist, &cpu)[nr]; if (q->nr_pages > min_pages) { pages_to_free = min_pages_to_free(q, min_pages, max_free); @@ -69,7 +70,7 @@ void quicklist_trim(int nr, void (*dtor) * We pass a gfp_t of 0 to quicklist_alloc here * because we will never call into the page allocator. */ - void *p = quicklist_alloc(nr, 0, NULL); + void *p = __quicklist_alloc(cpu, nr, 0, NULL); if (dtor) dtor(p); @@ -77,7 +78,7 @@ void quicklist_trim(int nr, void (*dtor) pages_to_free--; } } - put_cpu_var(quicklist); + put_cpu_var_locked(quicklist, cpu); } unsigned long quicklist_total_size(void) ��������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/quicklist-release-before-free-page-fix.patch������������������������������������������������0000664�0000764�0000764�00000005102�11041657734�021567� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������--- include/linux/quicklist.h | 18 ++++++++---------- mm/quicklist.c | 8 ++------ 2 files changed, 10 insertions(+), 16 deletions(-) Index: linux-2.6.24.7/include/linux/quicklist.h =================================================================== --- linux-2.6.24.7.orig/include/linux/quicklist.h +++ linux-2.6.24.7/include/linux/quicklist.h @@ -30,13 +30,10 @@ DECLARE_PER_CPU_LOCKED(struct quicklist, * The fast patch in quicklist_alloc touched only a per cpu cacheline and * the first cacheline of the page itself. There is minmal overhead involved. */ -static inline void *__quicklist_alloc(int cpu, int nr, gfp_t flags, void (*ctor)(void *)) +static inline void *__quicklist_alloc(struct quicklist *q) { - struct quicklist *q; - void **p = NULL; + void **p = q->page; - q = &__get_cpu_var_locked(quicklist, cpu)[nr]; - p = q->page; if (likely(p)) { q->page = p[0]; p[0] = NULL; @@ -48,11 +45,11 @@ static inline void *__quicklist_alloc(in static inline void *quicklist_alloc(int nr, gfp_t flags, void (*ctor)(void *)) { struct quicklist *q; - void **p = NULL; + void **p; int cpu; - (void)get_cpu_var_locked(quicklist, &cpu)[nr]; - p = __quicklist_alloc(cpu, nr, flags, ctor); + q = &get_cpu_var_locked(quicklist, &cpu)[nr]; + p = __quicklist_alloc(q); put_cpu_var_locked(quicklist, cpu); if (likely(p)) return p; @@ -67,12 +64,13 @@ static inline void __quicklist_free(int struct page *page) { struct quicklist *q; + int cpu; - q = &get_cpu_var(quicklist)[nr]; + q = &get_cpu_var_locked(quicklist, &cpu)[nr]; *(void **)p = q->page; q->page = p; q->nr_pages++; - put_cpu_var(quicklist); + put_cpu_var_locked(quicklist, cpu); } static inline void quicklist_free(int nr, void (*dtor)(void *), void *pp) Index: linux-2.6.24.7/mm/quicklist.c =================================================================== --- linux-2.6.24.7.orig/mm/quicklist.c +++ linux-2.6.24.7/mm/quicklist.c @@ -66,11 +66,7 @@ void quicklist_trim(int nr, void (*dtor) pages_to_free = min_pages_to_free(q, min_pages, max_free); while (pages_to_free > 0) { - /* - * We pass a gfp_t of 0 to quicklist_alloc here - * because we will never call into the page allocator. - */ - void *p = __quicklist_alloc(cpu, nr, 0, NULL); + void *p = __quicklist_alloc(q); if (dtor) dtor(p); @@ -88,7 +84,7 @@ unsigned long quicklist_total_size(void) struct quicklist *ql, *q; for_each_online_cpu(cpu) { - ql = per_cpu(quicklist, cpu); + ql = per_cpu_var_locked(quicklist, cpu); for (q = ql; q < ql + CONFIG_NR_QUICK; q++) count += q->nr_pages; } ��������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/disable-lpptest-on-nonlinux.patch�����������������������������������������������������������0000664�0000764�0000764�00000002610�11041657730�017637� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������� Sadly people keep wanting to build kernels on non-Linux hosts (cygwin & solaris) and testlpp really doesn't like to build on those. I have a separate patch to testlpp.c that fixes this, but it really makes no sense to build the tool to run on your cygwin host as it's meant to be run on Linux with the testlpp module loaded. Even this patch isn't really the right solution b/c you really want to cross-build the may be cross-building for another architecture from Linux you want cross-compile, not host compile but there's no really easy way to cross-compile a userland binary from the kernel build w/o some makefile uglyiness AFAICT. Is there some sort of -rt userland package this could move to instead of being in the kernel itself...? Signed-off-by: Deepak Saxena <dsaxena@mvista.com> --- scripts/Makefile | 3 +++ 1 file changed, 3 insertions(+) Index: linux-2.6.24.7/scripts/Makefile =================================================================== --- linux-2.6.24.7.orig/scripts/Makefile +++ linux-2.6.24.7/scripts/Makefile @@ -12,9 +12,12 @@ hostprogs-$(CONFIG_LOGO) += pnmt hostprogs-$(CONFIG_VT) += conmakehash hostprogs-$(CONFIG_PROM_CONSOLE) += conmakehash hostprogs-$(CONFIG_IKCONFIG) += bin2c +HOST_OS := $(shell uname) +ifeq ($(HOST_OS),Linux) ifdef CONFIG_LPPTEST hostprogs-y += testlpp endif +endif always := $(hostprogs-y) $(hostprogs-m) ������������������������������������������������������������������������������������������������������������������������patches/sched-rt-stats.patch������������������������������������������������������������������������0000664�0000764�0000764�00000002447�11041657730�015136� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������On Wed, Jul 25, 2007 at 10:05:04AM +0200, Ingo Molnar wrote: > > * Ankita Garg <ankita@in.ibm.com> wrote: > > > Hi, > > > > This patch adds support to display captured -rt stats under > > /proc/schedstat. > > hm, could you add it to /proc/sched_debug instead? That's where all the > runqueue values are showing up normally. I'm also a bit wary about > introducing a new schedstats version for -rt. So, I have merged my previous patch (to display rt_nr_running info in sched_debug.c) with this one. Signed-off-by: Ankita Garg <ankita@in.ibm.com> [mingo@elte.hu: fix it to work on !SCHEDSTATS too] Signed-off-by: Ingo Molnar <mingo@elte.hu> -- kernel/sched_debug.c | 13 +++++++++++++ 1 file changed, 13 insertions(+) Index: linux-2.6.24.7/kernel/sched_debug.c =================================================================== --- linux-2.6.24.7.orig/kernel/sched_debug.c +++ linux-2.6.24.7/kernel/sched_debug.c @@ -186,6 +186,19 @@ static void print_cpu(struct seq_file *m P(cpu_load[2]); P(cpu_load[3]); P(cpu_load[4]); +#ifdef CONFIG_PREEMPT_RT + /* Print rt related rq stats */ + P(rt.rt_nr_running); + P(rt.rt_nr_uninterruptible); +# ifdef CONFIG_SCHEDSTATS + P(rto_schedule); + P(rto_schedule_tail); + P(rto_wakeup); + P(rto_pulled); + P(rto_pushed); +# endif +#endif + #undef P #undef PN �������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/mitigate-resched-flood.patch����������������������������������������������������������������0000664�0000764�0000764�00000011176�11041657731�016610� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������[PATCH 1/3] mitigate-resched-interrupt-floods Mitigate rescheduling interrupt floods. Background: preempt-rt sends a resched interrupt to all other cpus whenever some realtime task gets preempted. This is to give that task a chance to continue running on some other cpu. Unfortunately this can cause 'resched interrupt floods' when there are large numbers of realtime tasks on the system that are continually being preempted. This patch reduces such interrupts by noting that it is not necessary to send rescheduling interrupts to every cpu in the system, just to those cpus in the affinity mask of the task to be migrated. This works well in the real world, as traditionally realtime tasks are carefully targeted to specific cpus or sets of cpus, meaning users often give such tasks reduced affinity masks. Signed-off-by: Joe Korty <joe.korty@ccur.com> --- arch/x86/kernel/smp_32.c | 9 +++++++++ arch/x86/kernel/smp_64.c | 9 +++++++++ include/asm-x86/smp_32.h | 2 ++ include/asm-x86/smp_64.h | 3 +++ include/linux/smp.h | 10 ++++++++++ 5 files changed, 33 insertions(+) Index: linux-2.6.24.7/arch/x86/kernel/smp_32.c =================================================================== --- linux-2.6.24.7.orig/arch/x86/kernel/smp_32.c +++ linux-2.6.24.7/arch/x86/kernel/smp_32.c @@ -18,6 +18,7 @@ #include <linux/cache.h> #include <linux/interrupt.h> #include <linux/cpu.h> +#include <linux/cpumask.h> #include <linux/module.h> #include <asm/mtrr.h> @@ -485,6 +486,14 @@ void smp_send_reschedule_allbutself(void send_IPI_allbutself(RESCHEDULE_VECTOR); } +void smp_send_reschedule_allbutself_cpumask(cpumask_t mask) +{ + cpu_clear(smp_processor_id(), mask); + cpus_and(mask, mask, cpu_online_map); + if (!cpus_empty(mask)) + send_IPI_mask(mask, RESCHEDULE_VECTOR); +} + /* * Structure and data for smp_call_function(). This is designed to minimise * static memory requirements. It also looks cleaner. Index: linux-2.6.24.7/arch/x86/kernel/smp_64.c =================================================================== --- linux-2.6.24.7.orig/arch/x86/kernel/smp_64.c +++ linux-2.6.24.7/arch/x86/kernel/smp_64.c @@ -15,6 +15,7 @@ #include <linux/delay.h> #include <linux/spinlock.h> #include <linux/smp.h> +#include <linux/cpumask.h> #include <linux/kernel_stat.h> #include <linux/mc146818rtc.h> #include <linux/interrupt.h> @@ -305,6 +306,14 @@ void smp_send_reschedule_allbutself(void send_IPI_allbutself(RESCHEDULE_VECTOR); } +void smp_send_reschedule_allbutself_cpumask(cpumask_t mask) +{ + cpu_clear(smp_processor_id(), mask); + cpus_and(mask, mask, cpu_online_map); + if (!cpus_empty(mask)) + send_IPI_mask(mask, RESCHEDULE_VECTOR); +} + /* * Structure and data for smp_call_function(). This is designed to minimise * static memory requirements. It also looks cleaner. Index: linux-2.6.24.7/include/asm-x86/smp_32.h =================================================================== --- linux-2.6.24.7.orig/include/asm-x86/smp_32.h +++ linux-2.6.24.7/include/asm-x86/smp_32.h @@ -181,4 +181,6 @@ static __inline int logical_smp_processo #endif #endif +#define HAVE_RESCHEDULE_ALLBUTSELF_CPUMASK 1 + #endif Index: linux-2.6.24.7/include/asm-x86/smp_64.h =================================================================== --- linux-2.6.24.7.orig/include/asm-x86/smp_64.h +++ linux-2.6.24.7/include/asm-x86/smp_64.h @@ -126,5 +126,8 @@ static __inline int logical_smp_processo extern unsigned int boot_cpu_id; #define cpu_physical_id(cpu) boot_cpu_id #endif /* !CONFIG_SMP */ + +#define HAVE_RESCHEDULE_ALLBUTSELF_CPUMASK 1 + #endif Index: linux-2.6.24.7/include/linux/smp.h =================================================================== --- linux-2.6.24.7.orig/include/linux/smp.h +++ linux-2.6.24.7/include/linux/smp.h @@ -7,6 +7,7 @@ */ #include <linux/errno.h> +#include <linux/cpumask.h> extern void cpu_idle(void); @@ -43,6 +44,14 @@ extern void smp_send_reschedule_allbutse */ extern void smp_send_reschedule_allbutself(void); +#ifdef HAVE_RESCHEDULE_ALLBUTSELF_CPUMASK +extern void smp_send_reschedule_allbutself_cpumask(cpumask_t); +#else +static inline void smp_send_reschedule_allbutself_cpumask(cpumask_t mask) { + smp_send_reschedule_allbutself(); +} +#endif + /* * Prepare machine for booting other CPUs. @@ -109,6 +118,7 @@ static inline int up_smp_call_function(v }) static inline void smp_send_reschedule(int cpu) { } static inline void smp_send_reschedule_allbutself(void) { } +static inline void smp_send_reschedule_allbutself_cpumask(cpumask_t mask) { } #define num_booting_cpus() 1 #define smp_prepare_boot_cpu() do {} while (0) #define smp_call_function_single(cpuid, func, info, retry, wait) \ ��������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/genirq-soft-resend.patch��������������������������������������������������������������������0000664�0000764�0000764�00000002531�11041657735�016004� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������Subject: x86: activate HARDIRQS_SW_RESEND From: Ingo Molnar <mingo@elte.hu> activate the software-triggered IRQ-resend logic. it appears some chipsets/cpus do not handle local-APIC driven IRQ resends all that well, so always use the soft mechanism to trigger the execution of pending interrupts. Signed-off-by: Ingo Molnar <mingo@elte.hu> --- arch/x86/Kconfig | 4 ++++ kernel/irq/manage.c | 8 ++++++++ 2 files changed, 12 insertions(+) Index: linux-2.6.24.7/arch/x86/Kconfig =================================================================== --- linux-2.6.24.7.orig/arch/x86/Kconfig +++ linux-2.6.24.7/arch/x86/Kconfig @@ -1230,6 +1230,10 @@ config OUT_OF_LINE_PFN_TO_PAGE def_bool X86_64 depends on DISCONTIGMEM +config HARDIRQS_SW_RESEND + bool + default y + menu "Power management options" depends on !X86_VOYAGER Index: linux-2.6.24.7/kernel/irq/manage.c =================================================================== --- linux-2.6.24.7.orig/kernel/irq/manage.c +++ linux-2.6.24.7/kernel/irq/manage.c @@ -191,6 +191,14 @@ void enable_irq(unsigned int irq) desc->depth--; } spin_unlock_irqrestore(&desc->lock, flags); +#ifdef CONFIG_HARDIRQS_SW_RESEND + /* + * Do a bh disable/enable pair to trigger any pending + * irq resend logic: + */ + local_bh_disable(); + local_bh_enable(); +#endif } EXPORT_SYMBOL(enable_irq); �����������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/rcu-preempt-hotplug-hackaround.patch��������������������������������������������������������0000664�0000764�0000764�00000000705�11041657733�020327� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������--- kernel/sched.c | 1 - 1 file changed, 1 deletion(-) Index: linux-2.6.24.7/kernel/sched.c =================================================================== --- linux-2.6.24.7.orig/kernel/sched.c +++ linux-2.6.24.7/kernel/sched.c @@ -7076,7 +7076,6 @@ static void detach_destroy_domains(const for_each_cpu_mask(i, *cpu_map) cpu_attach_domain(NULL, &def_root_domain, i); - synchronize_sched(); arch_destroy_sched_domains(cpu_map); } �����������������������������������������������������������patches/relay-fix.patch�����������������������������������������������������������������������������0000664�0000764�0000764�00000003043�11041657730�014162� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������Subject: relay: fix timer madness From: Ingo Molnar <mingo@elte.hu> remove timer calls (!!!) from deep within the tracing infrastructure. This was totally bogus code that can cause lockups and worse. Poll the buffer every 2 jiffies for now. Signed-off-by: Ingo Molnar <mingo@elte.hu> --- kernel/relay.c | 14 +++++--------- 1 file changed, 5 insertions(+), 9 deletions(-) Index: linux-2.6.24.7/kernel/relay.c =================================================================== --- linux-2.6.24.7.orig/kernel/relay.c +++ linux-2.6.24.7/kernel/relay.c @@ -320,6 +320,10 @@ static void wakeup_readers(unsigned long { struct rchan_buf *buf = (struct rchan_buf *)data; wake_up_interruptible(&buf->read_wait); + /* + * Stupid polling for now: + */ + mod_timer(&buf->timer, jiffies + 1); } /** @@ -337,6 +341,7 @@ static void __relay_reset(struct rchan_b init_waitqueue_head(&buf->read_wait); kref_init(&buf->kref); setup_timer(&buf->timer, wakeup_readers, (unsigned long)buf); + mod_timer(&buf->timer, jiffies + 1); } else del_timer_sync(&buf->timer); @@ -606,15 +611,6 @@ size_t relay_switch_subbuf(struct rchan_ buf->subbufs_produced++; buf->dentry->d_inode->i_size += buf->chan->subbuf_size - buf->padding[old_subbuf]; - smp_mb(); - if (waitqueue_active(&buf->read_wait)) - /* - * Calling wake_up_interruptible() from here - * will deadlock if we happen to be logging - * from the scheduler (trying to re-grab - * rq->lock), so defer it. - */ - __mod_timer(&buf->timer, jiffies + 1); } old = buf->data; ���������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/schedule_on_each_cpu-enhance.patch����������������������������������������������������������0000664�0000764�0000764�00000010100�11041657730�017770� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������It always bothered me a bit that on_each_cpu() and schedule_on_each_cpu() had wildly different interfaces. Rectify this and convert the sole in-kernel user to the new interface. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Ingo Molnar <mingo@elte.hu> --- include/linux/workqueue.h | 2 - kernel/workqueue.c | 63 ++++++++++++++++++++++++++++++++++++++-------- mm/swap.c | 4 +- 3 files changed, 56 insertions(+), 13 deletions(-) Index: linux-2.6.24.7/include/linux/workqueue.h =================================================================== --- linux-2.6.24.7.orig/include/linux/workqueue.h +++ linux-2.6.24.7/include/linux/workqueue.h @@ -196,7 +196,7 @@ extern int FASTCALL(schedule_delayed_wor extern int schedule_delayed_work_on(int cpu, struct delayed_work *work, unsigned long delay); extern int schedule_on_each_cpu_wq(struct workqueue_struct *wq, work_func_t func); -extern int schedule_on_each_cpu(work_func_t func); +extern int schedule_on_each_cpu(void (*func)(void *info), void *info, int retry, int wait); extern int current_is_keventd(void); extern int keventd_up(void); Index: linux-2.6.24.7/kernel/workqueue.c =================================================================== --- linux-2.6.24.7.orig/kernel/workqueue.c +++ linux-2.6.24.7/kernel/workqueue.c @@ -594,9 +594,28 @@ int schedule_delayed_work_on(int cpu, } EXPORT_SYMBOL(schedule_delayed_work_on); +struct schedule_on_each_cpu_work { + struct work_struct work; + void (*func)(void *info); + void *info; +}; + +static void schedule_on_each_cpu_func(struct work_struct *work) +{ + struct schedule_on_each_cpu_work *w; + + w = container_of(work, typeof(*w), work); + w->func(w->info); + + kfree(w); +} + /** * schedule_on_each_cpu - call a function on each online CPU from keventd * @func: the function to call + * @info: data to pass to function + * @retry: ignored + * @wait: wait for completion * * Returns zero on success. * Returns -ve errno on failure. @@ -605,27 +624,51 @@ EXPORT_SYMBOL(schedule_delayed_work_on); * * schedule_on_each_cpu() is very slow. */ -int schedule_on_each_cpu(work_func_t func) +int schedule_on_each_cpu(void (*func)(void *info), void *info, int retry, int wait) { int cpu; - struct work_struct *works; + struct schedule_on_each_cpu_work **works; + int err = 0; - works = alloc_percpu(struct work_struct); + works = kzalloc(sizeof(void *)*nr_cpu_ids, GFP_KERNEL); if (!works) return -ENOMEM; + for_each_possible_cpu(cpu) { + works[cpu] = kmalloc_node(sizeof(struct schedule_on_each_cpu_work), + GFP_KERNEL, cpu_to_node(cpu)); + if (!works[cpu]) { + err = -ENOMEM; + goto out; + } + } + preempt_disable(); /* CPU hotplug */ for_each_online_cpu(cpu) { - struct work_struct *work = per_cpu_ptr(works, cpu); + struct schedule_on_each_cpu_work *work; - INIT_WORK(work, func); - set_bit(WORK_STRUCT_PENDING, work_data_bits(work)); - __queue_work(per_cpu_ptr(keventd_wq->cpu_wq, cpu), work); + work = works[cpu]; + works[cpu] = NULL; + + work->func = func; + work->info = info; + INIT_WORK(&work->work, schedule_on_each_cpu_func); + set_bit(WORK_STRUCT_PENDING, work_data_bits(&work->work)); + __queue_work(per_cpu_ptr(keventd_wq->cpu_wq, cpu), &work->work); } preempt_enable(); - flush_workqueue(keventd_wq); - free_percpu(works); - return 0; + +out: + for_each_possible_cpu(cpu) { + if (works[cpu]) + kfree(works[cpu]); + } + kfree(works); + + if (!err && wait) + flush_workqueue(keventd_wq); + + return err; } /** Index: linux-2.6.24.7/mm/swap.c =================================================================== --- linux-2.6.24.7.orig/mm/swap.c +++ linux-2.6.24.7/mm/swap.c @@ -318,7 +318,7 @@ void lru_add_drain(void) } #ifdef CONFIG_NUMA -static void lru_add_drain_per_cpu(struct work_struct *dummy) +static void lru_add_drain_per_cpu(void *info) { lru_add_drain(); } @@ -328,7 +328,7 @@ static void lru_add_drain_per_cpu(struct */ int lru_add_drain_all(void) { - return schedule_on_each_cpu(lru_add_drain_per_cpu); + return schedule_on_each_cpu(lru_add_drain_per_cpu, NULL, 0, 1); } #else ����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/schedule_on_each_cpu-enhance-rt.patch�������������������������������������������������������0000664�0000764�0000764�00000001372�11041657735�020433� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������--- kernel/workqueue.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) Index: linux-2.6.24.7/kernel/workqueue.c =================================================================== --- linux-2.6.24.7.orig/kernel/workqueue.c +++ linux-2.6.24.7/kernel/workqueue.c @@ -643,7 +643,7 @@ int schedule_on_each_cpu(void (*func)(vo } } - preempt_disable(); /* CPU hotplug */ + lock_cpu_hotplug(); for_each_online_cpu(cpu) { struct schedule_on_each_cpu_work *work; @@ -656,7 +656,7 @@ int schedule_on_each_cpu(void (*func)(vo set_bit(WORK_STRUCT_PENDING, work_data_bits(&work->work)); __queue_work(per_cpu_ptr(keventd_wq->cpu_wq, cpu), &work->work); } - preempt_enable(); + unlock_cpu_hotplug(); out: for_each_possible_cpu(cpu) { ����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/lockdep-rt-recursion-limit-fix.patch��������������������������������������������������������0000664�0000764�0000764�00000004441�11041657731�020241� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������� OK, I sent this out once before but it must have slipped under the radar. http://lkml.org/lkml/2007/6/28/325 My config fails miserably with lockdep: kernel/lockdep.c: In function 'find_usage_forwards': kernel/lockdep.c:814: error: 'RECURSION_LIMIT' undeclared (first use in this function) kernel/lockdep.c:814: error: (Each undeclared identifier is reported only once kernel/lockdep.c:814: error: for each function it appears in.) kernel/lockdep.c:815: warning: implicit declaration of function 'print_infinite_recursion_bug' kernel/lockdep.c: In function 'find_usage_backwards': kernel/lockdep.c:856: error: 'RECURSION_LIMIT' undeclared (first use in this function) make[1]: *** [kernel/lockdep.o] Error 1 But this patch fixes it nicely. Signed-off-by: Steven Rostedt <rostedt@goodmis.org> --- kernel/lockdep.c | 29 +++++++++++++++++------------ 1 file changed, 17 insertions(+), 12 deletions(-) Index: linux-2.6.24.7/kernel/lockdep.c =================================================================== --- linux-2.6.24.7.orig/kernel/lockdep.c +++ linux-2.6.24.7/kernel/lockdep.c @@ -817,6 +817,21 @@ out_unlock_set: return class; } +#if defined(CONFIG_PROVE_LOCKING) || defined(CONFIG_TRACE_IRQFLAGS) + +#define RECURSION_LIMIT 40 + +static int noinline print_infinite_recursion_bug(void) +{ + if (!debug_locks_off_graph_unlock()) + return 0; + + WARN_ON(1); + + return 0; +} +#endif /* CONFIG_PROVE_LOCKING || CONFIG_TRACE_IRQFLAGS */ + #ifdef CONFIG_PROVE_LOCKING /* * Allocate a lockdep entry. (assumes the graph_lock held, returns @@ -947,18 +962,6 @@ static noinline int print_circular_bug_t return 0; } -#define RECURSION_LIMIT 40 - -static int noinline print_infinite_recursion_bug(void) -{ - if (!debug_locks_off_graph_unlock()) - return 0; - - WARN_ON(1); - - return 0; -} - /* * Prove that the dependency graph starting at <entry> can not * lead to <target>. Print an error and return 0 if it does. @@ -1076,6 +1079,7 @@ find_usage_backwards(struct lock_class * return 1; } +#ifdef CONFIG_PROVE_LOCKING static int print_bad_irq_dependency(struct task_struct *curr, struct held_lock *prev, @@ -1136,6 +1140,7 @@ print_bad_irq_dependency(struct task_str return 0; } +#endif /* CONFIG_PROVE_LOCKING */ static int check_usage(struct task_struct *curr, struct held_lock *prev, �������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/cond_resched_softirq-WARN-fix.patch���������������������������������������������������������0000664�0000764�0000764�00000001434�11041657733�020007� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������Subject: [BUG RT] WARNING: at kernel/sched.c:5071 2.6.23-rc1-rt7 From: Steven Rostedt <rostedt@goodmis.org> The below ifndef, shouldn't that be ifndef CONFIG_PREEMPT_SOFTIRQS ? I hit that warning while I was running !PREEMPT_RT but with both hard and softiqs as threads. --- kernel/sched.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) Index: linux-2.6.24.7/kernel/sched.c =================================================================== --- linux-2.6.24.7.orig/kernel/sched.c +++ linux-2.6.24.7/kernel/sched.c @@ -5071,7 +5071,7 @@ EXPORT_SYMBOL(__cond_resched_spinlock); */ int __sched cond_resched_softirq(void) { -#ifndef CONFIG_PREEMPT_RT +#ifndef CONFIG_PREEMPT_SOFTIRQS WARN_ON_ONCE(!in_softirq()); #endif if (need_resched() && system_state == SYSTEM_RUNNING) { ������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/irq-mask-fix.patch��������������������������������������������������������������������������0000664�0000764�0000764�00000006576�11041657735�014615� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������Subject: genirq: fix simple and fasteoi irq handlers From: Jarek Poplawski <jarkao2@o2.pl> After the "genirq: do not mask interrupts by default" patch interrupts should be disabled not immediately upon request, but after they happen. But, handle_simple_irq() and handle_fasteoi_irq() can skip this once or more if an irq is just serviced (IRQ_INPROGRESS), possibly disrupting a driver's work. The main reason of problems here, pointing the broken patch and making the first patch which can fix this was done by Marcin Slusarz. Additional test patches of Thomas Gleixner and Ingo Molnar tested by Marcin Slusarz helped to narrow possible reasons even more. Thanks. PS: this patch fixes only one evident error here, but there could be more places affected by above-mentioned change in irq handling. PS 2: After rethinking, IMHO, there are two most probable scenarios here: 1. After hw resend there could be a conflict between retriggered edge type irq and the next level type one: e.g. if this level type irq (io_apic is enabled then) is triggered while retriggered irq is serviced (IRQ_INPROGRESS) there is goto out with eoi, and probably the next such levels are triggered and looping, so probably kind of flood in io_apic until this retriggered edge service has ended. 2. There is something wrong with ioapic_retrigger_irq (less probable because this should be probably seen with 'normal' edge retriggers, but on the other hand, they could be less common). So, if there is #1, this fixed patch should work. But, since level types don't need this retriggers too much I think this "don't mask interrupts by default" idea should be rethinked: is there enough gain to risk such hard to diagnose errors? So, IMHO, there should be at least possibility to turn this off for level types in config (it should be a visible option, so people could find & try this before writing for help or changing a network card). Signed-off-by: Jarek Poplawski <jarkao2@o2.pl> Signed-off-by: Ingo Molnar <mingo@elte.hu> --- kernel/irq/chip.c | 12 +++++++----- 1 file changed, 7 insertions(+), 5 deletions(-) Index: linux-2.6.24.7/kernel/irq/chip.c =================================================================== --- linux-2.6.24.7.orig/kernel/irq/chip.c +++ linux-2.6.24.7/kernel/irq/chip.c @@ -340,6 +340,8 @@ handle_simple_irq(unsigned int irq, stru spin_lock(&desc->lock); desc->status &= ~IRQ_INPROGRESS; + if (!(desc->status & IRQ_DISABLED) && desc->chip->unmask) + desc->chip->unmask(irq); out_unlock: spin_unlock(&desc->lock); } @@ -418,18 +420,16 @@ handle_fasteoi_irq(unsigned int irq, str spin_lock(&desc->lock); - if (unlikely(desc->status & IRQ_INPROGRESS)) - goto out; - desc->status &= ~(IRQ_REPLAY | IRQ_WAITING); kstat_cpu(cpu).irqs[irq]++; /* - * If its disabled or no action available + * If it's running, disabled or no action available * then mask it and get out of here: */ action = desc->action; - if (unlikely(!action || (desc->status & IRQ_DISABLED))) { + if (unlikely(!action || (desc->status & (IRQ_INPROGRESS | + IRQ_DISABLED)))) { desc->status |= IRQ_PENDING; if (desc->chip->mask) desc->chip->mask(irq); @@ -455,6 +455,8 @@ handle_fasteoi_irq(unsigned int irq, str spin_lock(&desc->lock); desc->status &= ~IRQ_INPROGRESS; + if (!(desc->status & IRQ_DISABLED) && desc->chip->unmask) + desc->chip->unmask(irq); out: desc->chip->eoi(irq); spin_unlock(&desc->lock); ����������������������������������������������������������������������������������������������������������������������������������patches/export-schedule-on-each-cpu.patch�����������������������������������������������������������0000664�0000764�0000764�00000000654�11041657735�017504� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������� --- kernel/workqueue.c | 1 + 1 file changed, 1 insertion(+) Index: linux-2.6.24.7/kernel/workqueue.c =================================================================== --- linux-2.6.24.7.orig/kernel/workqueue.c +++ linux-2.6.24.7/kernel/workqueue.c @@ -670,6 +670,7 @@ out: return err; } +EXPORT_SYMBOL(schedule_on_each_cpu); /** * schedule_on_each_cpu_wq - call a function on each online CPU on a per-CPU wq ������������������������������������������������������������������������������������patches/powerpc-rearrange-thread-flags-to-work-with-andi-instruction.patch��������������������������0000664�0000764�0000764�00000003676�11041657735�026126� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From tony@bakeyournoodle.com Wed Sep 26 10:25:29 2007 Date: Tue, 04 Sep 2007 17:09:02 +1000 From: Tony Breeds <tony@bakeyournoodle.com> To: linux-rt-users@vger.kernel.org Subject: [PATCH 1/5] [POWERPC] Rearrange thread flags to work with the "andi" instruction. Signed-off-by: Tony Breeds <tony@bakeyournoodle.com> --- include/asm-powerpc/thread_info.h | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) Index: linux-2.6.24.7/include/asm-powerpc/thread_info.h =================================================================== --- linux-2.6.24.7.orig/include/asm-powerpc/thread_info.h +++ linux-2.6.24.7/include/asm-powerpc/thread_info.h @@ -121,11 +121,11 @@ static inline struct thread_info *curren #define TIF_RESTOREALL 11 /* Restore all regs (implies NOERROR) */ #define TIF_NOERROR 12 /* Force successful syscall return */ #define TIF_RESTORE_SIGMASK 13 /* Restore signal mask in do_signal */ -#define TIF_FREEZE 14 /* Freezing for suspend */ -#define TIF_RUNLATCH 15 /* Is the runlatch enabled? */ -#define TIF_ABI_PENDING 16 /* 32/64 bit switch needed */ #define TIF_NEED_RESCHED_DELAYED \ - 17 /* reschedule on return to userspace */ + 14 /* reschedule on return to userspace */ +#define TIF_FREEZE 15 /* Freezing for suspend */ +#define TIF_RUNLATCH 16 /* Is the runlatch enabled? */ +#define TIF_ABI_PENDING 17 /* 32/64 bit switch needed */ /* as above, but as bit values */ @@ -142,10 +142,10 @@ static inline struct thread_info *curren #define _TIF_RESTOREALL (1<<TIF_RESTOREALL) #define _TIF_NOERROR (1<<TIF_NOERROR) #define _TIF_RESTORE_SIGMASK (1<<TIF_RESTORE_SIGMASK) +#define _TIF_NEED_RESCHED_DELAYED (1<<TIF_NEED_RESCHED_DELAYED) #define _TIF_FREEZE (1<<TIF_FREEZE) #define _TIF_RUNLATCH (1<<TIF_RUNLATCH) #define _TIF_ABI_PENDING (1<<TIF_ABI_PENDING) -#define _TIF_NEED_RESCHED_DELAYED (1<<TIF_NEED_RESCHED_DELAYED) #define _TIF_SYSCALL_T_OR_A (_TIF_SYSCALL_TRACE|_TIF_SYSCALL_AUDIT|_TIF_SECCOMP) ������������������������������������������������������������������patches/powerpc-count_active_rt_tasks-is-undefined-for-non-preempt-rt.patch�������������������������0000664�0000764�0000764�00000003215�11041657731�026357� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From tony@bakeyournoodle.com Wed Sep 26 10:26:59 2007 Date: Tue, 04 Sep 2007 17:09:02 +1000 From: Tony Breeds <tony@bakeyournoodle.com> To: linux-rt-users@vger.kernel.org Subject: [PATCH 2/5] count_active_rt_tasks() is undefined when CONFIG_PREEMPT_RT is not set. Also, it looks to me that active_rt_tasks[] was never modified. Signed-off-by: Tony Breeds <tony@bakeyournoodle.com> --- kernel/timer.c | 12 ++++++++---- 1 file changed, 8 insertions(+), 4 deletions(-) Index: linux-2.6.24.7/kernel/timer.c =================================================================== --- linux-2.6.24.7.orig/kernel/timer.c +++ linux-2.6.24.7/kernel/timer.c @@ -973,21 +973,25 @@ unsigned long avenrun_rt[3]; static inline void calc_load(unsigned long ticks) { unsigned long active_tasks; /* fixed-point */ - unsigned long active_rt_tasks; /* fixed-point */ static int count = LOAD_FREQ; +#ifdef CONFIG_PREEMPT_RT + unsigned long active_rt_tasks; /* fixed-point */ +#endif count -= ticks; if (unlikely(count < 0)) { active_tasks = count_active_tasks(); +#ifdef CONFIG_PREEMPT_RT active_rt_tasks = count_active_rt_tasks(); +#endif do { CALC_LOAD(avenrun[0], EXP_1, active_tasks); CALC_LOAD(avenrun[1], EXP_5, active_tasks); CALC_LOAD(avenrun[2], EXP_15, active_tasks); #ifdef CONFIG_PREEMPT_RT - CALC_LOAD(avenrun_rt[0], EXP_1, active_tasks); - CALC_LOAD(avenrun_rt[1], EXP_5, active_tasks); - CALC_LOAD(avenrun_rt[2], EXP_15, active_tasks); + CALC_LOAD(avenrun_rt[0], EXP_1, active_rt_tasks); + CALC_LOAD(avenrun_rt[1], EXP_5, active_rt_tasks); + CALC_LOAD(avenrun_rt[2], EXP_15, active_rt_tasks); #endif count += LOAD_FREQ; �����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/powerpc-match-__rw_yield-function-declaration-to-prototype.patch����������������������������0000664�0000764�0000764�00000001634�11041657735�025747� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From tony@bakeyournoodle.com Wed Sep 26 10:29:23 2007 Date: Tue, 04 Sep 2007 17:09:02 +1000 From: Tony Breeds <tony@bakeyournoodle.com> To: linux-rt-users@vger.kernel.org Subject: [PATCH 3/5] [POWERPC] Match __rw_yeild function declaration to prototype. Signed-off-by: Tony Breeds <tony@bakeyournoodle.com> --- arch/powerpc/lib/locks.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) Index: linux-2.6.24.7/arch/powerpc/lib/locks.c =================================================================== --- linux-2.6.24.7.orig/arch/powerpc/lib/locks.c +++ linux-2.6.24.7/arch/powerpc/lib/locks.c @@ -55,7 +55,7 @@ void __spin_yield(__raw_spinlock_t *lock * This turns out to be the same for read and write locks, since * we only know the holder if it is write-locked. */ -void __rw_yield(raw_rwlock_t *rw) +void __rw_yield(__raw_rwlock_t *rw) { int lock_value; unsigned int holder_cpu, yield_count; ����������������������������������������������������������������������������������������������������patches/powerpc-flush_tlb_pending-is-no-more.patch��������������������������������������������������0000664�0000764�0000764�00000001560�11041657733�021415� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From tony@bakeyournoodle.com Wed Sep 26 10:31:40 2007 Date: Tue, 04 Sep 2007 17:09:02 +1000 From: Tony Breeds <tony@bakeyournoodle.com> To: linux-rt-users@vger.kernel.org Subject: [PATCH 5/5] [POWERPC] flush_tlb_pending() is no more, use __flush_tlb_pending() instead. Signed-off-by: Tony Breeds <tony@bakeyournoodle.com> --- arch/powerpc/mm/tlb_64.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) Index: linux-2.6.24.7/arch/powerpc/mm/tlb_64.c =================================================================== --- linux-2.6.24.7.orig/arch/powerpc/mm/tlb_64.c +++ linux-2.6.24.7/arch/powerpc/mm/tlb_64.c @@ -215,7 +215,7 @@ void hpte_need_flush(struct mm_struct *m * always flush it on RT to reduce scheduling latency. */ if (machine_is(celleb)) { - flush_tlb_pending(); + __flush_tlb_pending(batch); return; } #endif /* CONFIG_PREEMPT_RT */ ������������������������������������������������������������������������������������������������������������������������������������������������patches/fix-alternate_node_alloc.patch��������������������������������������������������������������0000664�0000764�0000764�00000004704�11041657733�017214� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From ak@suse.de Wed Sep 26 10:34:53 2007 Date: Mon, 17 Sep 2007 15:36:59 +0200 From: Andi Kleen <ak@suse.de> To: mingo@elte.hu, Thomas Gleixner <tglx@linutronix.de> Cc: linux-rt-users@vger.kernel.org Subject: [PATCH] Fix alternate_node_alloc() on RT kernel __do_cache_allow/alternate_node_alloc() need to pass the this_cpu variable from the caller to cache_grow(); otherwise the slab lock for the wrong CPU can be released when a task switches CPUs inside cache_grow(). Signed-off-by: Andi Kleen <ak@suse.de> --- mm/slab.c | 13 +++++++------ 1 file changed, 7 insertions(+), 6 deletions(-) Index: linux-2.6.24.7/mm/slab.c =================================================================== --- linux-2.6.24.7.orig/mm/slab.c +++ linux-2.6.24.7/mm/slab.c @@ -1069,7 +1069,7 @@ cache_free_alien(struct kmem_cache *cach } static inline void *alternate_node_alloc(struct kmem_cache *cachep, - gfp_t flags) + gfp_t flags, int *this_cpu) { return NULL; } @@ -1084,7 +1084,7 @@ static inline void *____cache_alloc_node static void *____cache_alloc_node(struct kmem_cache *cachep, gfp_t flags, int nodeid, int *this_cpu); -static void *alternate_node_alloc(struct kmem_cache *, gfp_t); +static void *alternate_node_alloc(struct kmem_cache *, gfp_t, int *); static struct array_cache **alloc_alien_cache(int node, int limit) { @@ -3331,9 +3331,10 @@ ____cache_alloc(struct kmem_cache *cache * If we are in_interrupt, then process context, including cpusets and * mempolicy, may not apply and should not be used for allocation policy. */ -static void *alternate_node_alloc(struct kmem_cache *cachep, gfp_t flags) +static void *alternate_node_alloc(struct kmem_cache *cachep, gfp_t flags, + int *this_cpu) { - int nid_alloc, nid_here, this_cpu = raw_smp_processor_id(); + int nid_alloc, nid_here; if (in_interrupt() || (flags & __GFP_THISNODE)) return NULL; @@ -3343,7 +3344,7 @@ static void *alternate_node_alloc(struct else if (current->mempolicy) nid_alloc = slab_node(current->mempolicy); if (nid_alloc != nid_here) - return ____cache_alloc_node(cachep, flags, nid_alloc, &this_cpu); + return ____cache_alloc_node(cachep, flags, nid_alloc, this_cpu); return NULL; } @@ -3556,7 +3557,7 @@ __do_cache_alloc(struct kmem_cache *cach void *objp; if (unlikely(current->flags & (PF_SPREAD_SLAB | PF_MEMPOLICY))) { - objp = alternate_node_alloc(cache, flags); + objp = alternate_node_alloc(cache, flags, this_cpu); if (objp) goto out; } ������������������������������������������������������������patches/fix-compilation-for-non-RT-in-timer.patch���������������������������������������������������0000664�0000764�0000764�00000002166�11041657732�021014� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From ak@suse.de Wed Sep 26 10:39:29 2007 Date: Mon, 17 Sep 2007 17:52:37 +0200 From: Andi Kleen <ak@suse.de> To: mingo@elte.hu, Thomas Gleixner <tglx@linutronix.de> Cc: linux-rt-users@vger.kernel.org Subject: [PATCH] Fix compilation of 2.6.23rc4-rt1 without CONFIG_PREEMPT_RT count_active_rt_tasks() is undefined otherwise. Signed-off-by: Andi Kleen <ak@suse.de> --- kernel/timer.c | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) Index: linux-2.6.24.7/kernel/timer.c =================================================================== --- linux-2.6.24.7.orig/kernel/timer.c +++ linux-2.6.24.7/kernel/timer.c @@ -939,18 +939,20 @@ static unsigned long count_active_tasks( #endif } -#ifdef CONFIG_PREEMPT_RT /* * Nr of active tasks - counted in fixed-point numbers */ static unsigned long count_active_rt_tasks(void) { +#ifdef CONFIG_PREEMPT_RT extern unsigned long rt_nr_running(void); extern unsigned long rt_nr_uninterruptible(void); return (rt_nr_running() + rt_nr_uninterruptible()) * FIXED_1; -} +#else + return 0; #endif +} /* * Hmm.. Changed this, as the GNU make sources (load.c) seems to ����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/hack-convert-i_alloc_sem-for-direct_io-craziness.patch��������������������������������������0000664�0000764�0000764�00000004526�11041657730�023643� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From rostedt@goodmis.org Wed Sep 26 11:12:03 2007 Date: Mon, 24 Sep 2007 17:14:26 -0400 (EDT) From: Steven Rostedt <rostedt@goodmis.org> To: LKML <linux-kernel@vger.kernel.org> Cc: linux-rt-users <linux-rt-users@vger.kernel.org>, mingo@goodmis.org, Thomas Gleixner <tglx@linutronix.de> Subject: [HACK] convert i_alloc_sem for direct_io.c craziness! Hopefully I will get some attention from those that are responsible for fs/direct_io.c Ingo and Thomas, This patch converts the i_alloc_sem into a compat_rw_semaphore for the -rt patch. Seems that the code in fs/direct_io.c does some nasty logic with the i_alloc_sem. For DIO_LOCKING, I'm assuming that the i_alloc_sem is used as a reference counter for pending requests. When the request is made, the down_read is performed. When the request is handled by the block softirq, then that softirq does an up on the request. So the owner is not the same between down and up. When all requests are handled, the semaphore counter should be zero. This keeps away any write access while requests are pending. Now this may all be well and dandy for vanilla Linux, but it breaks miserbly when converted to -rt. 1) In RT rw_semaphores must be up'd by the same thread that down's it. 2) We can't do PI on the correct processes. This patch converts (for now) the i_alloc_sem into a compat_rw_semaphore to give back the old features to the sem. This fixes deadlocks that we've been having WRT direct_io. But unfortunately, it now opens up unbonded priority inversion with this semaphore. But really, those that can be affected by this, shouldn't be doing disk IO anyway. The real fix would be to get rid of the read semaphore trickery in direct_io.c. Signed-off-by: Steve Rostedt <rostedt@goodmis.org> --- include/linux/fs.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) Index: linux-2.6.24.7/include/linux/fs.h =================================================================== --- linux-2.6.24.7.orig/include/linux/fs.h +++ linux-2.6.24.7/include/linux/fs.h @@ -635,7 +635,7 @@ struct inode { umode_t i_mode; spinlock_t i_lock; /* i_blocks, i_bytes, maybe i_size */ struct mutex i_mutex; - struct rw_semaphore i_alloc_sem; + struct compat_rw_semaphore i_alloc_sem; const struct inode_operations *i_op; const struct file_operations *i_fop; /* former ->i_op->default_file_ops */ struct super_block *i_sb; ��������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/dont-let-rt-rw_semaphores-do-non_owner-locks.patch������������������������������������������0000664�0000764�0000764�00000010155�11041657732�023026� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From rostedt@goodmis.org Wed Sep 26 11:12:47 2007 Date: Tue, 25 Sep 2007 11:29:51 -0400 (EDT) From: Steven Rostedt <rostedt@goodmis.org> To: Peter Zijlstra <peterz@infradead.org> Cc: LKML <linux-kernel@vger.kernel.org>, linux-rt-users <linux-rt-users@vger.kernel.org>, mingo@goodmis.org, Thomas Gleixner <tglx@linutronix.de> Subject: [PATCH RT] Don't let -rt rw_semaphors do _non_owner locks -- On Tue, 25 Sep 2007, Peter Zijlstra wrote: > How about teaching {up,down}_read_non_owner() to barf on rw_semaphore > in -rt? > Sure thing! This patch prevents rw_semaphore in PREEMPT_RT from performing down_read_non_owner and up_read_non_owner. If this must be used, then either convert to a completion or use compat_rw_semaphore. Signed-off-by: Steven Rostedt <rostedt@goodmis.org> --- include/linux/rt_lock.h | 15 +++++---------- kernel/rt.c | 45 --------------------------------------------- 2 files changed, 5 insertions(+), 55 deletions(-) Index: linux-2.6.24.7/include/linux/rt_lock.h =================================================================== --- linux-2.6.24.7.orig/include/linux/rt_lock.h +++ linux-2.6.24.7/include/linux/rt_lock.h @@ -241,25 +241,20 @@ do { \ __rt_rwsem_init((sem), #sem, &__key); \ } while (0) +extern void __dont_do_this_in_rt(struct rw_semaphore *rwsem); + +#define rt_down_read_non_owner(rwsem) __dont_do_this_in_rt(rwsem) +#define rt_up_read_non_owner(rwsem) __dont_do_this_in_rt(rwsem) + extern void fastcall rt_down_write(struct rw_semaphore *rwsem); extern void fastcall rt_down_read_nested(struct rw_semaphore *rwsem, int subclass); extern void fastcall rt_down_write_nested(struct rw_semaphore *rwsem, int subclass); extern void fastcall rt_down_read(struct rw_semaphore *rwsem); -#ifdef CONFIG_DEBUG_LOCK_ALLOC -extern void fastcall rt_down_read_non_owner(struct rw_semaphore *rwsem); -#else -# define rt_down_read_non_owner(rwsem) rt_down_read(rwsem) -#endif extern int fastcall rt_down_write_trylock(struct rw_semaphore *rwsem); extern int fastcall rt_down_read_trylock(struct rw_semaphore *rwsem); extern void fastcall rt_up_read(struct rw_semaphore *rwsem); -#ifdef CONFIG_DEBUG_LOCK_ALLOC -extern void fastcall rt_up_read_non_owner(struct rw_semaphore *rwsem); -#else -# define rt_up_read_non_owner(rwsem) rt_up_read(rwsem) -#endif extern void fastcall rt_up_write(struct rw_semaphore *rwsem); extern void fastcall rt_downgrade_write(struct rw_semaphore *rwsem); Index: linux-2.6.24.7/kernel/rt.c =================================================================== --- linux-2.6.24.7.orig/kernel/rt.c +++ linux-2.6.24.7/kernel/rt.c @@ -324,25 +324,6 @@ void fastcall rt_up_read(struct rw_semap } EXPORT_SYMBOL(rt_up_read); -#ifdef CONFIG_DEBUG_LOCK_ALLOC -void fastcall rt_up_read_non_owner(struct rw_semaphore *rwsem) -{ - unsigned long flags; - /* - * Read locks within the self-held write lock succeed. - */ - spin_lock_irqsave(&rwsem->lock.wait_lock, flags); - if (rt_mutex_real_owner(&rwsem->lock) == current && rwsem->read_depth) { - spin_unlock_irqrestore(&rwsem->lock.wait_lock, flags); - rwsem->read_depth--; - return; - } - spin_unlock_irqrestore(&rwsem->lock.wait_lock, flags); - rt_mutex_unlock(&rwsem->lock); -} -EXPORT_SYMBOL(rt_up_read_non_owner); -#endif - /* * downgrade a write lock into a read lock * - just wake up any readers at the front of the queue @@ -433,32 +414,6 @@ void fastcall rt_down_read_nested(struct } EXPORT_SYMBOL(rt_down_read_nested); - -#ifdef CONFIG_DEBUG_LOCK_ALLOC - -/* - * Same as rt_down_read() but no lockdep calls: - */ -void fastcall rt_down_read_non_owner(struct rw_semaphore *rwsem) -{ - unsigned long flags; - /* - * Read locks within the write lock succeed. - */ - spin_lock_irqsave(&rwsem->lock.wait_lock, flags); - - if (rt_mutex_real_owner(&rwsem->lock) == current) { - spin_unlock_irqrestore(&rwsem->lock.wait_lock, flags); - rwsem->read_depth++; - return; - } - spin_unlock_irqrestore(&rwsem->lock.wait_lock, flags); - rt_mutex_lock(&rwsem->lock); -} -EXPORT_SYMBOL(rt_down_read_non_owner); - -#endif - void fastcall __rt_rwsem_init(struct rw_semaphore *rwsem, char *name, struct lock_class_key *key) { �������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/rt-s_files-kill-a-union.patch���������������������������������������������������������������0000664�0000764�0000764�00000001240�11041657732�016625� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������� Remove a dependancy on the size of rcu_head. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> --- include/linux/fs.h | 6 +----- 1 file changed, 1 insertion(+), 5 deletions(-) Index: linux-2.6.24.7/include/linux/fs.h =================================================================== --- linux-2.6.24.7.orig/include/linux/fs.h +++ linux-2.6.24.7/include/linux/fs.h @@ -797,11 +797,7 @@ static inline int ra_has_index(struct fi } struct file { - /* - * fu_llist becomes invalid after file_free is called and queued via - * fu_rcuhead for RCU freeing - */ - union { + struct { struct lock_list_head fu_llist; struct rcu_head fu_rcuhead; } f_u; ����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/loadavg_fixes_weird_loads.patch�������������������������������������������������������������0000664�0000764�0000764�00000001520�11041657734�017453� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������----------> Fixes spurious system load spikes observed in /proc/loadavgrt, as described in: Bug 253103: /proc/loadavgrt issues weird results https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=253103 Signed-off-by: Luis Claudio R. Goncalves <lgoncalv@redhat.com>> --- --- kernel/sched.c | 7 +++++++ 1 file changed, 7 insertions(+) Index: linux-2.6.24.7/kernel/sched.c =================================================================== --- linux-2.6.24.7.orig/kernel/sched.c +++ linux-2.6.24.7/kernel/sched.c @@ -2201,6 +2201,13 @@ unsigned long nr_iowait(void) for_each_possible_cpu(i) sum += atomic_read(&cpu_rq(i)->nr_iowait); + /* + * Since we read the counters lockless, it might be slightly + * inaccurate. Do not allow it to go below zero though: + */ + if (unlikely((long)sum < 0)) + sum = 0; + return sum; } ��������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/watchdog_use_timer_and_hpet_on_x86_64.patch�������������������������������������������������0000664�0000764�0000764�00000002113�11041657732�021511� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������This modifies nmi_watchdog_tick behavior for x86_64 arch to consider both timer and hpet IRQs just as the i386 arch does. Signed-off-by: David Bahi <dbahi@novell.com> --- arch/x86/kernel/nmi_64.c | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-) Index: linux-2.6.24.7/arch/x86/kernel/nmi_64.c =================================================================== --- linux-2.6.24.7.orig/arch/x86/kernel/nmi_64.c +++ linux-2.6.24.7/arch/x86/kernel/nmi_64.c @@ -371,7 +371,6 @@ nmi_watchdog_tick(struct pt_regs * regs, touched = 1; } - sum = read_pda(apic_timer_irqs) + read_pda(irq0_irqs); if (__get_cpu_var(nmi_touch)) { __get_cpu_var(nmi_touch) = 0; touched = 1; @@ -387,6 +386,12 @@ nmi_watchdog_tick(struct pt_regs * regs, cpu_clear(cpu, backtrace_mask); } + /* + * Take the local apic timer and PIT/HPET into account. We don't + * know which one is active, when we have highres/dyntick on + */ + sum = read_pda(apic_timer_irqs) + kstat_cpu(cpu).irqs[0]; + #ifdef CONFIG_X86_MCE /* Could check oops_in_progress here too, but it's safer not too */ �����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/pmtmr-override.patch������������������������������������������������������������������������0000664�0000764�0000764�00000002040�11041657734�015236� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������Subject: pmtmr: allow command line override of ioport From: Thomas Gleixner <tglx@linutronix.de> Date: Wed, 21 May 2008 21:14:58 +0200 Signed-off-by: Thomas Gleixner <tglx@linutronix.de> --- drivers/clocksource/acpi_pm.c | 19 +++++++++++++++++++ 1 file changed, 19 insertions(+) Index: linux-2.6.24.7/drivers/clocksource/acpi_pm.c =================================================================== --- linux-2.6.24.7.orig/drivers/clocksource/acpi_pm.c +++ linux-2.6.24.7/drivers/clocksource/acpi_pm.c @@ -215,3 +215,22 @@ pm_good: * but we still need to load before device_initcall */ fs_initcall(init_acpi_pm_clocksource); + +/* + * Allow an override of the IOPort. Stupid BIOSes do not tell us about + * the PMTimer, but we might know where it is. + */ +static int __init parse_pmtmr(char *arg) +{ + unsigned long base; + char *e; + + base = simple_strtoul(arg, &e, 16); + + printk(KERN_INFO "PMTMR IOPort override: 0x%04lx -> 0x%04lx\n", + pmtmr_ioport, base); + pmtmr_ioport = base; + + return 1; +} +__setup("pmtmr=", parse_pmtmr); ������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/call_rcu_bh-rename-of-call_rcu.patch��������������������������������������������������������0000664�0000764�0000764�00000003504�11041657732�020154� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������Subject: [PATCH] just rename call_rcu_bh instead of making it a macro Seems that I found a box that has a config that passes call_rcu_bh as a function pointer (see net/sctp/sm_make_chunk.c), so declaring the call_rcu_bh has a macro function isn't good enough. This patch makes it just another name of call_rcu for rcupreempt. Signed-off-by: Steven Rostedt <rostedt@goodmis.org> --- include/linux/rcupdate.h | 4 ++-- include/linux/rcupreempt.h | 7 ++++++- 2 files changed, 8 insertions(+), 3 deletions(-) Index: linux-2.6.24.7/include/linux/rcupdate.h =================================================================== --- linux-2.6.24.7.orig/include/linux/rcupdate.h +++ linux-2.6.24.7/include/linux/rcupdate.h @@ -221,9 +221,9 @@ extern struct lockdep_map rcu_lock_map; * and may be nested. */ #ifdef CONFIG_CLASSIC_RCU -#define call_rcu(head, func) call_rcu_classic(head, func) +#define call_rcu call_rcu_classic #else /* #ifdef CONFIG_CLASSIC_RCU */ -#define call_rcu(head, func) call_rcu_preempt(head, func) +#define call_rcu call_rcu_preempt #endif /* #else #ifdef CONFIG_CLASSIC_RCU */ /** Index: linux-2.6.24.7/include/linux/rcupreempt.h =================================================================== --- linux-2.6.24.7.orig/include/linux/rcupreempt.h +++ linux-2.6.24.7/include/linux/rcupreempt.h @@ -42,7 +42,12 @@ #include <linux/cpumask.h> #include <linux/seqlock.h> -#define call_rcu_bh(head, rcu) call_rcu(head, rcu) +/* + * Someone might want to pass call_rcu_bh as a function pointer. + * So this needs to just be a rename and not a macro function. + * (no parentheses) + */ +#define call_rcu_bh call_rcu_preempt #define rcu_bh_qsctr_inc(cpu) do { } while (0) #define __rcu_read_lock_bh() { rcu_read_lock(); local_bh_disable(); } #define __rcu_read_unlock_bh() { local_bh_enable(); rcu_read_unlock(); } ��������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/introduce-pick-function-macro.patch���������������������������������������������������������0000664�0000764�0000764�00000022631�11041657734�020134� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From dwalker@mvista.com Wed Sep 26 21:44:14 2007 Date: Tue, 28 Aug 2007 14:37:49 -0700 From: Daniel Walker <dwalker@mvista.com> To: mingo@elte.hu Cc: mingo@redhat.com, linux-kernel@vger.kernel.org, linux-rt-users@vger.kernel.org, Peter Zijlstra <peterz@infradead.org> Subject: [PATCH -rt 1/8] introduce PICK_FUNCTION PICK_FUNCTION() is similar to the other PICK_OP style macros, and was created to replace them all. I used variable argument macros to handle PICK_FUNC_2ARG/PICK_FUNC_1ARG. Otherwise the marcos are similar to the original macros used for semaphores. The entire system is used to do a compile time switch between two different locking APIs. For example, real spinlocks (raw_spinlock_t) and mutexes (or sleeping spinlocks). This new macro replaces all the duplication from lock type to lock type. The result of this patch, and the next two, is a fairly nice simplification, and consolidation. Although the seqlock changes are larger than the originals I think over all the patchset is worth while. Incorporated peterz's suggestion to not require TYPE_EQUAL() to only use pointers. Signed-off-by: Daniel Walker <dwalker@mvista.com> --- include/linux/pickop.h | 36 +++++++++++++ include/linux/rt_lock.h | 129 +++++++++++++++--------------------------------- 2 files changed, 77 insertions(+), 88 deletions(-) Index: linux-2.6.24.7/include/linux/pickop.h =================================================================== --- /dev/null +++ linux-2.6.24.7/include/linux/pickop.h @@ -0,0 +1,36 @@ +#ifndef _LINUX_PICKOP_H +#define _LINUX_PICKOP_H + +#undef TYPE_EQUAL +#define TYPE_EQUAL(var, type) \ + __builtin_types_compatible_p(typeof(var), type *) + +#undef PICK_TYPE_EQUAL +#define PICK_TYPE_EQUAL(var, type) \ + __builtin_types_compatible_p(typeof(var), type) + +extern int __bad_func_type(void); + +#define PICK_FUNCTION(type1, type2, func1, func2, arg0, ...) \ +do { \ + if (PICK_TYPE_EQUAL((arg0), type1)) \ + func1((type1)(arg0), ##__VA_ARGS__); \ + else if (PICK_TYPE_EQUAL((arg0), type2)) \ + func2((type2)(arg0), ##__VA_ARGS__); \ + else __bad_func_type(); \ +} while (0) + +#define PICK_FUNCTION_RET(type1, type2, func1, func2, arg0, ...) \ +({ \ + unsigned long __ret; \ + \ + if (PICK_TYPE_EQUAL((arg0), type1)) \ + __ret = func1((type1)(arg0), ##__VA_ARGS__); \ + else if (PICK_TYPE_EQUAL((arg0), type2)) \ + __ret = func2((type2)(arg0), ##__VA_ARGS__); \ + else __ret = __bad_func_type(); \ + \ + __ret; \ +}) + +#endif /* _LINUX_PICKOP_H */ Index: linux-2.6.24.7/include/linux/rt_lock.h =================================================================== --- linux-2.6.24.7.orig/include/linux/rt_lock.h +++ linux-2.6.24.7/include/linux/rt_lock.h @@ -149,76 +149,40 @@ extern void fastcall rt_up(struct semaph extern int __bad_func_type(void); -#undef TYPE_EQUAL -#define TYPE_EQUAL(var, type) \ - __builtin_types_compatible_p(typeof(var), type *) - -#define PICK_FUNC_1ARG(type1, type2, func1, func2, arg) \ -do { \ - if (TYPE_EQUAL((arg), type1)) \ - func1((type1 *)(arg)); \ - else if (TYPE_EQUAL((arg), type2)) \ - func2((type2 *)(arg)); \ - else __bad_func_type(); \ -} while (0) +#include <linux/pickop.h> -#define PICK_FUNC_1ARG_RET(type1, type2, func1, func2, arg) \ -({ \ - unsigned long __ret; \ - \ - if (TYPE_EQUAL((arg), type1)) \ - __ret = func1((type1 *)(arg)); \ - else if (TYPE_EQUAL((arg), type2)) \ - __ret = func2((type2 *)(arg)); \ - else __ret = __bad_func_type(); \ - \ - __ret; \ -}) - -#define PICK_FUNC_2ARG(type1, type2, func1, func2, arg0, arg1) \ -do { \ - if (TYPE_EQUAL((arg0), type1)) \ - func1((type1 *)(arg0), arg1); \ - else if (TYPE_EQUAL((arg0), type2)) \ - func2((type2 *)(arg0), arg1); \ - else __bad_func_type(); \ -} while (0) +/* + * PICK_SEM_OP() is a small redirector to allow less typing of the lock + * types struct compat_semaphore, struct semaphore, at the front of the + * PICK_FUNCTION macro. + */ +#define PICK_SEM_OP(...) PICK_FUNCTION(struct compat_semaphore *, \ + struct semaphore *, ##__VA_ARGS__) +#define PICK_SEM_OP_RET(...) PICK_FUNCTION_RET(struct compat_semaphore *,\ + struct semaphore *, ##__VA_ARGS__) #define sema_init(sem, val) \ - PICK_FUNC_2ARG(struct compat_semaphore, struct semaphore, \ - compat_sema_init, rt_sema_init, sem, val) + PICK_SEM_OP(compat_sema_init, rt_sema_init, sem, val) -#define init_MUTEX(sem) \ - PICK_FUNC_1ARG(struct compat_semaphore, struct semaphore, \ - compat_init_MUTEX, rt_init_MUTEX, sem) +#define init_MUTEX(sem) PICK_SEM_OP(compat_init_MUTEX, rt_init_MUTEX, sem) #define init_MUTEX_LOCKED(sem) \ - PICK_FUNC_1ARG(struct compat_semaphore, struct semaphore, \ - compat_init_MUTEX_LOCKED, rt_init_MUTEX_LOCKED, sem) + PICK_SEM_OP(compat_init_MUTEX_LOCKED, rt_init_MUTEX_LOCKED, sem) -#define down(sem) \ - PICK_FUNC_1ARG(struct compat_semaphore, struct semaphore, \ - compat_down, rt_down, sem) +#define down(sem) PICK_SEM_OP(compat_down, rt_down, sem) #define down_interruptible(sem) \ - PICK_FUNC_1ARG_RET(struct compat_semaphore, struct semaphore, \ - compat_down_interruptible, rt_down_interruptible, sem) + PICK_SEM_OP_RET(compat_down_interruptible, rt_down_interruptible, sem) #define down_trylock(sem) \ - PICK_FUNC_1ARG_RET(struct compat_semaphore, struct semaphore, \ - compat_down_trylock, rt_down_trylock, sem) + PICK_SEM_OP_RET(compat_down_trylock, rt_down_trylock, sem) -#define up(sem) \ - PICK_FUNC_1ARG(struct compat_semaphore, struct semaphore, \ - compat_up, rt_up, sem) +#define up(sem) PICK_SEM_OP(compat_up, rt_up, sem) #define sem_is_locked(sem) \ - PICK_FUNC_1ARG_RET(struct compat_semaphore, struct semaphore, \ - compat_sem_is_locked, rt_sem_is_locked, sem) + PICK_SEM_OP_RET(compat_sem_is_locked, rt_sem_is_locked, sem) -#define sema_count(sem) \ - PICK_FUNC_1ARG_RET(struct compat_semaphore, struct semaphore, \ - compat_sema_count, rt_sema_count, sem) +#define sema_count(sem) PICK_SEM_OP_RET(compat_sema_count, rt_sema_count, sem) /* * rwsems: @@ -260,58 +224,47 @@ extern void fastcall rt_downgrade_write( # define rt_rwsem_is_locked(rws) (rt_mutex_is_locked(&(rws)->lock)) -#define init_rwsem(rwsem) \ - PICK_FUNC_1ARG(struct compat_rw_semaphore, struct rw_semaphore, \ - compat_init_rwsem, rt_init_rwsem, rwsem) - -#define down_read(rwsem) \ - PICK_FUNC_1ARG(struct compat_rw_semaphore, struct rw_semaphore, \ - compat_down_read, rt_down_read, rwsem) +#define PICK_RWSEM_OP(...) PICK_FUNCTION(struct compat_rw_semaphore *, \ + struct rw_semaphore *, ##__VA_ARGS__) +#define PICK_RWSEM_OP_RET(...) PICK_FUNCTION_RET(struct compat_rw_semaphore *,\ + struct rw_semaphore *, ##__VA_ARGS__) + +#define init_rwsem(rwsem) PICK_RWSEM_OP(compat_init_rwsem, rt_init_rwsem, rwsem) + +#define down_read(rwsem) PICK_RWSEM_OP(compat_down_read, rt_down_read, rwsem) #define down_read_non_owner(rwsem) \ - PICK_FUNC_1ARG(struct compat_rw_semaphore, struct rw_semaphore, \ - compat_down_read_non_owner, rt_down_read_non_owner, rwsem) + PICK_RWSEM_OP(compat_down_read_non_owner, rt_down_read_non_owner, rwsem) #define down_read_trylock(rwsem) \ - PICK_FUNC_1ARG_RET(struct compat_rw_semaphore, struct rw_semaphore, \ - compat_down_read_trylock, rt_down_read_trylock, rwsem) + PICK_RWSEM_OP_RET(compat_down_read_trylock, rt_down_read_trylock, rwsem) -#define down_write(rwsem) \ - PICK_FUNC_1ARG(struct compat_rw_semaphore, struct rw_semaphore, \ - compat_down_write, rt_down_write, rwsem) +#define down_write(rwsem) PICK_RWSEM_OP(compat_down_write, rt_down_write, rwsem) #define down_read_nested(rwsem, subclass) \ - PICK_FUNC_2ARG(struct compat_rw_semaphore, struct rw_semaphore, \ - compat_down_read_nested, rt_down_read_nested, rwsem, subclass) - + PICK_RWSEM_OP(compat_down_read_nested, rt_down_read_nested, \ + rwsem, subclass) #define down_write_nested(rwsem, subclass) \ - PICK_FUNC_2ARG(struct compat_rw_semaphore, struct rw_semaphore, \ - compat_down_write_nested, rt_down_write_nested, rwsem, subclass) + PICK_RWSEM_OP(compat_down_write_nested, rt_down_write_nested, \ + rwsem, subclass) #define down_write_trylock(rwsem) \ - PICK_FUNC_1ARG_RET(struct compat_rw_semaphore, struct rw_semaphore, \ - compat_down_write_trylock, rt_down_write_trylock, rwsem) + PICK_RWSEM_OP_RET(compat_down_write_trylock, rt_down_write_trylock,\ + rwsem) -#define up_read(rwsem) \ - PICK_FUNC_1ARG(struct compat_rw_semaphore, struct rw_semaphore, \ - compat_up_read, rt_up_read, rwsem) +#define up_read(rwsem) PICK_RWSEM_OP(compat_up_read, rt_up_read, rwsem) #define up_read_non_owner(rwsem) \ - PICK_FUNC_1ARG(struct compat_rw_semaphore, struct rw_semaphore, \ - compat_up_read_non_owner, rt_up_read_non_owner, rwsem) + PICK_RWSEM_OP(compat_up_read_non_owner, rt_up_read_non_owner, rwsem) -#define up_write(rwsem) \ - PICK_FUNC_1ARG(struct compat_rw_semaphore, struct rw_semaphore, \ - compat_up_write, rt_up_write, rwsem) +#define up_write(rwsem) PICK_RWSEM_OP(compat_up_write, rt_up_write, rwsem) #define downgrade_write(rwsem) \ - PICK_FUNC_1ARG(struct compat_rw_semaphore, struct rw_semaphore, \ - compat_downgrade_write, rt_downgrade_write, rwsem) + PICK_RWSEM_OP(compat_downgrade_write, rt_downgrade_write, rwsem) #define rwsem_is_locked(rwsem) \ - PICK_FUNC_1ARG_RET(struct compat_rw_semaphore, struct rw_semaphore, \ - compat_rwsem_is_locked, rt_rwsem_is_locked, rwsem) + PICK_RWSEM_OP_RET(compat_rwsem_is_locked, rt_rwsem_is_locked, rwsem) #endif /* CONFIG_PREEMPT_RT */ �������������������������������������������������������������������������������������������������������patches/replace-PICK_OP-with-PICK_FUNCTION.patch����������������������������������������������������0000664�0000764�0000764�00000047767�11041657733�020051� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From dwalker@mvista.com Wed Sep 26 21:45:42 2007 Date: Tue, 28 Aug 2007 14:37:50 -0700 From: Daniel Walker <dwalker@mvista.com> To: mingo@elte.hu Cc: mingo@redhat.com, linux-kernel@vger.kernel.org, linux-rt-users@vger.kernel.org Subject: [PATCH -rt 2/8] spinlocks/rwlocks: use PICK_FUNCTION() Reaplace old PICK_OP style macros with the new PICK_FUNCTION macro. Signed-off-by: Daniel Walker <dwalker@mvista.com> --- include/linux/sched.h | 13 - include/linux/spinlock.h | 345 ++++++++++++++--------------------------------- kernel/rtmutex.c | 2 lib/dec_and_lock.c | 2 4 files changed, 111 insertions(+), 251 deletions(-) Index: linux-2.6.24.7/include/linux/sched.h =================================================================== --- linux-2.6.24.7.orig/include/linux/sched.h +++ linux-2.6.24.7/include/linux/sched.h @@ -2033,17 +2033,8 @@ extern int __cond_resched_raw_spinlock(r extern int __cond_resched_spinlock(spinlock_t *spinlock); #define cond_resched_lock(lock) \ -({ \ - int __ret; \ - \ - if (TYPE_EQUAL((lock), raw_spinlock_t)) \ - __ret = __cond_resched_raw_spinlock((raw_spinlock_t *)lock);\ - else if (TYPE_EQUAL(lock, spinlock_t)) \ - __ret = __cond_resched_spinlock((spinlock_t *)lock); \ - else __ret = __bad_spinlock_type(); \ - \ - __ret; \ -}) + PICK_SPIN_OP_RET(__cond_resched_raw_spinlock, __cond_resched_spinlock,\ + lock) extern int cond_resched_softirq(void); extern int cond_resched_softirq_context(void); Index: linux-2.6.24.7/include/linux/spinlock.h =================================================================== --- linux-2.6.24.7.orig/include/linux/spinlock.h +++ linux-2.6.24.7/include/linux/spinlock.h @@ -91,6 +91,7 @@ #include <linux/stringify.h> #include <linux/bottom_half.h> #include <linux/irqflags.h> +#include <linux/pickop.h> #include <asm/system.h> @@ -162,7 +163,7 @@ extern void __lockfunc rt_spin_unlock_wa extern int __lockfunc rt_spin_trylock_irqsave(spinlock_t *lock, unsigned long *flags); extern int __lockfunc rt_spin_trylock(spinlock_t *lock); -extern int _atomic_dec_and_spin_lock(atomic_t *atomic, spinlock_t *lock); +extern int _atomic_dec_and_spin_lock(spinlock_t *lock, atomic_t *atomic); /* * lockdep-less calls, for derived types like rwlock: @@ -243,54 +244,6 @@ do { \ # define _spin_trylock_irqsave(l,f) TSNBCONRT(l) #endif -#undef TYPE_EQUAL -#define TYPE_EQUAL(lock, type) \ - __builtin_types_compatible_p(typeof(lock), type *) - -#define PICK_OP(op, lock) \ -do { \ - if (TYPE_EQUAL((lock), raw_spinlock_t)) \ - __spin##op((raw_spinlock_t *)(lock)); \ - else if (TYPE_EQUAL(lock, spinlock_t)) \ - _spin##op((spinlock_t *)(lock)); \ - else __bad_spinlock_type(); \ -} while (0) - -#define PICK_OP_RET(op, lock...) \ -({ \ - unsigned long __ret; \ - \ - if (TYPE_EQUAL((lock), raw_spinlock_t)) \ - __ret = __spin##op((raw_spinlock_t *)(lock)); \ - else if (TYPE_EQUAL(lock, spinlock_t)) \ - __ret = _spin##op((spinlock_t *)(lock)); \ - else __ret = __bad_spinlock_type(); \ - \ - __ret; \ -}) - -#define PICK_OP2(op, lock, flags) \ -do { \ - if (TYPE_EQUAL((lock), raw_spinlock_t)) \ - __spin##op((raw_spinlock_t *)(lock), flags); \ - else if (TYPE_EQUAL(lock, spinlock_t)) \ - _spin##op((spinlock_t *)(lock), flags); \ - else __bad_spinlock_type(); \ -} while (0) - -#define PICK_OP2_RET(op, lock, flags) \ -({ \ - unsigned long __ret; \ - \ - if (TYPE_EQUAL((lock), raw_spinlock_t)) \ - __ret = __spin##op((raw_spinlock_t *)(lock), flags); \ - else if (TYPE_EQUAL(lock, spinlock_t)) \ - __ret = _spin##op((spinlock_t *)(lock), flags); \ - else __bad_spinlock_type(); \ - \ - __ret; \ -}) - extern void __lockfunc rt_write_lock(rwlock_t *rwlock); extern void __lockfunc rt_read_lock(rwlock_t *rwlock); extern int __lockfunc rt_write_trylock(rwlock_t *rwlock); @@ -349,76 +302,10 @@ do { \ # define _read_unlock_irqrestore(rwl, f) rt_read_unlock(rwl) # define _write_unlock_irqrestore(rwl, f) rt_write_unlock(rwl) -#define __PICK_RW_OP(optype, op, lock) \ -do { \ - if (TYPE_EQUAL((lock), raw_rwlock_t)) \ - __##optype##op((raw_rwlock_t *)(lock)); \ - else if (TYPE_EQUAL(lock, rwlock_t)) \ - ##op((rwlock_t *)(lock)); \ - else __bad_rwlock_type(); \ -} while (0) - -#define PICK_RW_OP(optype, op, lock) \ -do { \ - if (TYPE_EQUAL((lock), raw_rwlock_t)) \ - __##optype##op((raw_rwlock_t *)(lock)); \ - else if (TYPE_EQUAL(lock, rwlock_t)) \ - _##optype##op((rwlock_t *)(lock)); \ - else __bad_rwlock_type(); \ -} while (0) - -#define __PICK_RW_OP_RET(optype, op, lock...) \ -({ \ - unsigned long __ret; \ - \ - if (TYPE_EQUAL((lock), raw_rwlock_t)) \ - __ret = __##optype##op((raw_rwlock_t *)(lock)); \ - else if (TYPE_EQUAL(lock, rwlock_t)) \ - __ret = _##optype##op((rwlock_t *)(lock)); \ - else __ret = __bad_rwlock_type(); \ - \ - __ret; \ -}) - -#define PICK_RW_OP_RET(optype, op, lock...) \ -({ \ - unsigned long __ret; \ - \ - if (TYPE_EQUAL((lock), raw_rwlock_t)) \ - __ret = __##optype##op((raw_rwlock_t *)(lock)); \ - else if (TYPE_EQUAL(lock, rwlock_t)) \ - __ret = _##optype##op((rwlock_t *)(lock)); \ - else __ret = __bad_rwlock_type(); \ - \ - __ret; \ -}) - -#define PICK_RW_OP2(optype, op, lock, flags) \ -do { \ - if (TYPE_EQUAL((lock), raw_rwlock_t)) \ - __##optype##op((raw_rwlock_t *)(lock), flags); \ - else if (TYPE_EQUAL(lock, rwlock_t)) \ - _##optype##op((rwlock_t *)(lock), flags); \ - else __bad_rwlock_type(); \ -} while (0) - -#define PICK_RW_OP2_RET(optype, op, lock, flags) \ -({ \ - unsigned long __ret; \ - \ - if (TYPE_EQUAL((lock), raw_rwlock_t)) \ - __ret = __##optype##op((raw_rwlock_t *)(lock), flags); \ - else if (TYPE_EQUAL(lock, rwlock_t)) \ - __ret = _##optype##op((rwlock_t *)(lock), flags); \ - else __bad_rwlock_type(); \ - \ - __ret; \ -}) - #ifdef CONFIG_DEBUG_SPINLOCK extern void __raw_spin_lock_init(raw_spinlock_t *lock, const char *name, struct lock_class_key *key); -# define _raw_spin_lock_init(lock) \ +# define _raw_spin_lock_init(lock, name, file, line) \ do { \ static struct lock_class_key __key; \ \ @@ -428,25 +315,28 @@ do { \ #else #define __raw_spin_lock_init(lock) \ do { *(lock) = RAW_SPIN_LOCK_UNLOCKED(lock); } while (0) -# define _raw_spin_lock_init(lock) __raw_spin_lock_init(lock) +# define _raw_spin_lock_init(lock, name, file, line) __raw_spin_lock_init(lock) #endif -#define PICK_OP_INIT(op, lock) \ -do { \ - if (TYPE_EQUAL((lock), raw_spinlock_t)) \ - _raw_spin##op((raw_spinlock_t *)(lock)); \ - else if (TYPE_EQUAL(lock, spinlock_t)) \ - _spin##op((spinlock_t *)(lock), #lock, __FILE__, __LINE__); \ - else __bad_spinlock_type(); \ -} while (0) - +/* + * PICK_SPIN_OP()/PICK_RW_OP() are simple redirectors for PICK_FUNCTION + */ +#define PICK_SPIN_OP(...) \ + PICK_FUNCTION(raw_spinlock_t *, spinlock_t *, ##__VA_ARGS__) +#define PICK_SPIN_OP_RET(...) \ + PICK_FUNCTION_RET(raw_spinlock_t *, spinlock_t *, ##__VA_ARGS__) +#define PICK_RW_OP(...) PICK_FUNCTION(raw_rwlock_t *, rwlock_t *, ##__VA_ARGS__) +#define PICK_RW_OP_RET(...) \ + PICK_FUNCTION_RET(raw_rwlock_t *, rwlock_t *, ##__VA_ARGS__) -#define spin_lock_init(lock) PICK_OP_INIT(_lock_init, lock) +#define spin_lock_init(lock) \ + PICK_SPIN_OP(_raw_spin_lock_init, _spin_lock_init, lock, #lock, \ + __FILE__, __LINE__) #ifdef CONFIG_DEBUG_SPINLOCK extern void __raw_rwlock_init(raw_rwlock_t *lock, const char *name, struct lock_class_key *key); -# define _raw_rwlock_init(lock) \ +# define _raw_rwlock_init(lock, name, file, line) \ do { \ static struct lock_class_key __key; \ \ @@ -455,83 +345,82 @@ do { \ #else #define __raw_rwlock_init(lock) \ do { *(lock) = RAW_RW_LOCK_UNLOCKED(lock); } while (0) -# define _raw_rwlock_init(lock) __raw_rwlock_init(lock) +# define _raw_rwlock_init(lock, name, file, line) __raw_rwlock_init(lock) #endif -#define __PICK_RW_OP_INIT(optype, op, lock) \ -do { \ - if (TYPE_EQUAL((lock), raw_rwlock_t)) \ - _raw_##optype##op((raw_rwlock_t *)(lock)); \ - else if (TYPE_EQUAL(lock, rwlock_t)) \ - _##optype##op((rwlock_t *)(lock), #lock, __FILE__, __LINE__);\ - else __bad_spinlock_type(); \ -} while (0) - -#define rwlock_init(lock) __PICK_RW_OP_INIT(rwlock, _init, lock) +#define rwlock_init(lock) \ + PICK_RW_OP(_raw_rwlock_init, _rwlock_init, lock, #lock, \ + __FILE__, __LINE__) #define __spin_is_locked(lock) __raw_spin_is_locked(&(lock)->raw_lock) -#define spin_is_locked(lock) PICK_OP_RET(_is_locked, lock) +#define spin_is_locked(lock) \ + PICK_SPIN_OP_RET(__spin_is_locked, _spin_is_locked, lock) #define __spin_unlock_wait(lock) __raw_spin_unlock_wait(&(lock)->raw_lock) -#define spin_unlock_wait(lock) PICK_OP(_unlock_wait, lock) +#define spin_unlock_wait(lock) \ + PICK_SPIN_OP(__spin_unlock_wait, _spin_unlock_wait, lock) + /* * Define the various spin_lock and rw_lock methods. Note we define these * regardless of whether CONFIG_SMP or CONFIG_PREEMPT are set. The various * methods are defined as nops in the case they are not required. */ -// #define spin_trylock(lock) _spin_trylock(lock) -#define spin_trylock(lock) __cond_lock(lock, PICK_OP_RET(_trylock, lock)) +#define spin_trylock(lock) \ + __cond_lock(lock, PICK_SPIN_OP_RET(__spin_trylock, _spin_trylock, lock)) -//#define read_trylock(lock) _read_trylock(lock) -#define read_trylock(lock) __cond_lock(lock, PICK_RW_OP_RET(read, _trylock, lock)) +#define read_trylock(lock) \ + __cond_lock(lock, PICK_RW_OP_RET(__read_trylock, _read_trylock, lock)) -//#define write_trylock(lock) _write_trylock(lock) -#define write_trylock(lock) __cond_lock(lock, PICK_RW_OP_RET(write, _trylock, lock)) +#define write_trylock(lock) \ + __cond_lock(lock, PICK_RW_OP_RET(__write_trylock, _write_trylock, lock)) #define write_trylock_irqsave(lock, flags) \ - __cond_lock(lock, PICK_RW_OP2_RET(write, _trylock_irqsave, lock, &flags)) + __cond_lock(lock, PICK_RW_OP_RET(__write_trylock_irqsave, \ + _write_trylock_irqsave, lock, &flags)) #define __spin_can_lock(lock) __raw_spin_can_lock(&(lock)->raw_lock) #define __read_can_lock(lock) __raw_read_can_lock(&(lock)->raw_lock) #define __write_can_lock(lock) __raw_write_can_lock(&(lock)->raw_lock) #define spin_can_lock(lock) \ - __cond_lock(lock, PICK_OP_RET(_can_lock, lock)) + __cond_lock(lock, PICK_SPIN_OP_RET(__spin_can_lock, _spin_can_lock,\ + lock)) #define read_can_lock(lock) \ - __cond_lock(lock, PICK_RW_OP_RET(read, _can_lock, lock)) + __cond_lock(lock, PICK_RW_OP_RET(__read_can_lock, _read_can_lock, lock)) #define write_can_lock(lock) \ - __cond_lock(lock, PICK_RW_OP_RET(write, _can_lock, lock)) + __cond_lock(lock, PICK_RW_OP_RET(__write_can_lock, _write_can_lock,\ + lock)) -// #define spin_lock(lock) _spin_lock(lock) -#define spin_lock(lock) PICK_OP(_lock, lock) +#define spin_lock(lock) PICK_SPIN_OP(__spin_lock, _spin_lock, lock) #ifdef CONFIG_DEBUG_LOCK_ALLOC -# define spin_lock_nested(lock, subclass) PICK_OP2(_lock_nested, lock, subclass) +# define spin_lock_nested(lock, subclass) \ + PICK_SPIN_OP(__spin_lock_nested, _spin_lock_nested, lock, subclass) #else # define spin_lock_nested(lock, subclass) spin_lock(lock) #endif -//#define write_lock(lock) _write_lock(lock) -#define write_lock(lock) PICK_RW_OP(write, _lock, lock) +#define write_lock(lock) PICK_RW_OP(__write_lock, _write_lock, lock) -// #define read_lock(lock) _read_lock(lock) -#define read_lock(lock) PICK_RW_OP(read, _lock, lock) +#define read_lock(lock) PICK_RW_OP(__read_lock, _read_lock, lock) # define spin_lock_irqsave(lock, flags) \ do { \ BUILD_CHECK_IRQ_FLAGS(flags); \ - flags = PICK_OP_RET(_lock_irqsave, lock); \ + flags = PICK_SPIN_OP_RET(__spin_lock_irqsave, _spin_lock_irqsave, \ + lock); \ } while (0) #ifdef CONFIG_DEBUG_LOCK_ALLOC # define spin_lock_irqsave_nested(lock, flags, subclass) \ do { \ BUILD_CHECK_IRQ_FLAGS(flags); \ - flags = PICK_OP2_RET(_lock_irqsave_nested, lock, subclass); \ + flags = PICK_SPIN_OP_RET(__spin_lock_irqsave_nested, \ + _spin_lock_irqsave_nested, lock, subclass); \ } while (0) #else # define spin_lock_irqsave_nested(lock, flags, subclass) \ @@ -541,112 +430,92 @@ do { \ # define read_lock_irqsave(lock, flags) \ do { \ BUILD_CHECK_IRQ_FLAGS(flags); \ - flags = PICK_RW_OP_RET(read, _lock_irqsave, lock); \ + flags = PICK_RW_OP_RET(__read_lock_irqsave, _read_lock_irqsave, lock);\ } while (0) # define write_lock_irqsave(lock, flags) \ do { \ BUILD_CHECK_IRQ_FLAGS(flags); \ - flags = PICK_RW_OP_RET(write, _lock_irqsave, lock); \ + flags = PICK_RW_OP_RET(__write_lock_irqsave, _write_lock_irqsave,lock);\ } while (0) -// #define spin_lock_irq(lock) _spin_lock_irq(lock) -// #define spin_lock_bh(lock) _spin_lock_bh(lock) -#define spin_lock_irq(lock) PICK_OP(_lock_irq, lock) -#define spin_lock_bh(lock) PICK_OP(_lock_bh, lock) - -// #define read_lock_irq(lock) _read_lock_irq(lock) -// #define read_lock_bh(lock) _read_lock_bh(lock) -#define read_lock_irq(lock) PICK_RW_OP(read, _lock_irq, lock) -#define read_lock_bh(lock) PICK_RW_OP(read, _lock_bh, lock) - -// #define write_lock_irq(lock) _write_lock_irq(lock) -// #define write_lock_bh(lock) _write_lock_bh(lock) -#define write_lock_irq(lock) PICK_RW_OP(write, _lock_irq, lock) -#define write_lock_bh(lock) PICK_RW_OP(write, _lock_bh, lock) - -// #define spin_unlock(lock) _spin_unlock(lock) -// #define write_unlock(lock) _write_unlock(lock) -// #define read_unlock(lock) _read_unlock(lock) -#define spin_unlock(lock) PICK_OP(_unlock, lock) -#define read_unlock(lock) PICK_RW_OP(read, _unlock, lock) -#define write_unlock(lock) PICK_RW_OP(write, _unlock, lock) +#define spin_lock_irq(lock) PICK_SPIN_OP(__spin_lock_irq, _spin_lock_irq, lock) -// #define spin_unlock(lock) _spin_unlock_no_resched(lock) -#define spin_unlock_no_resched(lock) \ - PICK_OP(_unlock_no_resched, lock) +#define spin_lock_bh(lock) PICK_SPIN_OP(__spin_lock_bh, _spin_lock_bh, lock) -//#define spin_unlock_irqrestore(lock, flags) -// _spin_unlock_irqrestore(lock, flags) -//#define spin_unlock_irq(lock) _spin_unlock_irq(lock) -//#define spin_unlock_bh(lock) _spin_unlock_bh(lock) -#define spin_unlock_irqrestore(lock, flags) \ -do { \ - BUILD_CHECK_IRQ_FLAGS(flags); \ - PICK_OP2(_unlock_irqrestore, lock, flags); \ -} while (0) +#define read_lock_irq(lock) PICK_RW_OP(__read_lock_irq, _read_lock_irq, lock) -#define spin_unlock_irq(lock) PICK_OP(_unlock_irq, lock) -#define spin_unlock_bh(lock) PICK_OP(_unlock_bh, lock) +#define read_lock_bh(lock) PICK_RW_OP(__read_lock_bh, _read_lock_bh, lock) -// #define read_unlock_irqrestore(lock, flags) -// _read_unlock_irqrestore(lock, flags) -// #define read_unlock_irq(lock) _read_unlock_irq(lock) -// #define read_unlock_bh(lock) _read_unlock_bh(lock) -#define read_unlock_irqrestore(lock, flags) \ -do { \ - BUILD_CHECK_IRQ_FLAGS(flags); \ - PICK_RW_OP2(read, _unlock_irqrestore, lock, flags); \ +#define write_lock_irq(lock) PICK_RW_OP(__write_lock_irq, _write_lock_irq, lock) + +#define write_lock_bh(lock) PICK_RW_OP(__write_lock_bh, _write_lock_bh, lock) + +#define spin_unlock(lock) PICK_SPIN_OP(__spin_unlock, _spin_unlock, lock) + +#define read_unlock(lock) PICK_RW_OP(__read_unlock, _read_unlock, lock) + +#define write_unlock(lock) PICK_RW_OP(__write_unlock, _write_unlock, lock) + +#define spin_unlock_no_resched(lock) \ + PICK_SPIN_OP(__spin_unlock_no_resched, _spin_unlock_no_resched, lock) + +#define spin_unlock_irqrestore(lock, flags) \ +do { \ + BUILD_CHECK_IRQ_FLAGS(flags); \ + PICK_SPIN_OP(__spin_unlock_irqrestore, _spin_unlock_irqrestore, \ + lock, flags); \ } while (0) -#define read_unlock_irq(lock) PICK_RW_OP(read, _unlock_irq, lock) -#define read_unlock_bh(lock) PICK_RW_OP(read, _unlock_bh, lock) +#define spin_unlock_irq(lock) \ + PICK_SPIN_OP(__spin_unlock_irq, _spin_unlock_irq, lock) +#define spin_unlock_bh(lock) \ + PICK_SPIN_OP(__spin_unlock_bh, _spin_unlock_bh, lock) -// #define write_unlock_irqrestore(lock, flags) -// _write_unlock_irqrestore(lock, flags) -// #define write_unlock_irq(lock) _write_unlock_irq(lock) -// #define write_unlock_bh(lock) _write_unlock_bh(lock) -#define write_unlock_irqrestore(lock, flags) \ -do { \ - BUILD_CHECK_IRQ_FLAGS(flags); \ - PICK_RW_OP2(write, _unlock_irqrestore, lock, flags); \ +#define read_unlock_irqrestore(lock, flags) \ +do { \ + BUILD_CHECK_IRQ_FLAGS(flags); \ + PICK_RW_OP(__read_unlock_irqrestore, _read_unlock_irqrestore, \ + lock, flags); \ } while (0) -#define write_unlock_irq(lock) PICK_RW_OP(write, _unlock_irq, lock) -#define write_unlock_bh(lock) PICK_RW_OP(write, _unlock_bh, lock) -// #define spin_trylock_bh(lock) _spin_trylock_bh(lock) -#define spin_trylock_bh(lock) __cond_lock(lock, PICK_OP_RET(_trylock_bh, lock)) +#define read_unlock_irq(lock) \ + PICK_RW_OP(__read_unlock_irq, _read_unlock_irq, lock) +#define read_unlock_bh(lock) PICK_RW_OP(__read_unlock_bh, _read_unlock_bh, lock) -// #define spin_trylock_irq(lock) +#define write_unlock_irqrestore(lock, flags) \ +do { \ + BUILD_CHECK_IRQ_FLAGS(flags); \ + PICK_RW_OP(__write_unlock_irqrestore, _write_unlock_irqrestore, \ + lock, flags); \ +} while (0) +#define write_unlock_irq(lock) \ + PICK_RW_OP(__write_unlock_irq, _write_unlock_irq, lock) -#define spin_trylock_irq(lock) __cond_lock(lock, PICK_OP_RET(_trylock_irq, lock)) +#define write_unlock_bh(lock) \ + PICK_RW_OP(__write_unlock_bh, _write_unlock_bh, lock) -// #define spin_trylock_irqsave(lock, flags) +#define spin_trylock_bh(lock) \ + __cond_lock(lock, PICK_SPIN_OP_RET(__spin_trylock_bh, _spin_trylock_bh,\ + lock)) + +#define spin_trylock_irq(lock) \ + __cond_lock(lock, PICK_SPIN_OP_RET(__spin_trylock_irq, \ + __spin_trylock_irq, lock)) #define spin_trylock_irqsave(lock, flags) \ - __cond_lock(lock, PICK_OP2_RET(_trylock_irqsave, lock, &flags)) + __cond_lock(lock, PICK_SPIN_OP_RET(__spin_trylock_irqsave, \ + _spin_trylock_irqsave, lock, &flags)) /* "lock on reference count zero" */ #ifndef ATOMIC_DEC_AND_LOCK # include <asm/atomic.h> - extern int __atomic_dec_and_spin_lock(atomic_t *atomic, raw_spinlock_t *lock); + extern int __atomic_dec_and_spin_lock(raw_spinlock_t *lock, atomic_t *atomic); #endif #define atomic_dec_and_lock(atomic, lock) \ -__cond_lock(lock, ({ \ - unsigned long __ret; \ - \ - if (TYPE_EQUAL(lock, raw_spinlock_t)) \ - __ret = __atomic_dec_and_spin_lock(atomic, \ - (raw_spinlock_t *)(lock)); \ - else if (TYPE_EQUAL(lock, spinlock_t)) \ - __ret = _atomic_dec_and_spin_lock(atomic, \ - (spinlock_t *)(lock)); \ - else __ret = __bad_spinlock_type(); \ - \ - __ret; \ -})) - + __cond_lock(lock, PICK_SPIN_OP_RET(__atomic_dec_and_spin_lock, \ + _atomic_dec_and_spin_lock, lock, atomic)) /* * bit-based spin_lock() Index: linux-2.6.24.7/kernel/rtmutex.c =================================================================== --- linux-2.6.24.7.orig/kernel/rtmutex.c +++ linux-2.6.24.7/kernel/rtmutex.c @@ -857,7 +857,7 @@ int __lockfunc rt_spin_trylock_irqsave(s } EXPORT_SYMBOL(rt_spin_trylock_irqsave); -int _atomic_dec_and_spin_lock(atomic_t *atomic, spinlock_t *lock) +int _atomic_dec_and_spin_lock(spinlock_t *lock, atomic_t *atomic) { /* Subtract 1 from counter unless that drops it to 0 (ie. it was 1) */ if (atomic_add_unless(atomic, -1, 1)) Index: linux-2.6.24.7/lib/dec_and_lock.c =================================================================== --- linux-2.6.24.7.orig/lib/dec_and_lock.c +++ linux-2.6.24.7/lib/dec_and_lock.c @@ -17,7 +17,7 @@ * because the spin-lock and the decrement must be * "atomic". */ -int __atomic_dec_and_spin_lock(atomic_t *atomic, raw_spinlock_t *lock) +int __atomic_dec_and_spin_lock(raw_spinlock_t *lock, atomic_t *atomic) { #ifdef CONFIG_SMP /* Subtract 1 from counter unless that drops it to 0 (ie. it was 1) */ ���������patches/fix-PICK_FUNCTION-spin_trylock_irq.patch����������������������������������������������������0000664�0000764�0000764�00000003177�11041657730�020502� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From sebastien.dugue@bull.net Thu Oct 11 11:32:58 2007 Date: Thu, 11 Oct 2007 14:24:17 +0200 From: "[UTF-8] Sébastien Dugué" <sebastien.dugue@bull.net> To: Ingo Molnar <mingo@elte.hu>, Thomas Gleixner <tglx@linutronix.de>, Steven Rostedt <rostedt@goodmis.org> Cc: Linux RT Users <linux-rt-users@vger.kernel.org>, linux-kernel <linux-kernel@vger.kernel.org> Subject: [PATCH] RT: fix spin_trylock_irq [ The following text is in the "UTF-8" character set. ] [ Your display is set for the "iso-8859-1" character set. ] [ Some characters may be displayed incorrectly. ] This patch fixes a bug in spin_trylock_irq() where __spin_trylock_irq() is picked for regular (non-raw) spinlocks instead of _spin_trylock_irq(). This results in systematic boot hangs and may have been going unnoticed for quite some time as it only manifests (aside from a compile warning) when booting with a NUMA config or when using the Chelsio T3 (cxgb3) driver as these seems to be the sole users. Signed-off-by: Sébastien Dugué <sebastien.dugue@bull.net> --- include/linux/spinlock.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) Index: linux-2.6.24.7/include/linux/spinlock.h =================================================================== --- linux-2.6.24.7.orig/include/linux/spinlock.h +++ linux-2.6.24.7/include/linux/spinlock.h @@ -501,7 +501,7 @@ do { \ #define spin_trylock_irq(lock) \ __cond_lock(lock, PICK_SPIN_OP_RET(__spin_trylock_irq, \ - __spin_trylock_irq, lock)) + _spin_trylock_irq, lock)) #define spin_trylock_irqsave(lock, flags) \ __cond_lock(lock, PICK_SPIN_OP_RET(__spin_trylock_irqsave, \ �������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/seqlocks-use-PICK_FUNCTION.patch������������������������������������������������������������0000664�0000764�0000764�00000022255�11041657731�016740� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From dwalker@mvista.com Wed Sep 26 22:16:38 2007 Date: Tue, 28 Aug 2007 14:37:51 -0700 From: Daniel Walker <dwalker@mvista.com> To: mingo@elte.hu Cc: mingo@redhat.com, linux-kernel@vger.kernel.org, linux-rt-users@vger.kernel.org Subject: [PATCH -rt 3/8] seqlocks: use PICK_FUNCTION Replace the old PICK_OP style macros with PICK_FUNCTION. Although, seqlocks has some alien code, which I also replaced as can be seen from the line count below. Signed-off-by: Daniel Walker <dwalker@mvista.com> --- include/linux/pickop.h | 4 include/linux/seqlock.h | 235 +++++++++++++++++++++++++++--------------------- 2 files changed, 135 insertions(+), 104 deletions(-) Index: linux-2.6.24.7/include/linux/pickop.h =================================================================== --- linux-2.6.24.7.orig/include/linux/pickop.h +++ linux-2.6.24.7/include/linux/pickop.h @@ -1,10 +1,6 @@ #ifndef _LINUX_PICKOP_H #define _LINUX_PICKOP_H -#undef TYPE_EQUAL -#define TYPE_EQUAL(var, type) \ - __builtin_types_compatible_p(typeof(var), type *) - #undef PICK_TYPE_EQUAL #define PICK_TYPE_EQUAL(var, type) \ __builtin_types_compatible_p(typeof(var), type) Index: linux-2.6.24.7/include/linux/seqlock.h =================================================================== --- linux-2.6.24.7.orig/include/linux/seqlock.h +++ linux-2.6.24.7/include/linux/seqlock.h @@ -90,6 +90,12 @@ static inline void __write_seqlock(seqlo smp_wmb(); } +static __always_inline unsigned long __write_seqlock_irqsave(seqlock_t *sl) +{ + __write_seqlock(sl); + return 0; +} + static inline void __write_sequnlock(seqlock_t *sl) { smp_wmb(); @@ -97,6 +103,8 @@ static inline void __write_sequnlock(seq spin_unlock(&sl->lock); } +#define __write_sequnlock_irqrestore(sl, flags) __write_sequnlock(sl) + static inline int __write_tryseqlock(seqlock_t *sl) { int ret = spin_trylock(&sl->lock); @@ -149,6 +157,28 @@ static __always_inline void __write_seql smp_wmb(); } +static __always_inline unsigned long +__write_seqlock_irqsave_raw(raw_seqlock_t *sl) +{ + unsigned long flags; + + local_irq_save(flags); + __write_seqlock_raw(sl); + return flags; +} + +static __always_inline void __write_seqlock_irq_raw(raw_seqlock_t *sl) +{ + local_irq_disable(); + __write_seqlock_raw(sl); +} + +static __always_inline void __write_seqlock_bh_raw(raw_seqlock_t *sl) +{ + local_bh_disable(); + __write_seqlock_raw(sl); +} + static __always_inline void __write_sequnlock_raw(raw_seqlock_t *sl) { smp_wmb(); @@ -156,6 +186,27 @@ static __always_inline void __write_sequ spin_unlock(&sl->lock); } +static __always_inline void +__write_sequnlock_irqrestore_raw(raw_seqlock_t *sl, unsigned long flags) +{ + __write_sequnlock_raw(sl); + local_irq_restore(flags); + preempt_check_resched(); +} + +static __always_inline void __write_sequnlock_irq_raw(raw_seqlock_t *sl) +{ + __write_sequnlock_raw(sl); + local_irq_enable(); + preempt_check_resched(); +} + +static __always_inline void __write_sequnlock_bh_raw(raw_seqlock_t *sl) +{ + __write_sequnlock_raw(sl); + local_bh_enable(); +} + static __always_inline int __write_tryseqlock_raw(raw_seqlock_t *sl) { int ret = spin_trylock(&sl->lock); @@ -182,60 +233,93 @@ static __always_inline int __read_seqret extern int __bad_seqlock_type(void); -#define PICK_SEQOP(op, lock) \ +/* + * PICK_SEQ_OP() is a small redirector to allow less typing of the lock + * types raw_seqlock_t, seqlock_t, at the front of the PICK_FUNCTION + * macro. + */ +#define PICK_SEQ_OP(...) \ + PICK_FUNCTION(raw_seqlock_t *, seqlock_t *, ##__VA_ARGS__) +#define PICK_SEQ_OP_RET(...) \ + PICK_FUNCTION_RET(raw_seqlock_t *, seqlock_t *, ##__VA_ARGS__) + +#define write_seqlock(sl) PICK_SEQ_OP(__write_seqlock_raw, __write_seqlock, sl) + +#define write_sequnlock(sl) \ + PICK_SEQ_OP(__write_sequnlock_raw, __write_sequnlock, sl) + +#define write_tryseqlock(sl) \ + PICK_SEQ_OP_RET(__write_tryseqlock_raw, __write_tryseqlock, sl) + +#define read_seqbegin(sl) \ + PICK_SEQ_OP_RET(__read_seqbegin_raw, __read_seqbegin, sl) + +#define read_seqretry(sl, iv) \ + PICK_SEQ_OP_RET(__read_seqretry_raw, __read_seqretry, sl, iv) + +#define write_seqlock_irqsave(lock, flags) \ do { \ - if (TYPE_EQUAL((lock), raw_seqlock_t)) \ - op##_raw((raw_seqlock_t *)(lock)); \ - else if (TYPE_EQUAL((lock), seqlock_t)) \ - op((seqlock_t *)(lock)); \ - else __bad_seqlock_type(); \ + flags = PICK_SEQ_OP_RET(__write_seqlock_irqsave_raw, \ + __write_seqlock_irqsave, lock); \ } while (0) -#define PICK_SEQOP_RET(op, lock) \ -({ \ - unsigned long __ret; \ - \ - if (TYPE_EQUAL((lock), raw_seqlock_t)) \ - __ret = op##_raw((raw_seqlock_t *)(lock)); \ - else if (TYPE_EQUAL((lock), seqlock_t)) \ - __ret = op((seqlock_t *)(lock)); \ - else __ret = __bad_seqlock_type(); \ - \ - __ret; \ -}) - -#define PICK_SEQOP_CONST_RET(op, lock) \ -({ \ - unsigned long __ret; \ - \ - if (TYPE_EQUAL((lock), raw_seqlock_t)) \ - __ret = op##_raw((const raw_seqlock_t *)(lock));\ - else if (TYPE_EQUAL((lock), seqlock_t)) \ - __ret = op((seqlock_t *)(lock)); \ - else __ret = __bad_seqlock_type(); \ - \ - __ret; \ -}) - -#define PICK_SEQOP2_CONST_RET(op, lock, arg) \ - ({ \ - unsigned long __ret; \ - \ - if (TYPE_EQUAL((lock), raw_seqlock_t)) \ - __ret = op##_raw((const raw_seqlock_t *)(lock), (arg)); \ - else if (TYPE_EQUAL((lock), seqlock_t)) \ - __ret = op((seqlock_t *)(lock), (arg)); \ - else __ret = __bad_seqlock_type(); \ - \ - __ret; \ -}) - - -#define write_seqlock(sl) PICK_SEQOP(__write_seqlock, sl) -#define write_sequnlock(sl) PICK_SEQOP(__write_sequnlock, sl) -#define write_tryseqlock(sl) PICK_SEQOP_RET(__write_tryseqlock, sl) -#define read_seqbegin(sl) PICK_SEQOP_CONST_RET(__read_seqbegin, sl) -#define read_seqretry(sl, iv) PICK_SEQOP2_CONST_RET(__read_seqretry, sl, iv) +#define write_seqlock_irq(lock) \ + PICK_SEQ_OP(__write_seqlock_irq_raw, __write_seqlock, lock) + +#define write_seqlock_bh(lock) \ + PICK_SEQ_OP(__write_seqlock_bh_raw, __write_seqlock, lock) + +#define write_sequnlock_irqrestore(lock, flags) \ + PICK_SEQ_OP(__write_sequnlock_irqrestore_raw, \ + __write_sequnlock_irqrestore, lock, flags) + +#define write_sequnlock_bh(lock) \ + PICK_SEQ_OP(__write_sequnlock_bh_raw, __write_sequnlock, lock) + +#define write_sequnlock_irq(lock) \ + PICK_SEQ_OP(__write_sequnlock_irq_raw, __write_sequnlock, lock) + +static __always_inline +unsigned long __read_seqbegin_irqsave_raw(raw_seqlock_t *sl) +{ + unsigned long flags; + + local_irq_save(flags); + __read_seqbegin_raw(sl); + return flags; +} + +static __always_inline unsigned long __read_seqbegin_irqsave(seqlock_t *sl) +{ + __read_seqbegin(sl); + return 0; +} + +#define read_seqbegin_irqsave(lock, flags) \ +do { \ + flags = PICK_SEQ_OP_RET(__read_seqbegin_irqsave_raw, \ + __read_seqbegin_irqsave, lock); \ +} while (0) + +static __always_inline int +__read_seqretry_irqrestore(seqlock_t *sl, unsigned iv, unsigned long flags) +{ + return __read_seqretry(sl, iv); +} + +static __always_inline int +__read_seqretry_irqrestore_raw(raw_seqlock_t *sl, unsigned iv, + unsigned long flags) +{ + int ret = read_seqretry(sl, iv); + local_irq_restore(flags); + preempt_check_resched(); + return ret; +} + +#define read_seqretry_irqrestore(lock, iv, flags) \ + PICK_SEQ_OP_RET(__read_seqretry_irqrestore_raw, \ + __read_seqretry_irqrestore, lock, iv, flags) /* * Version using sequence counter only. @@ -286,53 +370,4 @@ static inline void write_seqcount_end(se smp_wmb(); s->sequence++; } - -#define PICK_IRQOP(op, lock) \ -do { \ - if (TYPE_EQUAL((lock), raw_seqlock_t)) \ - op(); \ - else if (TYPE_EQUAL((lock), seqlock_t)) \ - { /* nothing */ } \ - else __bad_seqlock_type(); \ -} while (0) - -#define PICK_IRQOP2(op, arg, lock) \ -do { \ - if (TYPE_EQUAL((lock), raw_seqlock_t)) \ - op(arg); \ - else if (TYPE_EQUAL(lock, seqlock_t)) \ - { /* nothing */ } \ - else __bad_seqlock_type(); \ -} while (0) - - - -/* - * Possible sw/hw IRQ protected versions of the interfaces. - */ -#define write_seqlock_irqsave(lock, flags) \ - do { PICK_IRQOP2(local_irq_save, flags, lock); write_seqlock(lock); } while (0) -#define write_seqlock_irq(lock) \ - do { PICK_IRQOP(local_irq_disable, lock); write_seqlock(lock); } while (0) -#define write_seqlock_bh(lock) \ - do { PICK_IRQOP(local_bh_disable, lock); write_seqlock(lock); } while (0) - -#define write_sequnlock_irqrestore(lock, flags) \ - do { write_sequnlock(lock); PICK_IRQOP2(local_irq_restore, flags, lock); preempt_check_resched(); } while(0) -#define write_sequnlock_irq(lock) \ - do { write_sequnlock(lock); PICK_IRQOP(local_irq_enable, lock); preempt_check_resched(); } while(0) -#define write_sequnlock_bh(lock) \ - do { write_sequnlock(lock); PICK_IRQOP(local_bh_enable, lock); } while(0) - -#define read_seqbegin_irqsave(lock, flags) \ - ({ PICK_IRQOP2(local_irq_save, flags, lock); read_seqbegin(lock); }) - -#define read_seqretry_irqrestore(lock, iv, flags) \ - ({ \ - int ret = read_seqretry(lock, iv); \ - PICK_IRQOP2(local_irq_restore, flags, lock); \ - preempt_check_resched(); \ - ret; \ - }) - #endif /* __LINUX_SEQLOCK_H */ ���������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/fork-desched_thread-comment-rework.patch����������������������������������������������������0000664�0000764�0000764�00000002121�11041657731�021113� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From dwalker@mvista.com Wed Sep 26 22:18:23 2007 Date: Tue, 28 Aug 2007 14:37:52 -0700 From: Daniel Walker <dwalker@mvista.com> To: mingo@elte.hu Cc: mingo@redhat.com, linux-kernel@vger.kernel.org, linux-rt-users@vger.kernel.org Subject: [PATCH -rt 4/8] fork: desched_thread comment rework. Lines are too long.. Signed-off-by: Daniel Walker <dwalker@mvista.com> --- kernel/fork.c | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) Index: linux-2.6.24.7/kernel/fork.c =================================================================== --- linux-2.6.24.7.orig/kernel/fork.c +++ linux-2.6.24.7/kernel/fork.c @@ -1839,8 +1839,10 @@ static int desched_thread(void * __bind_ continue; schedule(); - /* This must be called from time to time on ia64, and is a no-op on other archs. - * Used to be in cpu_idle(), but with the new -rt semantics it can't stay there. + /* + * This must be called from time to time on ia64, and is a + * no-op on other archs. Used to be in cpu_idle(), but with + * the new -rt semantics it can't stay there. */ check_pgt_cache(); �����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/disable-ist-x86_64.patch��������������������������������������������������������������������0000664�0000764�0000764�00000007437�11041657734�015435� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From ak@suse.de Thu Oct 4 11:22:57 2007 Date: Tue, 2 Oct 2007 10:24:27 +0200 From: Andi Kleen <ak@suse.de> To: linux-rt-users@vger.kernel.org Cc: mingo@elte.hu, Thomas Gleixner <tglx@linutronix.de> Subject: [PATCH] Disable IST stacks for debug/int 3/stack fault for PREEMPT_RT Normally the x86-64 trap handlers for debug/int 3/stack fault run on a special interrupt stack to make them more robust when dealing with kernel code. The PREEMPT_RT kernel can sleep in locks even while allocating GFP_ATOMIC memory. When one of these trap handlers needs to send real time signals for ptrace it allocates memory and could then try to to schedule. But it is not allowed to schedule on a IST stack. This can cause warnings and hangs. This patch disables the IST stacks for these handlers for PREEMPT_RT kernel. Instead let them run on the normal process stack. The kernel only really needs the ISTs here to make kernel debuggers more robust in case someone sets a break point somewhere where the stack is invalid. But there are no kernel debuggers in the standard kernel that do this. It also means kprobes cannot be set in situations with invalid stack; but that sounds like a reasonable restriction. The stack fault change could minimally impact oops quality, but not very much because stack faults are fairly rare. A better solution would be to use similar logic as the NMI "paranoid" path: check if signal is for user space, if yes go back to entry.S, switch stack, call sync_regs, then do the signal sending etc. But this patch is much simpler and should work too with minimal impact. Signed-off-by: Andi Kleen <ak@suse.de> --- arch/x86/kernel/setup64.c | 2 ++ arch/x86/kernel/traps_64.c | 4 ++++ include/asm-x86/page_64.h | 9 +++++++++ 3 files changed, 15 insertions(+) Index: linux-2.6.24.7/arch/x86/kernel/setup64.c =================================================================== --- linux-2.6.24.7.orig/arch/x86/kernel/setup64.c +++ linux-2.6.24.7/arch/x86/kernel/setup64.c @@ -248,7 +248,9 @@ void __cpuinit cpu_init (void) for (v = 0; v < N_EXCEPTION_STACKS; v++) { static const unsigned int order[N_EXCEPTION_STACKS] = { [0 ... N_EXCEPTION_STACKS - 1] = EXCEPTION_STACK_ORDER, +#if DEBUG_STACK > 0 [DEBUG_STACK - 1] = DEBUG_STACK_ORDER +#endif }; if (cpu) { estacks = (char *)__get_free_pages(GFP_ATOMIC, order[v]); Index: linux-2.6.24.7/arch/x86/kernel/traps_64.c =================================================================== --- linux-2.6.24.7.orig/arch/x86/kernel/traps_64.c +++ linux-2.6.24.7/arch/x86/kernel/traps_64.c @@ -131,10 +131,14 @@ static unsigned long *in_exception_stack unsigned *usedp, char **idp) { static char ids[][8] = { +#if DEBUG_STACK > 0 [DEBUG_STACK - 1] = "#DB", +#endif [NMI_STACK - 1] = "NMI", [DOUBLEFAULT_STACK - 1] = "#DF", +#if STACKFAULT_STACK > 0 [STACKFAULT_STACK - 1] = "#SS", +#endif [MCE_STACK - 1] = "#MC", #if DEBUG_STKSZ > EXCEPTION_STKSZ [N_EXCEPTION_STACKS ... N_EXCEPTION_STACKS + DEBUG_STKSZ / EXCEPTION_STKSZ - 2] = "#DB[?]" Index: linux-2.6.24.7/include/asm-x86/page_64.h =================================================================== --- linux-2.6.24.7.orig/include/asm-x86/page_64.h +++ linux-2.6.24.7/include/asm-x86/page_64.h @@ -22,12 +22,21 @@ #define IRQSTACK_ORDER 2 #define IRQSTACKSIZE (PAGE_SIZE << IRQSTACK_ORDER) +#ifdef CONFIG_PREEMPT_RT +#define STACKFAULT_STACK 0 +#define DOUBLEFAULT_STACK 1 +#define NMI_STACK 2 +#define DEBUG_STACK 0 +#define MCE_STACK 3 +#define N_EXCEPTION_STACKS 3 /* hw limit: 7 */ +#else #define STACKFAULT_STACK 1 #define DOUBLEFAULT_STACK 2 #define NMI_STACK 3 #define DEBUG_STACK 4 #define MCE_STACK 5 #define N_EXCEPTION_STACKS 5 /* hw limit: 7 */ +#endif #define LARGE_PAGE_MASK (~(LARGE_PAGE_SIZE-1)) #define LARGE_PAGE_SIZE (_AC(1,UL) << PMD_SHIFT) ���������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/rcu-trace-fix-free.patch��������������������������������������������������������������������0000664�0000764�0000764�00000001340�11041657733�015653� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������--- kernel/rcupreempt_trace.c | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-) Index: linux-2.6.24.7/kernel/rcupreempt_trace.c =================================================================== --- linux-2.6.24.7.orig/kernel/rcupreempt_trace.c +++ linux-2.6.24.7/kernel/rcupreempt_trace.c @@ -309,11 +309,16 @@ out: static int __init rcupreempt_trace_init(void) { + int ret; + mutex_init(&rcupreempt_trace_mutex); rcupreempt_trace_buf = kmalloc(RCUPREEMPT_TRACE_BUF_SIZE, GFP_KERNEL); if (!rcupreempt_trace_buf) return 1; - return rcupreempt_debugfs_init(); + ret = rcupreempt_debugfs_init(); + if (ret) + kfree(rcupreempt_trace_buf); + return ret; } static void __exit rcupreempt_trace_cleanup(void) ������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/rcu-preempt-fix-bad-dyntick-accounting.patch������������������������������������������������0000664�0000764�0000764�00000002415�11041657731�021633� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������--- include/linux/rcupreempt.h | 16 ++++++++++++++-- 1 file changed, 14 insertions(+), 2 deletions(-) Index: linux-2.6.24.7/include/linux/rcupreempt.h =================================================================== --- linux-2.6.24.7.orig/include/linux/rcupreempt.h +++ linux-2.6.24.7/include/linux/rcupreempt.h @@ -88,7 +88,13 @@ DECLARE_PER_CPU(long, dynticks_progress_ static inline void rcu_enter_nohz(void) { __get_cpu_var(dynticks_progress_counter)++; - WARN_ON(__get_cpu_var(dynticks_progress_counter) & 0x1); + if (unlikely(__get_cpu_var(dynticks_progress_counter) & 0x1)) { + printk("BUG: bad accounting of dynamic ticks\n"); + printk(" will try to fix, but it is best to reboot\n"); + WARN_ON(1); + /* try to fix it */ + __get_cpu_var(dynticks_progress_counter)++; + } mb(); } @@ -96,7 +102,13 @@ static inline void rcu_exit_nohz(void) { mb(); __get_cpu_var(dynticks_progress_counter)++; - WARN_ON(!(__get_cpu_var(dynticks_progress_counter) & 0x1)); + if (unlikely(!(__get_cpu_var(dynticks_progress_counter) & 0x1))) { + printk("BUG: bad accounting of dynamic ticks\n"); + printk(" will try to fix, but it is best to reboot\n"); + WARN_ON(1); + /* try to fix it */ + __get_cpu_var(dynticks_progress_counter)++; + } } #else /* CONFIG_NO_HZ */ ���������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/rcu-preempt-boost-sdr.patch�����������������������������������������������������������������0000664�0000764�0000764�00000064444�11041657733�016456� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������--- include/linux/init_task.h | 13 + include/linux/rcupdate.h | 44 +++ include/linux/rcupreempt.h | 20 + include/linux/sched.h | 22 + kernel/Kconfig.preempt | 13 + kernel/Makefile | 1 kernel/fork.c | 8 kernel/rcupdate.c | 3 kernel/rcupreempt-boost.c | 549 +++++++++++++++++++++++++++++++++++++++++++++ kernel/rcupreempt.c | 2 kernel/rcupreempt_trace.c | 7 kernel/rtmutex.c | 7 kernel/sched.c | 2 13 files changed, 687 insertions(+), 4 deletions(-) Index: linux-2.6.24.7/include/linux/init_task.h =================================================================== --- linux-2.6.24.7.orig/include/linux/init_task.h +++ linux-2.6.24.7/include/linux/init_task.h @@ -88,6 +88,17 @@ extern struct nsproxy init_nsproxy; .signalfd_wqh = __WAIT_QUEUE_HEAD_INITIALIZER(sighand.signalfd_wqh), \ } +#ifdef CONFIG_PREEMPT_RCU_BOOST +#define INIT_RCU_BOOST_PRIO .rcu_prio = MAX_PRIO, +#define INIT_PREEMPT_RCU_BOOST(tsk) \ + .rcub_rbdp = NULL, \ + .rcub_state = RCU_BOOST_IDLE, \ + .rcub_entry = LIST_HEAD_INIT(tsk.rcub_entry), +#else /* #ifdef CONFIG_PREEMPT_RCU_BOOST */ +#define INIT_RCU_BOOST_PRIO +#define INIT_PREEMPT_RCU_BOOST(tsk) +#endif /* #else #ifdef CONFIG_PREEMPT_RCU_BOOST */ + extern struct group_info init_groups; #define INIT_STRUCT_PID { \ @@ -130,6 +141,7 @@ extern struct group_info init_groups; .static_prio = MAX_PRIO-20, \ .normal_prio = MAX_PRIO-20, \ .policy = SCHED_NORMAL, \ + INIT_RCU_BOOST_PRIO \ .cpus_allowed = CPU_MASK_ALL, \ .nr_cpus_allowed = NR_CPUS, \ .mm = NULL, \ @@ -176,6 +188,7 @@ extern struct group_info init_groups; .dirties = INIT_PROP_LOCAL_SINGLE(dirties), \ INIT_TRACE_IRQFLAGS \ INIT_LOCKDEP \ + INIT_PREEMPT_RCU_BOOST(tsk) \ } Index: linux-2.6.24.7/include/linux/rcupdate.h =================================================================== --- linux-2.6.24.7.orig/include/linux/rcupdate.h +++ linux-2.6.24.7/include/linux/rcupdate.h @@ -275,5 +275,49 @@ static inline void rcu_qsctr_inc(int cpu per_cpu(rcu_data_passed_quiesc, cpu) = 1; } +struct dentry; + +#ifdef CONFIG_PREEMPT_RCU_BOOST +extern void init_rcu_boost_late(void); +extern void rcu_boost_readers(void); +extern void rcu_unboost_readers(void); +extern void __rcu_preempt_boost(void); +#ifdef CONFIG_RCU_TRACE +extern int rcu_trace_boost_create(struct dentry *rcudir); +extern void rcu_trace_boost_destroy(void); +#endif /* CONFIG_RCU_TRACE */ +#define rcu_preempt_boost() /* cpp to avoid #include hell. */ \ + do { \ + if (unlikely(current->rcu_read_lock_nesting > 0)) \ + __rcu_preempt_boost(); \ + } while (0) +extern void __rcu_preempt_unboost(void); +#else /* #ifdef CONFIG_PREEMPT_RCU_BOOST */ +static inline void init_rcu_boost_late(void) +{ +} +static inline void rcu_preempt_boost(void) +{ +} +static inline void __rcu_preempt_unboost(void) +{ +} +static inline void rcu_boost_readers(void) +{ +} +static inline void rcu_unboost_readers(void) +{ +} +#ifdef CONFIG_RCU_TRACE +static inline int rcu_trace_boost_create(struct dentry *rcudir) +{ + return 0; +} +static inline void rcu_trace_boost_destroy(void) +{ +} +#endif /* CONFIG_RCU_TRACE */ +#endif /* #else #ifdef CONFIG_PREEMPT_RCU_BOOST */ + #endif /* __KERNEL__ */ #endif /* __LINUX_RCUPDATE_H */ Index: linux-2.6.24.7/include/linux/rcupreempt.h =================================================================== --- linux-2.6.24.7.orig/include/linux/rcupreempt.h +++ linux-2.6.24.7/include/linux/rcupreempt.h @@ -42,6 +42,26 @@ #include <linux/cpumask.h> #include <linux/seqlock.h> +#ifdef CONFIG_PREEMPT_RCU_BOOST +/* + * Task state with respect to being RCU-boosted. This state is changed + * by the task itself in response to the following three events: + * 1. Preemption (or block on lock) while in RCU read-side critical section. + * 2. Outermost rcu_read_unlock() for blocked RCU read-side critical section. + * + * The RCU-boost task also updates the state when boosting priority. + */ +enum rcu_boost_state { + RCU_BOOST_IDLE = 0, /* Not yet blocked if in RCU read-side. */ + RCU_BOOST_BLOCKED = 1, /* Blocked from RCU read-side. */ + RCU_BOOSTED = 2, /* Boosting complete. */ + RCU_BOOST_INVALID = 3, /* For bogus state sightings. */ +}; + +#define N_RCU_BOOST_STATE (RCU_BOOST_INVALID + 1) + +#endif /* #ifdef CONFIG_PREEMPT_RCU_BOOST */ + /* * Someone might want to pass call_rcu_bh as a function pointer. * So this needs to just be a rename and not a macro function. Index: linux-2.6.24.7/include/linux/sched.h =================================================================== --- linux-2.6.24.7.orig/include/linux/sched.h +++ linux-2.6.24.7/include/linux/sched.h @@ -585,6 +585,19 @@ struct signal_struct { #define SIGNAL_STOP_CONTINUED 0x00000004 /* SIGCONT since WCONTINUED reap */ #define SIGNAL_GROUP_EXIT 0x00000008 /* group exit in progress */ +#ifdef CONFIG_PREEMPT_RCU_BOOST +#define set_rcu_prio(p, prio) /* cpp to avoid #include hell */ \ + do { \ + (p)->rcu_prio = (prio); \ + } while (0) +#define get_rcu_prio(p) (p)->rcu_prio /* cpp to avoid #include hell */ +#else /* #ifdef CONFIG_PREEMPT_RCU_BOOST */ +static inline void set_rcu_prio(struct task_struct *p, int prio) +{ +} +#define get_rcu_prio(p) (MAX_PRIO) /* cpp to use MAX_PRIO before it's defined */ +#endif /* #else #ifdef CONFIG_PREEMPT_RCU_BOOST */ + /* * Some day this will be a full-fledged user tracking system.. */ @@ -1008,6 +1021,9 @@ struct task_struct { #endif int prio, static_prio, normal_prio; +#ifdef CONFIG_PREEMPT_RCU_BOOST + int rcu_prio; +#endif struct list_head run_list; const struct sched_class *sched_class; struct sched_entity se; @@ -1045,6 +1061,12 @@ struct task_struct { #if defined(CONFIG_SCHEDSTATS) || defined(CONFIG_TASK_DELAY_ACCT) struct sched_info sched_info; #endif +#ifdef CONFIG_PREEMPT_RCU_BOOST + struct rcu_boost_dat *rcub_rbdp; + enum rcu_boost_state rcub_state; + struct list_head rcub_entry; + unsigned long rcu_preempt_counter; +#endif struct list_head tasks; /* Index: linux-2.6.24.7/kernel/Kconfig.preempt =================================================================== --- linux-2.6.24.7.orig/kernel/Kconfig.preempt +++ linux-2.6.24.7/kernel/Kconfig.preempt @@ -157,6 +157,19 @@ config PREEMPT_RCU endchoice +config PREEMPT_RCU_BOOST + bool "Enable priority boosting of RCU read-side critical sections" + depends on PREEMPT_RCU + help + This option permits priority boosting of RCU read-side critical + sections tat have been preempted and a RT process is waiting + on a synchronize_rcu. + + An RCU thread is also created that periodically wakes up and + performs a synchronize_rcu to make sure that all readers eventually + do complete to prevent an indefinite delay of grace periods and + possible OOM problems. + config RCU_TRACE bool "Enable tracing for RCU - currently stats in debugfs" select DEBUG_FS Index: linux-2.6.24.7/kernel/Makefile =================================================================== --- linux-2.6.24.7.orig/kernel/Makefile +++ linux-2.6.24.7/kernel/Makefile @@ -70,6 +70,7 @@ obj-$(CONFIG_SECCOMP) += seccomp.o obj-$(CONFIG_RCU_TORTURE_TEST) += rcutorture.o obj-$(CONFIG_CLASSIC_RCU) += rcuclassic.o obj-$(CONFIG_PREEMPT_RCU) += rcuclassic.o rcupreempt.o +obj-$(CONFIG_PREEMPT_RCU_BOOST) += rcupreempt-boost.o ifeq ($(CONFIG_PREEMPT_RCU),y) obj-$(CONFIG_RCU_TRACE) += rcupreempt_trace.o endif Index: linux-2.6.24.7/kernel/fork.c =================================================================== --- linux-2.6.24.7.orig/kernel/fork.c +++ linux-2.6.24.7/kernel/fork.c @@ -1089,7 +1089,13 @@ static struct task_struct *copy_process( #ifdef CONFIG_PREEMPT_RCU p->rcu_read_lock_nesting = 0; p->rcu_flipctr_idx = 0; -#endif /* #ifdef CONFIG_PREEMPT_RCU */ +#ifdef CONFIG_PREEMPT_RCU_BOOST + p->rcu_prio = MAX_PRIO; + p->rcub_rbdp = NULL; + p->rcub_state = RCU_BOOST_IDLE; + INIT_LIST_HEAD(&p->rcub_entry); +#endif +#endif /* CONFIG_PREEMPT_RCU */ p->vfork_done = NULL; spin_lock_init(&p->alloc_lock); Index: linux-2.6.24.7/kernel/rcupdate.c =================================================================== --- linux-2.6.24.7.orig/kernel/rcupdate.c +++ linux-2.6.24.7/kernel/rcupdate.c @@ -91,8 +91,11 @@ void synchronize_rcu(void) /* Will wake me after RCU finished */ call_rcu(&rcu.head, wakeme_after_rcu); + rcu_boost_readers(); + /* Wait for it */ wait_for_completion(&rcu.completion); + rcu_unboost_readers(); } EXPORT_SYMBOL_GPL(synchronize_rcu); Index: linux-2.6.24.7/kernel/rcupreempt-boost.c =================================================================== --- /dev/null +++ linux-2.6.24.7/kernel/rcupreempt-boost.c @@ -0,0 +1,549 @@ +/* + * Read-Copy Update preempt priority boosting + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA. + * + * Copyright Red Hat Inc, 2007 + * + * Authors: Steven Rostedt <srostedt@redhat.com> + * + * Based on the original work by Paul McKenney <paulmck@us.ibm.com>. + * + */ +#include <linux/sched.h> +#include <linux/list.h> +#include <linux/spinlock.h> +#include <linux/debugfs.h> +#include <linux/module.h> +#include <linux/syscalls.h> +#include <linux/kthread.h> + +DEFINE_RAW_SPINLOCK(rcu_boost_wake_lock); +static int rcu_boost_prio = MAX_PRIO; /* Prio to set preempted RCU readers */ +static long rcu_boost_counter; /* used to keep track of who boosted */ +static int rcu_preempt_thread_secs = 3; /* Seconds between waking rcupreemptd thread */ + +struct rcu_boost_dat { + raw_spinlock_t rbs_lock; /* Sync changes to this struct */ + int rbs_prio; /* CPU copy of rcu_boost_prio */ + struct list_head rbs_toboost; /* Preempted RCU readers */ + struct list_head rbs_boosted; /* RCU readers that have been boosted */ +#ifdef CONFIG_RCU_TRACE + /* The rest are for statistics */ + unsigned long rbs_stat_task_boost_called; + unsigned long rbs_stat_task_boosted; + unsigned long rbs_stat_boost_called; + unsigned long rbs_stat_try_boost; + unsigned long rbs_stat_boosted; + unsigned long rbs_stat_unboost_called; + unsigned long rbs_stat_unboosted; + unsigned long rbs_stat_try_boost_readers; + unsigned long rbs_stat_boost_readers; + unsigned long rbs_stat_try_unboost_readers; + unsigned long rbs_stat_unboost_readers; + unsigned long rbs_stat_over_taken; +#endif /* CONFIG_RCU_TRACE */ +}; + +static DEFINE_PER_CPU(struct rcu_boost_dat, rcu_boost_data); +#define RCU_BOOST_ME &__get_cpu_var(rcu_boost_data) + +#ifdef CONFIG_RCU_TRACE + +#define RCUPREEMPT_BOOST_TRACE_BUF_SIZE 4096 +static char rcupreempt_boost_trace_buf[RCUPREEMPT_BOOST_TRACE_BUF_SIZE]; + +static ssize_t rcuboost_read(struct file *filp, char __user *buffer, + size_t count, loff_t *ppos) +{ + static DEFINE_MUTEX(mutex); + int cnt = 0; + int cpu; + struct rcu_boost_dat *rbd; + ssize_t bcount; + unsigned long task_boost_called = 0; + unsigned long task_boosted = 0; + unsigned long boost_called = 0; + unsigned long try_boost = 0; + unsigned long boosted = 0; + unsigned long unboost_called = 0; + unsigned long unboosted = 0; + unsigned long try_boost_readers = 0; + unsigned long boost_readers = 0; + unsigned long try_unboost_readers = 0; + unsigned long unboost_readers = 0; + unsigned long over_taken = 0; + + mutex_lock(&mutex); + + for_each_online_cpu(cpu) { + rbd = &per_cpu(rcu_boost_data, cpu); + + task_boost_called += rbd->rbs_stat_task_boost_called; + task_boosted += rbd->rbs_stat_task_boosted; + boost_called += rbd->rbs_stat_boost_called; + try_boost += rbd->rbs_stat_try_boost; + boosted += rbd->rbs_stat_boosted; + unboost_called += rbd->rbs_stat_unboost_called; + unboosted += rbd->rbs_stat_unboosted; + try_boost_readers += rbd->rbs_stat_try_boost_readers; + boost_readers += rbd->rbs_stat_boost_readers; + try_unboost_readers += rbd->rbs_stat_try_boost_readers; + unboost_readers += rbd->rbs_stat_boost_readers; + over_taken += rbd->rbs_stat_over_taken; + } + + cnt += snprintf(&rcupreempt_boost_trace_buf[cnt], + RCUPREEMPT_BOOST_TRACE_BUF_SIZE - cnt, + "task_boost_called = %ld\n", + task_boost_called); + cnt += snprintf(&rcupreempt_boost_trace_buf[cnt], + RCUPREEMPT_BOOST_TRACE_BUF_SIZE - cnt, + "task_boosted = %ld\n", + task_boosted); + cnt += snprintf(&rcupreempt_boost_trace_buf[cnt], + RCUPREEMPT_BOOST_TRACE_BUF_SIZE - cnt, + "boost_called = %ld\n", + boost_called); + cnt += snprintf(&rcupreempt_boost_trace_buf[cnt], + RCUPREEMPT_BOOST_TRACE_BUF_SIZE - cnt, + "try_boost = %ld\n", + try_boost); + cnt += snprintf(&rcupreempt_boost_trace_buf[cnt], + RCUPREEMPT_BOOST_TRACE_BUF_SIZE - cnt, + "boosted = %ld\n", + boosted); + cnt += snprintf(&rcupreempt_boost_trace_buf[cnt], + RCUPREEMPT_BOOST_TRACE_BUF_SIZE - cnt, + "unboost_called = %ld\n", + unboost_called); + cnt += snprintf(&rcupreempt_boost_trace_buf[cnt], + RCUPREEMPT_BOOST_TRACE_BUF_SIZE - cnt, + "unboosted = %ld\n", + unboosted); + cnt += snprintf(&rcupreempt_boost_trace_buf[cnt], + RCUPREEMPT_BOOST_TRACE_BUF_SIZE - cnt, + "try_boost_readers = %ld\n", + try_boost_readers); + cnt += snprintf(&rcupreempt_boost_trace_buf[cnt], + RCUPREEMPT_BOOST_TRACE_BUF_SIZE - cnt, + "boost_readers = %ld\n", + boost_readers); + cnt += snprintf(&rcupreempt_boost_trace_buf[cnt], + RCUPREEMPT_BOOST_TRACE_BUF_SIZE - cnt, + "try_unboost_readers = %ld\n", + try_unboost_readers); + cnt += snprintf(&rcupreempt_boost_trace_buf[cnt], + RCUPREEMPT_BOOST_TRACE_BUF_SIZE - cnt, + "unboost_readers = %ld\n", + unboost_readers); + cnt += snprintf(&rcupreempt_boost_trace_buf[cnt], + RCUPREEMPT_BOOST_TRACE_BUF_SIZE - cnt, + "over_taken = %ld\n", + over_taken); + cnt += snprintf(&rcupreempt_boost_trace_buf[cnt], + RCUPREEMPT_BOOST_TRACE_BUF_SIZE - cnt, + "rcu_boost_prio = %d\n", + rcu_boost_prio); + bcount = simple_read_from_buffer(buffer, count, ppos, + rcupreempt_boost_trace_buf, strlen(rcupreempt_boost_trace_buf)); + mutex_unlock(&mutex); + + return bcount; +} + +static struct file_operations rcuboost_fops = { + .read = rcuboost_read, +}; + +static struct dentry *rcuboostdir; +int rcu_trace_boost_create(struct dentry *rcudir) +{ + rcuboostdir = debugfs_create_file("rcuboost", 0444, rcudir, + NULL, &rcuboost_fops); + if (!rcuboostdir) + return 1; + + return 0; +} +EXPORT_SYMBOL_GPL(rcu_trace_boost_create); + +void rcu_trace_boost_destroy(void) +{ + if (rcuboostdir) + debugfs_remove(rcuboostdir); + rcuboostdir = NULL; +} +EXPORT_SYMBOL_GPL(rcu_trace_boost_destroy); + +#define RCU_BOOST_TRACE_FUNC_DECL(type) \ + static void rcu_trace_boost_##type(struct rcu_boost_dat *rbd) \ + { \ + rbd->rbs_stat_##type++; \ + } +RCU_BOOST_TRACE_FUNC_DECL(task_boost_called) +RCU_BOOST_TRACE_FUNC_DECL(task_boosted) +RCU_BOOST_TRACE_FUNC_DECL(boost_called) +RCU_BOOST_TRACE_FUNC_DECL(try_boost) +RCU_BOOST_TRACE_FUNC_DECL(boosted) +RCU_BOOST_TRACE_FUNC_DECL(unboost_called) +RCU_BOOST_TRACE_FUNC_DECL(unboosted) +RCU_BOOST_TRACE_FUNC_DECL(try_boost_readers) +RCU_BOOST_TRACE_FUNC_DECL(boost_readers) +RCU_BOOST_TRACE_FUNC_DECL(try_unboost_readers) +RCU_BOOST_TRACE_FUNC_DECL(unboost_readers) +RCU_BOOST_TRACE_FUNC_DECL(over_taken) +#else /* CONFIG_RCU_TRACE */ +/* These were created by the above macro "RCU_BOOST_TRACE_FUNC_DECL" */ +# define rcu_trace_boost_task_boost_called(rbd) do { } while (0) +# define rcu_trace_boost_task_boosted(rbd) do { } while (0) +# define rcu_trace_boost_boost_called(rbd) do { } while (0) +# define rcu_trace_boost_try_boost(rbd) do { } while (0) +# define rcu_trace_boost_boosted(rbd) do { } while (0) +# define rcu_trace_boost_unboost_called(rbd) do { } while (0) +# define rcu_trace_boost_unboosted(rbd) do { } while (0) +# define rcu_trace_boost_try_boost_readers(rbd) do { } while (0) +# define rcu_trace_boost_boost_readers(rbd) do { } while (0) +# define rcu_trace_boost_try_unboost_readers(rbd) do { } while (0) +# define rcu_trace_boost_unboost_readers(rbd) do { } while (0) +# define rcu_trace_boost_over_taken(rbd) do { } while (0) +#endif /* CONFIG_RCU_TRACE */ + +/* + * Helper function to boost a task's prio. + */ +static void rcu_boost_task(struct task_struct *task) +{ + WARN_ON(!irqs_disabled()); + WARN_ON_SMP(!spin_is_locked(&task->pi_lock)); + + rcu_trace_boost_task_boost_called(RCU_BOOST_ME); + + if (task->rcu_prio < task->prio) { + rcu_trace_boost_task_boosted(RCU_BOOST_ME); + rt_mutex_setprio(task, task->rcu_prio); + } +} + +/** + * __rcu_preepmt_boost - Called by sleeping RCU readers. + * + * When the RCU read-side critical section is preempted + * (or schedules out due to RT mutex) + * it places itself onto a list to notify that it is sleeping + * while holding a RCU read lock. If there is already a + * synchronize_rcu happening, then it will increase its + * priority (if necessary). + */ +void __rcu_preempt_boost(void) +{ + struct task_struct *curr = current; + struct rcu_boost_dat *rbd; + int prio; + unsigned long flags; + + WARN_ON(!current->rcu_read_lock_nesting); + + rcu_trace_boost_boost_called(RCU_BOOST_ME); + + /* check to see if we are already boosted */ + if (unlikely(curr->rcub_rbdp)) + return; + + /* + * To keep us from preempting between grabing + * the rbd and locking it, we use local_irq_save + */ + local_irq_save(flags); + rbd = &__get_cpu_var(rcu_boost_data); + spin_lock(&rbd->rbs_lock); + + spin_lock(&curr->pi_lock); + + curr->rcub_rbdp = rbd; + + rcu_trace_boost_try_boost(rbd); + + prio = rt_mutex_getprio(curr); + + if (list_empty(&curr->rcub_entry)) + list_add_tail(&curr->rcub_entry, &rbd->rbs_toboost); + if (prio <= rbd->rbs_prio) + goto out; + + rcu_trace_boost_boosted(curr->rcub_rbdp); + + curr->rcu_prio = rbd->rbs_prio; + rcu_boost_task(curr); + + out: + spin_unlock(&curr->pi_lock); + spin_unlock_irqrestore(&rbd->rbs_lock, flags); +} + +/** + * __rcu_preempt_unboost - called when releasing the RCU read lock + * + * When releasing the RCU read lock, a check is made to see if + * the task was preempted. If it was, it removes itself from the + * RCU data lists and if necessary, sets its priority back to + * normal. + */ +void __rcu_preempt_unboost(void) +{ + struct task_struct *curr = current; + struct rcu_boost_dat *rbd; + int prio; + unsigned long flags; + + rcu_trace_boost_unboost_called(RCU_BOOST_ME); + + /* if not boosted, then ignore */ + if (likely(!curr->rcub_rbdp)) + return; + + rbd = curr->rcub_rbdp; + + spin_lock_irqsave(&rbd->rbs_lock, flags); + list_del_init(&curr->rcub_entry); + + rcu_trace_boost_unboosted(curr->rcub_rbdp); + + curr->rcu_prio = MAX_PRIO; + + spin_lock(&curr->pi_lock); + prio = rt_mutex_getprio(curr); + rt_mutex_setprio(curr, prio); + + curr->rcub_rbdp = NULL; + + spin_unlock(&curr->pi_lock); + spin_unlock_irqrestore(&rbd->rbs_lock, flags); +} + +/* + * For each rcu_boost_dat structure, update all the tasks that + * are on the lists to the priority of the caller of + * synchronize_rcu. + */ +static int __rcu_boost_readers(struct rcu_boost_dat *rbd, int prio, unsigned long flags) +{ + struct task_struct *curr = current; + struct task_struct *p; + + spin_lock(&rbd->rbs_lock); + + rbd->rbs_prio = prio; + + /* + * Move the already boosted readers onto the list and reboost + * them. + */ + list_splice_init(&rbd->rbs_boosted, + &rbd->rbs_toboost); + + while (!list_empty(&rbd->rbs_toboost)) { + p = list_entry(rbd->rbs_toboost.next, + struct task_struct, rcub_entry); + list_move_tail(&p->rcub_entry, + &rbd->rbs_boosted); + p->rcu_prio = prio; + spin_lock(&p->pi_lock); + rcu_boost_task(p); + spin_unlock(&p->pi_lock); + + /* + * Now we release the lock to allow for a higher + * priority task to come in and boost the readers + * even higher. Or simply to let a higher priority + * task to run now. + */ + spin_unlock(&rbd->rbs_lock); + spin_unlock_irqrestore(&rcu_boost_wake_lock, flags); + + cpu_relax(); + spin_lock_irqsave(&rcu_boost_wake_lock, flags); + /* + * Another task may have taken over. + */ + if (curr->rcu_preempt_counter != rcu_boost_counter) { + rcu_trace_boost_over_taken(rbd); + return 1; + } + + spin_lock(&rbd->rbs_lock); + } + + spin_unlock(&rbd->rbs_lock); + + return 0; +} + +/** + * rcu_boost_readers - called by synchronize_rcu to boost sleeping RCU readers. + * + * This function iterates over all the per_cpu rcu_boost_data descriptors + * and boosts any sleeping (or slept) RCU readers. + */ +void rcu_boost_readers(void) +{ + struct task_struct *curr = current; + struct rcu_boost_dat *rbd; + unsigned long flags; + int prio; + int cpu; + int ret; + + spin_lock_irqsave(&rcu_boost_wake_lock, flags); + + prio = rt_mutex_getprio(curr); + + rcu_trace_boost_try_boost_readers(RCU_BOOST_ME); + + if (prio >= rcu_boost_prio) { + /* already boosted */ + spin_unlock_irqrestore(&rcu_boost_wake_lock, flags); + return; + } + + rcu_boost_prio = prio; + + rcu_trace_boost_boost_readers(RCU_BOOST_ME); + + /* Flag that we are the one to unboost */ + curr->rcu_preempt_counter = ++rcu_boost_counter; + + for_each_online_cpu(cpu) { + rbd = &per_cpu(rcu_boost_data, cpu); + ret = __rcu_boost_readers(rbd, prio, flags); + if (ret) + break; + } + + spin_unlock_irqrestore(&rcu_boost_wake_lock, flags); + +} + +/** + * rcu_unboost_readers - set the boost level back to normal. + * + * This function DOES NOT change the priority of any RCU reader + * that was boosted. The RCU readers do that when they release + * the RCU lock. This function only sets the global + * rcu_boost_prio to MAX_PRIO so that new RCU readers that sleep + * do not increase their priority. + */ +void rcu_unboost_readers(void) +{ + struct rcu_boost_dat *rbd; + unsigned long flags; + int cpu; + + spin_lock_irqsave(&rcu_boost_wake_lock, flags); + + rcu_trace_boost_try_unboost_readers(RCU_BOOST_ME); + + if (current->rcu_preempt_counter != rcu_boost_counter) + goto out; + + rcu_trace_boost_unboost_readers(RCU_BOOST_ME); + + /* + * We could also put in something that + * would allow other synchronize_rcu callers + * of lower priority that are still waiting + * to boost the prio. + */ + rcu_boost_prio = MAX_PRIO; + + for_each_online_cpu(cpu) { + rbd = &per_cpu(rcu_boost_data, cpu); + + spin_lock(&rbd->rbs_lock); + rbd->rbs_prio = rcu_boost_prio; + spin_unlock(&rbd->rbs_lock); + } + + out: + spin_unlock_irqrestore(&rcu_boost_wake_lock, flags); +} + +/* + * The krcupreemptd wakes up every "rcu_preempt_thread_secs" + * seconds at the minimum priority of 1 to do a + * synchronize_rcu. This ensures that grace periods finish + * and that we do not starve the system. If there are RT + * tasks above priority 1 that are hogging the system and + * preventing release of memory, then its the fault of the + * system designer running RT tasks too aggressively and the + * system is flawed regardless. + */ +static int krcupreemptd(void *data) +{ + struct sched_param param = { .sched_priority = 1 }; + int ret; + int prio; + + ret = sched_setscheduler(current, SCHED_FIFO, ¶m); + printk("krcupreemptd setsched %d\n", ret); + prio = current->prio; + printk(" prio = %d\n", prio); + set_current_state(TASK_INTERRUPTIBLE); + + while (!kthread_should_stop()) { + schedule_timeout(rcu_preempt_thread_secs * HZ); + + __set_current_state(TASK_RUNNING); + if (prio != current->prio) { + prio = current->prio; + printk("krcupreemptd new prio is %d??\n",prio); + } + + synchronize_rcu(); + + set_current_state(TASK_INTERRUPTIBLE); + } + __set_current_state(TASK_RUNNING); + return 0; +} + +static int __init rcu_preempt_boost_init(void) +{ + struct rcu_boost_dat *rbd; + struct task_struct *p; + int cpu; + + for_each_possible_cpu(cpu) { + rbd = &per_cpu(rcu_boost_data, cpu); + + spin_lock_init(&rbd->rbs_lock); + rbd->rbs_prio = MAX_PRIO; + INIT_LIST_HEAD(&rbd->rbs_toboost); + INIT_LIST_HEAD(&rbd->rbs_boosted); + } + + p = kthread_create(krcupreemptd, NULL, + "krcupreemptd"); + + if (IS_ERR(p)) { + printk("krcupreemptd failed\n"); + return NOTIFY_BAD; + } + wake_up_process(p); + + return 0; +} + +core_initcall(rcu_preempt_boost_init); Index: linux-2.6.24.7/kernel/rcupreempt.c =================================================================== --- linux-2.6.24.7.orig/kernel/rcupreempt.c +++ linux-2.6.24.7/kernel/rcupreempt.c @@ -310,6 +310,8 @@ void __rcu_read_unlock(void) ACCESS_ONCE(__get_cpu_var(rcu_flipctr)[idx])--; local_irq_restore(oldirq); + + __rcu_preempt_unboost(); } } EXPORT_SYMBOL_GPL(__rcu_read_unlock); Index: linux-2.6.24.7/kernel/rcupreempt_trace.c =================================================================== --- linux-2.6.24.7.orig/kernel/rcupreempt_trace.c +++ linux-2.6.24.7/kernel/rcupreempt_trace.c @@ -296,8 +296,14 @@ static int rcupreempt_debugfs_init(void) NULL, &rcuctrs_fops); if (!ctrsdir) goto free_out; + + if (!rcu_trace_boost_create(rcudir)) + goto free_out; + return 0; free_out: + if (ctrsdir) + debugfs_remove(ctrsdir); if (statdir) debugfs_remove(statdir); if (gpdir) @@ -323,6 +329,7 @@ static int __init rcupreempt_trace_init( static void __exit rcupreempt_trace_cleanup(void) { + rcu_trace_boost_destroy(); debugfs_remove(statdir); debugfs_remove(gpdir); debugfs_remove(ctrsdir); Index: linux-2.6.24.7/kernel/rtmutex.c =================================================================== --- linux-2.6.24.7.orig/kernel/rtmutex.c +++ linux-2.6.24.7/kernel/rtmutex.c @@ -121,11 +121,12 @@ static inline void init_lists(struct rt_ */ int rt_mutex_getprio(struct task_struct *task) { + int prio = min(task->normal_prio, get_rcu_prio(task)); + if (likely(!task_has_pi_waiters(task))) - return task->normal_prio; + return prio; - return min(task_top_pi_waiter(task)->pi_list_entry.prio, - task->normal_prio); + return min(task_top_pi_waiter(task)->pi_list_entry.prio, prio); } /* Index: linux-2.6.24.7/kernel/sched.c =================================================================== --- linux-2.6.24.7.orig/kernel/sched.c +++ linux-2.6.24.7/kernel/sched.c @@ -3850,6 +3850,8 @@ asmlinkage void __sched __schedule(void) struct rq *rq; int cpu; + rcu_preempt_boost(); + preempt_disable(); cpu = smp_processor_id(); rq = cpu_rq(cpu); ����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/rcu-preempt-boost-default.patch�������������������������������������������������������������0000664�0000764�0000764�00000001114�11041657730�017270� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������--- kernel/Kconfig.preempt | 1 + 1 file changed, 1 insertion(+) Index: linux-2.6.24.7/kernel/Kconfig.preempt =================================================================== --- linux-2.6.24.7.orig/kernel/Kconfig.preempt +++ linux-2.6.24.7/kernel/Kconfig.preempt @@ -160,6 +160,7 @@ endchoice config PREEMPT_RCU_BOOST bool "Enable priority boosting of RCU read-side critical sections" depends on PREEMPT_RCU + default y if PREEMPT_RT help This option permits priority boosting of RCU read-side critical sections tat have been preempted and a RT process is waiting ����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/rcu-preempt-boost-fix.patch�����������������������������������������������������������������0000664�0000764�0000764�00000005224�11041657731�016441� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������--- kernel/rcupreempt-boost.c | 39 ++++++++++++++++++++++++++++++++++++--- kernel/rcupreempt.c | 1 + 2 files changed, 37 insertions(+), 3 deletions(-) Index: linux-2.6.24.7/kernel/rcupreempt-boost.c =================================================================== --- linux-2.6.24.7.orig/kernel/rcupreempt-boost.c +++ linux-2.6.24.7/kernel/rcupreempt-boost.c @@ -221,6 +221,11 @@ RCU_BOOST_TRACE_FUNC_DECL(over_taken) # define rcu_trace_boost_over_taken(rbd) do { } while (0) #endif /* CONFIG_RCU_TRACE */ +static inline int rcu_is_boosted(struct task_struct *task) +{ + return !list_empty(&task->rcub_entry); +} + /* * Helper function to boost a task's prio. */ @@ -259,7 +264,7 @@ void __rcu_preempt_boost(void) rcu_trace_boost_boost_called(RCU_BOOST_ME); /* check to see if we are already boosted */ - if (unlikely(curr->rcub_rbdp)) + if (unlikely(rcu_is_boosted(curr))) return; /* @@ -311,15 +316,42 @@ void __rcu_preempt_unboost(void) rcu_trace_boost_unboost_called(RCU_BOOST_ME); /* if not boosted, then ignore */ - if (likely(!curr->rcub_rbdp)) + if (likely(!rcu_is_boosted(curr))) return; + /* + * Need to be very careful with NMIs. + * If we take the lock and an NMI comes in + * and it may try to unboost us if curr->rcub_rbdp + * is still set. So we zero it before grabbing the lock. + * But this also means that we might be boosted again + * so the boosting code needs to be aware of this. + */ rbd = curr->rcub_rbdp; + curr->rcub_rbdp = NULL; + + /* + * Now an NMI might have came in after we grab + * the below lock. This check makes sure that + * the NMI doesn't try grabbing the lock + * while we already have it. + */ + if (unlikely(!rbd)) + return; spin_lock_irqsave(&rbd->rbs_lock, flags); + /* + * It is still possible that an NMI came in + * between the "is_boosted" check and setting + * the rcu_rbdp to NULL. This would mean that + * the NMI already dequeued us. + */ + if (unlikely(!rcu_is_boosted(curr))) + goto out; + list_del_init(&curr->rcub_entry); - rcu_trace_boost_unboosted(curr->rcub_rbdp); + rcu_trace_boost_unboosted(rbd); curr->rcu_prio = MAX_PRIO; @@ -330,6 +362,7 @@ void __rcu_preempt_unboost(void) curr->rcub_rbdp = NULL; spin_unlock(&curr->pi_lock); + out: spin_unlock_irqrestore(&rbd->rbs_lock, flags); } Index: linux-2.6.24.7/kernel/rcupreempt.c =================================================================== --- linux-2.6.24.7.orig/kernel/rcupreempt.c +++ linux-2.6.24.7/kernel/rcupreempt.c @@ -309,6 +309,7 @@ void __rcu_read_unlock(void) */ ACCESS_ONCE(__get_cpu_var(rcu_flipctr)[idx])--; + local_irq_restore(oldirq); __rcu_preempt_unboost(); ����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/rcu-torture-preempt-update.patch������������������������������������������������������������0000664�0000764�0000764�00000012152�11041657732�017512� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������--- kernel/rcutorture.c | 69 ++++++++++++++++++++++++++++++++++++++++------------ 1 file changed, 54 insertions(+), 15 deletions(-) Index: linux-2.6.24.7/kernel/rcutorture.c =================================================================== --- linux-2.6.24.7.orig/kernel/rcutorture.c +++ linux-2.6.24.7/kernel/rcutorture.c @@ -52,6 +52,7 @@ MODULE_AUTHOR("Paul E. McKenney <paulmck static int nreaders = -1; /* # reader threads, defaults to 2*ncpus */ static int nfakewriters = 4; /* # fake writer threads */ +static int npreempthogs = -1; /* # preempt hogs to run (defaults to ncpus-1) or 1 */ static int stat_interval; /* Interval between stats, in seconds. */ /* Defaults to "only at end of test". */ static int verbose; /* Print more debug info. */ @@ -88,9 +89,11 @@ MODULE_PARM_DESC(torture_type, "Type of static char printk_buf[4096]; static int nrealreaders; +static int nrealpreempthogs; static struct task_struct *writer_task; static struct task_struct **fakewriter_tasks; static struct task_struct **reader_tasks; +static struct task_struct **rcu_preempt_tasks; static struct task_struct *stats_task; static struct task_struct *shuffler_task; @@ -260,7 +263,6 @@ static void rcu_torture_deferred_free(st call_rcu(&p->rtort_rcu, rcu_torture_cb); } -static struct task_struct *rcu_preeempt_task; static unsigned long rcu_torture_preempt_errors; static int rcu_torture_preempt(void *arg) @@ -270,7 +272,7 @@ static int rcu_torture_preempt(void *arg time_t gcstart; struct sched_param sp; - sp.sched_priority = MAX_RT_PRIO - 1; + sp.sched_priority = 1; err = sched_setscheduler(current, SCHED_RR, &sp); if (err != 0) printk(KERN_ALERT "rcu_torture_preempt() priority err: %d\n", @@ -293,24 +295,43 @@ static int rcu_torture_preempt(void *arg static long rcu_preempt_start(void) { long retval = 0; + int i; - rcu_preeempt_task = kthread_run(rcu_torture_preempt, NULL, - "rcu_torture_preempt"); - if (IS_ERR(rcu_preeempt_task)) { - VERBOSE_PRINTK_ERRSTRING("Failed to create preempter"); - retval = PTR_ERR(rcu_preeempt_task); - rcu_preeempt_task = NULL; + rcu_preempt_tasks = kzalloc(nrealpreempthogs * sizeof(rcu_preempt_tasks[0]), + GFP_KERNEL); + if (rcu_preempt_tasks == NULL) { + VERBOSE_PRINTK_ERRSTRING("out of memory"); + retval = -ENOMEM; + goto out; } + + for (i=0; i < nrealpreempthogs; i++) { + rcu_preempt_tasks[i] = kthread_run(rcu_torture_preempt, NULL, + "rcu_torture_preempt"); + if (IS_ERR(rcu_preempt_tasks[i])) { + VERBOSE_PRINTK_ERRSTRING("Failed to create preempter"); + retval = PTR_ERR(rcu_preempt_tasks[i]); + rcu_preempt_tasks[i] = NULL; + break; + } + } + out: return retval; } static void rcu_preempt_end(void) { - if (rcu_preeempt_task != NULL) { - VERBOSE_PRINTK_STRING("Stopping rcu_preempt task"); - kthread_stop(rcu_preeempt_task); + int i; + if (rcu_preempt_tasks) { + for (i=0; i < nrealpreempthogs; i++) { + if (rcu_preempt_tasks[i] != NULL) { + VERBOSE_PRINTK_STRING("Stopping rcu_preempt task"); + kthread_stop(rcu_preempt_tasks[i]); + } + rcu_preempt_tasks[i] = NULL; + } + kfree(rcu_preempt_tasks); } - rcu_preeempt_task = NULL; } static int rcu_preempt_stats(char *page) @@ -605,10 +626,20 @@ rcu_torture_writer(void *arg) static int rcu_torture_fakewriter(void *arg) { + struct sched_param sp; + long id = (long) arg; + int err; DEFINE_RCU_RANDOM(rand); VERBOSE_PRINTK_STRING("rcu_torture_fakewriter task started"); - set_user_nice(current, 19); + /* + * Set up at a higher prio than the readers. + */ + sp.sched_priority = 1 + id; + err = sched_setscheduler(current, SCHED_RR, &sp); + if (err != 0) + printk(KERN_ALERT "rcu_torture_writer() priority err: %d\n", + err); do { schedule_timeout_uninterruptible(1 + rcu_random(&rand)%10); @@ -841,9 +872,11 @@ rcu_torture_print_module_parms(char *tag { printk(KERN_ALERT "%s" TORTURE_FLAG "--- %s: nreaders=%d nfakewriters=%d " + "npreempthogs=%d " "stat_interval=%d verbose=%d test_no_idle_hz=%d " "shuffle_interval=%d preempt_torture=%d\n", torture_type, tag, nrealreaders, nfakewriters, + nrealpreempthogs, stat_interval, verbose, test_no_idle_hz, shuffle_interval, preempt_torture); } @@ -917,7 +950,7 @@ rcu_torture_cleanup(void) static int __init rcu_torture_init(void) { - int i; + long i; int cpu; int firsterr = 0; static struct rcu_torture_ops *torture_ops[] = @@ -945,6 +978,12 @@ rcu_torture_init(void) rcu_torture_print_module_parms("Start of test"); fullstop = 0; + if (npreempthogs >= 0) + nrealpreempthogs = npreempthogs; + else + nrealpreempthogs = num_online_cpus() == 1 ? 1 : + num_online_cpus() - 1; + /* Set up the freelist. */ INIT_LIST_HEAD(&rcu_torture_freelist); @@ -992,7 +1031,7 @@ rcu_torture_init(void) } for (i = 0; i < nfakewriters; i++) { VERBOSE_PRINTK_STRING("Creating rcu_torture_fakewriter task"); - fakewriter_tasks[i] = kthread_run(rcu_torture_fakewriter, NULL, + fakewriter_tasks[i] = kthread_run(rcu_torture_fakewriter, (void*)i, "rcu_torture_fakewriter"); if (IS_ERR(fakewriter_tasks[i])) { firsterr = PTR_ERR(fakewriter_tasks[i]); ����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/rcupreempt-boost-early-init.patch�����������������������������������������������������������0000664�0000764�0000764�00000005600�11041657733�017653� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������--- include/linux/rcuclassic.h | 1 + include/linux/rcupreempt.h | 8 +++++++- kernel/rcupreempt-boost.c | 16 +++++++++++----- kernel/rcupreempt.c | 1 + 4 files changed, 20 insertions(+), 6 deletions(-) Index: linux-2.6.24.7/include/linux/rcuclassic.h =================================================================== --- linux-2.6.24.7.orig/include/linux/rcuclassic.h +++ linux-2.6.24.7/include/linux/rcuclassic.h @@ -88,6 +88,7 @@ static inline void rcu_bh_qsctr_inc(int #define rcu_process_callbacks_rt(unused) do { } while (0) #define rcu_enter_nohz() do { } while (0) #define rcu_exit_nohz() do { } while (0) +#define rcu_preempt_boost_init() do { } while (0) extern void FASTCALL(call_rcu_classic(struct rcu_head *head, void (*func)(struct rcu_head *head))); Index: linux-2.6.24.7/include/linux/rcupreempt.h =================================================================== --- linux-2.6.24.7.orig/include/linux/rcupreempt.h +++ linux-2.6.24.7/include/linux/rcupreempt.h @@ -60,7 +60,13 @@ enum rcu_boost_state { #define N_RCU_BOOST_STATE (RCU_BOOST_INVALID + 1) -#endif /* #ifdef CONFIG_PREEMPT_RCU_BOOST */ +int __init rcu_preempt_boost_init(void); + +#else /* CONFIG_PREEPMT_RCU_BOOST */ + +#define rcu_preempt_boost_init() do { } while (0) + +#endif /* CONFIG_PREEMPT_RCU_BOOST */ /* * Someone might want to pass call_rcu_bh as a function pointer. Index: linux-2.6.24.7/kernel/rcupreempt-boost.c =================================================================== --- linux-2.6.24.7.orig/kernel/rcupreempt-boost.c +++ linux-2.6.24.7/kernel/rcupreempt-boost.c @@ -174,9 +174,9 @@ int rcu_trace_boost_create(struct dentry rcuboostdir = debugfs_create_file("rcuboost", 0444, rcudir, NULL, &rcuboost_fops); if (!rcuboostdir) - return 1; + return 0; - return 0; + return 1; } EXPORT_SYMBOL_GPL(rcu_trace_boost_create); @@ -552,10 +552,9 @@ static int krcupreemptd(void *data) return 0; } -static int __init rcu_preempt_boost_init(void) +int __init rcu_preempt_boost_init(void) { struct rcu_boost_dat *rbd; - struct task_struct *p; int cpu; for_each_possible_cpu(cpu) { @@ -567,6 +566,13 @@ static int __init rcu_preempt_boost_init INIT_LIST_HEAD(&rbd->rbs_boosted); } + return 0; +} + +static int __init rcu_preempt_start_krcupreemptd(void) +{ + struct task_struct *p; + p = kthread_create(krcupreemptd, NULL, "krcupreemptd"); @@ -579,4 +585,4 @@ static int __init rcu_preempt_boost_init return 0; } -core_initcall(rcu_preempt_boost_init); +__initcall(rcu_preempt_start_krcupreemptd); Index: linux-2.6.24.7/kernel/rcupreempt.c =================================================================== --- linux-2.6.24.7.orig/kernel/rcupreempt.c +++ linux-2.6.24.7/kernel/rcupreempt.c @@ -995,6 +995,7 @@ void __init rcu_init_rt(void) rdp->donelist = NULL; rdp->donetail = &rdp->donelist; } + rcu_preempt_boost_init(); } /* ��������������������������������������������������������������������������������������������������������������������������������patches/plist-debug.patch���������������������������������������������������������������������������0000664�0000764�0000764�00000003505�11041657734�014510� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From acme@ghostprotocols.net Tue Oct 23 16:01:53 2007 Date: Mon, 22 Oct 2007 14:43:02 -0200 From: Arnaldo Carvalho de Melo <acme@ghostprotocols.net> To: Steven Rostedt <rostedt@goodmis.org> Cc: linux-rt-users@vger.kernel.org Subject: [PATCH][DEBUG_PI_LIST]: Set plist.lock to NULL on PREEMPT_RT On RT struct plist_head->lock is a raw_spinlock_t, but struct futex_hash_bucket->lock, that is set to plist_head->lock is a spinlock, which becomes a mutex on RT. Later in plist_check_head spin_is_locked can't figure out what is the right type, triggering a WARN_ON_SMP. As we were already special casing PREEMPT_RT on plist_check_head.. Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> --- kernel/futex.c | 12 ++++++++++-- 1 file changed, 10 insertions(+), 2 deletions(-) Index: linux-2.6.24.7/kernel/futex.c =================================================================== --- linux-2.6.24.7.orig/kernel/futex.c +++ linux-2.6.24.7/kernel/futex.c @@ -951,9 +951,13 @@ static int futex_requeue(u32 __user *uad plist_del(&this->list, &hb1->chain); plist_add(&this->list, &hb2->chain); this->lock_ptr = &hb2->lock; -#if defined(CONFIG_DEBUG_PI_LIST) && !defined(CONFIG_PREEMPT_RT) +#ifdef CONFIG_DEBUG_PI_LIST +#ifdef CONFIG_PREEMPT_RT + this->list.plist.lock = NULL; +#else this->list.plist.lock = &hb2->lock; #endif +#endif } this->key = key2; get_futex_key_refs(&key2); @@ -1012,9 +1016,13 @@ static inline void __queue_me(struct fut prio = min(current->normal_prio, MAX_RT_PRIO); plist_node_init(&q->list, prio); -#if defined(CONFIG_DEBUG_PI_LIST) && !defined(CONFIG_PREEMPT_RT) +#ifdef CONFIG_DEBUG_PI_LIST +#ifdef CONFIG_PREEMPT_RT + q->list.plist.lock = NULL; +#else q->list.plist.lock = &hb->lock; #endif +#endif plist_add(&q->list, &hb->chain); q->task = current; spin_unlock(&hb->lock); �������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/seq-irqsave.patch���������������������������������������������������������������������������0000664�0000764�0000764�00000004677�11041657733�014543� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From dwalker@mvista.com Tue Oct 23 16:15:26 2007 Date: Mon, 22 Oct 2007 11:53:03 -0700 From: Daniel Walker <dwalker@mvista.com> To: Steven Rostedt <rostedt@goodmis.org> Cc: Remy Bohmer <linux@bohmer.net>, Ingo Molnar <mingo@elte.hu>, LKML <linux-kernel@vger.kernel.org>, RT <linux-rt-users@vger.kernel.org>, Thomas Gleixner <tglx@linutronix.de> Subject: Re: [RT] seqlocks: use of PICK_FUNCTION breaks kernel compile when CONFIG_GENERIC_TIME is NOT set On Wed, 2007-10-17 at 11:34 -0400, Steven Rostedt wrote: > > Hmm, what about a __seq_irqsave_raw and __seq_nop? > > That way it spells out that irqs are NOT touched if it is not a raw lock. I took out the nop , and just did a save flags which makes sense.. There is still more cleanup to do in that regard. Signed-off-by: Daniel Walker <dwalker@mvista.com> --- include/linux/seqlock.h | 26 +++++++++++++++----------- 1 file changed, 15 insertions(+), 11 deletions(-) Index: linux-2.6.24.7/include/linux/seqlock.h =================================================================== --- linux-2.6.24.7.orig/include/linux/seqlock.h +++ linux-2.6.24.7/include/linux/seqlock.h @@ -92,8 +92,11 @@ static inline void __write_seqlock(seqlo static __always_inline unsigned long __write_seqlock_irqsave(seqlock_t *sl) { + unsigned long flags; + + local_save_flags(flags); __write_seqlock(sl); - return 0; + return flags; } static inline void __write_sequnlock(seqlock_t *sl) @@ -280,26 +283,27 @@ do { \ PICK_SEQ_OP(__write_sequnlock_irq_raw, __write_sequnlock, lock) static __always_inline -unsigned long __read_seqbegin_irqsave_raw(raw_seqlock_t *sl) +unsigned long __seq_irqsave_raw(raw_seqlock_t *sl) { unsigned long flags; local_irq_save(flags); - __read_seqbegin_raw(sl); return flags; } -static __always_inline unsigned long __read_seqbegin_irqsave(seqlock_t *sl) +static __always_inline unsigned long __seq_irqsave(seqlock_t *sl) { - __read_seqbegin(sl); - return 0; + unsigned long flags; + + local_save_flags(flags); + return flags; } -#define read_seqbegin_irqsave(lock, flags) \ -do { \ - flags = PICK_SEQ_OP_RET(__read_seqbegin_irqsave_raw, \ - __read_seqbegin_irqsave, lock); \ -} while (0) +#define read_seqbegin_irqsave(lock, flags) \ +({ \ + flags = PICK_SEQ_OP_RET(__seq_irqsave_raw, __seq_irqsave, lock);\ + read_seqbegin(lock); \ +}) static __always_inline int __read_seqretry_irqrestore(seqlock_t *sl, unsigned iv, unsigned long flags) �����������������������������������������������������������������patches/numa-slab-freeing.patch���������������������������������������������������������������������0000664�0000764�0000764�00000003573�11041657730�015566� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From ak@suse.de Tue Oct 23 16:24:16 2007 Date: Tue, 23 Oct 2007 19:13:03 +0200 From: Andi Kleen <ak@suse.de> To: linux-rt-users@vger.kernel.org Subject: [PATCH] Fix rt preempt slab NUMA freeing When this_cpu changes in the free path node needs to change too. Otherwise the slab can end up in the wrong node's list and this eventually leads to WARN_ONs and of course worse NUMA performace. This patch is likely not complete (the NUMA slab code is *very* hairy), but seems to make the make -j128 test survive for at least two hours. But at least it fixes one case that regularly triggered during testing, resulting in slabs in the wrong node lists and triggering WARN_ONs in slab_put/get_obj I tried a complete audit of keeping this_cpu/node/slabp in sync when needed, but it is very hairy code and I likely missed some cases. This so far fixes only the simple free path; but it seems to be good enough to not trigger easily anymore on a NUMA system with memory pressure. Longer term the only good fix is probably to migrate to slub. Or disable NUMA slab for PREEMPT_RT (its value has been disputed in some benchmarks anyways) Signed-off-by: Andi Kleen <ak@suse.de> --- mm/slab.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) Index: linux-2.6.24.7/mm/slab.c =================================================================== --- linux-2.6.24.7.orig/mm/slab.c +++ linux-2.6.24.7/mm/slab.c @@ -1192,7 +1192,7 @@ cache_free_alien(struct kmem_cache *cach struct array_cache *alien = NULL; int node; - node = numa_node_id(); + node = cpu_to_node(*this_cpu); /* * Make sure we are not freeing a object from another node to the array @@ -4215,6 +4215,8 @@ static void cache_reap(struct work_struc work_done += reap_alien(searchp, l3, &this_cpu); + node = cpu_to_node(this_cpu); + work_done += drain_array(searchp, l3, cpu_cache_get(searchp, this_cpu), 0, node); �������������������������������������������������������������������������������������������������������������������������������������patches/rt_mutex_setprio.patch����������������������������������������������������������������������0000664�0000764�0000764�00000005271�11041657734�015707� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������Subject: rt: rename rt_mutex_setprio to task_setprio With there being multiple non-mutex users of this function its past time it got renamed. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> --- include/linux/sched.h | 7 ++++++- kernel/rcupreempt-boost.c | 4 ++-- kernel/sched.c | 8 ++------ 3 files changed, 10 insertions(+), 9 deletions(-) Index: linux-2.6.24.7/include/linux/sched.h =================================================================== --- linux-2.6.24.7.orig/include/linux/sched.h +++ linux-2.6.24.7/include/linux/sched.h @@ -1634,9 +1634,14 @@ int sched_nr_latency_handler(struct ctl_ extern unsigned int sysctl_sched_compat_yield; +extern void task_setprio(struct task_struct *p, int prio); + #ifdef CONFIG_RT_MUTEXES extern int rt_mutex_getprio(struct task_struct *p); -extern void rt_mutex_setprio(struct task_struct *p, int prio); +static inline void rt_mutex_setprio(struct task_struct *p, int prio) +{ + task_setprio(p, prio); +} extern void rt_mutex_adjust_pi(struct task_struct *p); #else static inline int rt_mutex_getprio(struct task_struct *p) Index: linux-2.6.24.7/kernel/rcupreempt-boost.c =================================================================== --- linux-2.6.24.7.orig/kernel/rcupreempt-boost.c +++ linux-2.6.24.7/kernel/rcupreempt-boost.c @@ -238,7 +238,7 @@ static void rcu_boost_task(struct task_s if (task->rcu_prio < task->prio) { rcu_trace_boost_task_boosted(RCU_BOOST_ME); - rt_mutex_setprio(task, task->rcu_prio); + task_setprio(task, task->rcu_prio); } } @@ -357,7 +357,7 @@ void __rcu_preempt_unboost(void) spin_lock(&curr->pi_lock); prio = rt_mutex_getprio(curr); - rt_mutex_setprio(curr, prio); + task_setprio(curr, prio); curr->rcub_rbdp = NULL; Index: linux-2.6.24.7/kernel/sched.c =================================================================== --- linux-2.6.24.7.orig/kernel/sched.c +++ linux-2.6.24.7/kernel/sched.c @@ -4330,10 +4330,8 @@ long __sched sleep_on_timeout(wait_queue } EXPORT_SYMBOL(sleep_on_timeout); -#ifdef CONFIG_RT_MUTEXES - /* - * rt_mutex_setprio - set the current priority of a task + * task_setprio - set the current priority of a task * @p: task * @prio: prio value (kernel-internal form) * @@ -4342,7 +4340,7 @@ EXPORT_SYMBOL(sleep_on_timeout); * * Used by the rt_mutex code to implement priority inheritance logic. */ -void rt_mutex_setprio(struct task_struct *p, int prio) +void task_setprio(struct task_struct *p, int prio) { unsigned long flags; int oldprio, prev_resched, on_rq, running; @@ -4403,8 +4401,6 @@ out_unlock: task_rq_unlock(rq, &flags); } -#endif - void set_user_nice(struct task_struct *p, long nice) { int old_prio, delta, on_rq; ���������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/rt-list-mods.patch��������������������������������������������������������������������������0000664�0000764�0000764�00000011034�11041657735�014624� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������Subject: rt: list_splice2 Introduce list_splice2{,_tail}() which will splice a sub-list denoted by two list items instead of the full list. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> --- drivers/dma/ioat_dma.c | 2 - drivers/usb/host/ehci-q.c | 2 - include/linux/list.h | 66 ++++++++++++++++++++++++++++++++++++++++------ lib/lock_list.c | 2 - 4 files changed, 61 insertions(+), 11 deletions(-) Index: linux-2.6.24.7/drivers/dma/ioat_dma.c =================================================================== --- linux-2.6.24.7.orig/drivers/dma/ioat_dma.c +++ linux-2.6.24.7/drivers/dma/ioat_dma.c @@ -297,7 +297,7 @@ static dma_cookie_t ioat1_tx_submit(stru /* write address into NextDescriptor field of last desc in chain */ to_ioat_desc(ioat_chan->used_desc.prev)->hw->next = first->async_tx.phys; - __list_splice(&new_chain, ioat_chan->used_desc.prev); + list_splice_tail(&new_chain, &ioat_chan->used_desc); ioat_chan->dmacount += desc_count; ioat_chan->pending += desc_count; Index: linux-2.6.24.7/drivers/usb/host/ehci-q.c =================================================================== --- linux-2.6.24.7.orig/drivers/usb/host/ehci-q.c +++ linux-2.6.24.7/drivers/usb/host/ehci-q.c @@ -887,7 +887,7 @@ static struct ehci_qh *qh_append_tds ( list_del (&qtd->qtd_list); list_add (&dummy->qtd_list, qtd_list); - __list_splice (qtd_list, qh->qtd_list.prev); + list_splice_tail (qtd_list, &qh->qtd_list); ehci_qtd_init(ehci, qtd, qtd->qtd_dma); qh->dummy = qtd; Index: linux-2.6.24.7/include/linux/list.h =================================================================== --- linux-2.6.24.7.orig/include/linux/list.h +++ linux-2.6.24.7/include/linux/list.h @@ -320,17 +320,17 @@ static inline int list_empty_careful(con } static inline void __list_splice(struct list_head *list, - struct list_head *head) + struct list_head *prev, + struct list_head *next) { struct list_head *first = list->next; struct list_head *last = list->prev; - struct list_head *at = head->next; - first->prev = head; - head->next = first; + first->prev = prev; + prev->next = first; - last->next = at; - at->prev = last; + last->next = next; + next->prev = last; } /** @@ -341,7 +341,13 @@ static inline void __list_splice(struct static inline void list_splice(struct list_head *list, struct list_head *head) { if (!list_empty(list)) - __list_splice(list, head); + __list_splice(list, head, head->next); +} + +static inline void list_splice_tail(struct list_head *list, struct list_head *head) +{ + if (!list_empty(list)) + __list_splice(list, head->prev, head); } /** @@ -355,11 +361,55 @@ static inline void list_splice_init(stru struct list_head *head) { if (!list_empty(list)) { - __list_splice(list, head); + __list_splice(list, head, head->next); + INIT_LIST_HEAD(list); + } +} + +static inline void list_splice_tail_init(struct list_head *list, + struct list_head *head) +{ + if (!list_empty(list)) { + __list_splice(list, head->prev, head); INIT_LIST_HEAD(list); } } +static inline void __list_splice2(struct list_head *first, + struct list_head *last, + struct list_head *prev, + struct list_head *next) +{ + first->prev->next = last->next; + last->next->prev = first->prev; + + first->prev = prev; + prev->next = first; + + last->next = next; + next->prev = last; +} + +/** + * list_splice2 - join [first, last] to head + * @first: list item + * @last: list item further on the same list + * @head: the place to add it on another list + */ +static inline void list_splice2(struct list_head *first, + struct list_head *last, + struct list_head *head) +{ + __list_splice2(first, last, head, head->next); +} + +static inline void list_splice2_tail(struct list_head *first, + struct list_head *last, + struct list_head *head) +{ + __list_splice2(first, last, head->prev, head); +} + /** * list_splice_init_rcu - splice an RCU-protected list into an existing list. * @list: the RCU-protected list to splice Index: linux-2.6.24.7/lib/lock_list.c =================================================================== --- linux-2.6.24.7.orig/lib/lock_list.c +++ linux-2.6.24.7/lib/lock_list.c @@ -128,7 +128,7 @@ void lock_list_splice_init(struct lock_l lock = __lock_list_reverse(list); if (!list_empty(&list->head)) { spin_lock_nested(&head->lock, LOCK_LIST_NESTING_NEXT); - __list_splice(&list->head, &head->head); + __list_splice(&list->head, &head->head, head->head.next); INIT_LIST_HEAD(&list->head); spin_unlock(&head->lock); } ����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/rt-plist-mods.patch�������������������������������������������������������������������������0000664�0000764�0000764�00000006701�11041657732�015006� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������Subject: rt: plist_head_splice merge-sort two plists together Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> --- include/linux/plist.h | 2 + lib/plist.c | 68 ++++++++++++++++++++++++++++++++++++++++++++++++-- 2 files changed, 68 insertions(+), 2 deletions(-) Index: linux-2.6.24.7/include/linux/plist.h =================================================================== --- linux-2.6.24.7.orig/include/linux/plist.h +++ linux-2.6.24.7/include/linux/plist.h @@ -148,6 +148,8 @@ static inline void plist_node_init(struc extern void plist_add(struct plist_node *node, struct plist_head *head); extern void plist_del(struct plist_node *node, struct plist_head *head); +extern void plist_head_splice(struct plist_head *src, struct plist_head *dst); + /** * plist_for_each - iterate over the plist * @pos: the type * to use as a loop counter Index: linux-2.6.24.7/lib/plist.c =================================================================== --- linux-2.6.24.7.orig/lib/plist.c +++ linux-2.6.24.7/lib/plist.c @@ -66,6 +66,30 @@ static void plist_check_head(struct plis # define plist_check_head(h) do { } while (0) #endif +static inline struct plist_node *prev_node(struct plist_node *iter) +{ + return list_entry(iter->plist.node_list.prev, struct plist_node, + plist.node_list); +} + +static inline struct plist_node *next_node(struct plist_node *iter) +{ + return list_entry(iter->plist.node_list.next, struct plist_node, + plist.node_list); +} + +static inline struct plist_node *prev_prio(struct plist_node *iter) +{ + return list_entry(iter->plist.prio_list.prev, struct plist_node, + plist.prio_list); +} + +static inline struct plist_node *next_prio(struct plist_node *iter) +{ + return list_entry(iter->plist.prio_list.next, struct plist_node, + plist.prio_list); +} + /** * plist_add - add @node to @head * @@ -83,8 +107,7 @@ void plist_add(struct plist_node *node, if (node->prio < iter->prio) goto lt_prio; else if (node->prio == iter->prio) { - iter = list_entry(iter->plist.prio_list.next, - struct plist_node, plist.prio_list); + iter = next_prio(iter); goto eq_prio; } } @@ -118,3 +141,44 @@ void plist_del(struct plist_node *node, plist_check_head(head); } + +void plist_head_splice(struct plist_head *src, struct plist_head *dst) +{ + struct plist_node *src_iter_first, *src_iter_last, *dst_iter; + struct plist_node *tail = container_of(dst, struct plist_node, plist); + + dst_iter = next_prio(tail); + + while (!plist_head_empty(src) && dst_iter != tail) { + src_iter_first = plist_first(src); + + src_iter_last = next_prio(src_iter_first); + src_iter_last = prev_node(src_iter_last); + + WARN_ON(src_iter_first->prio != src_iter_last->prio); + WARN_ON(list_empty(&src_iter_first->plist.prio_list)); + + while (src_iter_first->prio > dst_iter->prio) { + dst_iter = next_prio(dst_iter); + if (dst_iter == tail) + goto tail; + } + + list_del_init(&src_iter_first->plist.prio_list); + + if (src_iter_first->prio < dst_iter->prio) { + list_add_tail(&src_iter_first->plist.prio_list, + &dst_iter->plist.prio_list); + } else if (src_iter_first->prio == dst_iter->prio) { + dst_iter = next_prio(dst_iter); + } else BUG(); + + list_splice2_tail(&src_iter_first->plist.node_list, + &src_iter_last->plist.node_list, + &dst_iter->plist.node_list); + } + +tail: + list_splice_tail_init(&src->prio_list, &dst->prio_list); + list_splice_tail_init(&src->node_list, &dst->node_list); +} ���������������������������������������������������������������patches/rt-workqeue-prio.patch����������������������������������������������������������������������0000664�0000764�0000764�00000015431�11041657734�015526� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������Subject: rt: PI-workqueue support From: Daniel Walker <dwalker@mvista.com> Add support for priority queueing and priority inheritance to the workqueue infrastructure. This is done by replacing the linear linked worklist with a priority sorted plist. The drawback is that this breaks the workqueue barrier, needed to support flush_workqueue() and wait_on_work(). Signed-off-by: Daniel Walker <dwalker@mvista.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> --- include/linux/workqueue.h | 9 +++++---- kernel/power/poweroff.c | 1 + kernel/workqueue.c | 40 +++++++++++++++++++++++++--------------- 3 files changed, 31 insertions(+), 19 deletions(-) Index: linux-2.6.24.7/include/linux/workqueue.h =================================================================== --- linux-2.6.24.7.orig/include/linux/workqueue.h +++ linux-2.6.24.7/include/linux/workqueue.h @@ -9,6 +9,7 @@ #include <linux/linkage.h> #include <linux/bitops.h> #include <linux/lockdep.h> +#include <linux/plist.h> #include <asm/atomic.h> struct workqueue_struct; @@ -27,7 +28,7 @@ struct work_struct { #define WORK_STRUCT_PENDING 0 /* T if work item pending execution */ #define WORK_STRUCT_FLAG_MASK (3UL) #define WORK_STRUCT_WQ_DATA_MASK (~WORK_STRUCT_FLAG_MASK) - struct list_head entry; + struct plist_node entry; work_func_t func; #ifdef CONFIG_LOCKDEP struct lockdep_map lockdep_map; @@ -59,7 +60,7 @@ struct execute_work { #define __WORK_INITIALIZER(n, f) { \ .data = WORK_DATA_INIT(), \ - .entry = { &(n).entry, &(n).entry }, \ + .entry = PLIST_NODE_INIT(n.entry, MAX_PRIO), \ .func = (f), \ __WORK_INIT_LOCKDEP_MAP(#n, &(n)) \ } @@ -100,14 +101,14 @@ struct execute_work { \ (_work)->data = (atomic_long_t) WORK_DATA_INIT(); \ lockdep_init_map(&(_work)->lockdep_map, #_work, &__key, 0);\ - INIT_LIST_HEAD(&(_work)->entry); \ + plist_node_init(&(_work)->entry, -1); \ PREPARE_WORK((_work), (_func)); \ } while (0) #else #define INIT_WORK(_work, _func) \ do { \ (_work)->data = (atomic_long_t) WORK_DATA_INIT(); \ - INIT_LIST_HEAD(&(_work)->entry); \ + plist_node_init(&(_work)->entry, -1); \ PREPARE_WORK((_work), (_func)); \ } while (0) #endif Index: linux-2.6.24.7/kernel/power/poweroff.c =================================================================== --- linux-2.6.24.7.orig/kernel/power/poweroff.c +++ linux-2.6.24.7/kernel/power/poweroff.c @@ -8,6 +8,7 @@ #include <linux/sysrq.h> #include <linux/init.h> #include <linux/pm.h> +#include <linux/sched.h> #include <linux/workqueue.h> #include <linux/reboot.h> Index: linux-2.6.24.7/kernel/workqueue.c =================================================================== --- linux-2.6.24.7.orig/kernel/workqueue.c +++ linux-2.6.24.7/kernel/workqueue.c @@ -45,7 +45,7 @@ struct cpu_workqueue_struct { spinlock_t lock; - struct list_head worklist; + struct plist_head worklist; wait_queue_head_t more_work; struct work_struct *current_work; @@ -131,16 +131,19 @@ struct cpu_workqueue_struct *get_wq_data static void insert_work(struct cpu_workqueue_struct *cwq, struct work_struct *work, int tail) { + int prio = current->normal_prio; + set_wq_data(work, cwq); /* * Ensure that we get the right work->data if we see the * result of list_add() below, see try_to_grab_pending(). */ smp_wmb(); - if (tail) - list_add_tail(&work->entry, &cwq->worklist); - else - list_add(&work->entry, &cwq->worklist); + plist_node_init(&work->entry, prio); + plist_add(&work->entry, &cwq->worklist); + + if (prio < cwq->thread->prio) + task_setprio(cwq->thread, prio); wake_up(&cwq->more_work); } @@ -172,7 +175,7 @@ int fastcall queue_work(struct workqueue int ret = 0, cpu = raw_smp_processor_id(); if (!test_and_set_bit(WORK_STRUCT_PENDING, work_data_bits(work))) { - BUG_ON(!list_empty(&work->entry)); + BUG_ON(!plist_node_empty(&work->entry)); __queue_work(wq_per_cpu(wq, cpu), work); ret = 1; } @@ -226,7 +229,7 @@ int queue_delayed_work_on(int cpu, struc if (!test_and_set_bit(WORK_STRUCT_PENDING, work_data_bits(work))) { BUG_ON(timer_pending(timer)); - BUG_ON(!list_empty(&work->entry)); + BUG_ON(!plist_node_empty(&work->entry)); /* This stores cwq for the moment, for the timer_fn */ set_wq_data(work, wq_per_cpu(wq, raw_smp_processor_id())); @@ -268,8 +271,8 @@ static void run_workqueue(struct cpu_wor __FUNCTION__, cwq->run_depth); dump_stack(); } - while (!list_empty(&cwq->worklist)) { - struct work_struct *work = list_entry(cwq->worklist.next, + while (!plist_head_empty(&cwq->worklist)) { + struct work_struct *work = plist_first_entry(&cwq->worklist, struct work_struct, entry); work_func_t f = work->func; #ifdef CONFIG_LOCKDEP @@ -284,8 +287,12 @@ static void run_workqueue(struct cpu_wor struct lockdep_map lockdep_map = work->lockdep_map; #endif + if (likely(cwq->thread->prio != work->entry.prio)) + task_setprio(cwq->thread, work->entry.prio); + cwq->current_work = work; - list_del_init(cwq->worklist.next); + plist_del(&work->entry, &cwq->worklist); + plist_node_init(&work->entry, MAX_PRIO); spin_unlock_irq(&cwq->lock); BUG_ON(get_wq_data(work) != cwq); @@ -301,6 +308,7 @@ static void run_workqueue(struct cpu_wor spin_lock_irq(&cwq->lock); cwq->current_work = NULL; } + task_setprio(cwq->thread, current->normal_prio); cwq->run_depth--; spin_unlock_irq(&cwq->lock); } @@ -319,7 +327,7 @@ static int worker_thread(void *__cwq) prepare_to_wait(&cwq->more_work, &wait, TASK_INTERRUPTIBLE); if (!freezing(current) && !kthread_should_stop() && - list_empty(&cwq->worklist)) + plist_head_empty(&cwq->worklist)) schedule(); finish_wait(&cwq->more_work, &wait); @@ -372,7 +380,8 @@ static int flush_cpu_workqueue(struct cp active = 0; spin_lock_irq(&cwq->lock); - if (!list_empty(&cwq->worklist) || cwq->current_work != NULL) { + if (!plist_head_empty(&cwq->worklist) || + cwq->current_work != NULL) { insert_wq_barrier(cwq, &barr, 1); active = 1; } @@ -433,7 +442,7 @@ static int try_to_grab_pending(struct wo return ret; spin_lock_irq(&cwq->lock); - if (!list_empty(&work->entry)) { + if (!plist_node_empty(&work->entry)) { /* * This work is queued, but perhaps we locked the wrong cwq. * In that case we must see the new value after rmb(), see @@ -441,7 +450,8 @@ static int try_to_grab_pending(struct wo */ smp_rmb(); if (cwq == get_wq_data(work)) { - list_del_init(&work->entry); + plist_del(&work->entry, &cwq->worklist); + plist_node_init(&work->entry, MAX_PRIO); ret = 1; } } @@ -770,7 +780,7 @@ init_cpu_workqueue(struct workqueue_stru cwq->wq = wq; spin_lock_init(&cwq->lock); - INIT_LIST_HEAD(&cwq->worklist); + plist_head_init(&cwq->worklist, NULL); init_waitqueue_head(&cwq->more_work); return cwq; ���������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/rt-workqueue-barrier.patch������������������������������������������������������������������0000664�0000764�0000764�00000020142�11041657732�016361� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������Subject: rt: PI-workqueue: fix barriers The plist change to the workqueues left the barrier functionality broken. The barrier is used for two things: - wait_on_work(), and - flush_cpu_workqueue(). wait_on_work() - uses the barrier to wait on the completion of the currently worklet. This was done by inserting a completion barrier at the very head of the worklist. With plist this would be the head of the highest prio. In order to do that, we extend the priority range to exceed the normal range and enqueue it higher than anything else. Another noteworthy point is that this high prio worklet must not boost the prio further than the waiting task's prio, even though we enqueue it at prio 100. flush_cpu_workqueue() - is a full ordering barrier, although as the name suggests usually used to wait for the worklist to drain. We'll support the full ordering semantics currently present. This means that: W10, W22, W65, B, W80, B, W99 [ where Wn is a worklet at prio n, and B a barrier ] would most likely execute in the following order: W10@99, W65@99, W22@99, W80@99, W99 [ Wn@m is Wn executed at prio m ] [ W10 would be first because it can start executing while the others are being added ] Whereas without the barriers it would be: W10@99, W99, W80, W65, W22 The prio ordering of the plist makes it hard to impose an extra order on top. The solution used is to nest plist structures. The example will look like: W10, B(B(W65, W22), W80), W99 That is, the barrier will splice the worklist into itself, and enqueue itself as the next item to run (very first item, highest prio). The barrier will then run its own plist to completion before 'popping' back to the regular worklist. To avoid callstack nesting, run_workqueue is taught about this barrier stack. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> --- kernel/workqueue.c | 111 +++++++++++++++++++++++++++++++++++++++++++++-------- 1 file changed, 95 insertions(+), 16 deletions(-) Index: linux-2.6.24.7/kernel/workqueue.c =================================================================== --- linux-2.6.24.7.orig/kernel/workqueue.c +++ linux-2.6.24.7/kernel/workqueue.c @@ -37,6 +37,8 @@ #include <asm/uaccess.h> +struct wq_full_barrier; + /* * The per-CPU workqueue (if single thread, we always use the first * possible cpu). @@ -53,6 +55,8 @@ struct cpu_workqueue_struct { struct task_struct *thread; int run_depth; /* Detect run_workqueue() recursion depth */ + + struct wq_full_barrier *barrier; } ____cacheline_aligned; /* @@ -129,10 +133,8 @@ struct cpu_workqueue_struct *get_wq_data } static void insert_work(struct cpu_workqueue_struct *cwq, - struct work_struct *work, int tail) + struct work_struct *work, int prio, int boost_prio) { - int prio = current->normal_prio; - set_wq_data(work, cwq); /* * Ensure that we get the right work->data if we see the @@ -142,8 +144,8 @@ static void insert_work(struct cpu_workq plist_node_init(&work->entry, prio); plist_add(&work->entry, &cwq->worklist); - if (prio < cwq->thread->prio) - task_setprio(cwq->thread, prio); + if (boost_prio < cwq->thread->prio) + task_setprio(cwq->thread, boost_prio); wake_up(&cwq->more_work); } @@ -154,7 +156,7 @@ static void __queue_work(struct cpu_work unsigned long flags; spin_lock_irqsave(&cwq->lock, flags); - insert_work(cwq, work, 1); + insert_work(cwq, work, current->normal_prio, current->normal_prio); spin_unlock_irqrestore(&cwq->lock, flags); } @@ -261,8 +263,20 @@ static void leak_check(void *func) dump_stack(); } +struct wq_full_barrier { + struct work_struct work; + struct plist_head worklist; + struct wq_full_barrier *prev_barrier; + int prev_prio; + int waiter_prio; + struct cpu_workqueue_struct *cwq; + struct completion done; +}; + static void run_workqueue(struct cpu_workqueue_struct *cwq) { + struct plist_head *worklist = &cwq->worklist; + spin_lock_irq(&cwq->lock); cwq->run_depth++; if (cwq->run_depth > 3) { @@ -271,8 +285,11 @@ static void run_workqueue(struct cpu_wor __FUNCTION__, cwq->run_depth); dump_stack(); } - while (!plist_head_empty(&cwq->worklist)) { - struct work_struct *work = plist_first_entry(&cwq->worklist, + +again: + while (!plist_head_empty(worklist)) { + int prio; + struct work_struct *work = plist_first_entry(worklist, struct work_struct, entry); work_func_t f = work->func; #ifdef CONFIG_LOCKDEP @@ -287,11 +304,19 @@ static void run_workqueue(struct cpu_wor struct lockdep_map lockdep_map = work->lockdep_map; #endif - if (likely(cwq->thread->prio != work->entry.prio)) - task_setprio(cwq->thread, work->entry.prio); + prio = work->entry.prio; + if (unlikely(worklist != &cwq->worklist)) { + prio = min(prio, cwq->barrier->prev_prio); + prio = min(prio, cwq->barrier->waiter_prio); + prio = min(prio, plist_first(&cwq->worklist)->prio); + } + prio = max(prio, 0); + + if (likely(cwq->thread->prio != prio)) + task_setprio(cwq->thread, prio); cwq->current_work = work; - plist_del(&work->entry, &cwq->worklist); + plist_del(&work->entry, worklist); plist_node_init(&work->entry, MAX_PRIO); spin_unlock_irq(&cwq->lock); @@ -307,7 +332,27 @@ static void run_workqueue(struct cpu_wor spin_lock_irq(&cwq->lock); cwq->current_work = NULL; + + if (unlikely(cwq->barrier)) + worklist = &cwq->barrier->worklist; + } + + if (unlikely(worklist != &cwq->worklist)) { + struct wq_full_barrier *barrier = cwq->barrier; + + BUG_ON(!barrier); + cwq->barrier = barrier->prev_barrier; + complete(&barrier->done); + + if (unlikely(cwq->barrier)) + worklist = &cwq->barrier->worklist; + else + worklist = &cwq->worklist; + + if (!plist_head_empty(worklist)) + goto again; } + task_setprio(cwq->thread, current->normal_prio); cwq->run_depth--; spin_unlock_irq(&cwq->lock); @@ -354,14 +399,47 @@ static void wq_barrier_func(struct work_ } static void insert_wq_barrier(struct cpu_workqueue_struct *cwq, - struct wq_barrier *barr, int tail) + struct wq_barrier *barr, int prio) { INIT_WORK(&barr->work, wq_barrier_func); __set_bit(WORK_STRUCT_PENDING, work_data_bits(&barr->work)); init_completion(&barr->done); - insert_work(cwq, &barr->work, tail); + insert_work(cwq, &barr->work, prio, current->prio); +} + +static void wq_full_barrier_func(struct work_struct *work) +{ + struct wq_full_barrier *barrier = + container_of(work, struct wq_full_barrier, work); + struct cpu_workqueue_struct *cwq = barrier->cwq; + int prio = MAX_PRIO; + + spin_lock_irq(&cwq->lock); + barrier->prev_barrier = cwq->barrier; + if (cwq->barrier) { + prio = min(prio, cwq->barrier->waiter_prio); + prio = min(prio, plist_first(&cwq->barrier->worklist)->prio); + } + barrier->prev_prio = prio; + cwq->barrier = barrier; + spin_unlock_irq(&cwq->lock); +} + +static void insert_wq_full_barrier(struct cpu_workqueue_struct *cwq, + struct wq_full_barrier *barr) +{ + INIT_WORK(&barr->work, wq_full_barrier_func); + __set_bit(WORK_STRUCT_PENDING, work_data_bits(&barr->work)); + + plist_head_init(&barr->worklist, NULL); + plist_head_splice(&cwq->worklist, &barr->worklist); + barr->cwq = cwq; + init_completion(&barr->done); + barr->waiter_prio = current->prio; + + insert_work(cwq, &barr->work, 0, current->prio); } static int flush_cpu_workqueue(struct cpu_workqueue_struct *cwq) @@ -376,13 +454,13 @@ static int flush_cpu_workqueue(struct cp run_workqueue(cwq); active = 1; } else { - struct wq_barrier barr; + struct wq_full_barrier barr; active = 0; spin_lock_irq(&cwq->lock); if (!plist_head_empty(&cwq->worklist) || cwq->current_work != NULL) { - insert_wq_barrier(cwq, &barr, 1); + insert_wq_full_barrier(cwq, &barr); active = 1; } spin_unlock_irq(&cwq->lock); @@ -468,7 +546,7 @@ static void wait_on_cpu_work(struct cpu_ spin_lock_irq(&cwq->lock); if (unlikely(cwq->current_work == work)) { - insert_wq_barrier(cwq, &barr, 0); + insert_wq_barrier(cwq, &barr, -1); running = 1; } spin_unlock_irq(&cwq->lock); @@ -782,6 +860,7 @@ init_cpu_workqueue(struct workqueue_stru spin_lock_init(&cwq->lock); plist_head_init(&cwq->worklist, NULL); init_waitqueue_head(&cwq->more_work); + cwq->barrier = NULL; return cwq; } ������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/rt-wq-barrier-fix.patch���������������������������������������������������������������������0000664�0000764�0000764�00000011521�11041657730�015544� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������Subject: rt: PI-workqueue: wait_on_work() fixup Oleg noticed that the new wait_on_work() barrier does not properly interact with the nesting barrier. The problem is that a wait_on_work() targeted at a worklet in a nested list will complete too late. Fix this by using a wait_queue instead. [ will be folded into the previous patch on next posting ] Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> --- kernel/workqueue.c | 74 ++++++++++++++++++++--------------------------------- 1 file changed, 29 insertions(+), 45 deletions(-) Index: linux-2.6.24.7/kernel/workqueue.c =================================================================== --- linux-2.6.24.7.orig/kernel/workqueue.c +++ linux-2.6.24.7/kernel/workqueue.c @@ -34,10 +34,11 @@ #include <linux/kallsyms.h> #include <linux/debug_locks.h> #include <linux/lockdep.h> +#include <linux/wait.h> #include <asm/uaccess.h> -struct wq_full_barrier; +struct wq_barrier; /* * The per-CPU workqueue (if single thread, we always use the first @@ -56,7 +57,8 @@ struct cpu_workqueue_struct { int run_depth; /* Detect run_workqueue() recursion depth */ - struct wq_full_barrier *barrier; + wait_queue_head_t work_done; + struct wq_barrier *barrier; } ____cacheline_aligned; /* @@ -263,10 +265,10 @@ static void leak_check(void *func) dump_stack(); } -struct wq_full_barrier { +struct wq_barrier { struct work_struct work; struct plist_head worklist; - struct wq_full_barrier *prev_barrier; + struct wq_barrier *prev_barrier; int prev_prio; int waiter_prio; struct cpu_workqueue_struct *cwq; @@ -332,13 +334,13 @@ again: spin_lock_irq(&cwq->lock); cwq->current_work = NULL; - + wake_up_all(&cwq->work_done); if (unlikely(cwq->barrier)) worklist = &cwq->barrier->worklist; } if (unlikely(worklist != &cwq->worklist)) { - struct wq_full_barrier *barrier = cwq->barrier; + struct wq_barrier *barrier = cwq->barrier; BUG_ON(!barrier); cwq->barrier = barrier->prev_barrier; @@ -387,32 +389,10 @@ static int worker_thread(void *__cwq) return 0; } -struct wq_barrier { - struct work_struct work; - struct completion done; -}; - static void wq_barrier_func(struct work_struct *work) { - struct wq_barrier *barr = container_of(work, struct wq_barrier, work); - complete(&barr->done); -} - -static void insert_wq_barrier(struct cpu_workqueue_struct *cwq, - struct wq_barrier *barr, int prio) -{ - INIT_WORK(&barr->work, wq_barrier_func); - __set_bit(WORK_STRUCT_PENDING, work_data_bits(&barr->work)); - - init_completion(&barr->done); - - insert_work(cwq, &barr->work, prio, current->prio); -} - -static void wq_full_barrier_func(struct work_struct *work) -{ - struct wq_full_barrier *barrier = - container_of(work, struct wq_full_barrier, work); + struct wq_barrier *barrier = + container_of(work, struct wq_barrier, work); struct cpu_workqueue_struct *cwq = barrier->cwq; int prio = MAX_PRIO; @@ -427,10 +407,10 @@ static void wq_full_barrier_func(struct spin_unlock_irq(&cwq->lock); } -static void insert_wq_full_barrier(struct cpu_workqueue_struct *cwq, - struct wq_full_barrier *barr) +static void insert_wq_barrier(struct cpu_workqueue_struct *cwq, + struct wq_barrier *barr) { - INIT_WORK(&barr->work, wq_full_barrier_func); + INIT_WORK(&barr->work, wq_barrier_func); __set_bit(WORK_STRUCT_PENDING, work_data_bits(&barr->work)); plist_head_init(&barr->worklist, NULL); @@ -454,13 +434,13 @@ static int flush_cpu_workqueue(struct cp run_workqueue(cwq); active = 1; } else { - struct wq_full_barrier barr; + struct wq_barrier barr; active = 0; spin_lock_irq(&cwq->lock); if (!plist_head_empty(&cwq->worklist) || cwq->current_work != NULL) { - insert_wq_full_barrier(cwq, &barr); + insert_wq_barrier(cwq, &barr); active = 1; } spin_unlock_irq(&cwq->lock); @@ -538,21 +518,24 @@ static int try_to_grab_pending(struct wo return ret; } -static void wait_on_cpu_work(struct cpu_workqueue_struct *cwq, - struct work_struct *work) +static inline +int is_current_work(struct cpu_workqueue_struct *cwq, struct work_struct *work) { - struct wq_barrier barr; - int running = 0; + int ret; spin_lock_irq(&cwq->lock); - if (unlikely(cwq->current_work == work)) { - insert_wq_barrier(cwq, &barr, -1); - running = 1; - } + ret = (cwq->current_work == work); spin_unlock_irq(&cwq->lock); - if (unlikely(running)) - wait_for_completion(&barr.done); + return ret; +} + +static void wait_on_cpu_work(struct cpu_workqueue_struct *cwq, + struct work_struct *work) +{ + DEFINE_WAIT(wait); + + wait_event(cwq->work_done, !is_current_work(cwq, work)); } static void wait_on_work(struct work_struct *work) @@ -861,6 +844,7 @@ init_cpu_workqueue(struct workqueue_stru plist_head_init(&cwq->worklist, NULL); init_waitqueue_head(&cwq->more_work); cwq->barrier = NULL; + init_waitqueue_head(&cwq->work_done); return cwq; } �������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/rt-delayed-prio.patch�����������������������������������������������������������������������0000664�0000764�0000764�00000006030�11041657734�015266� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������Subject: rt: PI-workqueue: propagate prio for delayed work Delayed work looses its enqueue priority, and will be enqueued on the prio of the softirq thread. Ammend this. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> --- include/linux/workqueue.h | 1 + kernel/workqueue.c | 16 ++++++++++------ 2 files changed, 11 insertions(+), 6 deletions(-) Index: linux-2.6.24.7/include/linux/workqueue.h =================================================================== --- linux-2.6.24.7.orig/include/linux/workqueue.h +++ linux-2.6.24.7/include/linux/workqueue.h @@ -40,6 +40,7 @@ struct work_struct { struct delayed_work { struct work_struct work; struct timer_list timer; + int prio; }; struct execute_work { Index: linux-2.6.24.7/kernel/workqueue.c =================================================================== --- linux-2.6.24.7.orig/kernel/workqueue.c +++ linux-2.6.24.7/kernel/workqueue.c @@ -153,12 +153,12 @@ static void insert_work(struct cpu_workq /* Preempt must be disabled. */ static void __queue_work(struct cpu_workqueue_struct *cwq, - struct work_struct *work) + struct work_struct *work, int prio) { unsigned long flags; spin_lock_irqsave(&cwq->lock, flags); - insert_work(cwq, work, current->normal_prio, current->normal_prio); + insert_work(cwq, work, prio, prio); spin_unlock_irqrestore(&cwq->lock, flags); } @@ -180,7 +180,7 @@ int fastcall queue_work(struct workqueue if (!test_and_set_bit(WORK_STRUCT_PENDING, work_data_bits(work))) { BUG_ON(!plist_node_empty(&work->entry)); - __queue_work(wq_per_cpu(wq, cpu), work); + __queue_work(wq_per_cpu(wq, cpu), work, current->normal_prio); ret = 1; } return ret; @@ -193,7 +193,8 @@ void delayed_work_timer_fn(unsigned long struct cpu_workqueue_struct *cwq = get_wq_data(&dwork->work); struct workqueue_struct *wq = cwq->wq; - __queue_work(wq_per_cpu(wq, raw_smp_processor_id()), &dwork->work); + __queue_work(wq_per_cpu(wq, raw_smp_processor_id()), + &dwork->work, dwork->prio); } /** @@ -236,6 +237,7 @@ int queue_delayed_work_on(int cpu, struc BUG_ON(!plist_node_empty(&work->entry)); /* This stores cwq for the moment, for the timer_fn */ + dwork->prio = current->normal_prio; set_wq_data(work, wq_per_cpu(wq, raw_smp_processor_id())); timer->expires = jiffies + delay; timer->data = (unsigned long)dwork; @@ -725,7 +727,8 @@ int schedule_on_each_cpu(void (*func)(vo work->info = info; INIT_WORK(&work->work, schedule_on_each_cpu_func); set_bit(WORK_STRUCT_PENDING, work_data_bits(&work->work)); - __queue_work(per_cpu_ptr(keventd_wq->cpu_wq, cpu), &work->work); + __queue_work(per_cpu_ptr(keventd_wq->cpu_wq, cpu), + &work->work, current->normal_prio); } unlock_cpu_hotplug(); @@ -772,7 +775,8 @@ int schedule_on_each_cpu_wq(struct workq INIT_WORK(work, func); set_bit(WORK_STRUCT_PENDING, work_data_bits(work)); - __queue_work(per_cpu_ptr(wq->cpu_wq, cpu), work); + __queue_work(per_cpu_ptr(wq->cpu_wq, cpu), work, + current->normal_prio); } flush_workqueue(wq); free_percpu(works); ��������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/sched_prio.patch����������������������������������������������������������������������������0000664�0000764�0000764�00000005572�11041657733�014415� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������factor out priorities for workqueue.h and sched.h From: Clark Williams <williams@redhat.com> This fixes a circular dependency between sched.h and workqueue.h by factoring out the common defines to a new header which is included by both Signed-off-by: Clark Williams <williams@redhat.com> --- include/linux/sched.h | 19 +------------------ include/linux/sched_prio.h | 23 +++++++++++++++++++++++ include/linux/workqueue.h | 1 + 3 files changed, 25 insertions(+), 18 deletions(-) Index: linux-2.6.24.7/include/linux/sched.h =================================================================== --- linux-2.6.24.7.orig/include/linux/sched.h +++ linux-2.6.24.7/include/linux/sched.h @@ -1328,24 +1328,7 @@ struct task_struct { # define set_printk_might_sleep(x) do { } while(0) #endif -/* - * Priority of a process goes from 0..MAX_PRIO-1, valid RT - * priority is 0..MAX_RT_PRIO-1, and SCHED_NORMAL/SCHED_BATCH - * tasks are in the range MAX_RT_PRIO..MAX_PRIO-1. Priority - * values are inverted: lower p->prio value means higher priority. - * - * The MAX_USER_RT_PRIO value allows the actual maximum - * RT priority to be separate from the value exported to - * user-space. This allows kernel threads to set their - * priority to a value higher than any user task. Note: - * MAX_RT_PRIO must not be smaller than MAX_USER_RT_PRIO. - */ - -#define MAX_USER_RT_PRIO 100 -#define MAX_RT_PRIO MAX_USER_RT_PRIO - -#define MAX_PRIO (MAX_RT_PRIO + 40) -#define DEFAULT_PRIO (MAX_RT_PRIO + 20) +#include <linux/sched_prio.h> static inline int rt_prio(int prio) { Index: linux-2.6.24.7/include/linux/sched_prio.h =================================================================== --- /dev/null +++ linux-2.6.24.7/include/linux/sched_prio.h @@ -0,0 +1,23 @@ +#ifndef __SCHED_PRIO_H +#define __SCHED_PRIO_H + +/* + * Priority of a process goes from 0..MAX_PRIO-1, valid RT + * priority is 0..MAX_RT_PRIO-1, and SCHED_NORMAL/SCHED_BATCH + * tasks are in the range MAX_RT_PRIO..MAX_PRIO-1. Priority + * values are inverted: lower p->prio value means higher priority. + * + * The MAX_USER_RT_PRIO value allows the actual maximum + * RT priority to be separate from the value exported to + * user-space. This allows kernel threads to set their + * priority to a value higher than any user task. Note: + * MAX_RT_PRIO must not be smaller than MAX_USER_RT_PRIO. + */ + +#define MAX_USER_RT_PRIO 100 +#define MAX_RT_PRIO MAX_USER_RT_PRIO + +#define MAX_PRIO (MAX_RT_PRIO + 40) +#define DEFAULT_PRIO (MAX_RT_PRIO + 20) + +#endif Index: linux-2.6.24.7/include/linux/workqueue.h =================================================================== --- linux-2.6.24.7.orig/include/linux/workqueue.h +++ linux-2.6.24.7/include/linux/workqueue.h @@ -10,6 +10,7 @@ #include <linux/bitops.h> #include <linux/lockdep.h> #include <linux/plist.h> +#include <linux/sched_prio.h> #include <asm/atomic.h> struct workqueue_struct; ��������������������������������������������������������������������������������������������������������������������������������������patches/lock-init-plist-fix.patch�������������������������������������������������������������������0000664�0000764�0000764�00000005251�11041657734�016077� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From jan.kiszka@siemens.com Fri Oct 26 22:37:46 2007 Date: Fri, 26 Oct 2007 17:38:19 +0200 From: Jan Kiszka <jan.kiszka@siemens.com> To: linux-kernel@vger.kernel.org Cc: Thomas Gleixner <tglx@linutronix.de>, Ingo Molnar <mingo@elte.hu> Subject: [PATCH] Fix plist initialisation for CONFIG_DEBUG_PI_LIST Resent-Date: Fri, 26 Oct 2007 18:38:07 +0200 (CEST) Resent-From: Thomas Gleixner <tglx@linutronix.de> Resent-To: Steven Rostedt <rostedt@goodmis.org> Resent-Subject: [PATCH] Fix plist initialisation for CONFIG_DEBUG_PI_LIST [ The following text is in the "ISO-8859-15" character set. ] [ Your display is set for the "iso-8859-1" character set. ] [ Some special characters may be displayed incorrectly. ] PLIST_NODE_INIT (once used, only in -rt ATM) will fail when CONFIG_DEBUG_PI_LIST is enabled as it then generates a &NULL statement. This patch fixes the issue indirectly by turning the _lock argument of PLIST_HEAD_INIT into a pointer and adopting its users. Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> --- include/linux/plist.h | 4 ++-- include/linux/rtmutex.h | 4 ++-- 2 files changed, 4 insertions(+), 4 deletions(-) Index: linux-2.6.24.7/include/linux/plist.h =================================================================== --- linux-2.6.24.7.orig/include/linux/plist.h +++ linux-2.6.24.7/include/linux/plist.h @@ -99,13 +99,13 @@ struct plist_node { /** * PLIST_HEAD_INIT - static struct plist_head initializer * @head: struct plist_head variable name - * @_lock: lock to initialize for this list + * @_lock: lock * to initialize for this list */ #define PLIST_HEAD_INIT(head, _lock) \ { \ .prio_list = LIST_HEAD_INIT((head).prio_list), \ .node_list = LIST_HEAD_INIT((head).node_list), \ - PLIST_HEAD_LOCK_INIT(&(_lock)) \ + PLIST_HEAD_LOCK_INIT(_lock) \ } /** Index: linux-2.6.24.7/include/linux/rtmutex.h =================================================================== --- linux-2.6.24.7.orig/include/linux/rtmutex.h +++ linux-2.6.24.7/include/linux/rtmutex.h @@ -64,7 +64,7 @@ struct hrtimer_sleeper; #define __RT_MUTEX_INITIALIZER(mutexname) \ { .wait_lock = RAW_SPIN_LOCK_UNLOCKED(mutexname) \ - , .wait_list = PLIST_HEAD_INIT(mutexname.wait_list, mutexname.wait_lock) \ + , .wait_list = PLIST_HEAD_INIT(mutexname.wait_list, &mutexname.wait_lock) \ , .owner = NULL \ __DEBUG_RT_MUTEX_INITIALIZER(mutexname)} @@ -98,7 +98,7 @@ extern void rt_mutex_unlock(struct rt_mu #ifdef CONFIG_RT_MUTEXES # define INIT_RT_MUTEXES(tsk) \ - .pi_waiters = PLIST_HEAD_INIT(tsk.pi_waiters, tsk.pi_lock), \ + .pi_waiters = PLIST_HEAD_INIT(tsk.pi_waiters, &tsk.pi_lock), \ INIT_RT_MUTEX_DEBUG(tsk) #else # define INIT_RT_MUTEXES(tsk) �������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/ntfs-local-irq-save-nort.patch��������������������������������������������������������������0000664�0000764�0000764�00000005247�11041657733�017044� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From efault@gmx.de Sat Oct 27 10:28:42 2007 Date: Sat, 27 Oct 2007 12:17:49 +0200 From: Mike Galbraith <efault@gmx.de> To: Ingo Molnar <mingo@elte.hu> Cc: Nick Piggin <nickpiggin@yahoo.com.au>, Steven Rostedt <rostedt@goodmis.org>, LKML <linux-kernel@vger.kernel.org>, RT <linux-rt-users@vger.kernel.org>, Thomas Gleixner <tglx@linutronix.de> Subject: Re: [2.6.23-rt3] NMI watchdog trace of deadlock On Sat, 2007-10-27 at 11:44 +0200, Ingo Molnar wrote: > * Nick Piggin <nickpiggin@yahoo.com.au> wrote: > > > > [10138.175796] [<c0105de3>] show_trace+0x12/0x14 > > > [10138.180291] [<c0105dfb>] dump_stack+0x16/0x18 > > > [10138.184769] [<c011609f>] native_smp_call_function_mask+0x138/0x13d > > > [10138.191117] [<c0117606>] smp_call_function+0x1e/0x24 > > > [10138.196210] [<c012f85c>] on_each_cpu+0x25/0x50 > > > [10138.200807] [<c0115c74>] flush_tlb_all+0x1e/0x20 > > > [10138.205553] [<c016caaf>] kmap_high+0x1b6/0x417 > > > [10138.210118] [<c011ec88>] kmap+0x4d/0x4f > > > [10138.214102] [<c026a9d8>] ntfs_end_buffer_async_read+0x228/0x2f9 > > > [10138.220163] [<c01a0e9e>] end_bio_bh_io_sync+0x26/0x3f > > > [10138.225352] [<c01a2b09>] bio_endio+0x42/0x6d > > > [10138.229769] [<c02c2a08>] __end_that_request_first+0x115/0x4ac > > > [10138.235682] [<c02c2da7>] end_that_request_chunk+0x8/0xa > > > [10138.241052] [<c0365943>] ide_end_request+0x55/0x10a > > > [10138.246058] [<c036dae3>] ide_dma_intr+0x6f/0xac > > > [10138.250727] [<c0366d83>] ide_intr+0x93/0x1e0 > > > [10138.255125] [<c015afb4>] handle_IRQ_event+0x5c/0xc9 > > > > Looks like ntfs is kmap()ing from interrupt context. Should be using > > kmap_atomic instead, I think. > > it's not atomic interrupt context but irq thread context - and -rt > remaps kmap_atomic() to kmap() internally. Hm. Looking at the change to mm/bounce.c, perhaps I should do this instead? --- fs/ntfs/aops.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) Index: linux-2.6.24.7/fs/ntfs/aops.c =================================================================== --- linux-2.6.24.7.orig/fs/ntfs/aops.c +++ linux-2.6.24.7/fs/ntfs/aops.c @@ -139,13 +139,13 @@ static void ntfs_end_buffer_async_read(s recs = PAGE_CACHE_SIZE / rec_size; /* Should have been verified before we got here... */ BUG_ON(!recs); - local_irq_save(flags); + local_irq_save_nort(flags); kaddr = kmap_atomic(page, KM_BIO_SRC_IRQ); for (i = 0; i < recs; i++) post_read_mst_fixup((NTFS_RECORD*)(kaddr + i * rec_size), rec_size); kunmap_atomic(kaddr, KM_BIO_SRC_IRQ); - local_irq_restore(flags); + local_irq_restore_nort(flags); flush_dcache_page(page); if (likely(page_uptodate && !PageError(page))) SetPageUptodate(page); ���������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/dont-disable-preemption-without-IST.patch���������������������������������������������������0000664�0000764�0000764�00000006647�11041657734�021166� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From ak@suse.de Sat Oct 27 10:32:13 2007 Date: Sat, 27 Oct 2007 12:39:33 +0200 From: Andi Kleen <ak@suse.de> To: linux-rt-users@vger.kernel.org Subject: [PATCH] Don't disable preemption in exception handlers without IST Some of the exception handlers that run on an IST in a normal kernel still disable preemption. This causes might_sleep warning when sending signals for debugging in PREEMPT-RT because sending signals can take a lock. Since the ISTs are disabled now for those don't disable the preemption. This completes the remove IST patch I sent some time ago and fixes another case where using gdb caused warnings. Also it will likely improve latency a little bit. Signed-off-by: Andi Kleen <ak@suse.de> --- arch/x86/kernel/traps_64.c | 24 +++++++++++++----------- 1 file changed, 13 insertions(+), 11 deletions(-) Index: linux-2.6.24.7/arch/x86/kernel/traps_64.c =================================================================== --- linux-2.6.24.7.orig/arch/x86/kernel/traps_64.c +++ linux-2.6.24.7/arch/x86/kernel/traps_64.c @@ -82,20 +82,22 @@ static inline void conditional_sti(struc local_irq_enable(); } -static inline void preempt_conditional_sti(struct pt_regs *regs) +static inline void preempt_conditional_sti(struct pt_regs *regs, int stack) { - preempt_disable(); + if (stack) + preempt_disable(); if (regs->eflags & X86_EFLAGS_IF) local_irq_enable(); } -static inline void preempt_conditional_cli(struct pt_regs *regs) +static inline void preempt_conditional_cli(struct pt_regs *regs, int stack) { if (regs->eflags & X86_EFLAGS_IF) local_irq_disable(); /* Make sure to not schedule here because we could be running on an exception stack. */ - preempt_enable_no_resched(); + if (stack) + preempt_enable_no_resched(); } int kstack_depth_to_print = 12; @@ -669,9 +671,9 @@ asmlinkage void do_stack_segment(struct if (notify_die(DIE_TRAP, "stack segment", regs, error_code, 12, SIGBUS) == NOTIFY_STOP) return; - preempt_conditional_sti(regs); + preempt_conditional_sti(regs, STACKFAULT_STACK); do_trap(12, SIGBUS, "stack segment", regs, error_code, NULL); - preempt_conditional_cli(regs); + preempt_conditional_cli(regs, STACKFAULT_STACK); } asmlinkage void do_double_fault(struct pt_regs * regs, long error_code) @@ -831,9 +833,9 @@ asmlinkage void __kprobes do_int3(struct if (notify_die(DIE_INT3, "int3", regs, error_code, 3, SIGTRAP) == NOTIFY_STOP) { return; } - preempt_conditional_sti(regs); + preempt_conditional_sti(regs, DEBUG_STACK); do_trap(3, SIGTRAP, "int3", regs, error_code, NULL); - preempt_conditional_cli(regs); + preempt_conditional_cli(regs, DEBUG_STACK); } /* Help handler running on IST stack to switch back to user stack @@ -873,7 +875,7 @@ asmlinkage void __kprobes do_debug(struc SIGTRAP) == NOTIFY_STOP) return; - preempt_conditional_sti(regs); + preempt_conditional_sti(regs, DEBUG_STACK); /* Mask out spurious debug traps due to lazy DR7 setting */ if (condition & (DR_TRAP0|DR_TRAP1|DR_TRAP2|DR_TRAP3)) { @@ -918,13 +920,13 @@ asmlinkage void __kprobes do_debug(struc clear_dr7: set_debugreg(0UL, 7); - preempt_conditional_cli(regs); + preempt_conditional_cli(regs, DEBUG_STACK); return; clear_TF_reenable: set_tsk_thread_flag(tsk, TIF_SINGLESTEP); regs->eflags &= ~TF_MASK; - preempt_conditional_cli(regs); + preempt_conditional_cli(regs, DEBUG_STACK); } static int kernel_math_error(struct pt_regs *regs, const char *str, int trapnr) �����������������������������������������������������������������������������������������patches/irq-flags-unsigned-long.patch���������������������������������������������������������������0000664�0000764�0000764�00000002130�11041657733�016715� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������--- drivers/media/video/zoran_driver.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) Index: linux-2.6.24.7/drivers/media/video/zoran_driver.c =================================================================== --- linux-2.6.24.7.orig/drivers/media/video/zoran_driver.c +++ linux-2.6.24.7/drivers/media/video/zoran_driver.c @@ -1174,7 +1174,7 @@ zoran_close_end_session (struct file *fi /* v4l capture */ if (fh->v4l_buffers.active != ZORAN_FREE) { - long flags; + unsigned long flags; spin_lock_irqsave(&zr->spinlock, flags); zr36057_set_memgrab(zr, 0); @@ -3447,7 +3447,7 @@ zoran_do_ioctl (struct inode *inode, /* unload capture */ if (zr->v4l_memgrab_active) { - long flags; + unsigned long flags; spin_lock_irqsave(&zr->spinlock, flags); zr36057_set_memgrab(zr, 0); @@ -4387,7 +4387,7 @@ zoran_vm_close (struct vm_area_struct *v mutex_lock(&zr->resource_lock); if (fh->v4l_buffers.active != ZORAN_FREE) { - long flags; + unsigned long flags; spin_lock_irqsave(&zr->spinlock, flags); zr36057_set_memgrab(zr, 0); ����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/filemap-dont-bug-non-atomic.patch�����������������������������������������������������������0000664�0000764�0000764�00000000733�11041657732�017463� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������--- mm/filemap.c | 2 ++ 1 file changed, 2 insertions(+) Index: linux-2.6.24.7/mm/filemap.c =================================================================== --- linux-2.6.24.7.orig/mm/filemap.c +++ linux-2.6.24.7/mm/filemap.c @@ -1763,7 +1763,9 @@ size_t iov_iter_copy_from_user_atomic(st char *kaddr; size_t copied; +#ifndef CONFIG_PREEMPT_RT BUG_ON(!in_atomic()); +#endif kaddr = kmap_atomic(page, KM_USER0); if (likely(i->nr_segs == 1)) { int left; �������������������������������������patches/fix-bug-on-in-filemap.patch�����������������������������������������������������������������0000664�0000764�0000764�00000001454�11041657734�016264� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From: Steven Rostedt <srostedt@redhat.com> Subject: Change bug_on for atomic to pagefault_disabled. The lockless page changes decoupled the pagefault disabled from preempt count. The bug_on in filemap.c is now not correct. This patch uses the proper check. Signed-off-by: Steven Rostedt <srostedt@redhat.com> --- mm/filemap.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) Index: linux-2.6.24.7/mm/filemap.c =================================================================== --- linux-2.6.24.7.orig/mm/filemap.c +++ linux-2.6.24.7/mm/filemap.c @@ -1764,7 +1764,7 @@ size_t iov_iter_copy_from_user_atomic(st size_t copied; #ifndef CONFIG_PREEMPT_RT - BUG_ON(!in_atomic()); + BUG_ON(!current->pagefault_disabled); #endif kaddr = kmap_atomic(page, KM_USER0); if (likely(i->nr_segs == 1)) { ��������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/rt-sched-groups.patch�����������������������������������������������������������������������0000664�0000764�0000764�00000001316�11041657734�015315� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������--- kernel/cgroup.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) Index: linux-2.6.24.7/kernel/cgroup.c =================================================================== --- linux-2.6.24.7.orig/kernel/cgroup.c +++ linux-2.6.24.7/kernel/cgroup.c @@ -168,7 +168,7 @@ list_for_each_entry(_root, &roots, root_ /* the list of cgroups eligible for automatic release. Protected by * release_list_lock */ static LIST_HEAD(release_list); -static DEFINE_SPINLOCK(release_list_lock); +static DEFINE_RAW_SPINLOCK(release_list_lock); static void cgroup_release_agent(struct work_struct *work); static DECLARE_WORK(release_agent_work, cgroup_release_agent); static void check_for_release(struct cgroup *cgrp); ������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/send-nmi-all-preempt-disable.patch����������������������������������������������������������0000664�0000764�0000764�00000001755�11041657733�017630� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������--- arch/x86/kernel/nmi_32.c | 2 ++ arch/x86/kernel/nmi_64.c | 2 ++ 2 files changed, 4 insertions(+) Index: linux-2.6.24.7/arch/x86/kernel/nmi_32.c =================================================================== --- linux-2.6.24.7.orig/arch/x86/kernel/nmi_32.c +++ linux-2.6.24.7/arch/x86/kernel/nmi_32.c @@ -538,9 +538,11 @@ void smp_send_nmi_allbutself(void) { #ifdef CONFIG_SMP cpumask_t mask = cpu_online_map; + preempt_disable(); cpu_clear(safe_smp_processor_id(), mask); if (!cpus_empty(mask)) send_IPI_mask(mask, NMI_VECTOR); + preempt_enable(); #endif } Index: linux-2.6.24.7/arch/x86/kernel/nmi_64.c =================================================================== --- linux-2.6.24.7.orig/arch/x86/kernel/nmi_64.c +++ linux-2.6.24.7/arch/x86/kernel/nmi_64.c @@ -539,7 +539,9 @@ void __trigger_all_cpu_backtrace(void) void smp_send_nmi_allbutself(void) { #ifdef CONFIG_SMP + preempt_disable(); send_IPI_allbutself(NMI_VECTOR); + preempt_enable(); #endif } �������������������patches/printk-dont-bug-on-sched.patch��������������������������������������������������������������0000664�0000764�0000764�00000002172�11041657734�017012� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������--- kernel/rtmutex.c | 9 +++++++++ 1 file changed, 9 insertions(+) Index: linux-2.6.24.7/kernel/rtmutex.c =================================================================== --- linux-2.6.24.7.orig/kernel/rtmutex.c +++ linux-2.6.24.7/kernel/rtmutex.c @@ -14,6 +14,7 @@ #include <linux/module.h> #include <linux/sched.h> #include <linux/timer.h> +#include <linux/hardirq.h> #include "rtmutex_common.h" @@ -635,6 +636,9 @@ rt_spin_lock_fastlock(struct rt_mutex *l /* Temporary HACK! */ if (!current->in_printk) might_sleep(); + else if (in_atomic() || irqs_disabled()) + /* don't grab locks for printk in atomic */ + return; if (likely(rt_mutex_cmpxchg(lock, NULL, current))) rt_mutex_deadlock_account_lock(lock, current); @@ -646,6 +650,11 @@ static inline void rt_spin_lock_fastunlock(struct rt_mutex *lock, void fastcall (*slowfn)(struct rt_mutex *lock)) { + /* Temporary HACK! */ + if (current->in_printk && (in_atomic() || irqs_disabled())) + /* don't grab locks for printk in atomic */ + return; + if (likely(rt_mutex_cmpxchg(lock, current, NULL))) rt_mutex_deadlock_account_unlock(current); else ������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/user-no-irq-disable.patch�������������������������������������������������������������������0000664�0000764�0000764�00000001220�11041657733�016042� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������--- kernel/user.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) Index: linux-2.6.24.7/kernel/user.c =================================================================== --- linux-2.6.24.7.orig/kernel/user.c +++ linux-2.6.24.7/kernel/user.c @@ -225,14 +225,14 @@ static void remove_user_sysfs_dir(struct */ uids_mutex_lock(); - local_irq_save(flags); + local_irq_save_nort(flags); if (atomic_dec_and_lock(&up->__count, &uidhash_lock)) { uid_hash_remove(up); remove_user = 1; spin_unlock_irqrestore(&uidhash_lock, flags); } else { - local_irq_restore(flags); + local_irq_restore_nort(flags); } if (!remove_user) ��������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/drain-all-local-pages-via-sched.patch�������������������������������������������������������0000664�0000764�0000764�00000003215�11041657734�020160� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������--- mm/page_alloc.c | 33 +++++++++++++++++++++++++++++++++ 1 file changed, 33 insertions(+) Index: linux-2.6.24.7/mm/page_alloc.c =================================================================== --- linux-2.6.24.7.orig/mm/page_alloc.c +++ linux-2.6.24.7/mm/page_alloc.c @@ -1020,6 +1020,38 @@ void smp_drain_local_pages(void *arg) */ void drain_all_local_pages(void) { +#ifdef CONFIG_PREEMPT_RT + /* + * HACK!!!!! + * For RT we can't use IPIs to run drain_local_pages, since + * that code will call spin_locks that will now sleep. + * But, schedule_on_each_cpu will call kzalloc, which will + * call page_alloc which was what calls this. + * + * Luckily, there's a condition to get here, and that is if + * the order passed in to alloc_pages is greater than 0 + * (alloced more than a page size). The slabs only allocate + * what is needed, and the allocation made by schedule_on_each_cpu + * does an alloc of "sizeof(void *)*nr_cpu_ids". + * + * So we can safely call schedule_on_each_cpu if that number + * is less than a page. Otherwise don't bother. At least warn of + * this issue. + * + * And yes, this is one big hack. Please fix ;-) + */ + if (sizeof(void *)*nr_cpu_ids < PAGE_SIZE) + schedule_on_each_cpu(smp_drain_local_pages, NULL, 0, 1); + else { + static int once; + if (!once) { + printk(KERN_ERR "Can't drain all CPUS due to possible recursion\n"); + once = 1; + } + drain_local_pages(); + } + +#else unsigned long flags; local_irq_save(flags); @@ -1027,6 +1059,7 @@ void drain_all_local_pages(void) local_irq_restore(flags); smp_call_function(smp_drain_local_pages, NULL, 0, 1); +#endif } /* �����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/local_irq_save_nort-in-swap.patch�����������������������������������������������������������0000664�0000764�0000764�00000001114�11041657734�017664� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������--- mm/swap.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) Index: linux-2.6.24.7/mm/swap.c =================================================================== --- linux-2.6.24.7.orig/mm/swap.c +++ linux-2.6.24.7/mm/swap.c @@ -302,9 +302,9 @@ static void drain_cpu_pagevecs(int cpu) unsigned long flags; /* No harm done if a racing interrupt already did this */ - local_irq_save(flags); + local_irq_save_nort(flags); pagevec_move_tail(pvec); - local_irq_restore(flags); + local_irq_restore_nort(flags); } swap_per_cpu_unlock(lru_rotate_pvecs, cpu); } ����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/proportions-raw-locks.patch�����������������������������������������������������������������0000664�0000764�0000764�00000001727�11041657734�016573� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������--- include/linux/proportions.h | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) Index: linux-2.6.24.7/include/linux/proportions.h =================================================================== --- linux-2.6.24.7.orig/include/linux/proportions.h +++ linux-2.6.24.7/include/linux/proportions.h @@ -58,7 +58,7 @@ struct prop_local_percpu { */ int shift; unsigned long period; - spinlock_t lock; /* protect the snapshot state */ + raw_spinlock_t lock; /* protect the snapshot state */ }; int prop_local_init_percpu(struct prop_local_percpu *pl); @@ -93,11 +93,11 @@ struct prop_local_single { */ int shift; unsigned long period; - spinlock_t lock; /* protect the snapshot state */ + raw_spinlock_t lock; /* protect the snapshot state */ }; #define INIT_PROP_LOCAL_SINGLE(name) \ -{ .lock = __SPIN_LOCK_UNLOCKED(name.lock), \ +{ .lock = RAW_SPIN_LOCK_UNLOCKED(name.lock), \ } int prop_local_init_single(struct prop_local_single *pl); �����������������������������������������patches/arm-compile-fix.patch�����������������������������������������������������������������������0000664�0000764�0000764�00000001443�11041657734�015261� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From: Kevin Hilman <khilman@mvista.com> Subject: [PATCH -rt] ARM: compile fix for event tracing The cycles/usecs conversion macros should be dependent on CONFIG_EVENT_TRACE instead of CONFIG_LATENCY_TIMING. Signed-off-by: Kevin Hilman <khilman@mvista.com> --- include/asm-arm/timex.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) Index: linux-2.6.24.7/include/asm-arm/timex.h =================================================================== --- linux-2.6.24.7.orig/include/asm-arm/timex.h +++ linux-2.6.24.7/include/asm-arm/timex.h @@ -18,7 +18,7 @@ typedef unsigned long cycles_t; #ifndef mach_read_cycles #define mach_read_cycles() (0) -#ifdef CONFIG_LATENCY_TIMING +#ifdef CONFIG_EVENT_TRACE #define mach_cycles_to_usecs(d) (d) #define mach_usecs_to_cycles(d) (d) #endif �����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/no-warning-for-irqs-disabled-in-local-bh-enable.patch���������������������������������������0000664�0000764�0000764�00000001771�11041657733�023166� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From: Kevin Hilman <kevin@hilman.org> Subject: [PATCH/RFC -rt] local_bh_enable() is safe for irqs_disabled() In local_bh_enable() there is a WARN_ON(irqs_disabled()), but looking at the rest of the code, it seems it expects to be used with interrupts off, so is this warning really necessary? I hit this warning in the ads7846 touchscreen driver timer function where enable_irq() may be called with interrupts disabled. Since enable_irq now calls local_bh_disable/enable for IRQ resend, this warning is triggered. Patch against 2.6.23.9-rt12 Signed-off-by: Kevin Hilman <khilman@mvista.com> --- kernel/softirq.c | 1 - 1 file changed, 1 deletion(-) Index: linux-2.6.24.7/kernel/softirq.c =================================================================== --- linux-2.6.24.7.orig/kernel/softirq.c +++ linux-2.6.24.7/kernel/softirq.c @@ -207,7 +207,6 @@ void local_bh_enable(void) WARN_ON_ONCE(in_irq()); #endif - WARN_ON_ONCE(irqs_disabled()); #ifdef CONFIG_TRACE_IRQFLAGS local_irq_save(flags); �������patches/page-alloc-use-real-time-pcp-locking-for-page-draining.patch��������������������������������0000664�0000764�0000764�00000002116�11041657734�024434� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From: Andi Kleen <ak@suse.de> Subject: [PATCH for 2.6.24rc2-rt1] Use real time pcp locking for page draining during cpu unplug Use real time pcp locking for page draining during cpu unplug Looks like a merging mistake that happened at some point. This is the only place in the file that disables interrupts directly. This fixes one case of CPU hotunplug failing on RT, but there are still more. Signed-off-by: Andi Kleen <ak@suse.de> --- mm/page_alloc.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) Index: linux-2.6.24.7/mm/page_alloc.c =================================================================== --- linux-2.6.24.7.orig/mm/page_alloc.c +++ linux-2.6.24.7/mm/page_alloc.c @@ -4058,10 +4058,11 @@ static int page_alloc_cpu_notify(struct int cpu = (unsigned long)hcpu; if (action == CPU_DEAD || action == CPU_DEAD_FROZEN) { - local_irq_disable(); + unsigned long flags; + __lock_cpu_pcp(&flags, cpu); __drain_pages(cpu); vm_events_fold_cpu(cpu); - local_irq_enable(); + unlock_cpu_pcp(flags, cpu); refresh_cpu_vm_stats(cpu); } return NOTIFY_OK; ��������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/handle-pending-in-simple-irq.patch����������������������������������������������������������0000664�0000764�0000764�00000002230�11041657732�017622� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From: Steven Rostedt <srostedt@redhat.com> Subject: handle IRQ_PENDING for simple irq handler With the IO-APIC pcix hack (level=>edge masking), we can receive interrupts while masked. But these interrupts might be missed. Also, normal "simple" interrupts might be missed too on leaving of thread handler. Signed-off-by: Steven Rostedt <srostedt@redhat.com> --- kernel/irq/manage.c | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-) Index: linux-2.6.24.7/kernel/irq/manage.c =================================================================== --- linux-2.6.24.7.orig/kernel/irq/manage.c +++ linux-2.6.24.7/kernel/irq/manage.c @@ -652,14 +652,17 @@ static void thread_simple_irq(irq_desc_t unsigned int irq = desc - irq_desc; irqreturn_t action_ret; - if (action && !desc->depth) { + do { + if (!action || desc->depth) + break; + desc->status &= ~IRQ_PENDING; spin_unlock(&desc->lock); action_ret = handle_IRQ_event(irq, action); cond_resched_hardirq_context(); spin_lock_irq(&desc->lock); if (!noirqdebug) note_interrupt(irq, desc, action_ret); - } + } while (desc->status & IRQ_PENDING); desc->status &= ~IRQ_INPROGRESS; } ������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/use-edge-triggered-irq-handler-instead-of-simple-irq.patch����������������������������������0000664�0000764�0000764�00000005161�11041657734�024254� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From: Remy Bohmer <linux@bohmer.net> Subject: [AT91: PATCH]: Use edge triggered interrupt handling for AT91-GPIO instead of simple_irq-handler On ARM there is a problem where the interrupt handler stalls when they are coming faster than the kernel can handle. The problem seems to occur on RT primarily, but the problem is also valid for non-RT kernels. The problem is twofold: * the handle_simple_irq() mechanism is used for GPIO, but because the GPIO interrupt source is actually an edge triggered interrupt source, the handle_edge_irq() mechanism must be used. While using the simple_irq() mechanisms edges can be missed for either mainline as RT kernels. The simple_irq mechanism is *never* meant to be used for these types of interrupts. See the thread at: http://lkml.org/lkml/2007/11/26/73 * The RT kernels has a problem that the interrupt get masked forever while the interrupt thread is running and a new interrupt arrives. In the interrupt threads there is masking done in the handle_simple_irq() path, while a simple_irq typically cannot be masked. This patch only solves the first bullet, which is enough for AT91, by moving the GPIO interrupt handler towards the handle_edge_irq(). To solve the problem in the simple_irq() path a seperate fix has to be done, but as it is no longer used by AT91, that fix will not affect AT91. Tested on: * AT91rm9200-ek, and proprietary board * AT91SAM9261-ek. (This patches also solves the problem that the DM9000 does not work on this board while using PREEMPT-RT) Signed-off-by: Remy Bohmer <linux@bohmer.net> --- arch/arm/mach-at91/gpio.c | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-) Index: linux-2.6.24.7/arch/arm/mach-at91/gpio.c =================================================================== --- linux-2.6.24.7.orig/arch/arm/mach-at91/gpio.c +++ linux-2.6.24.7/arch/arm/mach-at91/gpio.c @@ -362,12 +362,18 @@ static int gpio_irq_type(unsigned pin, u return (type == IRQT_BOTHEDGE) ? 0 : -EINVAL; } +static void gpio_irq_ack_noop(unsigned int irq) +{ + /* Dummy function. */ +} + static struct irq_chip gpio_irqchip = { .name = "GPIO", .mask = gpio_irq_mask, .unmask = gpio_irq_unmask, .set_type = gpio_irq_type, .set_wake = gpio_irq_set_wake, + .ack = gpio_irq_ack_noop, }; static void gpio_irq_handler(unsigned irq, struct irq_desc *desc) @@ -442,7 +448,7 @@ void __init at91_gpio_irq_setup(void) * shorter, and the AIC handles interrupts sanely. */ set_irq_chip(pin, &gpio_irqchip); - set_irq_handler(pin, handle_simple_irq); + set_irq_handler(pin, handle_edge_irq); set_irq_flags(pin, IRQF_VALID); } ���������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/dev-queue-xmit-preempt-fix.patch������������������������������������������������������������0000664�0000764�0000764�00000011144�11041657732�017402� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From mingo@elte.hu Fri Jan 11 14:56:57 2008 Date: Thu, 3 Jan 2008 09:22:03 +0100 From: Ingo Molnar <mingo@elte.hu> To: Steven Rostedt <rostedt@goodmis.org> Subject: [mbeauch@cox.net: FW: [PATCH -rt] Preemption problem in kernel RT Patch] ----- Forwarded message from mbeauch <mbeauch@cox.net> ----- Date: Wed, 02 Jan 2008 20:27:09 -0500 From: mbeauch <mbeauch@cox.net> To: mingo@elte.hu Subject: FW: [PATCH -rt] Preemption problem in kernel RT Patch Here's the updated patch: Changed the real-time patch code to detect recursive calls to dev_queue_xmit and drop the packet when detected. Signed-off-by: Mark Beauchemin <mark.beauchemin@sycamorenet.com> --- include/linux/netdevice.h | 20 ++++++++++---------- net/core/dev.c | 14 +++----------- net/sched/sch_generic.c | 4 ++-- 3 files changed, 15 insertions(+), 23 deletions(-) Index: linux-2.6.24.7/include/linux/netdevice.h =================================================================== --- linux-2.6.24.7.orig/include/linux/netdevice.h +++ linux-2.6.24.7/include/linux/netdevice.h @@ -629,7 +629,7 @@ struct net_device /* cpu id of processor entered to hard_start_xmit or -1, if nobody entered there. */ - int xmit_lock_owner; + void *xmit_lock_owner; void *priv; /* pointer to private data */ int (*hard_start_xmit) (struct sk_buff *skb, struct net_device *dev); @@ -1341,46 +1341,46 @@ static inline void netif_rx_complete(str * * Get network device transmit lock */ -static inline void __netif_tx_lock(struct net_device *dev, int cpu) +static inline void __netif_tx_lock(struct net_device *dev) { spin_lock(&dev->_xmit_lock); - dev->xmit_lock_owner = cpu; + dev->xmit_lock_owner = (void *)current; } static inline void netif_tx_lock(struct net_device *dev) { - __netif_tx_lock(dev, raw_smp_processor_id()); + __netif_tx_lock(dev); } static inline void netif_tx_lock_bh(struct net_device *dev) { spin_lock_bh(&dev->_xmit_lock); - dev->xmit_lock_owner = raw_smp_processor_id(); + dev->xmit_lock_owner = (void *)current; } static inline int netif_tx_trylock(struct net_device *dev) { int ok = spin_trylock(&dev->_xmit_lock); if (likely(ok)) - dev->xmit_lock_owner = raw_smp_processor_id(); + dev->xmit_lock_owner = (void *)current; return ok; } static inline void netif_tx_unlock(struct net_device *dev) { - dev->xmit_lock_owner = -1; + dev->xmit_lock_owner = (void *)-1; spin_unlock(&dev->_xmit_lock); } static inline void netif_tx_unlock_bh(struct net_device *dev) { - dev->xmit_lock_owner = -1; + dev->xmit_lock_owner = (void *)-1; spin_unlock_bh(&dev->_xmit_lock); } -#define HARD_TX_LOCK(dev, cpu) { \ +#define HARD_TX_LOCK(dev) { \ if ((dev->features & NETIF_F_LLTX) == 0) { \ - __netif_tx_lock(dev, cpu); \ + __netif_tx_lock(dev); \ } \ } Index: linux-2.6.24.7/net/core/dev.c =================================================================== --- linux-2.6.24.7.orig/net/core/dev.c +++ linux-2.6.24.7/net/core/dev.c @@ -1692,18 +1692,10 @@ gso: Either shot noqueue qdisc, it is even simpler 8) */ if (dev->flags & IFF_UP) { - int cpu = raw_smp_processor_id(); /* ok because BHs are off */ - /* - * No need to check for recursion with threaded interrupts: - */ -#ifdef CONFIG_PREEMPT_RT - if (1) { -#else - if (dev->xmit_lock_owner != cpu) { -#endif + if (dev->xmit_lock_owner != (void *)current) { - HARD_TX_LOCK(dev, cpu); + HARD_TX_LOCK(dev); if (!netif_queue_stopped(dev) && !netif_subqueue_stopped(dev, skb)) { @@ -3634,7 +3626,7 @@ int register_netdevice(struct net_device spin_lock_init(&dev->queue_lock); spin_lock_init(&dev->_xmit_lock); netdev_set_lockdep_class(&dev->_xmit_lock, dev->type); - dev->xmit_lock_owner = -1; + dev->xmit_lock_owner = (void *)-1; spin_lock_init(&dev->ingress_lock); dev->iflink = -1; Index: linux-2.6.24.7/net/sched/sch_generic.c =================================================================== --- linux-2.6.24.7.orig/net/sched/sch_generic.c +++ linux-2.6.24.7/net/sched/sch_generic.c @@ -89,7 +89,7 @@ static inline int handle_dev_cpu_collisi { int ret; - if (unlikely(dev->xmit_lock_owner == raw_smp_processor_id())) { + if (unlikely(dev->xmit_lock_owner == (void *)current)) { /* * Same CPU holding the lock. It may be a transient * configuration error, when hard_start_xmit() recurses. We @@ -146,7 +146,7 @@ static inline int qdisc_restart(struct n /* And release queue */ spin_unlock(&dev->queue_lock); - HARD_TX_LOCK(dev, raw_smp_processor_id()); + HARD_TX_LOCK(dev); if (!netif_subqueue_stopped(dev, skb)) ret = dev_hard_start_xmit(skb, dev); HARD_TX_UNLOCK(dev); ����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/dynamically-update-root-domain-span-online-maps.patch���������������������������������������0000664�0000764�0000764�00000007772�11041657730�023472� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From ghaskins@novell.com Fri Jan 11 14:52:37 2008 Date: Mon, 17 Dec 2007 21:40:32 -0500 From: Gregory Haskins <ghaskins@novell.com> To: srostedt@redhat.com Cc: mingo@elte.hu, linux-rt-users@vger.kernel.org, linux-kernel@vger.kernel.org, ghaskins@novell.com Subject: [PATCH] sched: dynamically update the root-domain span/online maps [ The following text is in the "utf-8" character set. ] [ Your display is set for the "iso-8859-1" character set. ] [ Some characters may be displayed incorrectly. ] Hi Steven, I posted a suspend-to-ram fix to sched-devel earlier today: http://lkml.org/lkml/2007/12/17/445 This fix should also be applied to -rt as I introduced the same regression there. Here is a version of the fix for 23-rt13. I can submit a version for 24-rc5-rt1 at your request. Regards, -Greg --------------------------------- The baseline code statically builds the span maps when the domain is formed. Previous attempts at dynamically updating the maps caused a suspend-to-ram regression, which should now be fixed. Signed-off-by: Gregory Haskins <ghaskins@novell.com> CC: Gautham R Shenoy <ego@in.ibm.com> --- kernel/sched.c | 28 ++++++++++++++++------------ 1 file changed, 16 insertions(+), 12 deletions(-) Index: linux-2.6.24.7/kernel/sched.c =================================================================== --- linux-2.6.24.7.orig/kernel/sched.c +++ linux-2.6.24.7/kernel/sched.c @@ -336,8 +336,6 @@ struct rt_rq { * exclusive cpuset is created, we also create and attach a new root-domain * object. * - * By default the system creates a single root-domain with all cpus as - * members (mimicking the global state we have today). */ struct root_domain { atomic_t refcount; @@ -355,6 +353,10 @@ struct root_domain { #endif }; +/* + * By default the system creates a single root-domain with all cpus as + * members (mimicking the global state we have today). + */ static struct root_domain def_root_domain; #endif @@ -6344,6 +6346,10 @@ static void rq_attach_root(struct rq *rq atomic_inc(&rd->refcount); rq->rd = rd; + cpu_set(rq->cpu, rd->span); + if (cpu_isset(rq->cpu, cpu_online_map)) + cpu_set(rq->cpu, rd->online); + for (class = sched_class_highest; class; class = class->next) { if (class->join_domain) class->join_domain(rq); @@ -6352,12 +6358,12 @@ static void rq_attach_root(struct rq *rq spin_unlock_irqrestore(&rq->lock, flags); } -static void init_rootdomain(struct root_domain *rd, const cpumask_t *map) +static void init_rootdomain(struct root_domain *rd) { memset(rd, 0, sizeof(*rd)); - rd->span = *map; - cpus_and(rd->online, rd->span, cpu_online_map); + cpus_clear(rd->span); + cpus_clear(rd->online); cpupri_init(&rd->cpupri); @@ -6365,13 +6371,11 @@ static void init_rootdomain(struct root_ static void init_defrootdomain(void) { - cpumask_t cpus = CPU_MASK_ALL; - - init_rootdomain(&def_root_domain, &cpus); + init_rootdomain(&def_root_domain); atomic_set(&def_root_domain.refcount, 1); } -static struct root_domain *alloc_rootdomain(const cpumask_t *map) +static struct root_domain *alloc_rootdomain(void) { struct root_domain *rd; @@ -6379,7 +6383,7 @@ static struct root_domain *alloc_rootdom if (!rd) return NULL; - init_rootdomain(rd, map); + init_rootdomain(rd); return rd; } @@ -6800,7 +6804,7 @@ static int build_sched_domains(const cpu sched_group_nodes_bycpu[first_cpu(*cpu_map)] = sched_group_nodes; #endif - rd = alloc_rootdomain(cpu_map); + rd = alloc_rootdomain(); if (!rd) { printk(KERN_WARNING "Cannot alloc root domain\n"); return -ENOMEM; @@ -7356,7 +7360,6 @@ void __init sched_init(void) #ifdef CONFIG_SMP rq->sd = NULL; rq->rd = NULL; - rq_attach_root(rq, &def_root_domain); rq->active_balance = 0; rq->next_balance = jiffies; rq->push_cpu = 0; @@ -7365,6 +7368,7 @@ void __init sched_init(void) INIT_LIST_HEAD(&rq->migration_queue); rq->rt.highest_prio = MAX_RT_PRIO; rq->rt.overloaded = 0; + rq_attach_root(rq, &def_root_domain); #endif atomic_set(&rq->nr_iowait, 0); ������patches/ppc-hacks-to-allow-rt-to-run-kernbench.patch������������������������������������������������0000664�0000764�0000764�00000017052�11041657732�021500� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From paulmck@linux.vnet.ibm.com Fri Jan 11 14:00:39 2008 Date: Wed, 12 Dec 2007 22:10:29 -0800 From: Paul E. McKenney <paulmck@linux.vnet.ibm.com> To: Steven Rostedt <rostedt@goodmis.org> Cc: linux-kernel@vger.kernel.org, tony@bakeyournoodle.com, paulus@samba.org, benh@kernel.crashing.org, dino@in.ibm.com, tytso@us.ibm.com, dvhltc@us.ibm.com, antonb@us.ibm.com Subject: Re: [PATCH, RFC] hacks to allow -rt to run kernbench on POWER On Wed, Dec 12, 2007 at 10:56:12PM -0500, Steven Rostedt wrote: > > On Mon, 29 Oct 2007, Paul E. McKenney wrote: > > diff -urpNa -X dontdiff linux-2.6.23.1-rt4/mm/memory.c linux-2.6.23.1-rt4-fix/mm/memory.c > > --- linux-2.6.23.1-rt4/mm/memory.c 2007-10-27 22:20:57.000000000 -0700 > > +++ linux-2.6.23.1-rt4-fix/mm/memory.c 2007-10-28 15:40:36.000000000 -0700 > > @@ -664,6 +664,7 @@ static unsigned long zap_pte_range(struc > > int anon_rss = 0; > > > > pte = pte_offset_map_lock(mm, pmd, addr, &ptl); > > + preempt_disable(); > > arch_enter_lazy_mmu_mode(); > > do { > > pte_t ptent = *pte; > > @@ -732,6 +733,7 @@ static unsigned long zap_pte_range(struc > > > > add_mm_rss(mm, file_rss, anon_rss); > > arch_leave_lazy_mmu_mode(); > > + preempt_enable(); > > pte_unmap_unlock(pte - 1, ptl); > > > > return addr; > > I'm pulling your patch for the above added code. Took me a few hours to > find the culprit, but I was getting scheduling in atomic bugs. Turns out > that this code you put "preempt_disable" in calls sleeping spinlocks. > > Might want to run with DEBUG_PREEMPT. I thought that you had already pulled the above version... Here is the replacement that I posted on November 9th (with much help from Ben H): http://lkml.org/lkml/2007/11/9/114 Thanx, Paul Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> --- --- arch/powerpc/kernel/process.c | 22 ++++++++++++++++++++++ arch/powerpc/kernel/prom.c | 2 +- arch/powerpc/mm/tlb_64.c | 5 ++++- arch/powerpc/platforms/pseries/eeh.c | 2 +- drivers/of/base.c | 2 +- include/asm-powerpc/tlb.h | 5 ++++- include/asm-powerpc/tlbflush.h | 15 ++++++++++----- 7 files changed, 43 insertions(+), 10 deletions(-) Index: linux-2.6.24.7/arch/powerpc/kernel/process.c =================================================================== --- linux-2.6.24.7.orig/arch/powerpc/kernel/process.c +++ linux-2.6.24.7/arch/powerpc/kernel/process.c @@ -267,6 +267,10 @@ struct task_struct *__switch_to(struct t struct thread_struct *new_thread, *old_thread; unsigned long flags; struct task_struct *last; +#ifdef CONFIG_PREEMPT_RT + struct ppc64_tlb_batch *batch; + int hadbatch; +#endif /* #ifdef CONFIG_PREEMPT_RT */ #ifdef CONFIG_SMP /* avoid complexity of lazy save/restore of fpu @@ -347,6 +351,17 @@ struct task_struct *__switch_to(struct t } #endif +#ifdef CONFIG_PREEMPT_RT + batch = &__get_cpu_var(ppc64_tlb_batch); + if (batch->active) { + hadbatch = 1; + if (batch->index) { + __flush_tlb_pending(batch); + } + batch->active = 0; + } +#endif /* #ifdef CONFIG_PREEMPT_RT */ + local_irq_save(flags); account_system_vtime(current); @@ -357,6 +372,13 @@ struct task_struct *__switch_to(struct t local_irq_restore(flags); +#ifdef CONFIG_PREEMPT_RT + if (hadbatch) { + batch = &__get_cpu_var(ppc64_tlb_batch); + batch->active = 1; + } +#endif /* #ifdef CONFIG_PREEMPT_RT */ + return last; } Index: linux-2.6.24.7/arch/powerpc/kernel/prom.c =================================================================== --- linux-2.6.24.7.orig/arch/powerpc/kernel/prom.c +++ linux-2.6.24.7/arch/powerpc/kernel/prom.c @@ -79,7 +79,7 @@ struct boot_param_header *initial_boot_p extern struct device_node *allnodes; /* temporary while merging */ -extern rwlock_t devtree_lock; /* temporary while merging */ +extern raw_rwlock_t devtree_lock; /* temporary while merging */ /* export that to outside world */ struct device_node *of_chosen; Index: linux-2.6.24.7/arch/powerpc/mm/tlb_64.c =================================================================== --- linux-2.6.24.7.orig/arch/powerpc/mm/tlb_64.c +++ linux-2.6.24.7/arch/powerpc/mm/tlb_64.c @@ -131,7 +131,7 @@ void pgtable_free_tlb(struct mmu_gather void hpte_need_flush(struct mm_struct *mm, unsigned long addr, pte_t *ptep, unsigned long pte, int huge) { - struct ppc64_tlb_batch *batch = &__get_cpu_var(ppc64_tlb_batch); + struct ppc64_tlb_batch *batch = &get_cpu_var(ppc64_tlb_batch); unsigned long vsid, vaddr; unsigned int psize; int ssize; @@ -182,6 +182,7 @@ void hpte_need_flush(struct mm_struct *m */ if (!batch->active) { flush_hash_page(vaddr, rpte, psize, ssize, 0); + put_cpu_var(ppc64_tlb_batch); return; } @@ -216,12 +217,14 @@ void hpte_need_flush(struct mm_struct *m */ if (machine_is(celleb)) { __flush_tlb_pending(batch); + put_cpu_var(ppc64_tlb_batch); return; } #endif /* CONFIG_PREEMPT_RT */ if (i >= PPC64_TLB_BATCH_NR) __flush_tlb_pending(batch); + put_cpu_var(ppc64_tlb_batch); } /* Index: linux-2.6.24.7/arch/powerpc/platforms/pseries/eeh.c =================================================================== --- linux-2.6.24.7.orig/arch/powerpc/platforms/pseries/eeh.c +++ linux-2.6.24.7/arch/powerpc/platforms/pseries/eeh.c @@ -97,7 +97,7 @@ int eeh_subsystem_enabled; EXPORT_SYMBOL(eeh_subsystem_enabled); /* Lock to avoid races due to multiple reports of an error */ -static DEFINE_SPINLOCK(confirm_error_lock); +static DEFINE_RAW_SPINLOCK(confirm_error_lock); /* Buffer for reporting slot-error-detail rtas calls. Its here * in BSS, and not dynamically alloced, so that it ends up in Index: linux-2.6.24.7/drivers/of/base.c =================================================================== --- linux-2.6.24.7.orig/drivers/of/base.c +++ linux-2.6.24.7/drivers/of/base.c @@ -25,7 +25,7 @@ struct device_node *allnodes; /* use when traversing tree through the allnext, child, sibling, * or parent members of struct device_node. */ -DEFINE_RWLOCK(devtree_lock); +DEFINE_RAW_RWLOCK(devtree_lock); int of_n_addr_cells(struct device_node *np) { Index: linux-2.6.24.7/include/asm-powerpc/tlb.h =================================================================== --- linux-2.6.24.7.orig/include/asm-powerpc/tlb.h +++ linux-2.6.24.7/include/asm-powerpc/tlb.h @@ -46,8 +46,11 @@ static inline void tlb_flush(struct mmu_ * pages are going to be freed and we really don't want to have a CPU * access a freed page because it has a stale TLB */ - if (tlbbatch->index) + if (tlbbatch->index) { + preempt_disable(); __flush_tlb_pending(tlbbatch); + preempt_enable(); + } pte_free_finish(); } Index: linux-2.6.24.7/include/asm-powerpc/tlbflush.h =================================================================== --- linux-2.6.24.7.orig/include/asm-powerpc/tlbflush.h +++ linux-2.6.24.7/include/asm-powerpc/tlbflush.h @@ -109,18 +109,23 @@ extern void hpte_need_flush(struct mm_st static inline void arch_enter_lazy_mmu_mode(void) { - struct ppc64_tlb_batch *batch = &__get_cpu_var(ppc64_tlb_batch); + struct ppc64_tlb_batch *batch = &get_cpu_var(ppc64_tlb_batch); batch->active = 1; + put_cpu_var(ppc64_tlb_batch); } static inline void arch_leave_lazy_mmu_mode(void) { - struct ppc64_tlb_batch *batch = &__get_cpu_var(ppc64_tlb_batch); + struct ppc64_tlb_batch *batch = &get_cpu_var(ppc64_tlb_batch); - if (batch->index) - __flush_tlb_pending(batch); - batch->active = 0; + if (batch->active) { + if (batch->index) { + __flush_tlb_pending(batch); + } + batch->active = 0; + } + put_cpu_var(ppc64_tlb_batch); } #define arch_flush_lazy_mmu_mode() do {} while (0) ��������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/ppc64-non-smp-compile-fix-per-cpu.patch�����������������������������������������������������0000664�0000764�0000764�00000002421�11041657733�020370� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From estarkov@ru.mvista.com Fri Jan 11 14:54:21 2008 Date: Thu, 20 Dec 2007 17:15:38 +0300 From: Egor Starkov <estarkov@ru.mvista.com> To: mingo@elte.hu Cc: rostedt@goodmis.org, linux-rt-users@vger.kernel.org, linux-kernel@vger.kernel.org, linuxppc-dev@ozlabs.org Subject: PPC64 doesn't compile with CONFIG_SMP=n Hello Ingo I've found out that real-time tree doesn't compile for PPC64 with CONFIG_SMP=n. Think this is due to patch-2.6.21.4-rt10 patch. It has definitions of following symbols missing: __get_cpu_lock, __get_cpu_var_locked. I've attached the patch to fix the problem. Egor Starkov [ Part 2: "Attached Text" ] Signed-off-by: Egor Starkov <estarkov@ru.mvista.com> --- include/asm-powerpc/percpu.h | 2 ++ 1 file changed, 2 insertions(+) Index: linux-2.6.24.7/include/asm-powerpc/percpu.h =================================================================== --- linux-2.6.24.7.orig/include/asm-powerpc/percpu.h +++ linux-2.6.24.7/include/asm-powerpc/percpu.h @@ -68,6 +68,8 @@ extern void setup_per_cpu_areas(void); #define __get_cpu_var(var) per_cpu__##var #define __raw_get_cpu_var(var) per_cpu__##var +#define __get_cpu_lock(var, cpu) per_cpu_lock__##var##_locked +#define __get_cpu_var_locked(var, cpu) per_cpu__##var##_locked #endif /* SMP */ �����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/rcu-preempt-trace-markers-1.patch�����������������������������������������������������������0000664�0000764�0000764�00000044202�11041657734�017427� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From prasad@linux.vnet.ibm.com Fri Jan 11 14:55:27 2008 Date: Tue, 8 Jan 2008 01:25:09 +0530 From: K. Prasad <prasad@linux.vnet.ibm.com> To: linux-kernel@vger.kernel.org, mingo@elte.hu Cc: Gautham R Shenoy <ego@in.ibm.com>, K. Prasad <prasad@linux.vnet.ibm.com>, mathieu.desnoyers@polymtl.ca, linux-rt-users@vger.kernel.org, dipankar@in.ibm.com, paulmck@linux.vnet.ibm.com Subject: [PATCH 1/2] Markers Implementation for RCU Preempt Tracing - Ver II This patch converts Preempt RCU Tracing code infrastructure to implement markers. - The rcupreempt_trace structure has been moved to the tracing infrastructure and de-linked from the rcupreempt.c code. A per-cpu instance of rcupreempt_trace structure will be maintained in rcupreempt_trace.c - The above change also renders a few macro definitions unused (such as RCU_TRACE_CPU, RCU_TRACE_ME and RCU_TRACE_RDP) which have been removed. - Some of the helper functions in rcupreempt.c which were exported only when CONFIG_RCU_TRACE was set are now exported unconditionally. These functions operate on per-cpu variables that are used both by the RCU and RCU Tracing code. The changes help in making RCU Tracing code operate as a kernel module also. - The references to rcupreempt-boost tracing in the module initialisation and cleanup have been removed here to enable kernel build, but will be brought in after enclosing them inside a #ifdef CONFIG_PREEMPT_RCU_BOOST. Signed-off-by: K.Prasad <prasad@linux.vnet.ibm.com> --- include/linux/rcupreempt.h | 10 ---- include/linux/rcupreempt_trace.h | 50 ++++++++++++------------ kernel/Kconfig.preempt | 7 +-- kernel/rcupreempt.c | 77 ++++++++++---------------------------- kernel/rcupreempt_trace.c | 79 +++++++++++++++++++++++++++++++++++++-- 5 files changed, 125 insertions(+), 98 deletions(-) Index: linux-2.6.24.7/include/linux/rcupreempt.h =================================================================== --- linux-2.6.24.7.orig/include/linux/rcupreempt.h +++ linux-2.6.24.7/include/linux/rcupreempt.h @@ -96,16 +96,6 @@ extern int rcu_pending_rt(int cpu); struct softirq_action; extern void rcu_process_callbacks_rt(struct softirq_action *unused); -#ifdef CONFIG_RCU_TRACE -struct rcupreempt_trace; -extern int *rcupreempt_flipctr(int cpu); -extern long rcupreempt_data_completed(void); -extern int rcupreempt_flip_flag(int cpu); -extern int rcupreempt_mb_flag(int cpu); -extern char *rcupreempt_try_flip_state_name(void); -extern struct rcupreempt_trace *rcupreempt_trace_cpu(int cpu); -#endif - struct softirq_action; #ifdef CONFIG_NO_HZ Index: linux-2.6.24.7/include/linux/rcupreempt_trace.h =================================================================== --- linux-2.6.24.7.orig/include/linux/rcupreempt_trace.h +++ linux-2.6.24.7/include/linux/rcupreempt_trace.h @@ -69,32 +69,32 @@ struct rcupreempt_trace { long rcu_try_flip_m2; }; -#ifdef CONFIG_RCU_TRACE -#define RCU_TRACE(fn, arg) fn(arg); -#else -#define RCU_TRACE(fn, arg) -#endif +struct rcupreempt_probe_data { + const char *name; + const char *format; + marker_probe_func *probe_func; +}; + +#define DEFINE_RCUPREEMPT_MARKER_HANDLER(rcupreempt_trace_worker) \ +void rcupreempt_trace_worker##_callback(const struct marker *mdata, \ + void *private_data, const char *format, ...) \ +{ \ + struct rcupreempt_trace *trace; \ + trace = (&per_cpu(trace_data, smp_processor_id())); \ + rcupreempt_trace_worker(trace); \ +} + +#define INIT_RCUPREEMPT_PROBE(rcupreempt_trace_worker) \ +{ \ + .name = __stringify(rcupreempt_trace_worker), \ + .probe_func = rcupreempt_trace_worker##_callback \ +} -extern void rcupreempt_trace_move2done(struct rcupreempt_trace *trace); -extern void rcupreempt_trace_move2wait(struct rcupreempt_trace *trace); -extern void rcupreempt_trace_try_flip_1(struct rcupreempt_trace *trace); -extern void rcupreempt_trace_try_flip_e1(struct rcupreempt_trace *trace); -extern void rcupreempt_trace_try_flip_i1(struct rcupreempt_trace *trace); -extern void rcupreempt_trace_try_flip_ie1(struct rcupreempt_trace *trace); -extern void rcupreempt_trace_try_flip_g1(struct rcupreempt_trace *trace); -extern void rcupreempt_trace_try_flip_a1(struct rcupreempt_trace *trace); -extern void rcupreempt_trace_try_flip_ae1(struct rcupreempt_trace *trace); -extern void rcupreempt_trace_try_flip_a2(struct rcupreempt_trace *trace); -extern void rcupreempt_trace_try_flip_z1(struct rcupreempt_trace *trace); -extern void rcupreempt_trace_try_flip_ze1(struct rcupreempt_trace *trace); -extern void rcupreempt_trace_try_flip_z2(struct rcupreempt_trace *trace); -extern void rcupreempt_trace_try_flip_m1(struct rcupreempt_trace *trace); -extern void rcupreempt_trace_try_flip_me1(struct rcupreempt_trace *trace); -extern void rcupreempt_trace_try_flip_m2(struct rcupreempt_trace *trace); -extern void rcupreempt_trace_check_callbacks(struct rcupreempt_trace *trace); -extern void rcupreempt_trace_done_remove(struct rcupreempt_trace *trace); -extern void rcupreempt_trace_invoke(struct rcupreempt_trace *trace); -extern void rcupreempt_trace_next_add(struct rcupreempt_trace *trace); +extern int *rcupreempt_flipctr(int cpu); +extern long rcupreempt_data_completed(void); +extern int rcupreempt_flip_flag(int cpu); +extern int rcupreempt_mb_flag(int cpu); +extern char *rcupreempt_try_flip_state_name(void); #endif /* __KERNEL__ */ #endif /* __LINUX_RCUPREEMPT_TRACE_H */ Index: linux-2.6.24.7/kernel/Kconfig.preempt =================================================================== --- linux-2.6.24.7.orig/kernel/Kconfig.preempt +++ linux-2.6.24.7/kernel/Kconfig.preempt @@ -172,14 +172,15 @@ config PREEMPT_RCU_BOOST possible OOM problems. config RCU_TRACE - bool "Enable tracing for RCU - currently stats in debugfs" + tristate "Enable tracing for RCU - currently stats in debugfs" select DEBUG_FS - default y + select MARKERS + default m help This option provides tracing in RCU which presents stats in debugfs for debugging RCU implementation. - Say Y here if you want to enable RCU tracing + Say Y/M here if you want to enable RCU tracing in-kernel/module. Say N if you are unsure. config SPINLOCK_BKL Index: linux-2.6.24.7/kernel/rcupreempt.c =================================================================== --- linux-2.6.24.7.orig/kernel/rcupreempt.c +++ linux-2.6.24.7/kernel/rcupreempt.c @@ -54,7 +54,6 @@ #include <linux/delay.h> #include <linux/byteorder/swabb.h> #include <linux/cpumask.h> -#include <linux/rcupreempt_trace.h> /* * PREEMPT_RCU data structures. @@ -71,9 +70,6 @@ struct rcu_data { struct rcu_head **waittail[GP_STAGES]; struct rcu_head *donelist; struct rcu_head **donetail; -#ifdef CONFIG_RCU_TRACE - struct rcupreempt_trace trace; -#endif /* #ifdef CONFIG_RCU_TRACE */ }; struct rcu_ctrlblk { raw_spinlock_t fliplock; /* Protect state-machine transitions. */ @@ -97,10 +93,8 @@ enum rcu_try_flip_states { rcu_try_flip_waitmb_state /* "M" */ }; static enum rcu_try_flip_states rcu_try_flip_state = rcu_try_flip_idle_state; -#ifdef CONFIG_RCU_TRACE static char *rcu_try_flip_state_names[] = { "idle", "waitack", "waitzero", "waitmb" }; -#endif /* #ifdef CONFIG_RCU_TRACE */ /* * Enum and per-CPU flag to determine when each CPU has seen @@ -147,24 +141,6 @@ static cpumask_t rcu_cpu_online_map = CP #define RCU_DATA_CPU(cpu) (&per_cpu(rcu_data, cpu)) /* - * Helper macro for tracing when the appropriate rcu_data is not - * cached in a local variable, but where the CPU number is so cached. - */ -#define RCU_TRACE_CPU(f, cpu) RCU_TRACE(f, &(RCU_DATA_CPU(cpu)->trace)); - -/* - * Helper macro for tracing when the appropriate rcu_data is not - * cached in a local variable. - */ -#define RCU_TRACE_ME(f) RCU_TRACE(f, &(RCU_DATA_ME()->trace)); - -/* - * Helper macro for tracing when the appropriate rcu_data is pointed - * to by a local variable. - */ -#define RCU_TRACE_RDP(f, rdp) RCU_TRACE(f, &((rdp)->trace)); - -/* * Return the number of RCU batches processed thus far. Useful * for debug and statistics. */ @@ -332,7 +308,7 @@ static void __rcu_advance_callbacks(stru if (rdp->waitlist[GP_STAGES - 1] != NULL) { *rdp->donetail = rdp->waitlist[GP_STAGES - 1]; rdp->donetail = rdp->waittail[GP_STAGES - 1]; - RCU_TRACE_RDP(rcupreempt_trace_move2done, rdp); + trace_mark(rcupreempt_trace_move2done, "NULL"); } for (i = GP_STAGES - 2; i >= 0; i--) { if (rdp->waitlist[i] != NULL) { @@ -351,7 +327,7 @@ static void __rcu_advance_callbacks(stru wlc++; rdp->nextlist = NULL; rdp->nexttail = &rdp->nextlist; - RCU_TRACE_RDP(rcupreempt_trace_move2wait, rdp); + trace_mark(rcupreempt_trace_move2wait, "NULL"); } else { rdp->waitlist[0] = NULL; rdp->waittail[0] = &rdp->waitlist[0]; @@ -595,9 +571,9 @@ rcu_try_flip_idle(void) { int cpu; - RCU_TRACE_ME(rcupreempt_trace_try_flip_i1); + trace_mark(rcupreempt_trace_try_flip_i1, "NULL"); if (!rcu_pending(smp_processor_id())) { - RCU_TRACE_ME(rcupreempt_trace_try_flip_ie1); + trace_mark(rcupreempt_trace_try_flip_ie1, "NULL"); return 0; } @@ -605,7 +581,7 @@ rcu_try_flip_idle(void) * Do the flip. */ - RCU_TRACE_ME(rcupreempt_trace_try_flip_g1); + trace_mark(rcupreempt_trace_try_flip_g1, "NULL"); rcu_ctrlblk.completed++; /* stands in for rcu_try_flip_g2 */ /* @@ -635,11 +611,11 @@ rcu_try_flip_waitack(void) { int cpu; - RCU_TRACE_ME(rcupreempt_trace_try_flip_a1); + trace_mark(rcupreempt_trace_try_flip_a1, "NULL"); for_each_cpu_mask(cpu, rcu_cpu_online_map) if (rcu_try_flip_waitack_needed(cpu) && per_cpu(rcu_flip_flag, cpu) != rcu_flip_seen) { - RCU_TRACE_ME(rcupreempt_trace_try_flip_ae1); + trace_mark(rcupreempt_trace_try_flip_ae1, "NULL"); return 0; } @@ -649,7 +625,7 @@ rcu_try_flip_waitack(void) */ smp_mb(); /* see above block comment. */ - RCU_TRACE_ME(rcupreempt_trace_try_flip_a2); + trace_mark(rcupreempt_trace_try_flip_a2, "NULL"); return 1; } @@ -667,11 +643,11 @@ rcu_try_flip_waitzero(void) /* Check to see if the sum of the "last" counters is zero. */ - RCU_TRACE_ME(rcupreempt_trace_try_flip_z1); + trace_mark(rcupreempt_trace_try_flip_z1, "NULL"); for_each_possible_cpu(cpu) sum += per_cpu(rcu_flipctr, cpu)[lastidx]; if (sum != 0) { - RCU_TRACE_ME(rcupreempt_trace_try_flip_ze1); + trace_mark(rcupreempt_trace_try_flip_ze1, "NULL"); return 0; } @@ -684,7 +660,7 @@ rcu_try_flip_waitzero(void) dyntick_save_progress_counter(cpu); } - RCU_TRACE_ME(rcupreempt_trace_try_flip_z2); + trace_mark(rcupreempt_trace_try_flip_z2, "NULL"); return 1; } @@ -698,16 +674,16 @@ rcu_try_flip_waitmb(void) { int cpu; - RCU_TRACE_ME(rcupreempt_trace_try_flip_m1); + trace_mark(rcupreempt_trace_try_flip_m1, "NULL"); for_each_cpu_mask(cpu, rcu_cpu_online_map) if (rcu_try_flip_waitmb_needed(cpu) && per_cpu(rcu_mb_flag, cpu) != rcu_mb_done) { - RCU_TRACE_ME(rcupreempt_trace_try_flip_me1); + trace_mark(rcupreempt_trace_try_flip_me1, "NULL"); return 0; } smp_mb(); /* Ensure that the above checks precede any following flip. */ - RCU_TRACE_ME(rcupreempt_trace_try_flip_m2); + trace_mark(rcupreempt_trace_try_flip_m2, "NULL"); return 1; } @@ -724,9 +700,9 @@ static void rcu_try_flip(void) { unsigned long oldirq; - RCU_TRACE_ME(rcupreempt_trace_try_flip_1); + trace_mark(rcupreempt_trace_try_flip_1, "NULL"); if (unlikely(!spin_trylock_irqsave(&rcu_ctrlblk.fliplock, oldirq))) { - RCU_TRACE_ME(rcupreempt_trace_try_flip_e1); + trace_mark(rcupreempt_trace_try_flip_e1, "NULL"); return; } @@ -778,7 +754,7 @@ void rcu_check_callbacks_rt(int cpu, int if (rcu_ctrlblk.completed == rdp->completed) rcu_try_flip(); spin_lock_irqsave(&rdp->lock, oldirq); - RCU_TRACE_RDP(rcupreempt_trace_check_callbacks, rdp); + trace_mark(rcupreempt_trace_check_callbacks, "NULL"); __rcu_advance_callbacks(rdp); spin_unlock_irqrestore(&rdp->lock, oldirq); } @@ -798,7 +774,7 @@ void rcu_advance_callbacks_rt(int cpu, i return; } spin_lock_irqsave(&rdp->lock, oldirq); - RCU_TRACE_RDP(rcupreempt_trace_check_callbacks, rdp); + trace_mark(rcupreempt_trace_check_callbacks, "NULL"); __rcu_advance_callbacks(rdp); spin_unlock_irqrestore(&rdp->lock, oldirq); } @@ -900,13 +876,13 @@ void rcu_process_callbacks_rt(struct sof } rdp->donelist = NULL; rdp->donetail = &rdp->donelist; - RCU_TRACE_RDP(rcupreempt_trace_done_remove, rdp); + trace_mark(rcupreempt_trace_done_remove, "NULL"); spin_unlock_irqrestore(&rdp->lock, flags); while (list) { next = list->next; list->func(list); list = next; - RCU_TRACE_ME(rcupreempt_trace_invoke); + trace_mark(rcupreempt_trace_invoke, "NULL"); } } @@ -924,7 +900,7 @@ void fastcall call_rcu_preempt(struct rc __rcu_advance_callbacks(rdp); *rdp->nexttail = head; rdp->nexttail = &head->next; - RCU_TRACE_RDP(rcupreempt_trace_next_add, rdp); + trace_mark(rcupreempt_trace_next_add, "NULL"); spin_unlock(&rdp->lock); local_irq_restore(oldirq); } @@ -1006,7 +982,6 @@ void synchronize_kernel(void) synchronize_rcu(); } -#ifdef CONFIG_RCU_TRACE int *rcupreempt_flipctr(int cpu) { return &per_cpu(rcu_flipctr, cpu)[0]; @@ -1030,13 +1005,3 @@ char *rcupreempt_try_flip_state_name(voi return rcu_try_flip_state_names[rcu_try_flip_state]; } EXPORT_SYMBOL_GPL(rcupreempt_try_flip_state_name); - -struct rcupreempt_trace *rcupreempt_trace_cpu(int cpu) -{ - struct rcu_data *rdp = RCU_DATA_CPU(cpu); - - return &rdp->trace; -} -EXPORT_SYMBOL_GPL(rcupreempt_trace_cpu); - -#endif /* #ifdef RCU_TRACE */ Index: linux-2.6.24.7/kernel/rcupreempt_trace.c =================================================================== --- linux-2.6.24.7.orig/kernel/rcupreempt_trace.c +++ linux-2.6.24.7/kernel/rcupreempt_trace.c @@ -43,11 +43,19 @@ #include <linux/mutex.h> #include <linux/rcupreempt_trace.h> #include <linux/debugfs.h> +#include <linux/percpu.h> static struct mutex rcupreempt_trace_mutex; static char *rcupreempt_trace_buf; #define RCUPREEMPT_TRACE_BUF_SIZE 4096 +static DEFINE_PER_CPU(struct rcupreempt_trace, trace_data); + +struct rcupreempt_trace *rcupreempt_trace_cpu(int cpu) +{ + return &per_cpu(trace_data, cpu); +} + void rcupreempt_trace_move2done(struct rcupreempt_trace *trace) { trace->done_length += trace->wait_length; @@ -135,6 +143,51 @@ void rcupreempt_trace_next_add(struct rc trace->next_length++; } +DEFINE_RCUPREEMPT_MARKER_HANDLER(rcupreempt_trace_move2done); +DEFINE_RCUPREEMPT_MARKER_HANDLER(rcupreempt_trace_move2wait); +DEFINE_RCUPREEMPT_MARKER_HANDLER(rcupreempt_trace_try_flip_1); +DEFINE_RCUPREEMPT_MARKER_HANDLER(rcupreempt_trace_try_flip_e1); +DEFINE_RCUPREEMPT_MARKER_HANDLER(rcupreempt_trace_try_flip_i1); +DEFINE_RCUPREEMPT_MARKER_HANDLER(rcupreempt_trace_try_flip_ie1); +DEFINE_RCUPREEMPT_MARKER_HANDLER(rcupreempt_trace_try_flip_g1); +DEFINE_RCUPREEMPT_MARKER_HANDLER(rcupreempt_trace_try_flip_a1); +DEFINE_RCUPREEMPT_MARKER_HANDLER(rcupreempt_trace_try_flip_ae1); +DEFINE_RCUPREEMPT_MARKER_HANDLER(rcupreempt_trace_try_flip_a2); +DEFINE_RCUPREEMPT_MARKER_HANDLER(rcupreempt_trace_try_flip_z1); +DEFINE_RCUPREEMPT_MARKER_HANDLER(rcupreempt_trace_try_flip_ze1); +DEFINE_RCUPREEMPT_MARKER_HANDLER(rcupreempt_trace_try_flip_z2); +DEFINE_RCUPREEMPT_MARKER_HANDLER(rcupreempt_trace_try_flip_m1); +DEFINE_RCUPREEMPT_MARKER_HANDLER(rcupreempt_trace_try_flip_me1); +DEFINE_RCUPREEMPT_MARKER_HANDLER(rcupreempt_trace_try_flip_m2); +DEFINE_RCUPREEMPT_MARKER_HANDLER(rcupreempt_trace_check_callbacks); +DEFINE_RCUPREEMPT_MARKER_HANDLER(rcupreempt_trace_done_remove); +DEFINE_RCUPREEMPT_MARKER_HANDLER(rcupreempt_trace_invoke); +DEFINE_RCUPREEMPT_MARKER_HANDLER(rcupreempt_trace_next_add); + +static struct rcupreempt_probe_data rcupreempt_probe_array[] = +{ + INIT_RCUPREEMPT_PROBE(rcupreempt_trace_move2done), + INIT_RCUPREEMPT_PROBE(rcupreempt_trace_move2wait), + INIT_RCUPREEMPT_PROBE(rcupreempt_trace_try_flip_1), + INIT_RCUPREEMPT_PROBE(rcupreempt_trace_try_flip_e1), + INIT_RCUPREEMPT_PROBE(rcupreempt_trace_try_flip_i1), + INIT_RCUPREEMPT_PROBE(rcupreempt_trace_try_flip_ie1), + INIT_RCUPREEMPT_PROBE(rcupreempt_trace_try_flip_g1), + INIT_RCUPREEMPT_PROBE(rcupreempt_trace_try_flip_a1), + INIT_RCUPREEMPT_PROBE(rcupreempt_trace_try_flip_ae1), + INIT_RCUPREEMPT_PROBE(rcupreempt_trace_try_flip_a2), + INIT_RCUPREEMPT_PROBE(rcupreempt_trace_try_flip_z1), + INIT_RCUPREEMPT_PROBE(rcupreempt_trace_try_flip_ze1), + INIT_RCUPREEMPT_PROBE(rcupreempt_trace_try_flip_z2), + INIT_RCUPREEMPT_PROBE(rcupreempt_trace_try_flip_m1), + INIT_RCUPREEMPT_PROBE(rcupreempt_trace_try_flip_me1), + INIT_RCUPREEMPT_PROBE(rcupreempt_trace_try_flip_m2), + INIT_RCUPREEMPT_PROBE(rcupreempt_trace_check_callbacks), + INIT_RCUPREEMPT_PROBE(rcupreempt_trace_done_remove), + INIT_RCUPREEMPT_PROBE(rcupreempt_trace_invoke), + INIT_RCUPREEMPT_PROBE(rcupreempt_trace_next_add) +}; + static void rcupreempt_trace_sum(struct rcupreempt_trace *sp) { struct rcupreempt_trace *cp; @@ -297,9 +350,6 @@ static int rcupreempt_debugfs_init(void) if (!ctrsdir) goto free_out; - if (!rcu_trace_boost_create(rcudir)) - goto free_out; - return 0; free_out: if (ctrsdir) @@ -316,6 +366,21 @@ out: static int __init rcupreempt_trace_init(void) { int ret; + int i; + + for (i = 0; i < ARRAY_SIZE(rcupreempt_probe_array); i++) { + struct rcupreempt_probe_data *p = &rcupreempt_probe_array[i]; + ret = marker_probe_register(p->name, p->format, + p->probe_func, p); + if (ret) + printk(KERN_INFO "Unable to register rcupreempt \ + probe %s\n", rcupreempt_probe_array[i].name); + ret = marker_arm(p->name); + if (ret) + printk(KERN_INFO "Unable to arm rcupreempt probe %s\n", + p->name); + } + printk(KERN_INFO "RCU Preempt markers registered\n"); mutex_init(&rcupreempt_trace_mutex); rcupreempt_trace_buf = kmalloc(RCUPREEMPT_TRACE_BUF_SIZE, GFP_KERNEL); @@ -329,7 +394,12 @@ static int __init rcupreempt_trace_init( static void __exit rcupreempt_trace_cleanup(void) { - rcu_trace_boost_destroy(); + int i; + + for (i = 0; i < ARRAY_SIZE(rcupreempt_probe_array); i++) + marker_probe_unregister(rcupreempt_probe_array[i].name); + printk(KERN_INFO "RCU Preempt markers unregistered\n"); + debugfs_remove(statdir); debugfs_remove(gpdir); debugfs_remove(ctrsdir); @@ -337,6 +407,7 @@ static void __exit rcupreempt_trace_clea kfree(rcupreempt_trace_buf); } +MODULE_LICENSE("GPL"); module_init(rcupreempt_trace_init); module_exit(rcupreempt_trace_cleanup); ����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/rcu-preempt-trace-markers-2.patch�����������������������������������������������������������0000664�0000764�0000764�00000047350�11041657732�017435� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From prasad@linux.vnet.ibm.com Fri Jan 11 14:55:40 2008 Date: Tue, 8 Jan 2008 01:26:57 +0530 From: K. Prasad <prasad@linux.vnet.ibm.com> To: linux-kernel@vger.kernel.org, mingo@elte.hu Cc: Gautham R Shenoy <ego@in.ibm.com>, K. Prasad <prasad@linux.vnet.ibm.com>, mathieu.desnoyers@polymtl.ca, linux-rt-users@vger.kernel.org, dipankar@in.ibm.com, paulmck@linux.vnet.ibm.com Subject: [PATCH 2/2] Markers Implementation for Preempt RCU Boost Tracing - Ver II This patch converts the tracing mechanism of Preempt RCU boosting into markers. The handler functions for these markers are included inside rcupreempt_trace.c and will be included only when PREEMPT_RCU_BOOST is chosen. Signed-off-by: K.Prasad <prasad@linux.vnet.ibm.com> --- include/linux/rcupreempt_trace.h | 40 +++++++ kernel/rcupreempt-boost.c | 211 ++++----------------------------------- kernel/rcupreempt_trace.c | 183 +++++++++++++++++++++++++++++++++ 3 files changed, 245 insertions(+), 189 deletions(-) Index: linux-2.6.24.7/include/linux/rcupreempt_trace.h =================================================================== --- linux-2.6.24.7.orig/include/linux/rcupreempt_trace.h +++ linux-2.6.24.7/include/linux/rcupreempt_trace.h @@ -96,5 +96,45 @@ extern int rcupreempt_flip_flag(int cpu) extern int rcupreempt_mb_flag(int cpu); extern char *rcupreempt_try_flip_state_name(void); +#ifdef CONFIG_PREEMPT_RCU_BOOST +struct preempt_rcu_boost_trace { + unsigned long rbs_stat_task_boost_called; + unsigned long rbs_stat_task_boosted; + unsigned long rbs_stat_boost_called; + unsigned long rbs_stat_try_boost; + unsigned long rbs_stat_boosted; + unsigned long rbs_stat_unboost_called; + unsigned long rbs_stat_unboosted; + unsigned long rbs_stat_try_boost_readers; + unsigned long rbs_stat_boost_readers; + unsigned long rbs_stat_try_unboost_readers; + unsigned long rbs_stat_unboost_readers; + unsigned long rbs_stat_over_taken; +}; + +#define DEFINE_PREEMPT_RCU_BOOST_MARKER_HANDLER(preempt_rcu_boost_var) \ +void preempt_rcu_boost_var##_callback(const struct marker *mdata, \ + void *private_data, const char *format, ...) \ +{ \ + struct preempt_rcu_boost_trace *boost_trace; \ + boost_trace = (&per_cpu(boost_trace_data, smp_processor_id())); \ + boost_trace->rbs_stat_##preempt_rcu_boost_var++; \ +} + +struct preempt_rcu_boost_probe { + const char *name; + const char *format; + marker_probe_func *probe_func; +}; + +#define INIT_PREEMPT_RCU_BOOST_PROBE(preempt_rcu_boost_probe_worker) \ +{ \ + .name = __stringify(preempt_rcu_boost_probe_worker), \ + .probe_func = preempt_rcu_boost_probe_worker##_callback \ +} + +extern int read_rcu_boost_prio(void); +#endif /* CONFIG_PREEMPT_RCU_BOOST */ + #endif /* __KERNEL__ */ #endif /* __LINUX_RCUPREEMPT_TRACE_H */ Index: linux-2.6.24.7/kernel/rcupreempt-boost.c =================================================================== --- linux-2.6.24.7.orig/kernel/rcupreempt-boost.c +++ linux-2.6.24.7/kernel/rcupreempt-boost.c @@ -40,186 +40,9 @@ struct rcu_boost_dat { int rbs_prio; /* CPU copy of rcu_boost_prio */ struct list_head rbs_toboost; /* Preempted RCU readers */ struct list_head rbs_boosted; /* RCU readers that have been boosted */ -#ifdef CONFIG_RCU_TRACE - /* The rest are for statistics */ - unsigned long rbs_stat_task_boost_called; - unsigned long rbs_stat_task_boosted; - unsigned long rbs_stat_boost_called; - unsigned long rbs_stat_try_boost; - unsigned long rbs_stat_boosted; - unsigned long rbs_stat_unboost_called; - unsigned long rbs_stat_unboosted; - unsigned long rbs_stat_try_boost_readers; - unsigned long rbs_stat_boost_readers; - unsigned long rbs_stat_try_unboost_readers; - unsigned long rbs_stat_unboost_readers; - unsigned long rbs_stat_over_taken; -#endif /* CONFIG_RCU_TRACE */ }; static DEFINE_PER_CPU(struct rcu_boost_dat, rcu_boost_data); -#define RCU_BOOST_ME &__get_cpu_var(rcu_boost_data) - -#ifdef CONFIG_RCU_TRACE - -#define RCUPREEMPT_BOOST_TRACE_BUF_SIZE 4096 -static char rcupreempt_boost_trace_buf[RCUPREEMPT_BOOST_TRACE_BUF_SIZE]; - -static ssize_t rcuboost_read(struct file *filp, char __user *buffer, - size_t count, loff_t *ppos) -{ - static DEFINE_MUTEX(mutex); - int cnt = 0; - int cpu; - struct rcu_boost_dat *rbd; - ssize_t bcount; - unsigned long task_boost_called = 0; - unsigned long task_boosted = 0; - unsigned long boost_called = 0; - unsigned long try_boost = 0; - unsigned long boosted = 0; - unsigned long unboost_called = 0; - unsigned long unboosted = 0; - unsigned long try_boost_readers = 0; - unsigned long boost_readers = 0; - unsigned long try_unboost_readers = 0; - unsigned long unboost_readers = 0; - unsigned long over_taken = 0; - - mutex_lock(&mutex); - - for_each_online_cpu(cpu) { - rbd = &per_cpu(rcu_boost_data, cpu); - - task_boost_called += rbd->rbs_stat_task_boost_called; - task_boosted += rbd->rbs_stat_task_boosted; - boost_called += rbd->rbs_stat_boost_called; - try_boost += rbd->rbs_stat_try_boost; - boosted += rbd->rbs_stat_boosted; - unboost_called += rbd->rbs_stat_unboost_called; - unboosted += rbd->rbs_stat_unboosted; - try_boost_readers += rbd->rbs_stat_try_boost_readers; - boost_readers += rbd->rbs_stat_boost_readers; - try_unboost_readers += rbd->rbs_stat_try_boost_readers; - unboost_readers += rbd->rbs_stat_boost_readers; - over_taken += rbd->rbs_stat_over_taken; - } - - cnt += snprintf(&rcupreempt_boost_trace_buf[cnt], - RCUPREEMPT_BOOST_TRACE_BUF_SIZE - cnt, - "task_boost_called = %ld\n", - task_boost_called); - cnt += snprintf(&rcupreempt_boost_trace_buf[cnt], - RCUPREEMPT_BOOST_TRACE_BUF_SIZE - cnt, - "task_boosted = %ld\n", - task_boosted); - cnt += snprintf(&rcupreempt_boost_trace_buf[cnt], - RCUPREEMPT_BOOST_TRACE_BUF_SIZE - cnt, - "boost_called = %ld\n", - boost_called); - cnt += snprintf(&rcupreempt_boost_trace_buf[cnt], - RCUPREEMPT_BOOST_TRACE_BUF_SIZE - cnt, - "try_boost = %ld\n", - try_boost); - cnt += snprintf(&rcupreempt_boost_trace_buf[cnt], - RCUPREEMPT_BOOST_TRACE_BUF_SIZE - cnt, - "boosted = %ld\n", - boosted); - cnt += snprintf(&rcupreempt_boost_trace_buf[cnt], - RCUPREEMPT_BOOST_TRACE_BUF_SIZE - cnt, - "unboost_called = %ld\n", - unboost_called); - cnt += snprintf(&rcupreempt_boost_trace_buf[cnt], - RCUPREEMPT_BOOST_TRACE_BUF_SIZE - cnt, - "unboosted = %ld\n", - unboosted); - cnt += snprintf(&rcupreempt_boost_trace_buf[cnt], - RCUPREEMPT_BOOST_TRACE_BUF_SIZE - cnt, - "try_boost_readers = %ld\n", - try_boost_readers); - cnt += snprintf(&rcupreempt_boost_trace_buf[cnt], - RCUPREEMPT_BOOST_TRACE_BUF_SIZE - cnt, - "boost_readers = %ld\n", - boost_readers); - cnt += snprintf(&rcupreempt_boost_trace_buf[cnt], - RCUPREEMPT_BOOST_TRACE_BUF_SIZE - cnt, - "try_unboost_readers = %ld\n", - try_unboost_readers); - cnt += snprintf(&rcupreempt_boost_trace_buf[cnt], - RCUPREEMPT_BOOST_TRACE_BUF_SIZE - cnt, - "unboost_readers = %ld\n", - unboost_readers); - cnt += snprintf(&rcupreempt_boost_trace_buf[cnt], - RCUPREEMPT_BOOST_TRACE_BUF_SIZE - cnt, - "over_taken = %ld\n", - over_taken); - cnt += snprintf(&rcupreempt_boost_trace_buf[cnt], - RCUPREEMPT_BOOST_TRACE_BUF_SIZE - cnt, - "rcu_boost_prio = %d\n", - rcu_boost_prio); - bcount = simple_read_from_buffer(buffer, count, ppos, - rcupreempt_boost_trace_buf, strlen(rcupreempt_boost_trace_buf)); - mutex_unlock(&mutex); - - return bcount; -} - -static struct file_operations rcuboost_fops = { - .read = rcuboost_read, -}; - -static struct dentry *rcuboostdir; -int rcu_trace_boost_create(struct dentry *rcudir) -{ - rcuboostdir = debugfs_create_file("rcuboost", 0444, rcudir, - NULL, &rcuboost_fops); - if (!rcuboostdir) - return 0; - - return 1; -} -EXPORT_SYMBOL_GPL(rcu_trace_boost_create); - -void rcu_trace_boost_destroy(void) -{ - if (rcuboostdir) - debugfs_remove(rcuboostdir); - rcuboostdir = NULL; -} -EXPORT_SYMBOL_GPL(rcu_trace_boost_destroy); - -#define RCU_BOOST_TRACE_FUNC_DECL(type) \ - static void rcu_trace_boost_##type(struct rcu_boost_dat *rbd) \ - { \ - rbd->rbs_stat_##type++; \ - } -RCU_BOOST_TRACE_FUNC_DECL(task_boost_called) -RCU_BOOST_TRACE_FUNC_DECL(task_boosted) -RCU_BOOST_TRACE_FUNC_DECL(boost_called) -RCU_BOOST_TRACE_FUNC_DECL(try_boost) -RCU_BOOST_TRACE_FUNC_DECL(boosted) -RCU_BOOST_TRACE_FUNC_DECL(unboost_called) -RCU_BOOST_TRACE_FUNC_DECL(unboosted) -RCU_BOOST_TRACE_FUNC_DECL(try_boost_readers) -RCU_BOOST_TRACE_FUNC_DECL(boost_readers) -RCU_BOOST_TRACE_FUNC_DECL(try_unboost_readers) -RCU_BOOST_TRACE_FUNC_DECL(unboost_readers) -RCU_BOOST_TRACE_FUNC_DECL(over_taken) -#else /* CONFIG_RCU_TRACE */ -/* These were created by the above macro "RCU_BOOST_TRACE_FUNC_DECL" */ -# define rcu_trace_boost_task_boost_called(rbd) do { } while (0) -# define rcu_trace_boost_task_boosted(rbd) do { } while (0) -# define rcu_trace_boost_boost_called(rbd) do { } while (0) -# define rcu_trace_boost_try_boost(rbd) do { } while (0) -# define rcu_trace_boost_boosted(rbd) do { } while (0) -# define rcu_trace_boost_unboost_called(rbd) do { } while (0) -# define rcu_trace_boost_unboosted(rbd) do { } while (0) -# define rcu_trace_boost_try_boost_readers(rbd) do { } while (0) -# define rcu_trace_boost_boost_readers(rbd) do { } while (0) -# define rcu_trace_boost_try_unboost_readers(rbd) do { } while (0) -# define rcu_trace_boost_unboost_readers(rbd) do { } while (0) -# define rcu_trace_boost_over_taken(rbd) do { } while (0) -#endif /* CONFIG_RCU_TRACE */ static inline int rcu_is_boosted(struct task_struct *task) { @@ -234,10 +57,10 @@ static void rcu_boost_task(struct task_s WARN_ON(!irqs_disabled()); WARN_ON_SMP(!spin_is_locked(&task->pi_lock)); - rcu_trace_boost_task_boost_called(RCU_BOOST_ME); + trace_mark(task_boost_called, "NULL"); if (task->rcu_prio < task->prio) { - rcu_trace_boost_task_boosted(RCU_BOOST_ME); + trace_mark(task_boosted, "NULL"); task_setprio(task, task->rcu_prio); } } @@ -261,7 +84,7 @@ void __rcu_preempt_boost(void) WARN_ON(!current->rcu_read_lock_nesting); - rcu_trace_boost_boost_called(RCU_BOOST_ME); + trace_mark(boost_called, "NULL"); /* check to see if we are already boosted */ if (unlikely(rcu_is_boosted(curr))) @@ -279,7 +102,7 @@ void __rcu_preempt_boost(void) curr->rcub_rbdp = rbd; - rcu_trace_boost_try_boost(rbd); + trace_mark(try_boost, "NULL"); prio = rt_mutex_getprio(curr); @@ -288,7 +111,7 @@ void __rcu_preempt_boost(void) if (prio <= rbd->rbs_prio) goto out; - rcu_trace_boost_boosted(curr->rcub_rbdp); + trace_mark(boosted, "NULL"); curr->rcu_prio = rbd->rbs_prio; rcu_boost_task(curr); @@ -313,7 +136,7 @@ void __rcu_preempt_unboost(void) int prio; unsigned long flags; - rcu_trace_boost_unboost_called(RCU_BOOST_ME); + trace_mark(unboost_called, "NULL"); /* if not boosted, then ignore */ if (likely(!rcu_is_boosted(curr))) @@ -351,7 +174,7 @@ void __rcu_preempt_unboost(void) list_del_init(&curr->rcub_entry); - rcu_trace_boost_unboosted(rbd); + trace_mark(unboosted, "NULL"); curr->rcu_prio = MAX_PRIO; @@ -412,7 +235,7 @@ static int __rcu_boost_readers(struct rc * Another task may have taken over. */ if (curr->rcu_preempt_counter != rcu_boost_counter) { - rcu_trace_boost_over_taken(rbd); + trace_mark(over_taken, "NULL"); return 1; } @@ -443,7 +266,7 @@ void rcu_boost_readers(void) prio = rt_mutex_getprio(curr); - rcu_trace_boost_try_boost_readers(RCU_BOOST_ME); + trace_mark(try_boost_readers, "NULL"); if (prio >= rcu_boost_prio) { /* already boosted */ @@ -453,7 +276,7 @@ void rcu_boost_readers(void) rcu_boost_prio = prio; - rcu_trace_boost_boost_readers(RCU_BOOST_ME); + trace_mark(boost_readers, "NULL"); /* Flag that we are the one to unboost */ curr->rcu_preempt_counter = ++rcu_boost_counter; @@ -486,12 +309,12 @@ void rcu_unboost_readers(void) spin_lock_irqsave(&rcu_boost_wake_lock, flags); - rcu_trace_boost_try_unboost_readers(RCU_BOOST_ME); + trace_mark(try_unboost_readers, "NULL"); if (current->rcu_preempt_counter != rcu_boost_counter) goto out; - rcu_trace_boost_unboost_readers(RCU_BOOST_ME); + trace_mark(unboost_readers, "NULL"); /* * We could also put in something that @@ -514,6 +337,16 @@ void rcu_unboost_readers(void) } /* + * This function exports the rcu_boost_prio variable for use by + * modules that need it e.g. RCU_TRACE module + */ +int read_rcu_boost_prio(void) +{ + return rcu_boost_prio; +} +EXPORT_SYMBOL_GPL(read_rcu_boost_prio); + +/* * The krcupreemptd wakes up every "rcu_preempt_thread_secs" * seconds at the minimum priority of 1 to do a * synchronize_rcu. This ensures that grace periods finish Index: linux-2.6.24.7/kernel/rcupreempt_trace.c =================================================================== --- linux-2.6.24.7.orig/kernel/rcupreempt_trace.c +++ linux-2.6.24.7/kernel/rcupreempt_trace.c @@ -51,6 +51,163 @@ static char *rcupreempt_trace_buf; static DEFINE_PER_CPU(struct rcupreempt_trace, trace_data); +#ifdef CONFIG_PREEMPT_RCU_BOOST +#define RCUPREEMPT_BOOST_TRACE_BUF_SIZE 4096 +static char rcupreempt_boost_trace_buf[RCUPREEMPT_BOOST_TRACE_BUF_SIZE]; +static DEFINE_PER_CPU(struct preempt_rcu_boost_trace, boost_trace_data); + +DEFINE_PREEMPT_RCU_BOOST_MARKER_HANDLER(task_boost_called); +DEFINE_PREEMPT_RCU_BOOST_MARKER_HANDLER(task_boosted); +DEFINE_PREEMPT_RCU_BOOST_MARKER_HANDLER(boost_called); +DEFINE_PREEMPT_RCU_BOOST_MARKER_HANDLER(try_boost); +DEFINE_PREEMPT_RCU_BOOST_MARKER_HANDLER(boosted); +DEFINE_PREEMPT_RCU_BOOST_MARKER_HANDLER(unboost_called); +DEFINE_PREEMPT_RCU_BOOST_MARKER_HANDLER(unboosted); +DEFINE_PREEMPT_RCU_BOOST_MARKER_HANDLER(try_boost_readers); +DEFINE_PREEMPT_RCU_BOOST_MARKER_HANDLER(boost_readers); +DEFINE_PREEMPT_RCU_BOOST_MARKER_HANDLER(try_unboost_readers); +DEFINE_PREEMPT_RCU_BOOST_MARKER_HANDLER(unboost_readers); +DEFINE_PREEMPT_RCU_BOOST_MARKER_HANDLER(over_taken); + +static struct preempt_rcu_boost_probe preempt_rcu_boost_probe_array[] = +{ + INIT_PREEMPT_RCU_BOOST_PROBE(task_boost_called), + INIT_PREEMPT_RCU_BOOST_PROBE(task_boosted), + INIT_PREEMPT_RCU_BOOST_PROBE(boost_called), + INIT_PREEMPT_RCU_BOOST_PROBE(try_boost), + INIT_PREEMPT_RCU_BOOST_PROBE(boosted), + INIT_PREEMPT_RCU_BOOST_PROBE(unboost_called), + INIT_PREEMPT_RCU_BOOST_PROBE(unboosted), + INIT_PREEMPT_RCU_BOOST_PROBE(try_boost_readers), + INIT_PREEMPT_RCU_BOOST_PROBE(boost_readers), + INIT_PREEMPT_RCU_BOOST_PROBE(try_unboost_readers), + INIT_PREEMPT_RCU_BOOST_PROBE(unboost_readers), + INIT_PREEMPT_RCU_BOOST_PROBE(over_taken) +}; + +static ssize_t rcuboost_read(struct file *filp, char __user *buffer, + size_t count, loff_t *ppos) +{ + static DEFINE_MUTEX(mutex); + int cnt = 0; + int cpu; + struct preempt_rcu_boost_trace *prbt; + ssize_t bcount; + unsigned long task_boost_called = 0; + unsigned long task_boosted = 0; + unsigned long boost_called = 0; + unsigned long try_boost = 0; + unsigned long boosted = 0; + unsigned long unboost_called = 0; + unsigned long unboosted = 0; + unsigned long try_boost_readers = 0; + unsigned long boost_readers = 0; + unsigned long try_unboost_readers = 0; + unsigned long unboost_readers = 0; + unsigned long over_taken = 0; + + mutex_lock(&mutex); + + for_each_online_cpu(cpu) { + prbt = &per_cpu(boost_trace_data, cpu); + + task_boost_called += prbt->rbs_stat_task_boost_called; + task_boosted += prbt->rbs_stat_task_boosted; + boost_called += prbt->rbs_stat_boost_called; + try_boost += prbt->rbs_stat_try_boost; + boosted += prbt->rbs_stat_boosted; + unboost_called += prbt->rbs_stat_unboost_called; + unboosted += prbt->rbs_stat_unboosted; + try_boost_readers += prbt->rbs_stat_try_boost_readers; + boost_readers += prbt->rbs_stat_boost_readers; + try_unboost_readers += prbt->rbs_stat_try_boost_readers; + unboost_readers += prbt->rbs_stat_boost_readers; + over_taken += prbt->rbs_stat_over_taken; + } + + cnt += snprintf(&rcupreempt_boost_trace_buf[cnt], + RCUPREEMPT_BOOST_TRACE_BUF_SIZE - cnt, + "task_boost_called = %ld\n", + task_boost_called); + cnt += snprintf(&rcupreempt_boost_trace_buf[cnt], + RCUPREEMPT_BOOST_TRACE_BUF_SIZE - cnt, + "task_boosted = %ld\n", + task_boosted); + cnt += snprintf(&rcupreempt_boost_trace_buf[cnt], + RCUPREEMPT_BOOST_TRACE_BUF_SIZE - cnt, + "boost_called = %ld\n", + boost_called); + cnt += snprintf(&rcupreempt_boost_trace_buf[cnt], + RCUPREEMPT_BOOST_TRACE_BUF_SIZE - cnt, + "try_boost = %ld\n", + try_boost); + cnt += snprintf(&rcupreempt_boost_trace_buf[cnt], + RCUPREEMPT_BOOST_TRACE_BUF_SIZE - cnt, + "boosted = %ld\n", + boosted); + cnt += snprintf(&rcupreempt_boost_trace_buf[cnt], + RCUPREEMPT_BOOST_TRACE_BUF_SIZE - cnt, + "unboost_called = %ld\n", + unboost_called); + cnt += snprintf(&rcupreempt_boost_trace_buf[cnt], + RCUPREEMPT_BOOST_TRACE_BUF_SIZE - cnt, + "unboosted = %ld\n", + unboosted); + cnt += snprintf(&rcupreempt_boost_trace_buf[cnt], + RCUPREEMPT_BOOST_TRACE_BUF_SIZE - cnt, + "try_boost_readers = %ld\n", + try_boost_readers); + cnt += snprintf(&rcupreempt_boost_trace_buf[cnt], + RCUPREEMPT_BOOST_TRACE_BUF_SIZE - cnt, + "boost_readers = %ld\n", + boost_readers); + cnt += snprintf(&rcupreempt_boost_trace_buf[cnt], + RCUPREEMPT_BOOST_TRACE_BUF_SIZE - cnt, + "try_unboost_readers = %ld\n", + try_unboost_readers); + cnt += snprintf(&rcupreempt_boost_trace_buf[cnt], + RCUPREEMPT_BOOST_TRACE_BUF_SIZE - cnt, + "unboost_readers = %ld\n", + unboost_readers); + cnt += snprintf(&rcupreempt_boost_trace_buf[cnt], + RCUPREEMPT_BOOST_TRACE_BUF_SIZE - cnt, + "over_taken = %ld\n", + over_taken); + cnt += snprintf(&rcupreempt_boost_trace_buf[cnt], + RCUPREEMPT_BOOST_TRACE_BUF_SIZE - cnt, + "rcu_boost_prio = %d\n", + read_rcu_boost_prio()); + bcount = simple_read_from_buffer(buffer, count, ppos, + rcupreempt_boost_trace_buf, strlen(rcupreempt_boost_trace_buf)); + mutex_unlock(&mutex); + + return bcount; +} + +static struct file_operations rcuboost_fops = { + .read = rcuboost_read, +}; + +static struct dentry *rcuboostdir; +int rcu_trace_boost_create(struct dentry *rcudir) +{ + rcuboostdir = debugfs_create_file("rcuboost", 0444, rcudir, + NULL, &rcuboost_fops); + if (!rcuboostdir) + return 0; + + return 1; +} + +void rcu_trace_boost_destroy(void) +{ + if (rcuboostdir) + debugfs_remove(rcuboostdir); + rcuboostdir = NULL; +} + +#endif /* CONFIG_PREEMPT_RCU_BOOST */ + struct rcupreempt_trace *rcupreempt_trace_cpu(int cpu) { return &per_cpu(trace_data, cpu); @@ -350,6 +507,10 @@ static int rcupreempt_debugfs_init(void) if (!ctrsdir) goto free_out; +#ifdef CONFIG_PREEMPT_RCU_BOOST + if (!rcu_trace_boost_create(rcudir)) + goto free_out; +#endif /* CONFIG_PREEMPT_RCU_BOOST */ return 0; free_out: if (ctrsdir) @@ -382,6 +543,22 @@ static int __init rcupreempt_trace_init( } printk(KERN_INFO "RCU Preempt markers registered\n"); +#ifdef CONFIG_PREEMPT_RCU_BOOST + for (i = 0; i < ARRAY_SIZE(preempt_rcu_boost_probe_array); i++) { + struct preempt_rcu_boost_probe *p = \ + &preempt_rcu_boost_probe_array[i]; + ret = marker_probe_register(p->name, p->format, + p->probe_func, p); + if (ret) + printk(KERN_INFO "Unable to register Preempt RCU Boost \ + probe %s\n", preempt_rcu_boost_probe_array[i].name); + ret = marker_arm(p->name); + if (ret) + printk(KERN_INFO "Unable to arm Preempt RCU Boost \ + markers %s\n", p->name); +} +#endif /* CONFIG_PREEMPT_RCU_BOOST */ + mutex_init(&rcupreempt_trace_mutex); rcupreempt_trace_buf = kmalloc(RCUPREEMPT_TRACE_BUF_SIZE, GFP_KERNEL); if (!rcupreempt_trace_buf) @@ -400,6 +577,12 @@ static void __exit rcupreempt_trace_clea marker_probe_unregister(rcupreempt_probe_array[i].name); printk(KERN_INFO "RCU Preempt markers unregistered\n"); +#ifdef CONFIG_PREEMPT_RCU_BOOST + rcu_trace_boost_destroy(); + for (i = 0; i < ARRAY_SIZE(preempt_rcu_boost_probe_array); i++) + marker_probe_unregister(preempt_rcu_boost_probe_array[i].name); + printk(KERN_INFO "Preempt RCU Boost markers unregistered\n"); +#endif /* CONFIG_PREEMPT_RCU_BOOST */ debugfs_remove(statdir); debugfs_remove(gpdir); debugfs_remove(ctrsdir); ����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/kernel-bug-after-entering-something-from-login.patch����������������������������������������0000664�0000764�0000764�00000005034�11041657732�023273� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From r.schwebel@pengutronix.de Fri Jan 11 20:50:39 2008 Date: Fri, 11 Jan 2008 23:35:49 +0100 From: Robert Schwebel <r.schwebel@pengutronix.de> To: Steven Rostedt <rostedt@goodmis.org> Cc: linux-rt-users@vger.kernel.org Subject: lost patch for mpc52xx spinlock [ The following text is in the "iso-8859-15" character set. ] [ Your display is set for the "iso-8859-1" character set. ] [ Some special characters may be displayed incorrectly. ] Hi Steven, this patch from tglx seems to got lost, can you add it to the next release? Robert -- Robert Schwebel | http://www.pengutronix.de OSADL Testlab @ Pengutronix | http://www.osadl.org ----------8<---------- Subject: Re: Kernel Bug when entering something after login From: Thomas Gleixner <tglx@linutronix.de> To: Juergen Beisert <juergen127@kreuzholzen.de> Cc: linux-rt-users@vger.kernel.org In-Reply-To: <200707251900.47704.juergen127@kreuzholzen.de> References: <200707251900.47704.juergen127@kreuzholzen.de> Date: Wed, 25 Jul 2007 21:06:38 +0200 Message-Id: <1185390398.3227.8.camel@chaos> On Wed, 2007-07-25 at 19:00 +0200, Juergen Beisert wrote: > [c0245db0] [c01bdb98] rt_spin_lock_slowlock+0x4c/0x224 (unreliable) > [c0245e10] [c011823c] uart_start+0x24/0x48 > [c0245e30] [c0113ff4] n_tty_receive_buf+0x170/0xfd4 > [c0245ef0] [c010f0dc] flush_to_ldisc+0xe0/0x130 > [c0245f20] [c011b51c] mpc52xx_uart_int+0x194/0x350 > [c0245f50] [c0046dfc] handle_IRQ_event+0x6c/0x110 > [c0245f80] [c00475ec] thread_simple_irq+0x90/0xf8 > [c0245fa0] [c00479a0] do_irqd+0x34c/0x3cc > [c0245fd0] [c0033380] kthread+0x48/0x84 > [c0245ff0] [c00104ac] kernel_thread+0x44/0x60 > Instruction dump: > 70090008 40820144 80010064 bb410048 38210060 7c0803a6 4e800020 801c0010 > 5400003a 7c001278 7c000034 5400d97e <0f000000> 39600004 91610008 80010008 > note: IRQ-131[93] exited with preempt_count 1 Yup. That's a deadlock. In mainline this does not happen, as the spinlock is a NOP. Turn on CONFIG_PROVE_LOCKING in mainline and you see the problem as well. Solution below tglx --- drivers/serial/mpc52xx_uart.c | 2 ++ 1 file changed, 2 insertions(+) Index: linux-2.6.24.7/drivers/serial/mpc52xx_uart.c =================================================================== --- linux-2.6.24.7.orig/drivers/serial/mpc52xx_uart.c +++ linux-2.6.24.7/drivers/serial/mpc52xx_uart.c @@ -501,7 +501,9 @@ mpc52xx_uart_int_rx_chars(struct uart_po } } + spin_unlock(&port->lock); tty_flip_buffer_push(tty); + spin_lock(&port->lock); return in_be16(&PSC(port)->mpc52xx_psc_status) & MPC52xx_PSC_SR_RXRDY; } ����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/ppc-make-tlb-batch-64-only.patch������������������������������������������������������������0000664�0000764�0000764�00000004467�11041657733�017041� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From r.schwebel@pengutronix.de Fri Jan 11 21:04:50 2008 Date: Sat, 12 Jan 2008 00:01:22 +0100 From: Robert Schwebel <r.schwebel@pengutronix.de> To: Steven Rostedt <rostedt@goodmis.org> Cc: linux-rt-users@vger.kernel.org Subject: [patch 2.6.24-rc7-rt1-pre1] per_cpu__ppc64_tlb_batch is only for 64 bit [ The following text is in the "iso-8859-15" character set. ] [ Your display is set for the "iso-8859-1" character set. ] [ Some special characters may be displayed incorrectly. ] Fix the following compile error for powerpc32: arch/powerpc/kernel/process.c: In function '__switch_to': arch/powerpc/kernel/process.c:355: error: 'per_cpu__ppc64_tlb_batch' undeclared (first use in this function) Not sure what the code actually does, but as it was not there in -rc5, somebody else might find something ... Signed-off-by: Robert Schwebel <r.schwebel@pengutronix.de> --- arch/powerpc/kernel/process.c | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) Index: linux-2.6.24.7/arch/powerpc/kernel/process.c =================================================================== --- linux-2.6.24.7.orig/arch/powerpc/kernel/process.c +++ linux-2.6.24.7/arch/powerpc/kernel/process.c @@ -267,10 +267,10 @@ struct task_struct *__switch_to(struct t struct thread_struct *new_thread, *old_thread; unsigned long flags; struct task_struct *last; -#ifdef CONFIG_PREEMPT_RT +#if defined(CONFIG_PPC64) && defined (CONFIG_PREEMPT_RT) struct ppc64_tlb_batch *batch; int hadbatch; -#endif /* #ifdef CONFIG_PREEMPT_RT */ +#endif #ifdef CONFIG_SMP /* avoid complexity of lazy save/restore of fpu @@ -349,7 +349,6 @@ struct task_struct *__switch_to(struct t old_thread->accum_tb += (current_tb - start_tb); new_thread->start_tb = current_tb; } -#endif #ifdef CONFIG_PREEMPT_RT batch = &__get_cpu_var(ppc64_tlb_batch); @@ -361,6 +360,7 @@ struct task_struct *__switch_to(struct t batch->active = 0; } #endif /* #ifdef CONFIG_PREEMPT_RT */ +#endif local_irq_save(flags); @@ -372,12 +372,12 @@ struct task_struct *__switch_to(struct t local_irq_restore(flags); -#ifdef CONFIG_PREEMPT_RT +#if defined(CONFIG_PPC64) && defined(CONFIG_PREEMPT_RT) if (hadbatch) { batch = &__get_cpu_var(ppc64_tlb_batch); batch->active = 1; } -#endif /* #ifdef CONFIG_PREEMPT_RT */ +#endif return last; } ���������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/ppc-chpr-set-rtc-lock.patch�����������������������������������������������������������������0000664�0000764�0000764�00000001567�11041657731�016315� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������--- arch/powerpc/platforms/chrp/time.c | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-) Index: linux-2.6.24.7/arch/powerpc/platforms/chrp/time.c =================================================================== --- linux-2.6.24.7.orig/arch/powerpc/platforms/chrp/time.c +++ linux-2.6.24.7/arch/powerpc/platforms/chrp/time.c @@ -27,7 +27,7 @@ #include <asm/sections.h> #include <asm/time.h> -extern raw_spinlock_t rtc_lock; +extern spinlock_t rtc_lock; static int nvram_as1 = NVRAM_AS1; static int nvram_as0 = NVRAM_AS0; @@ -83,7 +83,12 @@ int chrp_set_rtc_time(struct rtc_time *t unsigned char save_control, save_freq_select; struct rtc_time tm = *tmarg; +#if CONFIG_PREEMPT_RT + if (!spin_trylock(&rtc_lock)) + return -1; +#else spin_lock(&rtc_lock); +#endif save_control = chrp_cmos_clock_read(RTC_CONTROL); /* tell the clock it's being set */ �����������������������������������������������������������������������������������������������������������������������������������������patches/disable-run-softirq-from-hardirq-completely.patch�������������������������������������������0000664�0000764�0000764�00000007136�11041657732�022731� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From: Steven Rostedt <srostedt@redhat.com> Subject: Disable running softirqs from hardirqs completely! There's too many problems with running softirqs from the hardirq context. Softirqs are not allowed to migrate, and hardirqs might. Perhaps this will be better when softirqs can migrate. Signed-off-by: Steven Rostedt <srostedt@redhat.com> --- kernel/irq/manage.c | 27 +-------------------------- kernel/softirq.c | 21 --------------------- 2 files changed, 1 insertion(+), 47 deletions(-) Index: linux-2.6.24.7/kernel/irq/manage.c =================================================================== --- linux-2.6.24.7.orig/kernel/irq/manage.c +++ linux-2.6.24.7/kernel/irq/manage.c @@ -786,28 +786,11 @@ static int do_irqd(void * __desc) { struct sched_param param = { 0, }; struct irq_desc *desc = __desc; - int run_softirq = 1; #ifdef CONFIG_SMP cpumask_t cpus_allowed; cpus_allowed = desc->affinity; - /* - * If the irqd is bound to one CPU we let it run softirqs - * that have the same priority as the irqd thread. We do - * not run it if the irqd is bound to more than one CPU - * due to the fact that it can - * 1) migrate to other CPUS while running the softirqd - * 2) if we pin the irqd to a CPU to run the softirqd, then - * we risk a high priority process from waking up and - * preempting the irqd. Although the irqd may be able to - * run on other CPUS due to its irq affinity, it will not - * be able to since we bound it to a CPU to run softirqs. - * So a RT hog could starve the irqd from running on - * other CPUS that it's allowed to run on. - */ - if (cpus_weight(cpus_allowed) != 1) - run_softirq = 0; /* turn it off */ #endif current->flags |= PF_NOFREEZE | PF_HARDIRQ; @@ -823,8 +806,6 @@ static int do_irqd(void * __desc) do { set_current_state(TASK_INTERRUPTIBLE); do_hardirq(desc); - if (run_softirq) - do_softirq_from_hardirq(); } while (current->state == TASK_RUNNING); local_irq_enable_nort(); @@ -832,14 +813,8 @@ static int do_irqd(void * __desc) /* * Did IRQ affinities change? */ - if (!cpus_equal(cpus_allowed, desc->affinity)) { + if (!cpus_equal(cpus_allowed, desc->affinity)) cpus_allowed = desc->affinity; - /* - * Only allow the irq thread to run the softirqs - * if it is bound to a single CPU. - */ - run_softirq = (cpus_weight(cpus_allowed) == 1); - } #endif schedule(); } Index: linux-2.6.24.7/kernel/softirq.c =================================================================== --- linux-2.6.24.7.orig/kernel/softirq.c +++ linux-2.6.24.7/kernel/softirq.c @@ -103,27 +103,6 @@ static void wakeup_softirqd(int softirq) if (unlikely(!tsk)) return; -#if defined(CONFIG_PREEMPT_SOFTIRQS) && defined(CONFIG_PREEMPT_HARDIRQS) - /* - * Optimization: if we are in a hardirq thread context, and - * if the priority of the softirq thread is the same as the - * priority of the hardirq thread, then 'merge' softirq - * processing into the hardirq context. (it will later on - * execute softirqs via do_softirq_from_hardirq()). - * So here we can skip the wakeup and can rely on the hardirq - * context processing it later on. - */ - if ((current->flags & PF_HARDIRQ) && !hardirq_count() && - (tsk->normal_prio == current->normal_prio) && - /* - * The hard irq thread must be bound to a single CPU to run - * a softirq. Don't worry about locking, the irq thread - * should be the only one to modify the cpus_allowed, when - * the irq affinity changes. - */ - (cpus_weight(current->cpus_allowed) == 1)) - return; -#endif /* * Wake up the softirq task: */ ����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/hack-fix-rt-migration.patch�����������������������������������������������������������������0000664�0000764�0000764�00000003641�11041657730�016372� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From efault@gmx.de Mon Jan 14 23:35:16 2008 Date: Mon, 14 Jan 2008 09:27:40 +0100 From: Mike Galbraith <efault@gmx.de> To: Steven Rostedt <rostedt@goodmis.org> Cc: Mariusz Kozlowski <m.kozlowski@tuxland.pl>, LKML <linux-kernel@vger.kernel.org>, RT <linux-rt-users@vger.kernel.org>, Ingo Molnar <mingo@elte.hu>, Thomas Gleixner <tglx@linutronix.de> Subject: Re: 2.6.24-rc7-rt1 [ The following text is in the "utf-8" character set. ] [ Your display is set for the "iso-8859-1" character set. ] [ Some characters may be displayed incorrectly. ] On Sun, 2008-01-13 at 15:54 -0500, Steven Rostedt wrote: > OK, -rt2 will take a bit more beating from me before I release it, so it > might take some time to get it out (expect it out on Monday). Ah, that reminds me (tests, yup) I still need the patchlet below to resume from ram without black screen of death. No idea why my P4 box seems to be the only box in the rt galaxy affected. (haven't poked at it since the holidays) --- kernel/sched_rt.c | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-) Index: linux-2.6.24.7/kernel/sched_rt.c =================================================================== --- linux-2.6.24.7.orig/kernel/sched_rt.c +++ linux-2.6.24.7/kernel/sched_rt.c @@ -33,6 +33,9 @@ static inline void rt_clear_overload(str static void update_rt_migration(struct rq *rq) { + if (unlikely(num_online_cpus() == 1)) + return; + if (rq->rt.rt_nr_migratory && (rq->rt.rt_nr_running > 1)) { if (!rq->rt.overloaded) { rt_set_overload(rq); @@ -105,8 +108,10 @@ static inline void dec_rt_tasks(struct t } /* otherwise leave rq->highest prio alone */ } else rq->rt.highest_prio = MAX_RT_PRIO; - if (p->nr_cpus_allowed > 1) + if (p->nr_cpus_allowed > 1) { + BUG_ON(!rq->rt.rt_nr_migratory); rq->rt.rt_nr_migratory--; + } if (rq->rt.highest_prio != highest_prio) cpupri_set(&rq->rd->cpupri, rq->cpu, rq->rt.highest_prio); �����������������������������������������������������������������������������������������������patches/mips-remove-conlicting-rtc-lock-declaration.patch�������������������������������������������0000664�0000764�0000764�00000001722�11041657735�022662� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From: Frank Rowand <frank.rowand@am.sony.com> Subject: [PATCH 2/4] RT: remove conflicting rtc_lock declaration To: linux-kernel@vger.kernel.org Cc: mingo@redhat.com, tglx@linutronix.de Date: Tue, 15 Jan 2008 14:19:55 -0800 From: Frank Rowand <frank.rowand@am.sony.com> Declaration of rtc_lock in arch/mips/kernel/time.c conflicts with time.h, remove from include/asm-mips/time.h. Signed-off-by: Frank Rowand <frank.rowand@am.sony.com> --- include/asm-mips/time.h | 2 -- 1 file changed, 2 deletions(-) Index: linux-2.6.24.7/include/asm-mips/time.h =================================================================== --- linux-2.6.24.7.orig/include/asm-mips/time.h +++ linux-2.6.24.7/include/asm-mips/time.h @@ -19,8 +19,6 @@ #include <linux/clockchips.h> #include <linux/clocksource.h> -extern raw_spinlock_t rtc_lock; - /* * RTC ops. By default, they point to weak no-op RTC functions. * rtc_mips_set_time - reverse the above translation and set time to RTC. ����������������������������������������������patches/mips-remove-finish-arch-switch.patch��������������������������������������������������������0000664�0000764�0000764�00000004361�11041657735�020226� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From: Frank Rowand <frank.rowand@am.sony.com> Subject: [PATCH 3/4] RT: remove finish_arch_switch To: linux-kernel@vger.kernel.org Cc: mingo@redhat.com, tglx@linutronix.de Date: Tue, 15 Jan 2008 14:20:46 -0800 From: Frank Rowand <frank.rowand@am.sony.com> This is probably just a temporary workaround for one procssor - the MIPS community will most likely want to architect a solution to this issue. Make of preempt kernel barfs in kernel/sched.c ifdef finish_arch_switch. Remove the finish_arch_switch() for boards with TX49xx MIPS processor. Signed-off-by: Frank Rowand <frank.rowand@am.sony.com> --- include/asm-mips/mach-tx49xx/cpu-feature-overrides.h | 7 +++++++ include/asm-mips/system.h | 3 +++ 2 files changed, 10 insertions(+) Index: linux-2.6.24.7/include/asm-mips/mach-tx49xx/cpu-feature-overrides.h =================================================================== --- linux-2.6.24.7.orig/include/asm-mips/mach-tx49xx/cpu-feature-overrides.h +++ linux-2.6.24.7/include/asm-mips/mach-tx49xx/cpu-feature-overrides.h @@ -1,6 +1,13 @@ #ifndef __ASM_MACH_TX49XX_CPU_FEATURE_OVERRIDES_H #define __ASM_MACH_TX49XX_CPU_FEATURE_OVERRIDES_H +/* finish_arch_switch_empty is defined if we know finish_arch_switch() will + * be empty, based on the lack of features defined in this file. This is + * needed because config preempt will barf in kernel/sched.c ifdef + * finish_arch_switch + */ +#define finish_arch_switch_empty + #define cpu_has_llsc 1 #define cpu_has_64bits 1 #define cpu_has_inclusive_pcaches 0 Index: linux-2.6.24.7/include/asm-mips/system.h =================================================================== --- linux-2.6.24.7.orig/include/asm-mips/system.h +++ linux-2.6.24.7/include/asm-mips/system.h @@ -70,6 +70,8 @@ do { \ (last) = resume(prev, next, task_thread_info(next)); \ } while (0) +/* preempt kernel barfs in kernel/sched.c ifdef finish_arch_switch */ +#ifndef finish_arch_switch_empty #define finish_arch_switch(prev) \ do { \ if (cpu_has_dsp) \ @@ -77,6 +79,7 @@ do { \ if (cpu_has_userlocal) \ write_c0_userlocal(current_thread_info()->tp_value); \ } while (0) +#endif static inline unsigned long __xchg_u32(volatile int * m, unsigned int val) { �������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/mips-change-raw-spinlock-type.patch���������������������������������������������������������0000664�0000764�0000764�00000001776�11041657734�020062� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From: Frank Rowand <frank.rowand@am.sony.com> Subject: [PATCH 4/4] RT: change from raw_spinlock_t to __raw_spinlock_t To: linux-kernel@vger.kernel.org Cc: mingo@redhat.com, tglx@linutronix.de Date: Tue, 15 Jan 2008 14:21:46 -0800 From: Frank Rowand <frank.rowand@am.sony.com> Fix compile warning (which becomes compile error due to -Werror), by changing from raw_spinlock_t to __raw_spinlock_t. Signed-off-by: Frank Rowand <frank.rowand@am.sony.com> --- arch/mips/kernel/gdb-stub.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) Index: linux-2.6.24.7/arch/mips/kernel/gdb-stub.c =================================================================== --- linux-2.6.24.7.orig/arch/mips/kernel/gdb-stub.c +++ linux-2.6.24.7/arch/mips/kernel/gdb-stub.c @@ -177,7 +177,7 @@ int kgdb_enabled; * spin locks for smp case */ static DEFINE_SPINLOCK(kgdb_lock); -static raw_spinlock_t kgdb_cpulock[NR_CPUS] = { +static __raw_spinlock_t kgdb_cpulock[NR_CPUS] = { [0 ... NR_CPUS-1] = __RAW_SPIN_LOCK_UNLOCKED, }; ��patches/ppc32-latency-compile-hack-fixes.patch������������������������������������������������������0000664�0000764�0000764�00000001434�11041657734�020322� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������--- arch/powerpc/kernel/setup_32.c | 19 +++++++++++++++++++ 1 file changed, 19 insertions(+) Index: linux-2.6.24.7/arch/powerpc/kernel/setup_32.c =================================================================== --- linux-2.6.24.7.orig/arch/powerpc/kernel/setup_32.c +++ linux-2.6.24.7/arch/powerpc/kernel/setup_32.c @@ -296,3 +296,22 @@ void __init setup_arch(char **cmdline_p) paging_init(); } + +#ifdef CONFIG_STACKTRACE +#include <linux/stacktrace.h> +void notrace save_stack_trace(struct stack_trace *trace) +{ +} +#endif /* CONFIG_STACKTRACE */ + +#ifdef CONFIG_EARLY_PRINTK +void notrace early_printk(const char *fmt, ...) +{ + BUG(); +} +#endif /* CONFIG_EARLY_PRINTK */ + +#ifdef CONFIG_MCOUNT +extern void _mcount(void); +EXPORT_SYMBOL(_mcount); +#endif /* CONFIG_MCOUNT */ ������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/mips-remove-duplicate-kconfig.patch���������������������������������������������������������0000664�0000764�0000764�00000003241�11041657734�020117� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From frank.rowand@am.sony.com Wed Jan 16 20:24:36 2008 Date: Wed, 16 Jan 2008 16:46:38 -0800 From: Frank Rowand <frank.rowand@am.sony.com> To: Steven Rostedt <rostedt@goodmis.org> Cc: linux-kernel@vger.kernel.org, mingo@redhat.com, tglx@linutronix.de Subject: Re: [PATCH 1/4] RT: remove duplicate time/Kconfig On Tue, 2008-01-15 at 19:40 -0500, Steven Rostedt wrote: > On Tue, Jan 15, 2008 at 02:18:45PM -0800, Frank Rowand wrote: > > > > Index: linux-2.6.24-rc7/arch/mips/Kconfig > > =================================================================== > > --- linux-2.6.24-rc7.orig/arch/mips/Kconfig > > +++ linux-2.6.24-rc7/arch/mips/Kconfig > > @@ -1775,8 +1775,6 @@ config NR_CPUS > > performance should round up your number of processors to the next > > power of two. > > > > -source "kernel/time/Kconfig" > > - > > This doesn't apply with -rt2. Are you sure you have the right tree? > > -- Steve As you suspected, I pulled this one from the wrong tree. The correct patch is below. -Frank > > > # > > # Timer Interrupt Frequency Configuration > time/Kconfig added by preempt-realtime-mips.patch duplicates other entry, resulting in kernel make error: Signed-off-by: Frank Rowand <frank.rowand@am.sony.com> --- arch/mips/Kconfig | 2 -- 1 file changed, 2 deletions(-) Index: linux-2.6.24.7/arch/mips/Kconfig =================================================================== --- linux-2.6.24.7.orig/arch/mips/Kconfig +++ linux-2.6.24.7/arch/mips/Kconfig @@ -1777,8 +1777,6 @@ config NR_CPUS performance should round up your number of processors to the next power of two. -source "kernel/time/Kconfig" - # # Timer Interrupt Frequency Configuration # ���������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/ppc32_notrace_init_functions.patch����������������������������������������������������������0000664�0000764�0000764�00000003074�11043037203�020030� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������Subject: don't trace early init functions for ppc32 By: Luotao Fu <l.fu@pengutronix.de> If the latency tracer is turned on in the kernel config, _mcount calls are added automatically to every function call during compiling since -pg compiling flag is set. _mcount() checks first the variable mcount_enabled. (see implementation of _mcount() in arch/powerpc/kernel/entry_32.S) This will stuck forever if _mcount is called before mcount_enabled is initialized. Hence we mark some init functions as notrace, so that _mcount calls are not added to these functions. Signed-off-by: Luotao Fu <l.fu@pengutronix.de> --- arch/powerpc/kernel/cputable.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) Index: linux-2.6.24.7/arch/powerpc/kernel/cputable.c =================================================================== --- linux-2.6.24.7.orig/arch/powerpc/kernel/cputable.c +++ linux-2.6.24.7/arch/powerpc/kernel/cputable.c @@ -1333,7 +1333,7 @@ static struct cpu_spec __initdata cpu_sp static struct cpu_spec the_cpu_spec; -struct cpu_spec * __init identify_cpu(unsigned long offset, unsigned int pvr) +notrace struct cpu_spec * __init identify_cpu(unsigned long offset, unsigned int pvr) { struct cpu_spec *s = cpu_specs; struct cpu_spec *t = &the_cpu_spec; @@ -1380,7 +1380,7 @@ struct cpu_spec * __init identify_cpu(un return NULL; } -void do_feature_fixups(unsigned long value, void *fixup_start, void *fixup_end) +notrace void do_feature_fixups(unsigned long value, void *fixup_start, void *fixup_end) { struct fixup_entry { unsigned long mask; ��������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/apic-level-smp-affinity.patch���������������������������������������������������������������0000664�0000764�0000764�00000001420�11041657734�016710� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������--- arch/x86/kernel/io_apic_64.c | 9 +++++++++ 1 file changed, 9 insertions(+) Index: linux-2.6.24.7/arch/x86/kernel/io_apic_64.c =================================================================== --- linux-2.6.24.7.orig/arch/x86/kernel/io_apic_64.c +++ linux-2.6.24.7/arch/x86/kernel/io_apic_64.c @@ -1509,6 +1509,15 @@ static void ack_apic_level(unsigned int move_masked_irq(irq); unmask_IO_APIC_irq(irq); } +#if (defined(CONFIG_GENERIC_PENDING_IRQ) || defined(CONFIG_IRQBALANCE)) && \ + defined(CONFIG_PREEMPT_HARDIRQS) + /* + * With threaded interrupts, we always have IRQ_INPROGRESS + * when acking. + */ + else if (unlikely(irq_desc[irq].status & IRQ_MOVE_PENDING)) + move_masked_irq(irq); +#endif } static struct irq_chip ioapic_chip __read_mostly = { ������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/timer-warning-fix.patch���������������������������������������������������������������������0000664�0000764�0000764�00000002777�11041657731�015647� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From shiwh@cn.fujitsu.com Wed Feb 20 14:37:52 2008 Date: Thu, 14 Feb 2008 18:02:14 +0800 From: Shi Weihua <shiwh@cn.fujitsu.com> To: linux-rt-users@vger.kernel.org Cc: linux-kernel@vger.kernel.org, khilman@mvista.com, rostedt@goodmis.org, Thomas Gleixner <tglx@linutronix.de>, Ingo Molnar <mingo@elte.hu> Subject: [PATCH 2.6.24-rt1] timer:fix build warning in timer.c [ The following text is in the "UTF-8" character set. ] [ Your display is set for the "iso-8859-1" character set. ] [ Some characters may be displayed incorrectly. ] Fix the following compile warning without CONFIG_PREEMPT_RT: kernel/timer.c:937: warning: �^�^�count_active_rt_tasks�^�^� defined but not used Signed-off-by: Shi Weihua <shiwh@cn.fujitsu.com> --- --- kernel/timer.c | 6 ++---- 1 file changed, 2 insertions(+), 4 deletions(-) Index: linux-2.6.24.7/kernel/timer.c =================================================================== --- linux-2.6.24.7.orig/kernel/timer.c +++ linux-2.6.24.7/kernel/timer.c @@ -939,20 +939,18 @@ static unsigned long count_active_tasks( #endif } +#ifdef CONFIG_PREEMPT_RT /* * Nr of active tasks - counted in fixed-point numbers */ static unsigned long count_active_rt_tasks(void) { -#ifdef CONFIG_PREEMPT_RT extern unsigned long rt_nr_running(void); extern unsigned long rt_nr_uninterruptible(void); return (rt_nr_running() + rt_nr_uninterruptible()) * FIXED_1; -#else - return 0; -#endif } +#endif /* * Hmm.. Changed this, as the GNU make sources (load.c) seems to �patches/printk-in-atomic.patch����������������������������������������������������������������������0000664�0000764�0000764�00000005632�11041657735�015462� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������rostedt's patch to make early printk safe on RT From: Clark Williams <williams@redhat.com> --- arch/x86/kernel/early_printk.c | 6 +++--- drivers/char/vt.c | 2 +- include/linux/console.h | 11 +++++++++++ kernel/printk.c | 1 + 4 files changed, 16 insertions(+), 4 deletions(-) Index: linux-2.6.24.7/arch/x86/kernel/early_printk.c =================================================================== --- linux-2.6.24.7.orig/arch/x86/kernel/early_printk.c +++ linux-2.6.24.7/arch/x86/kernel/early_printk.c @@ -51,7 +51,7 @@ static void early_vga_write(struct conso static struct console early_vga_console = { .name = "earlyvga", .write = early_vga_write, - .flags = CON_PRINTBUFFER, + .flags = CON_PRINTBUFFER | CON_ATOMIC, .index = -1, }; @@ -147,7 +147,7 @@ static __init void early_serial_init(cha static struct console early_serial_console = { .name = "earlyser", .write = early_serial_write, - .flags = CON_PRINTBUFFER, + .flags = CON_PRINTBUFFER | CON_ATOMIC, .index = -1, }; @@ -188,7 +188,7 @@ static void simnow_write(struct console static struct console simnow_console = { .name = "simnow", .write = simnow_write, - .flags = CON_PRINTBUFFER, + .flags = CON_PRINTBUFFER | CON_ATOMIC, .index = -1, }; Index: linux-2.6.24.7/drivers/char/vt.c =================================================================== --- linux-2.6.24.7.orig/drivers/char/vt.c +++ linux-2.6.24.7/drivers/char/vt.c @@ -2496,7 +2496,7 @@ static struct console vt_console_driver .write = vt_console_print, .device = vt_console_device, .unblank = unblank_screen, - .flags = CON_PRINTBUFFER, + .flags = CON_PRINTBUFFER | CON_ATOMIC, .index = -1, }; #endif Index: linux-2.6.24.7/include/linux/console.h =================================================================== --- linux-2.6.24.7.orig/include/linux/console.h +++ linux-2.6.24.7/include/linux/console.h @@ -92,6 +92,17 @@ void give_up_console(const struct consw #define CON_ENABLED (4) #define CON_BOOT (8) #define CON_ANYTIME (16) /* Safe to call when cpu is offline */ +#define CON_ATOMIC (32) /* Safe to call in PREEMPT_RT atomic */ + +#ifdef CONFIG_PREEMPT_RT +# define console_atomic_safe(con) \ + (((con)->flags & CON_ATOMIC) || \ + (!in_atomic() && !irqs_disabled()) || \ + (system_state != SYSTEM_RUNNING) || \ + oops_in_progress) +#else +# define console_atomic_safe(con) (1) +#endif struct console { char name[16]; Index: linux-2.6.24.7/kernel/printk.c =================================================================== --- linux-2.6.24.7.orig/kernel/printk.c +++ linux-2.6.24.7/kernel/printk.c @@ -435,6 +435,7 @@ static void __call_console_drivers(unsig for (con = console_drivers; con; con = con->next) { if ((con->flags & CON_ENABLED) && con->write && + console_atomic_safe(con) && (cpu_online(raw_smp_processor_id()) || (con->flags & CON_ANYTIME))) { set_printk_might_sleep(1); ������������������������������������������������������������������������������������������������������patches/root-domain-kfree-in-atomic.patch�����������������������������������������������������������0000664�0000764�0000764�00000003114�11041657733�017464� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From ghaskins@novell.com Tue Feb 26 13:05:53 2008 Date: Tue, 26 Feb 2008 12:14:27 -0500 From: Gregory Haskins <ghaskins@novell.com> To: rostedt@goodmis.org Cc: ghaskins@novell.com Subject: [PATCH] fix oops in root-domain code during repartitioning [ The following text is in the "utf-8" character set. ] [ Your display is set for the "iso-8859-1" character set. ] [ Some characters may be displayed incorrectly. ] we cannot kfree while in_atomic in -rt, and we currently hold the (raw_spinlock_t)rq->lock while we try. So defer the operation until we are out of the critical section. Signed-off-by: Gregory Haskins <ghaskins@novell.com> --- kernel/sched.c | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-) Index: linux-2.6.24.7/kernel/sched.c =================================================================== --- linux-2.6.24.7.orig/kernel/sched.c +++ linux-2.6.24.7/kernel/sched.c @@ -6325,6 +6325,7 @@ static void rq_attach_root(struct rq *rq { unsigned long flags; const struct sched_class *class; + struct root_domain *reap = NULL; spin_lock_irqsave(&rq->lock, flags); @@ -6340,7 +6341,7 @@ static void rq_attach_root(struct rq *rq cpu_clear(rq->cpu, old_rd->online); if (atomic_dec_and_test(&old_rd->refcount)) - kfree(old_rd); + reap = old_rd; } atomic_inc(&rd->refcount); @@ -6356,6 +6357,10 @@ static void rq_attach_root(struct rq *rq } spin_unlock_irqrestore(&rq->lock, flags); + + /* Don't try to free the memory while in-atomic() */ + if (unlikely(reap)) + kfree(reap); } static void init_rootdomain(struct root_domain *rd) ����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/rt-balance-check-rq.patch�������������������������������������������������������������������0000664�0000764�0000764�00000001415�11041657734�015772� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������--- kernel/sched_rt.c | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) Index: linux-2.6.24.7/kernel/sched_rt.c =================================================================== --- linux-2.6.24.7.orig/kernel/sched_rt.c +++ linux-2.6.24.7/kernel/sched_rt.c @@ -840,9 +840,11 @@ static void prio_changed_rt(struct rq *r pull_rt_task(rq); /* * If there's a higher priority task waiting to run - * then reschedule. + * then reschedule. Note, the above pull_rt_task + * can release the rq lock and p could migrate. + * Only reschedule if p is still on the same runqueue. */ - if (p->prio > rq->rt.highest_prio) + if (p->prio > rq->rt.highest_prio && task_rq(p) == rq) resched_task(p); #else /* For UP simply resched on drop of prio */ ���������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/printk-in-atomic-hack-fix.patch�������������������������������������������������������������0000664�0000764�0000764�00000003334�11041657735�017147� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From: Steven Rostedt <srostedt@redhat.com> Subject: fix printk in atomic hack The printk in atomic hack had a slight bug. This but was triggered by debug locking options. The hack prevents grabbing sleeping spin locks in printk console drivers if we are in atomic (can't sleep). But the unlock had a bug where it incorrectely assumed that if we are in printk and atomic, that we didn't grab the lock. The debug locking can encapsulate these options and cause unlocks to be in atomic when the lock was not. This means we would not release the lock after it was taken. The patch only skips releasing the lock if in printk - atomic *and* not the lock owner. Special thanks goes to Jon Masters for digging his head deep into this crap and narrowing it down to a problem with printks and locks. Signed-off-by: Steven Rostedt <srostedt@redhat.com> --- kernel/rtmutex.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) Index: linux-2.6.24.7/kernel/rtmutex.c =================================================================== --- linux-2.6.24.7.orig/kernel/rtmutex.c +++ linux-2.6.24.7/kernel/rtmutex.c @@ -634,7 +634,7 @@ rt_spin_lock_fastlock(struct rt_mutex *l void fastcall (*slowfn)(struct rt_mutex *lock)) { /* Temporary HACK! */ - if (!current->in_printk) + if (likely(!current->in_printk)) might_sleep(); else if (in_atomic() || irqs_disabled()) /* don't grab locks for printk in atomic */ @@ -651,7 +651,7 @@ rt_spin_lock_fastunlock(struct rt_mutex void fastcall (*slowfn)(struct rt_mutex *lock)) { /* Temporary HACK! */ - if (current->in_printk && (in_atomic() || irqs_disabled())) + if (unlikely(rt_mutex_owner(lock) != current) && current->in_printk) /* don't grab locks for printk in atomic */ return; ����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/slab-irq-nopreempt-fix.patch����������������������������������������������������������������0000664�0000764�0000764�00000017446�11041657735�016610� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From h-shimamoto@ct.jp.nec.com Mon Mar 24 16:41:44 2008 Date: Thu, 20 Mar 2008 11:00:27 -0700 From: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com> To: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de>, Steven Rostedt <rostedt@goodmis.org>, linux-kernel@vger.kernel.org, linux-rt-users@vger.kernel.org Subject: [PATCH -rt] rt-slab: fix cpu inconsistency case Hi Ingo, I encountered BUG at mm/slab.c. It tells me there is an inconsistency case. One of the panics, it shows obviously exclusive control in slab is not right. The console log; Unable to handle kernel paging request at ffff8108cf81d3c8 RIP: [<ffffffff8029e92f>] kmem_cache_alloc+0x5e/0xdb PGD 8063 PUD 0 Oops: 0000 [1] PREEMPT SMP CPU 1 Modules linked in: Pid: 32268, comm: dcachebench Not tainted 2.6.24.3-rt3 #3 RIP: 0010:[<ffffffff8029e92f>] [<ffffffff8029e92f>] kmem_cache_alloc+0x5e/0xdb RSP: 0018:ffff810048895e58 EFLAGS: 00010097 RAX: 00000000ffffffff RBX: 0000000000000246 RCX: 0000000000000000 RDX: ffff8100cf81d3b8 RSI: 0000000000000000 RDI: ffff8100cf80ee40 RBP: ffff8100cf80ee40 R08: 0000000000000c62 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 00000000000000d0 R13: ffffffff802aa83d R14: 0000000000000006 R15: 0000000000000000 FS: 00002adc7b6686f0(0000) GS:ffff8100cf81f340(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: ffff8108cf81d3c8 CR3: 00000000b36e0000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process dcachebench (pid: 32268, threadinfo ffff810048894000, task ffff8100cd248080) Stack: 0000000000000000 ffff810097d1c000 0000000100000282 0000000000000006 00000000ffffff9c 0000000000000000 00000000005038b0 ffffffff802aa83d ffff8100cf811080 0000000000000006 00000000ffffff9c 0000000000000000 Call Trace: [<ffffffff802aa83d>] getname+0x1e/0x1f0 [<ffffffff802ac38c>] do_rmdir+0x17/0xfd [<ffffffff8026fb5f>] audit_syscall_entry+0x163/0x189 [<ffffffff8020c2f3>] tracesys+0x71/0xe1 [<ffffffff8020c35e>] tracesys+0xdc/0xe1 --------------------------- | preempt count: 00000001 ] | 1-level deep critical section nesting: ---------------------------------------- ... [<ffffffff80640737>] .... __spin_trylock+0xe/0x49 ......[<00000000>] .. ( <= 0x0) Code: 48 8b 54 c2 18 53 9d 4c 89 e9 44 89 e6 48 89 ef e8 f1 e1 ff RIP [<ffffffff8029e92f>] kmem_cache_alloc+0x5e/0xdb RSP <ffff810048895e58> The kernel dump; (gdb) info thr 4 process 32262 flush_tlb_others (cpumask={bits = {4}}, mm=0xffff8100cd16d450, va=46966178668544) at arch/x86/kernel/smp_64.c:195 3 process 32325 0xffffffff8020cad0 in invalidate_interrupt3 () 2 process 32268 0xffffffff8029e92f in kmem_cache_alloc (cachep=0xffff8100cf80ee40, flags=208) at mm/slab.c:3320 * 1 process 32326 0xffffffff806408c4 in __spin_lock (lock=0xffff8100cdc9a4d0) at kernel/spinlock.c:333 (gdb) thr 2 [Switching to thread 2 (process 32268)]#0 0xffffffff8029e92f in kmem_cache_alloc ( cachep=0xffff8100cf80ee40, flags=208) at mm/slab.c:3320 3320 mm/slab.c: No such file or directory. in mm/slab.c 0xffffffff8029e903 <kmem_cache_alloc+50>: callq 0xffffffff8029c292 <check_irq_off> 0xffffffff8029e908 <kmem_cache_alloc+55>: movslq 0x14(%rsp),%rax 0xffffffff8029e90d <kmem_cache_alloc+60>: mov 0x0(%rbp,%rax,8),%rdx 0xffffffff8029e912 <kmem_cache_alloc+65>: mov (%rdx),%eax 0xffffffff8029e914 <kmem_cache_alloc+67>: test %eax,%eax 0xffffffff8029e916 <kmem_cache_alloc+69>: je 0xffffffff8029e990 <kmem_cache_alloc+191> 1st. %eax, ac->avail is not 0. 0xffffffff8029e918 <kmem_cache_alloc+71>: lock incl 0xf8(%rbp) 0xffffffff8029e91f <kmem_cache_alloc+78>: movl $0x1,0xc(%rdx) 0xffffffff8029e926 <kmem_cache_alloc+85>: mov (%rdx),%eax 0xffffffff8029e928 <kmem_cache_alloc+87>: sub $0x1,%eax 0xffffffff8029e92b <kmem_cache_alloc+90>: mov %eax,(%rdx) 0xffffffff8029e92d <kmem_cache_alloc+92>: mov %eax,%eax 0xffffffff8029e92f <kmem_cache_alloc+94>: mov 0x18(%rdx,%rax,8),%rdx 2nd. %eax, ac->avail = 0xffffffff, it causes this panic. 0xffffffff8029e934 <kmem_cache_alloc+99>: push %rbx 0xffffffff8029e935 <kmem_cache_alloc+100>: popfq static inline void * ____cache_alloc(struct kmem_cache *cachep, gfp_t flags, int *this_cpu) { void *objp; struct array_cache *ac; check_irq_off(); ac = cpu_cache_get(cachep, *this_cpu); if (likely(ac->avail)) { ---------------------------> 1st. check STATS_INC_ALLOCHIT(cachep); ac->touched = 1; objp = ac->entry[--ac->avail]; -------------> PANIC HERE } else { STATS_INC_ALLOCMISS(cachep); objp = cache_alloc_refill(cachep, flags, this_cpu); } return objp; } It means that ac->avail is changed by the other cpu. So I looked for who touch another cpu's data and found the root cause. I think, calling cache_grow() in cache_alloc_refill() can change cpu, but cache_grow() doesn't update this_cpu variable in spite of changing cpu. It makes the kernel access other cpu's data in cache_alloc_refill(). I'm testing this patch now. --- From: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com> On !PREEMPT_RT, an inconsistency case may be occurred. After releasing cpu by slab_irq_enable_nort() the cpu can be changed. In slab_irq_disable_nort() new cpu id should be gotten. If not, it makes data of other cpu corrupted. Signed-off-by: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com> --- mm/slab.c | 18 +++++++++--------- 1 file changed, 9 insertions(+), 9 deletions(-) Index: linux-2.6.24.7/mm/slab.c =================================================================== --- linux-2.6.24.7.orig/mm/slab.c +++ linux-2.6.24.7/mm/slab.c @@ -138,8 +138,8 @@ * * (On PREEMPT_RT, these are NOPs, but we have to drop/get the irq locks.) */ -# define slab_irq_disable_nort() local_irq_disable() -# define slab_irq_enable_nort() local_irq_enable() +# define slab_irq_disable_nort(cpu) slab_irq_disable(cpu) +# define slab_irq_enable_nort(cpu) slab_irq_enable(cpu) # define slab_irq_disable_rt(flags) do { (void)(flags); } while (0) # define slab_irq_enable_rt(flags) do { (void)(flags); } while (0) # define slab_spin_lock_irq(lock, cpu) \ @@ -160,8 +160,8 @@ DEFINE_PER_CPU_LOCKED(int, slab_irq_lock do { slab_irq_enable(cpu); (void) (flags); } while (0) # define slab_irq_disable_rt(cpu) slab_irq_disable(cpu) # define slab_irq_enable_rt(cpu) slab_irq_enable(cpu) -# define slab_irq_disable_nort() do { } while (0) -# define slab_irq_enable_nort() do { } while (0) +# define slab_irq_disable_nort(cpu) do { } while (0) +# define slab_irq_enable_nort(cpu) do { } while (0) # define slab_spin_lock_irq(lock, cpu) \ do { slab_irq_disable(cpu); spin_lock(lock); } while (0) # define slab_spin_unlock_irq(lock, cpu) \ @@ -2899,7 +2899,7 @@ static int cache_grow(struct kmem_cache offset *= cachep->colour_off; if (local_flags & __GFP_WAIT) - slab_irq_enable_nort(); + slab_irq_enable_nort(*this_cpu); slab_irq_enable_rt(*this_cpu); /* @@ -2932,7 +2932,7 @@ static int cache_grow(struct kmem_cache slab_irq_disable_rt(*this_cpu); if (local_flags & __GFP_WAIT) - slab_irq_disable_nort(); + slab_irq_disable_nort(*this_cpu); check_irq_off(); spin_lock(&l3->list_lock); @@ -2948,7 +2948,7 @@ opps1: failed: slab_irq_disable_rt(*this_cpu); if (local_flags & __GFP_WAIT) - slab_irq_disable_nort(); + slab_irq_disable_nort(*this_cpu); return 0; } @@ -3396,7 +3396,7 @@ retry: * set and go into memory reserves if necessary. */ if (local_flags & __GFP_WAIT) - slab_irq_enable_nort(); + slab_irq_enable_nort(*this_cpu); slab_irq_enable_rt(*this_cpu); kmem_flagcheck(cache, flags); @@ -3404,7 +3404,7 @@ retry: slab_irq_disable_rt(*this_cpu); if (local_flags & __GFP_WAIT) - slab_irq_disable_nort(); + slab_irq_disable_nort(*this_cpu); if (obj) { /* ��������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/sysctl-compile-fix.patch��������������������������������������������������������������������0000664�0000764�0000764�00000002011�11041657732�016011� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From leon.woestenberg@gmail.com Mon Mar 24 17:38:54 2008 Date: Fri, 29 Feb 2008 23:11:40 +0100 From: Leon Woestenberg <leon.woestenberg@gmail.com> To: linux-rt-users@vger.kernel.org, linux-kernel@vger.kernel.org Cc: Steven Rostedt <rostedt@goodmis.org> Subject: [PATCH] Fix build, missing profile.h include in kernel/sysctl.c (against 2.6.24.3-rt3) Build fix for kernel/sysctl.c:356: error: 'prof_pid' undeclared here (not in a function). The fix is to include <linux/profile.h> which defines prof_pid as external. Patch is against 2.6.24.3-rt3. Signed-off-by: Leon Woestenberg <leon@sidebranch.com> --- kernel/sysctl.c | 1 + 1 file changed, 1 insertion(+) Index: linux-2.6.24.7/kernel/sysctl.c =================================================================== --- linux-2.6.24.7.orig/kernel/sysctl.c +++ linux-2.6.24.7/kernel/sysctl.c @@ -47,6 +47,7 @@ #include <linux/acpi.h> #include <linux/reboot.h> #include <linux/ftrace.h> +#include <linux/profile.h> #include <asm/uaccess.h> #include <asm/processor.h> �����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/kthread-cpus-allowed-init.patch�������������������������������������������������������������0000664�0000764�0000764�00000003162�11041657733�017247� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From ghaskins@novell.com Mon Mar 24 17:40:01 2008 Date: Fri, 07 Mar 2008 07:11:43 -0500 From: Gregory Haskins <ghaskins@novell.com> To: rostedt@goodmis.org, mingo@elte.he Cc: linux-kernel@vger.kernel.org, linux-rt-users@vger.kernel.org, ghaskins@novell.com Subject: [PATCH] RESEND: fix cpus_allowed settings [ The following text is in the "utf-8" character set. ] [ Your display is set for the "iso-8859-1" character set. ] [ Some characters may be displayed incorrectly. ] Hi Ingo, Steve, I sent this patch a few weeks ago along with the migration_disable series. I think the controversy with the migration_disable feature may have resulted in this fix being overlooked. This patch is against -rt, but the bug theoretically affects both -rt and sched-devel/mainline. I can also whip up a sched-devel based patch if you like, but I think it will apply trivially to both places. Please consider it for inclusion. Regards, -Greg ------------------------------- Subject: fix cpus_allowed settings We miss setting nr_cpus_allowed for the kthread case since the normal set_cpus_allowed() function is not used. Signed-off-by: Gregory Haskins <ghaskins@novell.com> --- kernel/kthread.c | 1 + 1 file changed, 1 insertion(+) Index: linux-2.6.24.7/kernel/kthread.c =================================================================== --- linux-2.6.24.7.orig/kernel/kthread.c +++ linux-2.6.24.7/kernel/kthread.c @@ -170,6 +170,7 @@ void kthread_bind(struct task_struct *k, wait_task_inactive(k); set_task_cpu(k, cpu); k->cpus_allowed = cpumask_of_cpu(cpu); + k->nr_cpus_allowed = 1; } EXPORT_SYMBOL(kthread_bind); ��������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/ppc-tlbflush-preempt.patch������������������������������������������������������������������0000664�0000764�0000764�00000005772�11041657732�016354� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From estarkov@ru.mvista.com Mon Mar 24 17:41:35 2008 Date: Wed, 12 Mar 2008 18:37:42 +0300 From: Egor Starkov <estarkov@ru.mvista.com> To: mingo@elte.hu Subject: Memory corruption fixes Resent-Date: Wed, 12 Mar 2008 17:06:59 +0100 Resent-Date: Wed, 12 Mar 2008 12:10:04 -0400 Resent-From: Ingo Molnar <mingo@elte.hu> Resent-To: Steven Rostedt <rostedt@goodmis.org> Hi Ingo, I have found out that functions __flush_tlb_pending and hpte_need_flush must be called from within some kind of spinlock/non-preempt region. Fix "flush_hash_page_fix.patch" is attached. Also debug version of function add_preempt_count can be called on early stage of boot when current is not set and is 0. So we can have memory corruption. I had it as stack pointer exception after "Freeing unused kernel memory" message. Fix "preempt_debug_trace_fix.patch" is attached. Egor Starkov [ Part 2: "Attached Text" ] Signed-off-by: Egor Starkov <estarkov@ru.mvista.com> Description: Functions __flush_tlb_pending and hpte_need_flush must be called from within some kind of spinlock/non-preempt region --- include/asm-powerpc/pgtable-ppc64.h | 9 ++++++++- include/asm-powerpc/tlbflush.h | 20 ++++++++++++++++++-- 2 files changed, 26 insertions(+), 3 deletions(-) Index: linux-2.6.24.7/include/asm-powerpc/pgtable-ppc64.h =================================================================== --- linux-2.6.24.7.orig/include/asm-powerpc/pgtable-ppc64.h +++ linux-2.6.24.7/include/asm-powerpc/pgtable-ppc64.h @@ -277,8 +277,15 @@ static inline unsigned long pte_update(s : "r" (ptep), "r" (clr), "m" (*ptep), "i" (_PAGE_BUSY) : "cc" ); - if (old & _PAGE_HASHPTE) + if (old & _PAGE_HASHPTE) { +#ifdef CONFIG_PREEMPT_RT + preempt_disable(); +#endif hpte_need_flush(mm, addr, ptep, old, huge); +#ifdef CONFIG_PREEMPT_RT + preempt_enable(); +#endif + } return old; } Index: linux-2.6.24.7/include/asm-powerpc/tlbflush.h =================================================================== --- linux-2.6.24.7.orig/include/asm-powerpc/tlbflush.h +++ linux-2.6.24.7/include/asm-powerpc/tlbflush.h @@ -109,7 +109,15 @@ extern void hpte_need_flush(struct mm_st static inline void arch_enter_lazy_mmu_mode(void) { - struct ppc64_tlb_batch *batch = &get_cpu_var(ppc64_tlb_batch); + struct ppc64_tlb_batch *batch; +#ifdef CONFIG_PREEMPT_RT + preempt_disable(); +#endif + batch = &get_cpu_var(ppc64_tlb_batch); + +#ifdef CONFIG_PREEMPT_RT + preempt_enable(); +#endif batch->active = 1; put_cpu_var(ppc64_tlb_batch); @@ -117,7 +125,12 @@ static inline void arch_enter_lazy_mmu_m static inline void arch_leave_lazy_mmu_mode(void) { - struct ppc64_tlb_batch *batch = &get_cpu_var(ppc64_tlb_batch); + struct ppc64_tlb_batch *batch; + +#ifdef CONFIG_PREEMPT_RT + preempt_disable(); +#endif + batch = &get_cpu_var(ppc64_tlb_batch); if (batch->active) { if (batch->index) { @@ -125,6 +138,9 @@ static inline void arch_leave_lazy_mmu_m } batch->active = 0; } +#ifdef CONFIG_PREEMPT_RT + preempt_enable(); +#endif put_cpu_var(ppc64_tlb_batch); } ������patches/swap-spinlock-fix.patch���������������������������������������������������������������������0000664�0000764�0000764�00000022252�11041657732�015645� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From h-shimamoto@ct.jp.nec.com Mon Mar 24 17:44:11 2008 Date: Mon, 17 Mar 2008 17:14:51 -0700 From: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com> To: linux-rt-users@vger.kernel.org, Ingo Molnar <mingo@elte.hu>, Steven Rostedt <rostedt@goodmis.org>, Thomas Gleixner <tglx@linutronix.de> Subject: Re: deadlock on 2.6.24.3-rt3 Hiroshi Shimamoto wrote: > Hi, > > I got a soft lockup message on 2.6.24.3-rt3. > I attached the .config. > > I think there is a deadlock scenario, I explain later. > > Here is the console log; > BUG: soft lockup - CPU#2 stuck for 11s! [bash:2175] > CPU 2: > Modules linked in: > Pid: 2175, comm: bash Not tainted 2.6.24.3-rt3 #1 > RIP: 0010:[<ffffffff8063f052>] [<ffffffff8063f052>] __spin_lock+0x57/0x67 > RSP: 0000:ffff8100c52a1d48 EFLAGS: 00000202 > RAX: 0000000000000000 RBX: 0000000000004bc5 RCX: 0000000000004bc5 > RDX: 0000000000000002 RSI: 00000000006c3208 RDI: 0000000000000001 > RBP: 000000000000000d R08: ffff8100cbc28018 R09: ffff810007c95458 > R10: 00000000006c3208 R11: 0000000000000246 R12: ffffffff808246e8 > R13: 000284d000000002 R14: ffffffff80387277 R15: 00000000ffffffff > FS: 00002b28926a2ef0(0000) GS:ffff8100cf8a3940(0000) knlGS:0000000000000000 > CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b > CR2: 00000000006c3208 CR3: 00000000c3cac000 CR4: 00000000000006e0 > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 > Call Trace: > [<ffffffff8063f024>] __spin_lock+0x29/0x67 > [<ffffffff80296597>] swap_info_get+0x65/0xdd > [<ffffffff80296c0c>] can_share_swap_page+0x39/0x84 > [<ffffffff8028ae6e>] do_wp_page+0x2f9/0x519 > [<ffffffff8028c8d2>] handle_mm_fault+0x615/0x7cf > [<ffffffff802db823>] proc_flush_task+0x171/0x29c > [<ffffffff8024a64d>] recalc_sigpending+0xe/0x3c > [<ffffffff8022645e>] do_page_fault+0x162/0x754 > [<ffffffff8026fe81>] audit_syscall_exit+0x31c/0x37a > [<ffffffff8063f449>] error_exit+0x0/0x51 > --------------------------- > | preempt count: 00010002 ] > | 2-level deep critical section nesting: > ---------------------------------------- > .. [<ffffffff8063f009>] .... __spin_lock+0xe/0x67 > .....[<00000000>] .. ( <= 0x0) > .. [<ffffffff8063f009>] .... __spin_lock+0xe/0x67 > .....[<00000000>] .. ( <= 0x0) > BUG: soft lockup - CPU#3 stuck for 11s! [stress:9460] > CPU 3: > Modules linked in: > Pid: 9460, comm: stress Not tainted 2.6.24.3-rt3 #1 > RIP: 0010:[<ffffffff8027c974>] [<ffffffff8027c974>] find_get_page+0xad/0xbe > RSP: 0018:ffff8100cbf25b88 EFLAGS: 00000202 > 0000000000002009 RBX: ffffffff80824bc8 RCX: 0000000000000002 > RDX: 0000000000000002 RSI: ffff8100cbfcf298 RDI: ffff810005ad8910 > RBP: ffffffff80383a57 R08: ffff810005ad8918 R09: 0000000000000003 > R10: ffff810005ad88d8 R11: 0000000000000001 R12: ffffffff80822880 > R13: ffff81000799ce48 R14: ffffffff8028921c R15: ffffffff80822880 > FS: 00002acaa373bb00(0000) GS:ffff8100cf8a32c0(0000) knlGS:0000000000000000 > CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b > CR2: 00002b9b2827c530 CR3: 000000006cc90000 CR4: 00000000000006e0 > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 > Call Trace: > [<ffffffff8027c8ea>] find_get_page+0x23/0xbe > [<ffffffff80296f83>] free_swap_and_cache+0x46/0xdd > [<ffffffff8028b9b7>] unmap_vmas+0x626/0x8ce > [<ffffffff8028fa4c>] exit_mmap+0xac/0x147 > [<ffffffff8023ced7>] mmput+0x32/0xae > [<ffffffff80242f00>] do_exit+0x199/0x914 > [<ffffffff8024ab3b>] __dequeue_signal+0x19/0x1b7 > [<ffffffff802436a7>] do_group_exit+0x2c/0x7e > [<ffffffff8024c47b>] get_signal_to_deliver+0x2ef/0x4aa > [<ffffffff8020b5dc>] do_notify_resume+0xa8/0x7cd > [<ffffffff80239320>] add_preempt_count+0x14/0x111 > [<ffffffff8038b292>] __up_read+0x13/0x8d > [<ffffffff80226483>] do_page_fault+0x187/0x754 > [<ffffffff802335ae>] __dequeue_entity+0x2d/0x34 > [<ffffffff8020a6d5>] __switch_to+0x27/0x2c9 > [<ffffffff802264f0>] do_page_fault+0x1f4/0x754 > [<ffffffff8020c7be>] retint_signal+0x3d/0x7f > --------------------------- > | preempt count: 00010005 ] > | 5-level deep critical section nesting: > ---------------------------------------- > .. [<ffffffff8063f009>] .... __spin_lock+0xe/0x67 > .....[<00000000>] .. ( <= 0x0) > .. [<ffffffff8063f009>] .... __spin_lock+0xe/0x67 > .....[<00000000>] .. ( <= 0x0) > .. [<ffffffff8063f009>] .... __spin_lock+0xe/0x67 > .....[<00000000>] .. ( <= 0x0) > .. [<ffffffff8027c8db>] .... find_get_page+0x14/0xbe > .....[<00000000>] .. ( <= 0x0) > .. [<ffffffff8063f009>] .... __spin_lock+0xe/0x67 > .....[<00000000>] .. ( <= 0x0) > > > I also got a kernel core. > (gdb) info thr > 4 process 9460 0xffffffff8027c974 in find_get_page (mapping=<value optimized out>, > offset=18446744071570598016) at include/asm/processor_64.h:385 > 3 process 2175 __spin_lock (lock=0xffffffff80893f80) at kernel/spinlock.c:333 > 2 process 9132 __spin_lock (lock=0xffffffff80893f80) at include/asm/spinlock_64.h:22 > * 1 process 9478 __spin_lock (lock=0xffffffff80893f80) at kernel/spinlock.c:333 > > CPU3(thread 4) is in find_get_page(), and the others in __spin_lock(). > The thread 4 is waiting to turn PG_nonewrefs bit off in wait_on_page_ref() which is > called from page_cache_get_speculative(), and the thread 4 holds the swap_lock. > The other threads waiting the swap_lock. > On the other hand, the thread 1 turned PG_nonewrefs bit on by calling > lock_page_ref_irq() in remove_mapping(), and then waiting the swap_lock. > So if the target page of remove_mapping() is in the exiting process memory, > the kernel is deadlock. > > (gdb) bt > #0 __spin_lock (lock=0xffffffff80893f80) at kernel/spinlock.c:333 > #1 0xffffffff80296597 in swap_info_get (entry=<value optimized out>) > at mm/swapfile.c:253 > #2 0xffffffff80296618 in swap_free (entry={val = 1}) at mm/swapfile.c:300 > #3 0xffffffff80286acd in remove_mapping (mapping=<value optimized out>, > page=0xffff810005ad8910) at mm/vmscan.c:423 > ... > > (gdb) thr 2 > (gdb) bt > #0 __spin_lock (lock=0xffffffff80893f80) at include/asm/spinlock_64.h:22 > #1 0xffffffff80296374 in valid_swaphandles (entry=<value optimized out>, > offset=0xffff81001e22bc78) at mm/swapfile.c:1783 > #2 0xffffffff8028b0af in swapin_readahead (entry={val = 1}, addr=0, vma=0x1) > at mm/memory.c:2054 > #3 0xffffffff8029a6af in shmem_getpage (inode=0xffff8100cdf4fd48, idx=0, > pagep=0xffff81001e22bd80, sgp=SGP_FAULT, type=0xffff81001e22bd34) at mm/shmem.c:1089 > ... > > (gdb) thr 3 > (gdb) bt > #0 __spin_lock (lock=0xffffffff80893f80) at kernel/spinlock.c:333 > #1 0xffffffff80296597 in swap_info_get (entry=<value optimized out>) > at mm/swapfile.c:253 > #2 0xffffffff80296c0c in can_share_swap_page (page=<value optimized out>) > at mm/swapfile.c:317 > #3 0xffffffff8028ae6e in do_wp_page (mm=0xffff8100ce772f40, vma=0xffff8100cd212f00, > address=7090696, page_table=0xffff8100cbcef618, pmd=0xffff8100cbc28018, > ptl=0xffff810007c95458, orig_pte=<value optimized out>) at mm/memory.c:1606 > ... > > (gdb) thr 4 > (gdb) bt > #0 0xffffffff8027c974 in find_get_page (mapping=<value optimized out>, > offset=18446744071570598016) at include/asm/processor_64.h:385 > #1 0xffffffff80296f83 in free_swap_and_cache (entry={val = 4032}) at mm/swapfile.c:403 > #2 0xffffffff8028b9b7 in unmap_vmas (tlbp=0xffff8100cbf25cd8, vma=0xffff8100cde5c678, > start_addr=0, end_addr=18446744073709551615, nr_accounted=0xffff8100cbf25cd0, > details=0x0) at mm/memory.c:728 > #3 0xffffffff8028fa4c in exit_mmap (mm=0xffff8100cd093600) at mm/mmap.c:2048 > #4 0xffffffff8023ced7 in mmput (mm=0xffff8100cd093600) at kernel/fork.c:443 > #5 0xffffffff80242f00 in do_exit (code=14) at kernel/exit.c:997 > ... > > > I think it came from the lockless speculative get page patch. > I found the newer version of this patch in linux-mm. > http://marc.info/?l=linux-mm&m=119477111927364&w=2 > > I haven't tested it because it looks big change and hard to apply. > But it seems to fix this deadlock issue. > Any other patch to fix this issue is welcome. > Is this patch good? --- From: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com> Subject: [PATCH] avoid deadlock related with PG_nonewrefs and swap_lock There is a deadlock scenario; remove_mapping() vs free_swap_and_cache(). remove_mapping() turns PG_nonewrefs bit on, then locks swap_lock. free_swap_and_cache() locks swap_lock, then wait to turn PG_nonewrefs bit off in find_get_page(). swap_lock can be unlocked before calling find_get_page(). Signed-off-by: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com> --- mm/swapfile.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) Index: linux-2.6.24.7/mm/swapfile.c =================================================================== --- linux-2.6.24.7.orig/mm/swapfile.c +++ linux-2.6.24.7/mm/swapfile.c @@ -400,13 +400,14 @@ void free_swap_and_cache(swp_entry_t ent p = swap_info_get(entry); if (p) { if (swap_entry_free(p, swp_offset(entry)) == 1) { + spin_unlock(&swap_lock); page = find_get_page(&swapper_space, entry.val); if (page && unlikely(TestSetPageLocked(page))) { page_cache_release(page); page = NULL; } - } - spin_unlock(&swap_lock); + } else + spin_unlock(&swap_lock); } if (page) { int one_user; ������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/remove-spinlock-define.patch����������������������������������������������������������������0000664�0000764�0000764�00000004007�11041657733�016633� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From ghaskins@novell.com Mon Mar 24 17:45:51 2008 Date: Fri, 07 Mar 2008 09:06:35 -0500 From: Gregory Haskins <ghaskins@novell.com> To: mingo@elte.hu, rostedt@goodmis.org, tglx@linutronix.de, linux-rt-users@vger.kernel.org Cc: ghaskins@novell.com, linux-kernel@vger.kernel.org Subject: [PATCH] RT: fix spinlock preemption feature when PREEMPT_RT is enabled [ The following text is in the "utf-8" character set. ] [ Your display is set for the "iso-8859-1" character set. ] [ Some characters may be displayed incorrectly. ] kernel/spinlock.c implements two versions of spinlock wrappers around the arch-specific implementations: 1) A simple passthrough which implies disabled preemption while spinning 2) A "preemptible waiter" version which uses trylock. Currently, PREEMPT && SMP will turn on the preemptible feature, and lockdep or PREEMPT_RT will disable it. Disabling the feature for lockdep makes perfect sense, but PREEMPT_RT is counter-intuitive. My guess is that this was inadvertent, so this patch once again enables the feature for PREEMPT_RT. (Since PREEMPT is set for PREEMPT_RT, we simply get rid of the extra condition). I have tested the PREEMPT_RT kernel with this patch and all seems well. Therefore, if there *is* an issue with running preemptible versions of these spinlocks under PREEMPT_RT, it is not immediately apparent why. Signed-off-by: Gregory Haskins <ghaskins@novell.com> --- kernel/spinlock.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) Index: linux-2.6.24.7/kernel/spinlock.c =================================================================== --- linux-2.6.24.7.orig/kernel/spinlock.c +++ linux-2.6.24.7/kernel/spinlock.c @@ -117,7 +117,7 @@ EXPORT_SYMBOL(__write_trylock_irqsave); * not re-enabled during lock-acquire (which the preempt-spin-ops do): */ #if !defined(CONFIG_PREEMPT) || !defined(CONFIG_SMP) || \ - defined(CONFIG_DEBUG_LOCK_ALLOC) || defined(CONFIG_PREEMPT_RT) + defined(CONFIG_DEBUG_LOCK_ALLOC) void __lockfunc __read_lock(raw_rwlock_t *lock) { �������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/migrate-dying.patch�������������������������������������������������������������������������0000664�0000764�0000764�00000004555�11041657734�015037� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From ghaskins@novell.com Mon Mar 24 17:47:08 2008 Date: Mon, 10 Mar 2008 12:20:57 -0400 From: Gregory Haskins <ghaskins@novell.com> To: rostedt@goodmis.org, mingo@elte.hu, tglx@linutronix.de Cc: linux-rt-users@vger.kernel.org, ghaskins@novell.com Subject: [PATCH] keep rd->online and cpu_online_map in sync [ The following text is in the "utf-8" character set. ] [ Your display is set for the "iso-8859-1" character set. ] [ Some characters may be displayed incorrectly. ] Hi Ingo, Steve, Thomas, Ingo pointed me at an issue on LKML/mainline regarding cpu-hotplug. It turns out that the issue (I believe) was related to the root-domain code that I added a few months ago. This code is also in -rt, so the bug should exist there as well. This may fix the root-cause of that s2ram bug that Mike Galbraith found and patched a while back as well. If that is the case, we can probably get rid of the "num_online_cpus()" in the update_migration code. That will require confirmation from Mike, however, as I do not have a P4 machine like his that exhibits the bug. For now, both patches should probably co-exist together. This applies to 24.3-rt3 Regards, -Greg ------------------------------ keep rd->online and cpu_online_map in sync It is possible to allow the root-domain cache of online cpus to become out of sync with the global cpu_online_map. This is because we currently trigger removal of cpus too early in the notifier chain. Other DOWN_PREPARE handlers may in fact run and reconfigure the root-domain topology, thereby stomping on our own offline handling. The end result is that rd->online may become out of sync with cpu_online_map, which results in potential task misrouting. So change the offline handling to be more tightly coupled with the global offline process by triggering on CPU_DYING intead of CPU_DOWN_PREPARE. Signed-off-by: Gregory Haskins <ghaskins@novell.com> --- kernel/sched.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) Index: linux-2.6.24.7/kernel/sched.c =================================================================== --- linux-2.6.24.7.orig/kernel/sched.c +++ linux-2.6.24.7/kernel/sched.c @@ -6120,7 +6120,7 @@ migration_call(struct notifier_block *nf spin_unlock_irq(&rq->lock); break; - case CPU_DOWN_PREPARE: + case CPU_DYING: /* Update our root-domain */ rq = cpu_rq(cpu); spin_lock_irqsave(&rq->lock, flags); ���������������������������������������������������������������������������������������������������������������������������������������������������patches/nmi-watchdog-fix-1.patch��������������������������������������������������������������������0000664�0000764�0000764�00000002177�11041657732�015576� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From h-shimamoto@ct.jp.nec.com Thu May 15 10:14:15 2008 Date: Mon, 28 Apr 2008 11:14:39 -0700 From: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com> To: Ingo Molnar <mingo@elte.hu>, Steven Rostedt <rostedt@goodmis.org>, Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org, linux-rt-users@vger.kernel.org Subject: [PATCH -rt 1/4] x86_64: send NMI after nmi_show_regs on From: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com> The flags nmi_show_regs should be set before send NMI. Signed-off-by: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com> --- arch/x86/kernel/nmi_64.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) Index: linux-2.6.24.7/arch/x86/kernel/nmi_64.c =================================================================== --- linux-2.6.24.7.orig/arch/x86/kernel/nmi_64.c +++ linux-2.6.24.7/arch/x86/kernel/nmi_64.c @@ -327,11 +327,11 @@ void nmi_show_all_regs(void) if (system_state == SYSTEM_BOOTING) return; - smp_send_nmi_allbutself(); - for_each_online_cpu(i) nmi_show_regs[i] = 1; + smp_send_nmi_allbutself(); + for_each_online_cpu(i) { while (nmi_show_regs[i] == 1) barrier(); �������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/nmi-watchdog-fix-2.patch��������������������������������������������������������������������0000664�0000764�0000764�00000007636�11041657734�015606� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From h-shimamoto@ct.jp.nec.com Thu May 15 10:14:48 2008 Date: Mon, 28 Apr 2008 11:16:31 -0700 From: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com> To: Ingo Molnar <mingo@elte.hu>, Steven Rostedt <rostedt@goodmis.org>, Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org, linux-rt-users@vger.kernel.org Subject: [PATCH -rt 2/4] x86: return true for NMI handled From: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com> NMI for show_regs causes unknown NMI when nmi_watchdog is local APIC mode. Because lapic_wd_event() will fail due to still running perfctr. If NMI is for show_regs, nmi_watchdog_tick() should return 1. On x86_32, call irq_show_regs_callback() is moved to top of the nmi_watchdog_tick() same as x86_64. Signed-off-by: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com> --- arch/x86/kernel/nmi_32.c | 10 +++++----- arch/x86/kernel/nmi_64.c | 9 +++++---- include/linux/sched.h | 2 +- 3 files changed, 11 insertions(+), 10 deletions(-) Index: linux-2.6.24.7/arch/x86/kernel/nmi_32.c =================================================================== --- linux-2.6.24.7.orig/arch/x86/kernel/nmi_32.c +++ linux-2.6.24.7/arch/x86/kernel/nmi_32.c @@ -350,10 +350,10 @@ void nmi_show_all_regs(void) static DEFINE_RAW_SPINLOCK(nmi_print_lock); -notrace void irq_show_regs_callback(int cpu, struct pt_regs *regs) +notrace int irq_show_regs_callback(int cpu, struct pt_regs *regs) { if (!nmi_show_regs[cpu]) - return; + return 0; nmi_show_regs[cpu] = 0; spin_lock(&nmi_print_lock); @@ -362,6 +362,7 @@ notrace void irq_show_regs_callback(int per_cpu(irq_stat, cpu).apic_timer_irqs); show_regs(regs); spin_unlock(&nmi_print_lock); + return 1; } notrace __kprobes int @@ -376,8 +377,9 @@ nmi_watchdog_tick(struct pt_regs * regs, unsigned int sum; int touched = 0; int cpu = smp_processor_id(); - int rc=0; + int rc; + rc = irq_show_regs_callback(cpu, regs); __profile_tick(CPU_PROFILING, regs); /* check for other users first */ @@ -404,8 +406,6 @@ nmi_watchdog_tick(struct pt_regs * regs, sum = per_cpu(irq_stat, cpu).apic_timer_irqs + per_cpu(irq_stat, cpu).irq0_irqs; - irq_show_regs_callback(cpu, regs); - /* if the apic timer isn't firing, this cpu isn't doing much */ /* if the none of the timers isn't firing, this cpu isn't doing much */ if (!touched && last_irq_sums[cpu] == sum) { Index: linux-2.6.24.7/arch/x86/kernel/nmi_64.c =================================================================== --- linux-2.6.24.7.orig/arch/x86/kernel/nmi_64.c +++ linux-2.6.24.7/arch/x86/kernel/nmi_64.c @@ -340,10 +340,10 @@ void nmi_show_all_regs(void) static DEFINE_RAW_SPINLOCK(nmi_print_lock); -notrace void irq_show_regs_callback(int cpu, struct pt_regs *regs) +notrace int irq_show_regs_callback(int cpu, struct pt_regs *regs) { if (!nmi_show_regs[cpu]) - return; + return 0; nmi_show_regs[cpu] = 0; spin_lock(&nmi_print_lock); @@ -351,6 +351,7 @@ notrace void irq_show_regs_callback(int printk(KERN_WARNING "apic_timer_irqs: %d\n", read_pda(apic_timer_irqs)); show_regs(regs); spin_unlock(&nmi_print_lock); + return 1; } notrace int __kprobes @@ -359,9 +360,9 @@ nmi_watchdog_tick(struct pt_regs * regs, int sum; int touched = 0; int cpu = smp_processor_id(); - int rc = 0; + int rc; - irq_show_regs_callback(cpu, regs); + rc = irq_show_regs_callback(cpu, regs); __profile_tick(CPU_PROFILING, regs); /* check for other users first */ Index: linux-2.6.24.7/include/linux/sched.h =================================================================== --- linux-2.6.24.7.orig/include/linux/sched.h +++ linux-2.6.24.7/include/linux/sched.h @@ -292,7 +292,7 @@ static inline void show_state(void) } extern void show_regs(struct pt_regs *); -extern void irq_show_regs_callback(int cpu, struct pt_regs *regs); +extern int irq_show_regs_callback(int cpu, struct pt_regs *regs); /* * TASK is a pointer to the task whose backtrace we want to see (or NULL for current ��������������������������������������������������������������������������������������������������patches/nmi-watchdog-fix-3.patch��������������������������������������������������������������������0000664�0000764�0000764�00000003557�11041657733�015604� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From h-shimamoto@ct.jp.nec.com Thu May 15 10:15:00 2008 Date: Mon, 28 Apr 2008 11:17:48 -0700 From: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com> To: Ingo Molnar <mingo@elte.hu>, Steven Rostedt <rostedt@goodmis.org>, Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org, linux-rt-users@vger.kernel.org Subject: [PATCH -rt 3/4] x86: nmi_watchdog NMI needed for irq_show_regs_callback() From: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com> The -rt kernel doesn't panic immediately when NMI lockup detected. Because the kernel waits show_regs on all cpus, but NMI is not come so frequently. Signed-off-by: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com> --- arch/x86/kernel/nmi_32.c | 7 +++++++ arch/x86/kernel/nmi_64.c | 8 +++++++- 2 files changed, 14 insertions(+), 1 deletion(-) Index: linux-2.6.24.7/arch/x86/kernel/nmi_32.c =================================================================== --- linux-2.6.24.7.orig/arch/x86/kernel/nmi_32.c +++ linux-2.6.24.7/arch/x86/kernel/nmi_32.c @@ -428,6 +428,13 @@ nmi_watchdog_tick(struct pt_regs * regs, if (i == cpu) continue; nmi_show_regs[i] = 1; + } + + smp_send_nmi_allbutself(); + + for_each_online_cpu(i) { + if (i == cpu) + continue; while (nmi_show_regs[i] == 1) cpu_relax(); } Index: linux-2.6.24.7/arch/x86/kernel/nmi_64.c =================================================================== --- linux-2.6.24.7.orig/arch/x86/kernel/nmi_64.c +++ linux-2.6.24.7/arch/x86/kernel/nmi_64.c @@ -413,10 +413,16 @@ nmi_watchdog_tick(struct pt_regs * regs, if (i == cpu) continue; nmi_show_regs[i] = 1; + } + + smp_send_nmi_allbutself(); + + for_each_online_cpu(i) { + if (i == cpu) + continue; while (nmi_show_regs[i] == 1) cpu_relax(); } - die_nmi("NMI Watchdog detected LOCKUP on CPU %d\n", regs, panic_on_timeout); } �������������������������������������������������������������������������������������������������������������������������������������������������patches/nmi-watchdog-fix-4.patch��������������������������������������������������������������������0000664�0000764�0000764�00000004531�11041657731�015574� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From h-shimamoto@ct.jp.nec.com Thu May 15 10:15:22 2008 Date: Mon, 28 Apr 2008 11:19:21 -0700 From: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com> To: Ingo Molnar <mingo@elte.hu>, Steven Rostedt <rostedt@goodmis.org>, Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org, linux-rt-users@vger.kernel.org Subject: [PATCH -rt 4/4] wait for finish show_regs() before panic From: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com> It might cause kdump failure that the kernel doesn't wait for finish show_regs(). The nmi_show_regs variable for show_regs() flag is cleared before show_regs() is really called. This flag should be cleared after show_regs(). kdump stops all CPUs other than crashing CPU by NMI handler, but if show_regs() takes a bit time, kdump cannot wait and will continue process. It means that the 2nd kernel and the old kernel run simultaneously and it might cause unexpected behavior, such as randomly reboot. Signed-off-by: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com> Signed-off-by: Maxim Uvarov <muvarov@ru.mvista.com> --- arch/x86/kernel/nmi_32.c | 2 +- arch/x86/kernel/nmi_64.c | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) Index: linux-2.6.24.7/arch/x86/kernel/nmi_32.c =================================================================== --- linux-2.6.24.7.orig/arch/x86/kernel/nmi_32.c +++ linux-2.6.24.7/arch/x86/kernel/nmi_32.c @@ -355,13 +355,13 @@ notrace int irq_show_regs_callback(int c if (!nmi_show_regs[cpu]) return 0; - nmi_show_regs[cpu] = 0; spin_lock(&nmi_print_lock); printk(KERN_WARNING "NMI show regs on CPU#%d:\n", cpu); printk(KERN_WARNING "apic_timer_irqs: %d\n", per_cpu(irq_stat, cpu).apic_timer_irqs); show_regs(regs); spin_unlock(&nmi_print_lock); + nmi_show_regs[cpu] = 0; return 1; } Index: linux-2.6.24.7/arch/x86/kernel/nmi_64.c =================================================================== --- linux-2.6.24.7.orig/arch/x86/kernel/nmi_64.c +++ linux-2.6.24.7/arch/x86/kernel/nmi_64.c @@ -345,12 +345,12 @@ notrace int irq_show_regs_callback(int c if (!nmi_show_regs[cpu]) return 0; - nmi_show_regs[cpu] = 0; spin_lock(&nmi_print_lock); printk(KERN_WARNING "NMI show regs on CPU#%d:\n", cpu); printk(KERN_WARNING "apic_timer_irqs: %d\n", read_pda(apic_timer_irqs)); show_regs(regs); spin_unlock(&nmi_print_lock); + nmi_show_regs[cpu] = 0; return 1; } �����������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/rt-avoid-deadlock-in-swap.patch�������������������������������������������������������������0000664�0000764�0000764�00000004221�11041657730�017126� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From h-shimamoto@ct.jp.nec.com Thu May 15 09:57:50 2008 Date: Thu, 17 Apr 2008 16:57:20 +0200 From: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com> To: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org>, linux-rt-users <linux-rt-users@vger.kernel.org>, Ingo Molnar <mingo@elte.hu>, Thomas Gleixner <tglx@linutronix.de>, LKML <linux-kernel@vger.kernel.org> Subject: [PATCH -rt] avoid deadlock related with PG_nonewrefs and swap_lock Resent-Date: Thu, 17 Apr 2008 14:57:28 +0000 (UTC) Resent-From: Peter Zijlstra <peterz@infradead.org> Resent-To: Steven Rostedt <rostedt@goodmis.org> Hi Peter, I've updated the patch. Could you please review it? I'm also thinking that it can be in the mainline because it makes the lock period shorter, correct? --- From: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com> There is a deadlock scenario; remove_mapping() vs free_swap_and_cache(). remove_mapping() turns PG_nonewrefs bit on, then locks swap_lock. free_swap_and_cache() locks swap_lock, then wait to turn PG_nonewrefs bit off in find_get_page(). swap_lock can be unlocked before calling find_get_page(). In remove_exclusive_swap_page(), there is similar lock sequence; swap_lock, then PG_nonewrefs bit. swap_lock can be unlocked before turning PG_nonewrefs bit on. Signed-off-by: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com> --- mm/swapfile.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) Index: linux-2.6.24.7/mm/swapfile.c =================================================================== --- linux-2.6.24.7.orig/mm/swapfile.c +++ linux-2.6.24.7/mm/swapfile.c @@ -366,6 +366,7 @@ int remove_exclusive_swap_page(struct pa /* Is the only swap cache user the cache itself? */ retval = 0; if (p->swap_map[swp_offset(entry)] == 1) { + spin_unlock(&swap_lock); /* Recheck the page count with the swapcache lock held.. */ lock_page_ref_irq(page); if ((page_count(page) == 2) && !PageWriteback(page)) { @@ -374,8 +375,8 @@ int remove_exclusive_swap_page(struct pa retval = 1; } unlock_page_ref_irq(page); - } - spin_unlock(&swap_lock); + } else + spin_unlock(&swap_lock); if (retval) { swap_free(entry); �������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/rt-shorten-softirq-thread-names.patch�������������������������������������������������������0000664�0000764�0000764�00000002242�11041657731�020423� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From peterz@infradead.org Thu May 15 09:55:55 2008 Date: Fri, 04 Apr 2008 17:42:53 +0200 From: Peter Zijlstra <peterz@infradead.org> To: Steven Rostedt <rostedt@goodmis.org>, Clark Williams <williams@redhat.com> Cc: linux-rt-users <linux-rt-users@vger.kernel.org>, linux-kernel <linux-kernel@vger.kernel.org> Subject: [PATCH -rt] shorten softirq kernel thread names Subject: -rt: shorten softirq kernel thread names Shorten the softirq kernel thread names because they often overflow the limited comm length. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> --- --- kernel/softirq.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) Index: linux-2.6.24.7/kernel/softirq.c =================================================================== --- linux-2.6.24.7.orig/kernel/softirq.c +++ linux-2.6.24.7/kernel/softirq.c @@ -923,7 +923,7 @@ static int __cpuinit cpu_callback(struct for (i = 0; i < MAX_SOFTIRQ; i++) { p = kthread_create(ksoftirqd, &per_cpu(ksoftirqd, hotcpu)[i], - "softirq-%s/%d", softirq_names[i], + "sirq-%s/%d", softirq_names[i], hotcpu); if (IS_ERR(p)) { printk("ksoftirqd %d for %i failed\n", i, ��������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/time-gcc-linker-error.patch�����������������������������������������������������������������0000664�0000764�0000764�00000002674�11041657732�016376� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From tglx@linutronix.de Thu May 15 09:57:11 2008 Date: Wed, 9 Apr 2008 17:45:01 +0200 (CEST) From: Thomas Gleixner <tglx@linutronix.de> To: Steven Rostedt <rostedt@goodmis.org> Subject: GCC 4.3 kernel linker error (fwd) There is a fix in mainline for that as well. Pick the mainline one, though both are ugly as hell. Thanks, tglx ---------- Forwarded message ---------- Date: Tue, 01 Apr 2008 01:10:06 +0200 From: Carsten Emde <c.emde@osadl.org> To: Thomas Gleixner <tglx@linutronix.de> Subject: GCC 4.3 kernel linker error Thomas, GCC 4.3 still complains with strange linker errors, ("__udivd3 undefined") when compiling 2.6.24.4-rt4. I reported this problem about two months ago. Wouldn't you mind to apply the below patch to the next rt release? The patch applies to 2.6.24.4-rt4. Carsten. [ Part 2: "" ] --- include/linux/time.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) Index: linux-2.6.24.7/include/linux/time.h =================================================================== --- linux-2.6.24.7.orig/include/linux/time.h +++ linux-2.6.24.7/include/linux/time.h @@ -169,7 +169,7 @@ extern struct timeval ns_to_timeval(cons * @a: pointer to timespec to be incremented * @ns: unsigned nanoseconds value to be added */ -static inline void timespec_add_ns(struct timespec *a, u64 ns) +static inline void timespec_add_ns(struct timespec *a, volatile u64 ns) { ns += a->tv_nsec; while(unlikely(ns >= NSEC_PER_SEC)) { ��������������������������������������������������������������������patches/trace-fix-hist-name-spellings.patch���������������������������������������������������������0000664�0000764�0000764�00000003045�11041657734�020033� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From c.emde@osadl.org Thu May 15 09:56:37 2008 Date: Tue, 01 Apr 2008 00:34:59 +0200 From: Carsten Emde <c.emde@osadl.org> To: Thomas Gleixner <tglx@linutronix.de> Subject: Label missplaced in kernel/trace/Kconfig Resent-Date: Wed, 9 Apr 2008 17:43:50 +0200 (CEST) Resent-From: tglx@linutronix.de Resent-To: Steven Rostedt <rostedt@goodmis.org> Resent-Subject: Label missplaced in kernel/trace/Kconfig Hi Thomas, there is a strange problem in kernel/trace/Kconfig: Here config INTERRUPT_OFF_HIST bool "Interrupts off critical timings histogram" and here config WAKEUP_LATENCY_HIST bool "Interrupts off critical timings histogram" the same label is used, but at the second occurrence the label should most probably read config WAKEUP_LATENCY_HIST bool "Wakeup latencies histogram" or similar. Patch see below. I called it a "strange" problem, because it is so obvious and everybody who was editing this kernel configuration, must have seen it. Any idea what happened here? Carsten. --- kernel/trace/Kconfig | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) Index: linux-2.6.24.7/kernel/trace/Kconfig =================================================================== --- linux-2.6.24.7.orig/kernel/trace/Kconfig +++ linux-2.6.24.7/kernel/trace/Kconfig @@ -150,7 +150,7 @@ config PREEMPT_OFF_HIST preemption off timings to create a histogram of latencies. config WAKEUP_LATENCY_HIST - bool "Interrupts off critical timings histogram" + bool "Wakeup latencies histogram" select TRACING select MARKERS help �������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/cache_pci_find_capability.patch�������������������������������������������������������������0000664�0000764�0000764�00000016544�11041657734�017377� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From: Arnaldo Carvalho de Melo <acme@redhat.com> Subject: [PATCH] Cache calls to pci_find_capability The problem here is that everytime do_irqd masks/unmasks MSI interrupts it would ask if the device has the MSI capability, and that involves multiple calls to pci_conf1_read, that can take as high as 44us in in/out calls, so I just cached results in struct pci_dev. With this patch the highest latency still is in masking/unmasking MSI interrupts, but its down to 159us in a kernel with ftrace, using the preemptirqsoff tracer: I showed it to Jesse Barnes, the PCI maintainer and he said its 2.6.27 material, and Rostedt said tglx is OK with it, so please add it to the -50 kernel-rt release. Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> --- drivers/pci/msi.c | 12 ++++++------ drivers/pci/pci.c | 36 ++++++++++++++++++++++++++++++++++++ drivers/pci/probe.c | 4 ++++ include/linux/pci.h | 3 +++ include/linux/pci_regs.h | 1 + 5 files changed, 50 insertions(+), 6 deletions(-) Index: linux-2.6.24.7/drivers/pci/msi.c =================================================================== --- linux-2.6.24.7.orig/drivers/pci/msi.c +++ linux-2.6.24.7/drivers/pci/msi.c @@ -30,7 +30,7 @@ static void msi_set_enable(struct pci_de int pos; u16 control; - pos = pci_find_capability(dev, PCI_CAP_ID_MSI); + pos = pci_find_capability_cached(dev, PCI_CAP_ID_MSI); if (pos) { pci_read_config_word(dev, pos + PCI_MSI_FLAGS, &control); control &= ~PCI_MSI_FLAGS_ENABLE; @@ -45,7 +45,7 @@ static void msix_set_enable(struct pci_d int pos; u16 control; - pos = pci_find_capability(dev, PCI_CAP_ID_MSIX); + pos = pci_find_capability_cached(dev, PCI_CAP_ID_MSIX); if (pos) { pci_read_config_word(dev, pos + PCI_MSIX_FLAGS, &control); control &= ~PCI_MSIX_FLAGS_ENABLE; @@ -311,7 +311,7 @@ static int msi_capability_init(struct pc msi_set_enable(dev, 0); /* Ensure msi is disabled as I set it up */ - pos = pci_find_capability(dev, PCI_CAP_ID_MSI); + pos = pci_find_capability_cached(dev, PCI_CAP_ID_MSI); pci_read_config_word(dev, msi_control_reg(pos), &control); /* MSI Entry Initialization */ entry = alloc_msi_entry(); @@ -384,7 +384,7 @@ static int msix_capability_init(struct p msix_set_enable(dev, 0);/* Ensure msix is disabled as I set it up */ - pos = pci_find_capability(dev, PCI_CAP_ID_MSIX); + pos = pci_find_capability_cached(dev, PCI_CAP_ID_MSIX); /* Request & Map MSI-X table region */ pci_read_config_word(dev, msi_control_reg(pos), &control); nr_entries = multi_msix_capable(control); @@ -491,7 +491,7 @@ static int pci_msi_check_device(struct p if (ret) return ret; - if (!pci_find_capability(dev, type)) + if (!pci_find_capability_cached(dev, type)) return -EINVAL; return 0; @@ -610,7 +610,7 @@ int pci_enable_msix(struct pci_dev* dev, if (status) return status; - pos = pci_find_capability(dev, PCI_CAP_ID_MSIX); + pos = pci_find_capability_cached(dev, PCI_CAP_ID_MSIX); pci_read_config_word(dev, msi_control_reg(pos), &control); nr_entries = multi_msix_capable(control); if (nvec > nr_entries) Index: linux-2.6.24.7/drivers/pci/pci.c =================================================================== --- linux-2.6.24.7.orig/drivers/pci/pci.c +++ linux-2.6.24.7/drivers/pci/pci.c @@ -170,6 +170,42 @@ int pci_find_capability(struct pci_dev * } /** + * pci_find_capability_cached - query for devices' capabilities, cached version + * @dev: PCI device to query + * @cap: capability code + * + * Tell if a device supports a given PCI capability. + * Returns the address of the requested capability structure within the + * device's PCI configuration space or 0 in case the device does not + * support it. Possible values for @cap: + * + * %PCI_CAP_ID_PM Power Management + * %PCI_CAP_ID_AGP Accelerated Graphics Port + * %PCI_CAP_ID_VPD Vital Product Data + * %PCI_CAP_ID_SLOTID Slot Identification + * %PCI_CAP_ID_MSI Message Signalled Interrupts + * %PCI_CAP_ID_CHSWP CompactPCI HotSwap + * %PCI_CAP_ID_PCIX PCI-X + * %PCI_CAP_ID_EXP PCI Express + */ +int pci_find_capability_cached(struct pci_dev *dev, int cap) +{ + int pos = 0; + + WARN_ON_ONCE(cap <= 0 || cap > PCI_CAP_LIST_NR_ENTRIES); + + if (cap <= PCI_CAP_LIST_NR_ENTRIES) { + const int i = cap - 1; + if (dev->cached_capabilities[i] == -1) + dev->cached_capabilities[i] = pci_find_capability(dev, cap); + + pos = dev->cached_capabilities[i]; + } + + return pos; +} + +/** * pci_bus_find_capability - query for devices' capabilities * @bus: the PCI bus to query * @devfn: PCI device to query Index: linux-2.6.24.7/drivers/pci/probe.c =================================================================== --- linux-2.6.24.7.orig/drivers/pci/probe.c +++ linux-2.6.24.7/drivers/pci/probe.c @@ -854,6 +854,7 @@ static void pci_release_bus_bridge_dev(s struct pci_dev *alloc_pci_dev(void) { + int i; struct pci_dev *dev; dev = kzalloc(sizeof(struct pci_dev), GFP_KERNEL); @@ -863,6 +864,9 @@ struct pci_dev *alloc_pci_dev(void) INIT_LIST_HEAD(&dev->global_list); INIT_LIST_HEAD(&dev->bus_list); + for (i = 0; i < ARRAY_SIZE(dev->cached_capabilities); ++i) + dev->cached_capabilities[i] = -1; + pci_msi_init_pci_dev(dev); return dev; Index: linux-2.6.24.7/include/linux/pci.h =================================================================== --- linux-2.6.24.7.orig/include/linux/pci.h +++ linux-2.6.24.7/include/linux/pci.h @@ -193,6 +193,7 @@ struct pci_dev { unsigned int msix_enabled:1; unsigned int is_managed:1; unsigned int is_pcie:1; + int cached_capabilities[PCI_CAP_LIST_NR_ENTRIES]; /* See pci_find_capability_cached */ pci_dev_flags_t dev_flags; atomic_t enable_cnt; /* pci_enable_device has been called */ @@ -494,6 +495,7 @@ struct pci_dev __deprecated *pci_find_sl #endif /* CONFIG_PCI_LEGACY */ int pci_find_capability (struct pci_dev *dev, int cap); +int pci_find_capability_cached(struct pci_dev *dev, int cap); int pci_find_next_capability (struct pci_dev *dev, u8 pos, int cap); int pci_find_ext_capability (struct pci_dev *dev, int cap); int pci_find_ht_capability (struct pci_dev *dev, int ht_cap); @@ -760,6 +762,7 @@ static inline int __pci_register_driver( static inline int pci_register_driver(struct pci_driver *drv) { return 0;} static inline void pci_unregister_driver(struct pci_driver *drv) { } static inline int pci_find_capability (struct pci_dev *dev, int cap) {return 0; } +static inline int pci_find_capability_cached(struct pci_dev *dev, int cap) {return 0; } static inline int pci_find_next_capability (struct pci_dev *dev, u8 post, int cap) { return 0; } static inline int pci_find_ext_capability (struct pci_dev *dev, int cap) {return 0; } Index: linux-2.6.24.7/include/linux/pci_regs.h =================================================================== --- linux-2.6.24.7.orig/include/linux/pci_regs.h +++ linux-2.6.24.7/include/linux/pci_regs.h @@ -210,6 +210,7 @@ #define PCI_CAP_ID_AGP3 0x0E /* AGP Target PCI-PCI bridge */ #define PCI_CAP_ID_EXP 0x10 /* PCI Express */ #define PCI_CAP_ID_MSIX 0x11 /* MSI-X */ +#define PCI_CAP_LIST_NR_ENTRIES PCI_CAP_ID_MSIX #define PCI_CAP_LIST_NEXT 1 /* Next capability in the list */ #define PCI_CAP_FLAGS 2 /* Capability defined flags (16 bits) */ #define PCI_CAP_SIZEOF 4 ������������������������������������������������������������������������������������������������������������������������������������������������������������patches/rt-move-update-wall-time-back-to-do-timer.patch���������������������������������������������0000664�0000764�0000764�00000002365�11041657733�022071� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������Subject: rt: move update_wall_time back to do timer From: Thomas Gleixner <tglx@linutronix.de> Date: Wed, 30 Apr 2008 15:01:10 +0200 Heavy networking or high load rt tasks can starve the timer softirq. This can result in long loops in update_wall_time() once the timer softirq gets hold of the CPU again. This code runs with interrupts disabled and xtime lock write locked, so it can introduce pretty long latencies. Move update_wall_time() back into do_timer, so we avoid the looping and have a small but constant length irq off section every tick. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> --- kernel/timer.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) Index: linux-2.6.24.7/kernel/timer.c =================================================================== --- linux-2.6.24.7.orig/kernel/timer.c +++ linux-2.6.24.7/kernel/timer.c @@ -1027,7 +1027,6 @@ static inline void update_times(void) ticks = jiffies - last_tick; if (ticks) { last_tick += ticks; - update_wall_time(); calc_load(ticks); } write_sequnlock_irqrestore(&xtime_lock, flags); @@ -1057,6 +1056,7 @@ static void run_timer_softirq(struct sof void do_timer(unsigned long ticks) { jiffies_64 += ticks; + update_wall_time(); } #ifdef __ARCH_WANT_SYS_ALARM ���������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/rtmutex-lateral-steal.patch�����������������������������������������������������������������0000664�0000764�0000764�00000011226�11041657731�016525� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������allow rt-mutex lock-stealing to include lateral priority From: Gregory Haskins <ghaskins@novell.com> The current logic only allows lock stealing to occur if the current task is of higher priority than the pending owner. We can gain signficant throughput improvements (200%+) by allowing the lock-stealing code to include tasks of equal priority. The theory is that the system will make faster progress by allowing the task already on the CPU to take the lock rather than waiting for the system to wake-up a different task. This does add a degree of unfairness, yes. But also note that the users of these locks under non -rt environments have already been using unfair raw spinlocks anyway so the tradeoff is probably worth it. The way I like to think of this is that higher priority tasks should clearly preempt, and lower priority tasks should clearly block. However, if tasks have an identical priority value, then we can think of the scheduler decisions as the tie-breaking parameter. (e.g. tasks that the scheduler picked to run first have a logically higher priority amoung tasks of the same prio). This helps to keep the system "primed" with tasks doing useful work, and the end result is higher throughput. Thanks to Steven Rostedt for pointing out that RT tasks should be excluded to prevent the introduction of an unnatural unbounded latency. [ Steven Rostedt - removed config option to disable ] Signed-off-by: Gregory Haskins <ghaskins@novell.com> Signed-off-by: Steven Rostedt <srostedt@redhat.com> --- kernel/rtmutex.c | 17 +++++++++++------ kernel/rtmutex_common.h | 19 +++++++++++++++++++ 2 files changed, 30 insertions(+), 6 deletions(-) Index: linux-2.6.24.7/kernel/rtmutex.c =================================================================== --- linux-2.6.24.7.orig/kernel/rtmutex.c +++ linux-2.6.24.7/kernel/rtmutex.c @@ -318,7 +318,7 @@ static int rt_mutex_adjust_prio_chain(st * assigned pending owner [which might not have taken the * lock yet]: */ -static inline int try_to_steal_lock(struct rt_mutex *lock) +static inline int try_to_steal_lock(struct rt_mutex *lock, int mode) { struct task_struct *pendowner = rt_mutex_owner(lock); struct rt_mutex_waiter *next; @@ -330,7 +330,7 @@ static inline int try_to_steal_lock(stru return 1; spin_lock(&pendowner->pi_lock); - if (current->prio >= pendowner->prio) { + if (!lock_is_stealable(pendowner, mode)) { spin_unlock(&pendowner->pi_lock); return 0; } @@ -383,7 +383,7 @@ static inline int try_to_steal_lock(stru * * Must be called with lock->wait_lock held. */ -static int try_to_take_rt_mutex(struct rt_mutex *lock) +static int do_try_to_take_rt_mutex(struct rt_mutex *lock, int mode) { /* * We have to be careful here if the atomic speedups are @@ -406,7 +406,7 @@ static int try_to_take_rt_mutex(struct r */ mark_rt_mutex_waiters(lock); - if (rt_mutex_owner(lock) && !try_to_steal_lock(lock)) + if (rt_mutex_owner(lock) && !try_to_steal_lock(lock, mode)) return 0; /* We got the lock. */ @@ -419,6 +419,11 @@ static int try_to_take_rt_mutex(struct r return 1; } +static inline int try_to_take_rt_mutex(struct rt_mutex *lock) +{ + return do_try_to_take_rt_mutex(lock, STEAL_NORMAL); +} + /* * Task blocks on lock. * @@ -684,7 +689,7 @@ rt_spin_lock_slowlock(struct rt_mutex *l init_lists(lock); /* Try to acquire the lock again: */ - if (try_to_take_rt_mutex(lock)) { + if (do_try_to_take_rt_mutex(lock, STEAL_LATERAL)) { spin_unlock_irqrestore(&lock->wait_lock, flags); return; } @@ -707,7 +712,7 @@ rt_spin_lock_slowlock(struct rt_mutex *l int saved_lock_depth = current->lock_depth; /* Try to acquire the lock */ - if (try_to_take_rt_mutex(lock)) + if (do_try_to_take_rt_mutex(lock, STEAL_LATERAL)) break; /* * waiter.task is NULL the first time we come here and Index: linux-2.6.24.7/kernel/rtmutex_common.h =================================================================== --- linux-2.6.24.7.orig/kernel/rtmutex_common.h +++ linux-2.6.24.7/kernel/rtmutex_common.h @@ -121,6 +121,25 @@ extern void rt_mutex_init_proxy_locked(s extern void rt_mutex_proxy_unlock(struct rt_mutex *lock, struct task_struct *proxy_owner); + +#define STEAL_LATERAL 1 +#define STEAL_NORMAL 0 + +/* + * Note that RT tasks are excluded from lateral-steals to prevent the + * introduction of an unbounded latency + */ +static inline int lock_is_stealable(struct task_struct *pendowner, int mode) +{ + if (mode == STEAL_NORMAL || rt_task(current)) { + if (current->prio >= pendowner->prio) + return 0; + } else if (current->prio > pendowner->prio) + return 0; + + return 1; +} + #ifdef CONFIG_DEBUG_RT_MUTEXES # include "rtmutex-debug.h" #else ��������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/rtmutex-rearrange.patch���������������������������������������������������������������������0000664�0000764�0000764�00000004257�11041657730�015746� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������rearrange rt_spin_lock_slowlock sleeping code From: Gregory Haskins <ghaskins@novell.com> The current logic makes rather coarse adjustments to current->state since it is planning on sleeping anyway. We want to eventually move to an adaptive (e.g. optional sleep) algorithm, so we tighten the scope of the adjustments to bracket the schedule(). This should yield correct behavior with or without the adaptive features that are added later in the series. We add it here as a separate patch for greater review clarity on smaller changes. Signed-off-by: Gregory Haskins <ghaskins@novell.com> --- kernel/rtmutex.c | 19 ++++++++++++++----- 1 file changed, 14 insertions(+), 5 deletions(-) Index: linux-2.6.24.7/kernel/rtmutex.c =================================================================== --- linux-2.6.24.7.orig/kernel/rtmutex.c +++ linux-2.6.24.7/kernel/rtmutex.c @@ -666,6 +666,14 @@ rt_spin_lock_fastunlock(struct rt_mutex slowfn(lock); } +static inline void +update_current(unsigned long new_state, unsigned long *saved_state) +{ + unsigned long state = xchg(¤t->state, new_state); + if (unlikely(state == TASK_RUNNING)) + *saved_state = TASK_RUNNING; +} + /* * Slow path lock function spin_lock style: this variant is very * careful not to miss any non-lock wakeups. @@ -705,7 +713,7 @@ rt_spin_lock_slowlock(struct rt_mutex *l * saved_state accordingly. If we did not get a real wakeup * then we return with the saved state. */ - saved_state = xchg(¤t->state, TASK_UNINTERRUPTIBLE); + saved_state = current->state; for (;;) { unsigned long saved_flags; @@ -737,14 +745,15 @@ rt_spin_lock_slowlock(struct rt_mutex *l debug_rt_mutex_print_deadlock(&waiter); - schedule_rt_mutex(lock); + update_current(TASK_UNINTERRUPTIBLE, &saved_state); + if (waiter.task) + schedule_rt_mutex(lock); + else + update_current(TASK_RUNNING_MUTEX, &saved_state); spin_lock_irqsave(&lock->wait_lock, flags); current->flags |= saved_flags; current->lock_depth = saved_lock_depth; - state = xchg(¤t->state, TASK_UNINTERRUPTIBLE); - if (unlikely(state == TASK_RUNNING)) - saved_state = TASK_RUNNING; } state = xchg(¤t->state, saved_state); �������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/rtmutex-remove-xchg.patch�������������������������������������������������������������������0000664�0000764�0000764�00000001412�11041657731�016213� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From: Steven Rostedt <srostedt@redhat.com> Subject: rtmutex - remove double xchg No reason to update current if we are running. We'll do that when we exit the loop. Signed-off-by: Steven Rostedt <srostedt@redhat.com> --- kernel/rtmutex.c | 2 -- 1 file changed, 2 deletions(-) Index: linux-2.6.24.7/kernel/rtmutex.c =================================================================== --- linux-2.6.24.7.orig/kernel/rtmutex.c +++ linux-2.6.24.7/kernel/rtmutex.c @@ -748,8 +748,6 @@ rt_spin_lock_slowlock(struct rt_mutex *l update_current(TASK_UNINTERRUPTIBLE, &saved_state); if (waiter.task) schedule_rt_mutex(lock); - else - update_current(TASK_RUNNING_MUTEX, &saved_state); spin_lock_irqsave(&lock->wait_lock, flags); current->flags |= saved_flags; ������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/adaptive-spinlock-lite-v2.patch�������������������������������������������������������������0000664�0000764�0000764�00000013370�11041657732�017165� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From: Steven Rostedt <srostedt@redhat.com> Subject: adaptive spinlocks lite After talking with Gregory Haskins about how they implemented his version of adaptive spinlocks and before I actually looked at their code, I was thinking about it while lying in bed. I always thought that adaptive spinlocks were to spin for a short period of time based off of some heuristic and then sleep. This idea is totally bogus. No heuristic can account for a bunch of different activities. But Gregory mentioned something to me that made a hell of a lot of sense. And that is to only spin while the owner is running. If the owner is running, then it would seem that it would be quicker to spin then to take the scheduling hit. While lying awake in bed, it dawned on me that we could simply spin in the fast lock and never touch the "has waiters" flag, which would keep the owner from going into the slow path. Also, the task itself is preemptible while spinning so this would not affect latencies. The only trick was to not have the owner get freed between the time you saw the owner and the time you check its run queue. This was easily solved by simply grabing the RCU read lock because freeing of a task must happen after a grace period. I first tried to stay only in the fast path. This works fine until you want to guarantee that the highest prio task gets the lock next. I tried all sorts of hackeries and found that there was too many cases where we can miss. I finally concurred with Gregory, and decided that going into the slow path was the way to go. I then started looking into what the guys over at Novell did. The had the basic idea correct, but went way overboard in the implementation, making it far more complex than it needed to be. I rewrote their work using the ideas from my original patch, and simplified it quite a bit. This is the patch that they wanted to do ;-) Special thanks goes out to Gregory Haskins, Sven Dietrich and Peter Morreale, for proving that adaptive spin locks certainly *can* make a difference. Signed-off-by: Steven Rostedt <srostedt@redhat.com> --- include/linux/sched.h | 2 + kernel/rtmutex.c | 64 +++++++++++++++++++++++++++++++++++++++++++++++--- kernel/sched.c | 5 +++ 3 files changed, 68 insertions(+), 3 deletions(-) Index: linux-2.6.24.7/include/linux/sched.h =================================================================== --- linux-2.6.24.7.orig/include/linux/sched.h +++ linux-2.6.24.7/include/linux/sched.h @@ -2217,6 +2217,8 @@ static inline void migration_init(void) } #endif +extern int task_is_current(struct task_struct *task); + #define TASK_STATE_TO_CHAR_STR "RMSDTtZX" #endif /* __KERNEL__ */ Index: linux-2.6.24.7/kernel/rtmutex.c =================================================================== --- linux-2.6.24.7.orig/kernel/rtmutex.c +++ linux-2.6.24.7/kernel/rtmutex.c @@ -8,6 +8,12 @@ * Copyright (C) 2005 Kihon Technologies Inc., Steven Rostedt * Copyright (C) 2006 Esben Nielsen * + * Adaptive Spinlocks: + * Copyright (C) 2008 Novell, Inc., Gregory Haskins, Sven Dietrich, + * and Peter Morreale, + * Adaptive Spinlocks simplification: + * Copyright (C) 2008 Red Hat, Inc., Steven Rostedt <srostedt@redhat.com> + * * See Documentation/rt-mutex-design.txt for details. */ #include <linux/spinlock.h> @@ -674,6 +680,54 @@ update_current(unsigned long new_state, *saved_state = TASK_RUNNING; } +#ifdef CONFIG_SMP +static int adaptive_wait(struct rt_mutex_waiter *waiter, + struct task_struct *orig_owner) +{ + int sleep = 0; + + for (;;) { + + /* we are the owner? */ + if (!waiter->task) + break; + + /* + * We need to read the owner of the lock and then check + * its state. But we can't let the owner task be freed + * while we read the state. We grab the rcu_lock and + * this makes sure that the owner task wont disappear + * between testing that it still has the lock, and checking + * its state. + */ + rcu_read_lock(); + /* Owner changed? Then lets update the original */ + if (orig_owner != rt_mutex_owner(waiter->lock)) { + rcu_read_unlock(); + break; + } + + /* Owner went to bed, so should we */ + if (!task_is_current(orig_owner)) { + sleep = 1; + rcu_read_unlock(); + break; + } + rcu_read_unlock(); + + cpu_relax(); + } + + return sleep; +} +#else +static int adaptive_wait(struct rt_mutex_waiter *waiter, + struct task_struct *orig_owner) +{ + return 1; +} +#endif + /* * Slow path lock function spin_lock style: this variant is very * careful not to miss any non-lock wakeups. @@ -689,6 +743,7 @@ rt_spin_lock_slowlock(struct rt_mutex *l { struct rt_mutex_waiter waiter; unsigned long saved_state, state, flags; + struct task_struct *orig_owner; debug_rt_mutex_init_waiter(&waiter); waiter.task = NULL; @@ -741,13 +796,16 @@ rt_spin_lock_slowlock(struct rt_mutex *l saved_flags = current->flags & PF_NOSCHED; current->lock_depth = -1; current->flags &= ~PF_NOSCHED; + orig_owner = rt_mutex_owner(lock); spin_unlock_irqrestore(&lock->wait_lock, flags); debug_rt_mutex_print_deadlock(&waiter); - update_current(TASK_UNINTERRUPTIBLE, &saved_state); - if (waiter.task) - schedule_rt_mutex(lock); + if (adaptive_wait(&waiter, orig_owner)) { + update_current(TASK_UNINTERRUPTIBLE, &saved_state); + if (waiter.task) + schedule_rt_mutex(lock); + } spin_lock_irqsave(&lock->wait_lock, flags); current->flags |= saved_flags; Index: linux-2.6.24.7/kernel/sched.c =================================================================== --- linux-2.6.24.7.orig/kernel/sched.c +++ linux-2.6.24.7/kernel/sched.c @@ -573,6 +573,11 @@ int runqueue_is_locked(void) return ret; } +int task_is_current(struct task_struct *task) +{ + return task_rq(task)->curr == task; +} + /* * Debugging: various feature bits */ ������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/rwsems-mulitple-readers.patch���������������������������������������������������������������0000664�0000764�0000764�00000075152�11041657733�017073� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From: Steven Rostedt <srostedt@redhat.com> Subject: add framework for multi readers on rwsems Add the frame work for multiple readers and implemnt the code for rwsem first. A new structure is created called rw_mutex. This is used by PREEMPT_RT rwsems and will later be incorporated with rwlocks. The rw_mutex lock encapsulates the rt_mutex for use with rwsems (and later rwlocks). This patch is just the ground work. It simply allows for mulitple readers to grab the lock. This disables PI for readers. That is, when a writer is blocked on a rwsem with readers, it will not boost the readers. That work will be done later in the patch series. Signed-off-by: Steven Rostedt <srostedt@redhat.com> --- include/linux/lockdep.h | 13 include/linux/rt_lock.h | 13 kernel/rt.c | 64 ---- kernel/rtmutex.c | 706 +++++++++++++++++++++++++++++++++++++++++++++++- kernel/rtmutex_common.h | 57 +++ 5 files changed, 795 insertions(+), 58 deletions(-) Index: linux-2.6.24.7/include/linux/lockdep.h =================================================================== --- linux-2.6.24.7.orig/include/linux/lockdep.h +++ linux-2.6.24.7/include/linux/lockdep.h @@ -383,6 +383,16 @@ do { \ ret; \ }) +#define LOCK_CONTENDED_RT_RW(_lock, f_try, f_lock) \ +do { \ + if (!f_try(&(_lock)->owners)) { \ + lock_contended(&(_lock)->dep_map, _RET_IP_); \ + f_lock(&(_lock)->owners); \ + } \ + lock_acquired(&(_lock)->dep_map); \ +} while (0) + + #else /* CONFIG_LOCK_STAT */ #define lock_contended(lockdep_map, ip) do {} while (0) @@ -397,6 +407,9 @@ do { \ #define LOCK_CONTENDED_RT_RET(_lock, f_try, f_lock) \ f_lock(&(_lock)->lock) +#define LOCK_CONTENDED_RT_RW(_lock, f_try, f_lock) \ + f_lock(&(_lock)->owners) + #endif /* CONFIG_LOCK_STAT */ #if defined(CONFIG_TRACE_IRQFLAGS) && defined(CONFIG_GENERIC_HARDIRQS) Index: linux-2.6.24.7/include/linux/rt_lock.h =================================================================== --- linux-2.6.24.7.orig/include/linux/rt_lock.h +++ linux-2.6.24.7/include/linux/rt_lock.h @@ -60,6 +60,12 @@ typedef raw_spinlock_t spinlock_t; #ifdef CONFIG_PREEMPT_RT +struct rw_mutex { + struct task_struct *owner; + struct rt_mutex mutex; + atomic_t count; /* number of times held for read */ +}; + /* * RW-semaphores are a spinlock plus a reader-depth count. * @@ -71,8 +77,7 @@ typedef raw_spinlock_t spinlock_t; * fair and makes it simpler as well: */ struct rw_semaphore { - struct rt_mutex lock; - int read_depth; + struct rw_mutex owners; #ifdef CONFIG_DEBUG_LOCK_ALLOC struct lockdep_map dep_map; #endif @@ -189,7 +194,7 @@ extern int __bad_func_type(void); */ #define __RWSEM_INITIALIZER(name) \ - { .lock = __RT_MUTEX_INITIALIZER(name.lock), \ + { .owners.mutex = __RT_MUTEX_INITIALIZER(name.owners.mutex), \ RW_DEP_MAP_INIT(name) } #define DECLARE_RWSEM(lockname) \ @@ -222,7 +227,7 @@ extern void fastcall rt_up_read(struct r extern void fastcall rt_up_write(struct rw_semaphore *rwsem); extern void fastcall rt_downgrade_write(struct rw_semaphore *rwsem); -# define rt_rwsem_is_locked(rws) (rt_mutex_is_locked(&(rws)->lock)) +# define rt_rwsem_is_locked(rws) ((rws)->owners.owner != NULL) #define PICK_RWSEM_OP(...) PICK_FUNCTION(struct compat_rw_semaphore *, \ struct rw_semaphore *, ##__VA_ARGS__) Index: linux-2.6.24.7/kernel/rt.c =================================================================== --- linux-2.6.24.7.orig/kernel/rt.c +++ linux-2.6.24.7/kernel/rt.c @@ -301,26 +301,14 @@ EXPORT_SYMBOL(__rt_rwlock_init); void fastcall rt_up_write(struct rw_semaphore *rwsem) { rwsem_release(&rwsem->dep_map, 1, _RET_IP_); - rt_mutex_unlock(&rwsem->lock); + rt_mutex_up_write(&rwsem->owners); } EXPORT_SYMBOL(rt_up_write); void fastcall rt_up_read(struct rw_semaphore *rwsem) { - unsigned long flags; - rwsem_release(&rwsem->dep_map, 1, _RET_IP_); - /* - * Read locks within the self-held write lock succeed. - */ - spin_lock_irqsave(&rwsem->lock.wait_lock, flags); - if (rt_mutex_real_owner(&rwsem->lock) == current && rwsem->read_depth) { - spin_unlock_irqrestore(&rwsem->lock.wait_lock, flags); - rwsem->read_depth--; - return; - } - spin_unlock_irqrestore(&rwsem->lock.wait_lock, flags); - rt_mutex_unlock(&rwsem->lock); + rt_mutex_up_read(&rwsem->owners); } EXPORT_SYMBOL(rt_up_read); @@ -336,7 +324,7 @@ EXPORT_SYMBOL(rt_downgrade_write); int fastcall rt_down_write_trylock(struct rw_semaphore *rwsem) { - int ret = rt_mutex_trylock(&rwsem->lock); + int ret = rt_mutex_down_write_trylock(&rwsem->owners); if (ret) rwsem_acquire(&rwsem->dep_map, 0, 1, _RET_IP_); @@ -344,38 +332,29 @@ int fastcall rt_down_write_trylock(struc } EXPORT_SYMBOL(rt_down_write_trylock); +static void __rt_down_write(struct rw_semaphore *rwsem, int subclass) +{ + rwsem_acquire(&rwsem->dep_map, subclass, 0, _RET_IP_); + LOCK_CONTENDED_RT_RW(rwsem, rt_mutex_down_write_trylock, rt_mutex_down_write); +} + void fastcall rt_down_write(struct rw_semaphore *rwsem) { - rwsem_acquire(&rwsem->dep_map, 0, 0, _RET_IP_); - LOCK_CONTENDED_RT(rwsem, rt_mutex_trylock, rt_mutex_lock); + __rt_down_write(rwsem, 0); } EXPORT_SYMBOL(rt_down_write); void fastcall rt_down_write_nested(struct rw_semaphore *rwsem, int subclass) { - rwsem_acquire(&rwsem->dep_map, subclass, 0, _RET_IP_); - LOCK_CONTENDED_RT(rwsem, rt_mutex_trylock, rt_mutex_lock); + __rt_down_write(rwsem, subclass); } EXPORT_SYMBOL(rt_down_write_nested); int fastcall rt_down_read_trylock(struct rw_semaphore *rwsem) { - unsigned long flags; int ret; - /* - * Read locks within the self-held write lock succeed. - */ - spin_lock_irqsave(&rwsem->lock.wait_lock, flags); - if (rt_mutex_real_owner(&rwsem->lock) == current) { - spin_unlock_irqrestore(&rwsem->lock.wait_lock, flags); - rwsem_acquire_read(&rwsem->dep_map, 0, 1, _RET_IP_); - rwsem->read_depth++; - return 1; - } - spin_unlock_irqrestore(&rwsem->lock.wait_lock, flags); - - ret = rt_mutex_trylock(&rwsem->lock); + ret = rt_mutex_down_read_trylock(&rwsem->owners); if (ret) rwsem_acquire(&rwsem->dep_map, 0, 1, _RET_IP_); return ret; @@ -384,22 +363,8 @@ EXPORT_SYMBOL(rt_down_read_trylock); static void __rt_down_read(struct rw_semaphore *rwsem, int subclass) { - unsigned long flags; - rwsem_acquire_read(&rwsem->dep_map, subclass, 0, _RET_IP_); - - /* - * Read locks within the write lock succeed. - */ - spin_lock_irqsave(&rwsem->lock.wait_lock, flags); - - if (rt_mutex_real_owner(&rwsem->lock) == current) { - spin_unlock_irqrestore(&rwsem->lock.wait_lock, flags); - rwsem->read_depth++; - return; - } - spin_unlock_irqrestore(&rwsem->lock.wait_lock, flags); - LOCK_CONTENDED_RT(rwsem, rt_mutex_trylock, rt_mutex_lock); + LOCK_CONTENDED_RT_RW(rwsem, rt_mutex_down_read_trylock, rt_mutex_down_read); } void fastcall rt_down_read(struct rw_semaphore *rwsem) @@ -424,8 +389,7 @@ void fastcall __rt_rwsem_init(struct rw_ debug_check_no_locks_freed((void *)rwsem, sizeof(*rwsem)); lockdep_init_map(&rwsem->dep_map, name, key, 0); #endif - __rt_mutex_init(&rwsem->lock, name); - rwsem->read_depth = 0; + rt_mutex_rwsem_init(&rwsem->owners, name); } EXPORT_SYMBOL(__rt_rwsem_init); Index: linux-2.6.24.7/kernel/rtmutex.c =================================================================== --- linux-2.6.24.7.orig/kernel/rtmutex.c +++ linux-2.6.24.7/kernel/rtmutex.c @@ -87,6 +87,7 @@ static void fixup_rt_mutex_waiters(struc */ #if defined(__HAVE_ARCH_CMPXCHG) && !defined(CONFIG_DEBUG_RT_MUTEXES) # define rt_mutex_cmpxchg(l,c,n) (cmpxchg(&l->owner, c, n) == c) +# define rt_rwlock_cmpxchg(rwm,c,n) (cmpxchg(&(rwm)->owner, c, n) == c) static inline void mark_rt_mutex_waiters(struct rt_mutex *lock) { unsigned long owner, *p = (unsigned long *) &lock->owner; @@ -95,13 +96,31 @@ static inline void mark_rt_mutex_waiters owner = *p; } while (cmpxchg(p, owner, owner | RT_MUTEX_HAS_WAITERS) != owner); } +#ifdef CONFIG_PREEMPT_RT +static inline void mark_rt_rwlock_check(struct rw_mutex *rwm) +{ + unsigned long owner, *p = (unsigned long *) &rwm->owner; + + do { + owner = *p; + } while (cmpxchg(p, owner, owner | RT_RWLOCK_CHECK) != owner); +} +#endif /* CONFIG_PREEMPT_RT */ #else # define rt_mutex_cmpxchg(l,c,n) (0) +# define rt_rwlock_cmpxchg(l,c,n) ({ (void)c; (void)n; 0; }) static inline void mark_rt_mutex_waiters(struct rt_mutex *lock) { lock->owner = (struct task_struct *) ((unsigned long)lock->owner | RT_MUTEX_HAS_WAITERS); } +#ifdef CONFIG_PREEMPT_RT +static inline void mark_rt_rwlock_check(struct rw_mutex *rwm) +{ + rwm->owner = (struct task_struct *) + ((unsigned long)rwm->owner | RT_RWLOCK_CHECK); +} +#endif /* CONFIG_PREEMPT_RT */ #endif int pi_initialized; @@ -282,6 +301,13 @@ static int rt_mutex_adjust_prio_chain(st /* Grab the next task */ task = rt_mutex_owner(lock); + + /* Writers do not boost their readers. */ + if (task == RT_RW_READER) { + spin_unlock_irqrestore(&lock->wait_lock, flags); + goto out; + } + get_task_struct(task); spin_lock(&task->pi_lock); @@ -315,7 +341,7 @@ static int rt_mutex_adjust_prio_chain(st spin_unlock_irqrestore(&task->pi_lock, flags); out_put_task: put_task_struct(task); - + out: return ret; } @@ -335,6 +361,8 @@ static inline int try_to_steal_lock(stru if (pendowner == current) return 1; + WARN_ON(rt_mutex_owner(lock) == RT_RW_READER); + spin_lock(&pendowner->pi_lock); if (!lock_is_stealable(pendowner, mode)) { spin_unlock(&pendowner->pi_lock); @@ -462,6 +490,10 @@ static int task_blocks_on_rt_mutex(struc spin_unlock(¤t->pi_lock); if (waiter == rt_mutex_top_waiter(lock)) { + /* readers are not handled */ + if (owner == RT_RW_READER) + return 0; + spin_lock(&owner->pi_lock); plist_del(&top_waiter->pi_list_entry, &owner->pi_waiters); plist_add(&waiter->pi_list_entry, &owner->pi_waiters); @@ -474,7 +506,7 @@ static int task_blocks_on_rt_mutex(struc else if (debug_rt_mutex_detect_deadlock(waiter, detect_deadlock)) chain_walk = 1; - if (!chain_walk) + if (!chain_walk || owner == RT_RW_READER) return 0; /* @@ -574,7 +606,7 @@ static void remove_waiter(struct rt_mute current->pi_blocked_on = NULL; spin_unlock(¤t->pi_lock); - if (first && owner != current) { + if (first && owner != current && owner != RT_RW_READER) { spin_lock(&owner->pi_lock); @@ -747,6 +779,7 @@ rt_spin_lock_slowlock(struct rt_mutex *l debug_rt_mutex_init_waiter(&waiter); waiter.task = NULL; + waiter.write_lock = 0; spin_lock_irqsave(&lock->wait_lock, flags); init_lists(lock); @@ -964,7 +997,671 @@ __rt_spin_lock_init(spinlock_t *lock, ch } EXPORT_SYMBOL(__rt_spin_lock_init); -#endif +static inline int rt_release_bkl(struct rt_mutex *lock, unsigned long flags); +static inline void rt_reacquire_bkl(int saved_lock_depth); + +static inline void +rt_rwlock_set_owner(struct rw_mutex *rwm, struct task_struct *owner, + unsigned long mask) +{ + unsigned long val = (unsigned long)owner | mask; + + rwm->owner = (struct task_struct *)val; +} + +/* + * The fast paths of the rw locks do not set up owners to + * the mutex. When blocking on an rwlock we must make sure + * there exists an owner. + */ +static void +update_rw_mutex_owner(struct rw_mutex *rwm) +{ + struct rt_mutex *mutex = &rwm->mutex; + struct task_struct *mtxowner; + + mtxowner = rt_mutex_owner(mutex); + if (mtxowner) + return; + + mtxowner = rt_rwlock_owner(rwm); + WARN_ON(!mtxowner); + if (rt_rwlock_writer(rwm)) + WARN_ON(mtxowner == RT_RW_READER); + else + mtxowner = RT_RW_READER; + rt_mutex_set_owner(mutex, mtxowner, 0); +} + +static int try_to_take_rw_read(struct rw_mutex *rwm) +{ + struct rt_mutex *mutex = &rwm->mutex; + struct rt_mutex_waiter *waiter; + struct task_struct *mtxowner; + + assert_spin_locked(&mutex->wait_lock); + + /* mark the lock to force the owner to check on release */ + mark_rt_rwlock_check(rwm); + + /* is the owner a writer? */ + if (unlikely(rt_rwlock_writer(rwm))) + return 0; + + /* A writer is not the owner, but is a writer waiting */ + mtxowner = rt_mutex_owner(mutex); + + /* if the owner released it before we marked it then take it */ + if (!mtxowner && !rt_rwlock_owner(rwm)) { + WARN_ON(atomic_read(&rwm->count)); + rt_rwlock_set_owner(rwm, current, 0); + goto taken; + } + + if (mtxowner && mtxowner != RT_RW_READER) { + if (!try_to_steal_lock(mutex)) { + /* + * readers don't own the mutex, and rwm shows that a + * writer doesn't have it either. If we enter this + * condition, then we must be pending. + */ + WARN_ON(!rt_mutex_owner_pending(mutex)); + /* + * Even though we didn't steal the lock, if the owner + * is a reader, and we are of higher priority than + * any waiting writer, we might still be able to continue. + */ + if (rt_rwlock_pending_writer(rwm)) + return 0; + if (rt_mutex_has_waiters(mutex)) { + /* readers don't do PI */ + waiter = rt_mutex_top_waiter(mutex); + if (current->prio >= waiter->task->prio) + return 0; + /* + * The pending reader has PI waiters, + * but we are taking the lock. + * Remove the waiters from the pending owner. + */ + spin_lock(&mtxowner->pi_lock); + plist_del(&waiter->pi_list_entry, &mtxowner->pi_waiters); + spin_unlock(&mtxowner->pi_lock); + } + } else if (rt_mutex_has_waiters(mutex)) { + /* Readers don't do PI */ + waiter = rt_mutex_top_waiter(mutex); + spin_lock(¤t->pi_lock); + plist_del(&waiter->pi_list_entry, ¤t->pi_waiters); + spin_unlock(¤t->pi_lock); + } + /* Readers never own the mutex */ + rt_mutex_set_owner(mutex, RT_RW_READER, 0); + } + + /* RT_RW_READER forces slow paths */ + rt_rwlock_set_owner(rwm, RT_RW_READER, 0); + taken: + rt_mutex_deadlock_account_lock(mutex, current); + atomic_inc(&rwm->count); + return 1; +} + +static int +try_to_take_rw_write(struct rw_mutex *rwm) +{ + struct rt_mutex *mutex = &rwm->mutex; + struct task_struct *own; + + /* mark the lock to force the owner to check on release */ + mark_rt_rwlock_check(rwm); + + own = rt_rwlock_owner(rwm); + + /* readers or writers? */ + if ((own && !rt_rwlock_pending(rwm))) + return 0; + + WARN_ON(atomic_read(&rwm->count)); + + /* + * RT_RW_PENDING means that the lock is free, but there are + * pending owners on the mutex + */ + WARN_ON(own && !rt_mutex_owner_pending(mutex)); + + if (!try_to_take_rt_mutex(mutex)) + return 0; + + /* + * We stole the lock. Add both WRITER and CHECK flags + * since we must release the mutex. + */ + rt_rwlock_set_owner(rwm, current, RT_RWLOCK_WRITER | RT_RWLOCK_CHECK); + + return 1; +} + +static void +rt_read_slowlock(struct rw_mutex *rwm) +{ + struct rt_mutex_waiter waiter; + struct rt_mutex *mutex = &rwm->mutex; + int saved_lock_depth = -1; + unsigned long flags; + + spin_lock_irqsave(&mutex->wait_lock, flags); + init_lists(mutex); + + if (try_to_take_rw_read(rwm)) { + spin_unlock_irqrestore(&mutex->wait_lock, flags); + return; + } + update_rw_mutex_owner(rwm); + + /* Owner is a writer (or a blocked writer). Block on the lock */ + + debug_rt_mutex_init_waiter(&waiter); + waiter.task = NULL; + waiter.write_lock = 0; + + init_lists(mutex); + + /* + * We drop the BKL here before we go into the wait loop to avoid a + * possible deadlock in the scheduler. + */ + if (unlikely(current->lock_depth >= 0)) + saved_lock_depth = rt_release_bkl(mutex, flags); + set_current_state(TASK_UNINTERRUPTIBLE); + + for (;;) { + unsigned long saved_flags; + + /* Try to acquire the lock: */ + if (try_to_take_rw_read(rwm)) + break; + update_rw_mutex_owner(rwm); + + /* + * waiter.task is NULL the first time we come here and + * when we have been woken up by the previous owner + * but the lock got stolen by a higher prio task. + */ + if (!waiter.task) { + task_blocks_on_rt_mutex(mutex, &waiter, 0, flags); + /* Wakeup during boost ? */ + if (unlikely(!waiter.task)) + continue; + } + saved_flags = current->flags & PF_NOSCHED; + current->flags &= ~PF_NOSCHED; + + spin_unlock_irqrestore(&mutex->wait_lock, flags); + + debug_rt_mutex_print_deadlock(&waiter); + + if (waiter.task) + schedule_rt_mutex(mutex); + + spin_lock_irqsave(&mutex->wait_lock, flags); + + current->flags |= saved_flags; + set_current_state(TASK_UNINTERRUPTIBLE); + } + + set_current_state(TASK_RUNNING); + + if (unlikely(waiter.task)) + remove_waiter(mutex, &waiter, flags); + + WARN_ON(rt_mutex_owner(mutex) && + rt_mutex_owner(mutex) != current && + rt_mutex_owner(mutex) != RT_RW_READER && + !rt_mutex_owner_pending(mutex)); + + spin_unlock_irqrestore(&mutex->wait_lock, flags); + + /* Must we reaquire the BKL? */ + if (unlikely(saved_lock_depth >= 0)) + rt_reacquire_bkl(saved_lock_depth); + + debug_rt_mutex_free_waiter(&waiter); +} + +static inline void +rt_read_fastlock(struct rw_mutex *rwm, + void fastcall (*slowfn)(struct rw_mutex *rwm)) +{ +retry: + if (likely(rt_rwlock_cmpxchg(rwm, NULL, current))) { + rt_mutex_deadlock_account_lock(&rwm->mutex, current); + atomic_inc(&rwm->count); + /* + * It is possible that the owner was zeroed + * before we incremented count. If owner is not + * current, then retry again + */ + if (unlikely(rwm->owner != current)) { + atomic_dec(&rwm->count); + goto retry; + } + } else + slowfn(rwm); +} + +void fastcall rt_mutex_down_read(struct rw_mutex *rwm) +{ + rt_read_fastlock(rwm, rt_read_slowlock); +} + + +static inline int +rt_read_slowtrylock(struct rw_mutex *rwm) +{ + struct rt_mutex *mutex = &rwm->mutex; + unsigned long flags; + int ret = 0; + + spin_lock_irqsave(&mutex->wait_lock, flags); + init_lists(mutex); + + if (try_to_take_rw_read(rwm)) + ret = 1; + + spin_unlock_irqrestore(&mutex->wait_lock, flags); + + return ret; +} + +static inline int +rt_read_fasttrylock(struct rw_mutex *rwm, + int fastcall (*slowfn)(struct rw_mutex *rwm)) +{ +retry: + if (likely(rt_rwlock_cmpxchg(rwm, NULL, current))) { + rt_mutex_deadlock_account_lock(&rwm->mutex, current); + atomic_inc(&rwm->count); + /* + * It is possible that the owner was zeroed + * before we incremented count. If owner is not + * current, then retry again + */ + if (unlikely(rwm->owner != current)) { + atomic_dec(&rwm->count); + goto retry; + } + return 1; + } else + return slowfn(rwm); +} + +int __sched rt_mutex_down_read_trylock(struct rw_mutex *rwm) +{ + return rt_read_fasttrylock(rwm, rt_read_slowtrylock); +} + +static void +rt_write_slowlock(struct rw_mutex *rwm) +{ + struct rt_mutex *mutex = &rwm->mutex; + struct rt_mutex_waiter waiter; + int saved_lock_depth = -1; + unsigned long flags; + + debug_rt_mutex_init_waiter(&waiter); + waiter.task = NULL; + + /* we do PI different for writers that are blocked */ + waiter.write_lock = 1; + + spin_lock_irqsave(&mutex->wait_lock, flags); + init_lists(mutex); + + if (try_to_take_rw_write(rwm)) { + spin_unlock_irqrestore(&mutex->wait_lock, flags); + return; + } + update_rw_mutex_owner(rwm); + + /* + * We drop the BKL here before we go into the wait loop to avoid a + * possible deadlock in the scheduler. + */ + if (unlikely(current->lock_depth >= 0)) + saved_lock_depth = rt_release_bkl(mutex, flags); + set_current_state(TASK_UNINTERRUPTIBLE); + + for (;;) { + unsigned long saved_flags; + + /* Try to acquire the lock: */ + if (try_to_take_rw_write(rwm)) + break; + update_rw_mutex_owner(rwm); + + /* + * waiter.task is NULL the first time we come here and + * when we have been woken up by the previous owner + * but the lock got stolen by a higher prio task. + */ + if (!waiter.task) { + task_blocks_on_rt_mutex(mutex, &waiter, 0, flags); + /* Wakeup during boost ? */ + if (unlikely(!waiter.task)) + continue; + } + saved_flags = current->flags & PF_NOSCHED; + current->flags &= ~PF_NOSCHED; + + spin_unlock_irqrestore(&mutex->wait_lock, flags); + + debug_rt_mutex_print_deadlock(&waiter); + + if (waiter.task) + schedule_rt_mutex(mutex); + + spin_lock_irqsave(&mutex->wait_lock, flags); + + current->flags |= saved_flags; + set_current_state(TASK_UNINTERRUPTIBLE); + } + + set_current_state(TASK_RUNNING); + + if (unlikely(waiter.task)) + remove_waiter(mutex, &waiter, flags); + + /* check on unlock if we have any waiters. */ + if (rt_mutex_has_waiters(mutex)) + mark_rt_rwlock_check(rwm); + + spin_unlock_irqrestore(&mutex->wait_lock, flags); + + /* Must we reaquire the BKL? */ + if (unlikely(saved_lock_depth >= 0)) + rt_reacquire_bkl(saved_lock_depth); + + WARN_ON(atomic_read(&rwm->count)); + + debug_rt_mutex_free_waiter(&waiter); + +} + +static inline void +rt_write_fastlock(struct rw_mutex *rwm, + void fastcall (*slowfn)(struct rw_mutex *rwm)) +{ + unsigned long val = (unsigned long)current | RT_RWLOCK_WRITER; + + if (likely(rt_rwlock_cmpxchg(rwm, NULL, val))) { + rt_mutex_deadlock_account_lock(&rwm->mutex, current); + WARN_ON(atomic_read(&rwm->count)); + } else + slowfn(rwm); +} + +void fastcall rt_mutex_down_write(struct rw_mutex *rwm) +{ + rt_write_fastlock(rwm, rt_write_slowlock); +} + +static int +rt_write_slowtrylock(struct rw_mutex *rwm) +{ + struct rt_mutex *mutex = &rwm->mutex; + unsigned long flags; + int ret = 0; + + spin_lock_irqsave(&mutex->wait_lock, flags); + init_lists(mutex); + + if (try_to_take_rw_write(rwm)) + ret = 1; + + spin_unlock_irqrestore(&mutex->wait_lock, flags); + + return ret; +} + +static inline int +rt_write_fasttrylock(struct rw_mutex *rwm, + int fastcall (*slowfn)(struct rw_mutex *rwm)) +{ + unsigned long val = (unsigned long)current | RT_RWLOCK_WRITER; + + if (likely(rt_rwlock_cmpxchg(rwm, NULL, val))) { + rt_mutex_deadlock_account_lock(&rwm->mutex, current); + WARN_ON(atomic_read(&rwm->count)); + return 1; + } else + return slowfn(rwm); +} + +int fastcall rt_mutex_down_write_trylock(struct rw_mutex *rwm) +{ + return rt_write_fasttrylock(rwm, rt_write_slowtrylock); +} + +static void fastcall noinline __sched +rt_read_slowunlock(struct rw_mutex *rwm) +{ + struct rt_mutex *mutex = &rwm->mutex; + unsigned long flags; + struct rt_mutex_waiter *waiter; + + spin_lock_irqsave(&mutex->wait_lock, flags); + + rt_mutex_deadlock_account_unlock(current); + + /* + * To prevent multiple readers from zeroing out the owner + * when the count goes to zero and then have another task + * grab the task. We mark the lock. This makes all tasks + * go to the slow path. Then we can check the owner without + * worry that it changed. + */ + mark_rt_rwlock_check(rwm); + + /* + * If there are more readers, let the last one do any wakeups. + * Also check to make sure the owner wasn't cleared when two + * readers released the lock at the same time, and the count + * went to zero before grabbing the wait_lock. + */ + if (atomic_read(&rwm->count) || + (rt_rwlock_owner(rwm) != current && + rt_rwlock_owner(rwm) != RT_RW_READER)) { + spin_unlock_irqrestore(&mutex->wait_lock, flags); + return; + } + + /* If no one is blocked, then clear all ownership */ + if (!rt_mutex_has_waiters(mutex)) { + /* We could still have a pending reader waiting */ + if (rt_mutex_owner_pending(mutex)) { + /* set the rwm back to pending */ + rwm->owner = RT_RW_PENDING_READ; + } else { + rwm->owner = NULL; + mutex->owner = NULL; + } + goto out; + } + + /* We are the last reader with pending waiters. */ + waiter = rt_mutex_top_waiter(mutex); + if (waiter->write_lock) + rwm->owner = RT_RW_PENDING_WRITE; + else + rwm->owner = RT_RW_PENDING_READ; + + /* + * It is possible to have a reader waiting. We still only + * wake one up in that case. A way we can have a reader waiting + * is because a writer woke up, a higher prio reader came + * and stole the lock from the writer. But the writer now + * is no longer waiting on the lock and needs to retake + * the lock. We simply wake up the reader and let the + * reader have the lock. If the writer comes by, it + * will steal the lock from the reader. This is the + * only time we can have a reader pending on a lock. + */ + wakeup_next_waiter(mutex, 0); + + out: + spin_unlock_irqrestore(&mutex->wait_lock, flags); + + /* Undo pi boosting.when necessary */ + rt_mutex_adjust_prio(current); +} + +static inline void +rt_read_fastunlock(struct rw_mutex *rwm, + void fastcall (*slowfn)(struct rw_mutex *rwm)) +{ + WARN_ON(!atomic_read(&rwm->count)); + WARN_ON(!rwm->owner); + atomic_dec(&rwm->count); + if (likely(rt_rwlock_cmpxchg(rwm, current, NULL))) + rt_mutex_deadlock_account_unlock(current); + else + slowfn(rwm); +} + +void fastcall rt_mutex_up_read(struct rw_mutex *rwm) +{ + rt_read_fastunlock(rwm, rt_read_slowunlock); +} + +static void fastcall noinline __sched +rt_write_slowunlock(struct rw_mutex *rwm) +{ + struct rt_mutex *mutex = &rwm->mutex; + struct rt_mutex_waiter *waiter; + struct task_struct *pendowner; + unsigned long flags; + + spin_lock_irqsave(&mutex->wait_lock, flags); + + rt_mutex_deadlock_account_unlock(current); + + if (!rt_mutex_has_waiters(mutex)) { + rwm->owner = NULL; + mutex->owner = NULL; + spin_unlock_irqrestore(&mutex->wait_lock, flags); + return; + } + + debug_rt_mutex_unlock(mutex); + + /* + * This is where it gets a bit tricky. + * We can have both readers and writers waiting below us. + * They are ordered by priority. For each reader we wake + * up, we check to see if there's another reader waiting. + * If that is the case, we continue to wake up the readers + * until we hit a writer. Once we hit a writer, then we + * stop (and don't wake it up). + * + * If the next waiter is a writer, than we just wake up + * the writer and we are done. + */ + + waiter = rt_mutex_top_waiter(mutex); + pendowner = waiter->task; + wakeup_next_waiter(mutex, 0); + + /* another writer is next? */ + if (waiter->write_lock) { + rwm->owner = RT_RW_PENDING_WRITE; + goto out; + } + + rwm->owner = RT_RW_PENDING_READ; + + if (!rt_mutex_has_waiters(mutex)) + goto out; + + spin_lock(&pendowner->pi_lock); + /* + * Wake up all readers. + * This gets a bit more complex. More than one reader can't + * own the mutex. We give it to the first (highest prio) + * reader, and then wake up the rest of the readers until + * we wake up all readers or come to a writer. The woken + * up readers that don't own the lock will try to take it + * when they schedule. Doing this lets a high prio writer + * come along and steal the lock. + */ + waiter = rt_mutex_top_waiter(mutex); + while (waiter && !waiter->write_lock) { + struct task_struct *reader = waiter->task; + + plist_del(&waiter->list_entry, &mutex->wait_list); + + /* nop if not on a list */ + plist_del(&waiter->pi_list_entry, &pendowner->pi_waiters); + + waiter->task = NULL; + reader->pi_blocked_on = NULL; + + wake_up_process(reader); + + if (rt_mutex_has_waiters(mutex)) + waiter = rt_mutex_top_waiter(mutex); + else + waiter = NULL; + } + + /* If a writer is still pending, then update its plist. */ + if (rt_mutex_has_waiters(mutex)) { + struct rt_mutex_waiter *next; + + next = rt_mutex_top_waiter(mutex); + /* delete incase we didn't go through the loop */ + plist_del(&next->pi_list_entry, &pendowner->pi_waiters); + /* add back in as top waiter */ + plist_add(&next->pi_list_entry, &pendowner->pi_waiters); + } + spin_unlock(&pendowner->pi_lock); + + out: + + spin_unlock_irqrestore(&mutex->wait_lock, flags); + + /* Undo pi boosting.when necessary */ + rt_mutex_adjust_prio(current); +} + +static inline void +rt_write_fastunlock(struct rw_mutex *rwm, + void fastcall (*slowfn)(struct rw_mutex *rwm)) +{ + unsigned long val = (unsigned long)current | RT_RWLOCK_WRITER; + + WARN_ON(rt_rwlock_owner(rwm) != current); + if (likely(rt_rwlock_cmpxchg(rwm, (struct task_struct *)val, NULL))) + rt_mutex_deadlock_account_unlock(current); + else + slowfn(rwm); +} + +void fastcall rt_mutex_up_write(struct rw_mutex *rwm) +{ + rt_write_fastunlock(rwm, rt_write_slowunlock); +} + +void rt_mutex_rwsem_init(struct rw_mutex *rwm, const char *name) +{ + struct rt_mutex *mutex = &rwm->mutex; + + rwm->owner = NULL; + atomic_set(&rwm->count, 0); + + __rt_mutex_init(mutex, name); +} + +#endif /* CONFIG_PREEMPT_RT */ #ifdef CONFIG_PREEMPT_BKL @@ -1012,6 +1709,7 @@ rt_mutex_slowlock(struct rt_mutex *lock, debug_rt_mutex_init_waiter(&waiter); waiter.task = NULL; + waiter.write_lock = 0; spin_lock_irqsave(&lock->wait_lock, flags); init_lists(lock); Index: linux-2.6.24.7/kernel/rtmutex_common.h =================================================================== --- linux-2.6.24.7.orig/kernel/rtmutex_common.h +++ linux-2.6.24.7/kernel/rtmutex_common.h @@ -13,6 +13,7 @@ #define __KERNEL_RTMUTEX_COMMON_H #include <linux/rtmutex.h> +#include <linux/rt_lock.h> /* * The rtmutex in kernel tester is independent of rtmutex debugging. We @@ -43,12 +44,14 @@ extern void schedule_rt_mutex_test(struc * @list_entry: pi node to enqueue into the mutex waiters list * @pi_list_entry: pi node to enqueue into the mutex owner waiters list * @task: task reference to the blocked task + * @write_lock: true if blocked as writer */ struct rt_mutex_waiter { struct plist_node list_entry; struct plist_node pi_list_entry; struct task_struct *task; struct rt_mutex *lock; + int write_lock; #ifdef CONFIG_DEBUG_RT_MUTEXES unsigned long ip; pid_t deadlock_task_pid; @@ -112,6 +115,60 @@ static inline unsigned long rt_mutex_own return (unsigned long)lock->owner & RT_MUTEX_OWNER_PENDING; } +#ifdef CONFIG_PREEMPT_RT +/* + * rw_mutex->owner state tracking + */ +#define RT_RWLOCK_CHECK 1UL +#define RT_RWLOCK_WRITER 2UL +#define RT_RWLOCK_MASKALL 3UL + +/* used as reader owner of the mutex */ +#define RT_RW_READER (struct task_struct *)0x100 + +/* used when a writer releases the lock with waiters */ +/* pending owner is a reader */ +#define RT_RW_PENDING_READ (struct task_struct *)0x200 +/* pending owner is a writer */ +#define RT_RW_PENDING_WRITE (struct task_struct *)0x400 +/* Either of the above is true */ +#define RT_RW_PENDING_MASK (0x600 | RT_RWLOCK_MASKALL) + +/* Return true if lock is not owned but has pending owners */ +static inline int rt_rwlock_pending(struct rw_mutex *rwm) +{ + unsigned long owner = (unsigned long)rwm->owner; + return (owner & RT_RW_PENDING_MASK) == owner; +} + +static inline int rt_rwlock_pending_writer(struct rw_mutex *rwm) +{ + unsigned long owner = (unsigned long)rwm->owner; + return rt_rwlock_pending(rwm) && + (owner & (unsigned long)RT_RW_PENDING_WRITE); +} + +static inline struct task_struct *rt_rwlock_owner(struct rw_mutex *rwm) +{ + return (struct task_struct *) + ((unsigned long)rwm->owner & ~RT_RWLOCK_MASKALL); +} + +static inline unsigned long rt_rwlock_writer(struct rw_mutex *rwm) +{ + return (unsigned long)rwm->owner & RT_RWLOCK_WRITER; +} + +extern void rt_mutex_up_write(struct rw_mutex *rwm); +extern void rt_mutex_up_read(struct rw_mutex *rwm); +extern int rt_mutex_down_write_trylock(struct rw_mutex *rwm); +extern void rt_mutex_down_write(struct rw_mutex *rwm); +extern int rt_mutex_down_read_trylock(struct rw_mutex *rwm); +extern void rt_mutex_down_read(struct rw_mutex *rwm); +extern void rt_mutex_rwsem_init(struct rw_mutex *rwm, const char *name); + +#endif /* CONFIG_PREEMPT_RT */ + /* * PI-futex support (proxy locking functions, etc.): */ ����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/rwlocks-lateral-steal.patch�����������������������������������������������������������������0000664�0000764�0000764�00000015410�11041657735�016504� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From: Steven Rostedt <srostedt@redhat.com> Added lateral steal for rwlocks. Signed-off-by: Steven Rostedt <srostedt@redhat.com> --- kernel/rtmutex.c | 58 +++++++++++++++++++++++++++++-------------------------- 1 file changed, 31 insertions(+), 27 deletions(-) Index: linux-2.6.24.7/kernel/rtmutex.c =================================================================== --- linux-2.6.24.7.orig/kernel/rtmutex.c +++ linux-2.6.24.7/kernel/rtmutex.c @@ -1033,7 +1033,7 @@ update_rw_mutex_owner(struct rw_mutex *r rt_mutex_set_owner(mutex, mtxowner, 0); } -static int try_to_take_rw_read(struct rw_mutex *rwm) +static int try_to_take_rw_read(struct rw_mutex *rwm, int mtx) { struct rt_mutex *mutex = &rwm->mutex; struct rt_mutex_waiter *waiter; @@ -1059,7 +1059,9 @@ static int try_to_take_rw_read(struct rw } if (mtxowner && mtxowner != RT_RW_READER) { - if (!try_to_steal_lock(mutex)) { + int mode = mtx ? STEAL_NORMAL : STEAL_LATERAL; + + if (!try_to_steal_lock(mutex, mode)) { /* * readers don't own the mutex, and rwm shows that a * writer doesn't have it either. If we enter this @@ -1076,7 +1078,7 @@ static int try_to_take_rw_read(struct rw if (rt_mutex_has_waiters(mutex)) { /* readers don't do PI */ waiter = rt_mutex_top_waiter(mutex); - if (current->prio >= waiter->task->prio) + if (!lock_is_stealable(waiter->task, mode)) return 0; /* * The pending reader has PI waiters, @@ -1107,7 +1109,7 @@ static int try_to_take_rw_read(struct rw } static int -try_to_take_rw_write(struct rw_mutex *rwm) +try_to_take_rw_write(struct rw_mutex *rwm, int mtx) { struct rt_mutex *mutex = &rwm->mutex; struct task_struct *own; @@ -1129,7 +1131,7 @@ try_to_take_rw_write(struct rw_mutex *rw */ WARN_ON(own && !rt_mutex_owner_pending(mutex)); - if (!try_to_take_rt_mutex(mutex)) + if (!do_try_to_take_rt_mutex(mutex, mtx ? STEAL_NORMAL : STEAL_LATERAL)) return 0; /* @@ -1142,7 +1144,7 @@ try_to_take_rw_write(struct rw_mutex *rw } static void -rt_read_slowlock(struct rw_mutex *rwm) +rt_read_slowlock(struct rw_mutex *rwm, int mtx) { struct rt_mutex_waiter waiter; struct rt_mutex *mutex = &rwm->mutex; @@ -1152,7 +1154,7 @@ rt_read_slowlock(struct rw_mutex *rwm) spin_lock_irqsave(&mutex->wait_lock, flags); init_lists(mutex); - if (try_to_take_rw_read(rwm)) { + if (try_to_take_rw_read(rwm, mtx)) { spin_unlock_irqrestore(&mutex->wait_lock, flags); return; } @@ -1178,7 +1180,7 @@ rt_read_slowlock(struct rw_mutex *rwm) unsigned long saved_flags; /* Try to acquire the lock: */ - if (try_to_take_rw_read(rwm)) + if (try_to_take_rw_read(rwm, mtx)) break; update_rw_mutex_owner(rwm); @@ -1230,7 +1232,8 @@ rt_read_slowlock(struct rw_mutex *rwm) static inline void rt_read_fastlock(struct rw_mutex *rwm, - void fastcall (*slowfn)(struct rw_mutex *rwm)) + void fastcall (*slowfn)(struct rw_mutex *rwm, int mtx), + int mtx) { retry: if (likely(rt_rwlock_cmpxchg(rwm, NULL, current))) { @@ -1246,17 +1249,17 @@ retry: goto retry; } } else - slowfn(rwm); + slowfn(rwm, mtx); } void fastcall rt_mutex_down_read(struct rw_mutex *rwm) { - rt_read_fastlock(rwm, rt_read_slowlock); + rt_read_fastlock(rwm, rt_read_slowlock, 1); } static inline int -rt_read_slowtrylock(struct rw_mutex *rwm) +rt_read_slowtrylock(struct rw_mutex *rwm, int mtx) { struct rt_mutex *mutex = &rwm->mutex; unsigned long flags; @@ -1265,7 +1268,7 @@ rt_read_slowtrylock(struct rw_mutex *rwm spin_lock_irqsave(&mutex->wait_lock, flags); init_lists(mutex); - if (try_to_take_rw_read(rwm)) + if (try_to_take_rw_read(rwm, mtx)) ret = 1; spin_unlock_irqrestore(&mutex->wait_lock, flags); @@ -1275,7 +1278,7 @@ rt_read_slowtrylock(struct rw_mutex *rwm static inline int rt_read_fasttrylock(struct rw_mutex *rwm, - int fastcall (*slowfn)(struct rw_mutex *rwm)) + int fastcall (*slowfn)(struct rw_mutex *rwm, int mtx), int mtx) { retry: if (likely(rt_rwlock_cmpxchg(rwm, NULL, current))) { @@ -1292,16 +1295,16 @@ retry: } return 1; } else - return slowfn(rwm); + return slowfn(rwm, mtx); } int __sched rt_mutex_down_read_trylock(struct rw_mutex *rwm) { - return rt_read_fasttrylock(rwm, rt_read_slowtrylock); + return rt_read_fasttrylock(rwm, rt_read_slowtrylock, 1); } static void -rt_write_slowlock(struct rw_mutex *rwm) +rt_write_slowlock(struct rw_mutex *rwm, int mtx) { struct rt_mutex *mutex = &rwm->mutex; struct rt_mutex_waiter waiter; @@ -1317,7 +1320,7 @@ rt_write_slowlock(struct rw_mutex *rwm) spin_lock_irqsave(&mutex->wait_lock, flags); init_lists(mutex); - if (try_to_take_rw_write(rwm)) { + if (try_to_take_rw_write(rwm, mtx)) { spin_unlock_irqrestore(&mutex->wait_lock, flags); return; } @@ -1335,7 +1338,7 @@ rt_write_slowlock(struct rw_mutex *rwm) unsigned long saved_flags; /* Try to acquire the lock: */ - if (try_to_take_rw_write(rwm)) + if (try_to_take_rw_write(rwm, mtx)) break; update_rw_mutex_owner(rwm); @@ -1389,7 +1392,8 @@ rt_write_slowlock(struct rw_mutex *rwm) static inline void rt_write_fastlock(struct rw_mutex *rwm, - void fastcall (*slowfn)(struct rw_mutex *rwm)) + void fastcall (*slowfn)(struct rw_mutex *rwm, int mtx), + int mtx) { unsigned long val = (unsigned long)current | RT_RWLOCK_WRITER; @@ -1397,16 +1401,16 @@ rt_write_fastlock(struct rw_mutex *rwm, rt_mutex_deadlock_account_lock(&rwm->mutex, current); WARN_ON(atomic_read(&rwm->count)); } else - slowfn(rwm); + slowfn(rwm, mtx); } void fastcall rt_mutex_down_write(struct rw_mutex *rwm) { - rt_write_fastlock(rwm, rt_write_slowlock); + rt_write_fastlock(rwm, rt_write_slowlock, 1); } static int -rt_write_slowtrylock(struct rw_mutex *rwm) +rt_write_slowtrylock(struct rw_mutex *rwm, int mtx) { struct rt_mutex *mutex = &rwm->mutex; unsigned long flags; @@ -1415,7 +1419,7 @@ rt_write_slowtrylock(struct rw_mutex *rw spin_lock_irqsave(&mutex->wait_lock, flags); init_lists(mutex); - if (try_to_take_rw_write(rwm)) + if (try_to_take_rw_write(rwm, mtx)) ret = 1; spin_unlock_irqrestore(&mutex->wait_lock, flags); @@ -1425,7 +1429,7 @@ rt_write_slowtrylock(struct rw_mutex *rw static inline int rt_write_fasttrylock(struct rw_mutex *rwm, - int fastcall (*slowfn)(struct rw_mutex *rwm)) + int fastcall (*slowfn)(struct rw_mutex *rwm, int mtx), int mtx) { unsigned long val = (unsigned long)current | RT_RWLOCK_WRITER; @@ -1434,12 +1438,12 @@ rt_write_fasttrylock(struct rw_mutex *rw WARN_ON(atomic_read(&rwm->count)); return 1; } else - return slowfn(rwm); + return slowfn(rwm, mtx); } int fastcall rt_mutex_down_write_trylock(struct rw_mutex *rwm) { - return rt_write_fasttrylock(rwm, rt_write_slowtrylock); + return rt_write_fasttrylock(rwm, rt_write_slowtrylock, 1); } static void fastcall noinline __sched ��������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/rwlocks-multiple-readers.patch��������������������������������������������������������������0000664�0000764�0000764�00000035563�11041657730�017236� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From: Steven Rostedt <srostedt@redhat.com> Subject: implement rwlocks management This patch adds the managment for rwlocks to have multiple readers. Like the rwsems, it does not do PI boosting on readers when a writer is blocked. Signed-off-by: Steven Rostedt <srostedt@redhat.com> --- include/linux/rt_lock.h | 5 - include/linux/spinlock.h | 2 kernel/rt.c | 56 ++---------------- kernel/rtmutex.c | 140 +++++++++++++++++++++++++++++++++++------------ kernel/rtmutex_common.h | 4 + 5 files changed, 119 insertions(+), 88 deletions(-) Index: linux-2.6.24.7/include/linux/rt_lock.h =================================================================== --- linux-2.6.24.7.orig/include/linux/rt_lock.h +++ linux-2.6.24.7/include/linux/rt_lock.h @@ -87,8 +87,7 @@ struct rw_semaphore { * rwlocks - an RW semaphore plus lock-break field: */ typedef struct { - struct rt_mutex lock; - int read_depth; + struct rw_mutex owners; unsigned int break_lock; #ifdef CONFIG_DEBUG_LOCK_ALLOC struct lockdep_map dep_map; @@ -96,7 +95,7 @@ typedef struct { } rwlock_t; #define __RW_LOCK_UNLOCKED(name) (rwlock_t) \ - { .lock = __RT_SPIN_INITIALIZER(name), \ + { .owners.mutex = __RT_SPIN_INITIALIZER(name.owners.mutex), \ RW_DEP_MAP_INIT(name) } #else /* !PREEMPT_RT */ Index: linux-2.6.24.7/include/linux/spinlock.h =================================================================== --- linux-2.6.24.7.orig/include/linux/spinlock.h +++ linux-2.6.24.7/include/linux/spinlock.h @@ -266,7 +266,7 @@ do { \ #ifdef CONFIG_PREEMPT_RT # define rt_read_can_lock(rwl) (!rt_mutex_is_locked(&(rwl)->lock)) -# define rt_write_can_lock(rwl) (!rt_mutex_is_locked(&(rwl)->lock)) +# define rt_write_can_lock(rwl) ((rwl)->owners.owner == NULL) #else extern int rt_rwlock_can_lock_never_call_on_non_rt(rwlock_t *rwlock); # define rt_read_can_lock(rwl) rt_rwlock_can_lock_never_call_on_non_rt(rwl) Index: linux-2.6.24.7/kernel/rt.c =================================================================== --- linux-2.6.24.7.orig/kernel/rt.c +++ linux-2.6.24.7/kernel/rt.c @@ -165,7 +165,7 @@ EXPORT_SYMBOL(_mutex_unlock); */ int __lockfunc rt_write_trylock(rwlock_t *rwlock) { - int ret = rt_mutex_trylock(&rwlock->lock); + int ret = rt_mutex_down_write_trylock(&rwlock->owners); if (ret) rwlock_acquire(&rwlock->dep_map, 0, 1, _RET_IP_); @@ -183,23 +183,9 @@ EXPORT_SYMBOL(rt_write_trylock_irqsave); int __lockfunc rt_read_trylock(rwlock_t *rwlock) { - struct rt_mutex *lock = &rwlock->lock; - unsigned long flags; int ret; - /* - * Read locks within the self-held write lock succeed. - */ - spin_lock_irqsave(&lock->wait_lock, flags); - if (rt_mutex_real_owner(lock) == current) { - spin_unlock_irqrestore(&lock->wait_lock, flags); - rwlock->read_depth++; - rwlock_acquire_read(&rwlock->dep_map, 0, 1, _RET_IP_); - return 1; - } - spin_unlock_irqrestore(&lock->wait_lock, flags); - - ret = rt_mutex_trylock(lock); + ret = rt_mutex_down_read_trylock(&rwlock->owners); if (ret) rwlock_acquire_read(&rwlock->dep_map, 0, 1, _RET_IP_); @@ -210,27 +196,14 @@ EXPORT_SYMBOL(rt_read_trylock); void __lockfunc rt_write_lock(rwlock_t *rwlock) { rwlock_acquire(&rwlock->dep_map, 0, 0, _RET_IP_); - LOCK_CONTENDED_RT(rwlock, rt_mutex_trylock, __rt_spin_lock); + LOCK_CONTENDED_RT_RW(rwlock, rt_mutex_down_write_trylock, rt_rwlock_write_lock); } EXPORT_SYMBOL(rt_write_lock); void __lockfunc rt_read_lock(rwlock_t *rwlock) { - unsigned long flags; - struct rt_mutex *lock = &rwlock->lock; - rwlock_acquire_read(&rwlock->dep_map, 0, 0, _RET_IP_); - /* - * Read locks within the write lock succeed. - */ - spin_lock_irqsave(&lock->wait_lock, flags); - if (rt_mutex_real_owner(lock) == current) { - spin_unlock_irqrestore(&lock->wait_lock, flags); - rwlock->read_depth++; - return; - } - spin_unlock_irqrestore(&lock->wait_lock, flags); - LOCK_CONTENDED_RT(rwlock, rt_mutex_trylock, __rt_spin_lock); + LOCK_CONTENDED_RT_RW(rwlock, rt_mutex_down_read_trylock, rt_rwlock_read_lock); } EXPORT_SYMBOL(rt_read_lock); @@ -239,28 +212,14 @@ void __lockfunc rt_write_unlock(rwlock_t { /* NOTE: we always pass in '1' for nested, for simplicity */ rwlock_release(&rwlock->dep_map, 1, _RET_IP_); - __rt_spin_unlock(&rwlock->lock); + rt_rwlock_write_unlock(&rwlock->owners); } EXPORT_SYMBOL(rt_write_unlock); void __lockfunc rt_read_unlock(rwlock_t *rwlock) { - struct rt_mutex *lock = &rwlock->lock; - unsigned long flags; - rwlock_release(&rwlock->dep_map, 1, _RET_IP_); - // TRACE_WARN_ON(lock->save_state != 1); - /* - * Read locks within the self-held write lock succeed. - */ - spin_lock_irqsave(&lock->wait_lock, flags); - if (rt_mutex_real_owner(lock) == current && rwlock->read_depth) { - spin_unlock_irqrestore(&lock->wait_lock, flags); - rwlock->read_depth--; - return; - } - spin_unlock_irqrestore(&lock->wait_lock, flags); - __rt_spin_unlock(&rwlock->lock); + rt_rwlock_read_unlock(&rwlock->owners); } EXPORT_SYMBOL(rt_read_unlock); @@ -289,8 +248,7 @@ void __rt_rwlock_init(rwlock_t *rwlock, debug_check_no_locks_freed((void *)rwlock, sizeof(*rwlock)); lockdep_init_map(&rwlock->dep_map, name, key, 0); #endif - __rt_mutex_init(&rwlock->lock, name); - rwlock->read_depth = 0; + rt_mutex_rwsem_init(&rwlock->owners, name); } EXPORT_SYMBOL(__rt_rwlock_init); Index: linux-2.6.24.7/kernel/rtmutex.c =================================================================== --- linux-2.6.24.7.orig/kernel/rtmutex.c +++ linux-2.6.24.7/kernel/rtmutex.c @@ -1149,7 +1149,7 @@ rt_read_slowlock(struct rw_mutex *rwm, i struct rt_mutex_waiter waiter; struct rt_mutex *mutex = &rwm->mutex; int saved_lock_depth = -1; - unsigned long flags; + unsigned long saved_state = -1, state, flags; spin_lock_irqsave(&mutex->wait_lock, flags); init_lists(mutex); @@ -1168,13 +1168,19 @@ rt_read_slowlock(struct rw_mutex *rwm, i init_lists(mutex); - /* - * We drop the BKL here before we go into the wait loop to avoid a - * possible deadlock in the scheduler. - */ - if (unlikely(current->lock_depth >= 0)) - saved_lock_depth = rt_release_bkl(mutex, flags); - set_current_state(TASK_UNINTERRUPTIBLE); + if (mtx) { + /* + * We drop the BKL here before we go into the wait loop to avoid a + * possible deadlock in the scheduler. + */ + if (unlikely(current->lock_depth >= 0)) + saved_lock_depth = rt_release_bkl(mutex, flags); + set_current_state(TASK_UNINTERRUPTIBLE); + } else { + /* Spin lock must preserve BKL */ + saved_state = xchg(¤t->state, TASK_UNINTERRUPTIBLE); + saved_lock_depth = current->lock_depth; + } for (;;) { unsigned long saved_flags; @@ -1197,21 +1203,36 @@ rt_read_slowlock(struct rw_mutex *rwm, i } saved_flags = current->flags & PF_NOSCHED; current->flags &= ~PF_NOSCHED; + if (!mtx) + current->lock_depth = -1; spin_unlock_irqrestore(&mutex->wait_lock, flags); debug_rt_mutex_print_deadlock(&waiter); - if (waiter.task) + if (!mtx || waiter.task) schedule_rt_mutex(mutex); spin_lock_irqsave(&mutex->wait_lock, flags); current->flags |= saved_flags; - set_current_state(TASK_UNINTERRUPTIBLE); + if (mtx) + set_current_state(TASK_UNINTERRUPTIBLE); + else { + current->lock_depth = saved_lock_depth; + state = xchg(¤t->state, TASK_UNINTERRUPTIBLE); + if (unlikely(state == TASK_RUNNING)) + saved_state = TASK_RUNNING; + } } - set_current_state(TASK_RUNNING); + if (mtx) + set_current_state(TASK_RUNNING); + else { + state = xchg(¤t->state, saved_state); + if (unlikely(state == TASK_RUNNING)) + current->state = TASK_RUNNING; + } if (unlikely(waiter.task)) remove_waiter(mutex, &waiter, flags); @@ -1224,7 +1245,7 @@ rt_read_slowlock(struct rw_mutex *rwm, i spin_unlock_irqrestore(&mutex->wait_lock, flags); /* Must we reaquire the BKL? */ - if (unlikely(saved_lock_depth >= 0)) + if (mtx && unlikely(saved_lock_depth >= 0)) rt_reacquire_bkl(saved_lock_depth); debug_rt_mutex_free_waiter(&waiter); @@ -1257,6 +1278,11 @@ void fastcall rt_mutex_down_read(struct rt_read_fastlock(rwm, rt_read_slowlock, 1); } +void fastcall rt_rwlock_read_lock(struct rw_mutex *rwm) +{ + rt_read_fastlock(rwm, rt_read_slowlock, 0); +} + static inline int rt_read_slowtrylock(struct rw_mutex *rwm, int mtx) @@ -1309,7 +1335,7 @@ rt_write_slowlock(struct rw_mutex *rwm, struct rt_mutex *mutex = &rwm->mutex; struct rt_mutex_waiter waiter; int saved_lock_depth = -1; - unsigned long flags; + unsigned long flags, saved_state = -1, state; debug_rt_mutex_init_waiter(&waiter); waiter.task = NULL; @@ -1326,13 +1352,19 @@ rt_write_slowlock(struct rw_mutex *rwm, } update_rw_mutex_owner(rwm); - /* - * We drop the BKL here before we go into the wait loop to avoid a - * possible deadlock in the scheduler. - */ - if (unlikely(current->lock_depth >= 0)) - saved_lock_depth = rt_release_bkl(mutex, flags); - set_current_state(TASK_UNINTERRUPTIBLE); + if (mtx) { + /* + * We drop the BKL here before we go into the wait loop to avoid a + * possible deadlock in the scheduler. + */ + if (unlikely(current->lock_depth >= 0)) + saved_lock_depth = rt_release_bkl(mutex, flags); + set_current_state(TASK_UNINTERRUPTIBLE); + } else { + /* Spin locks must preserve the BKL */ + saved_lock_depth = current->lock_depth; + saved_state = xchg(¤t->state, TASK_UNINTERRUPTIBLE); + } for (;;) { unsigned long saved_flags; @@ -1355,21 +1387,36 @@ rt_write_slowlock(struct rw_mutex *rwm, } saved_flags = current->flags & PF_NOSCHED; current->flags &= ~PF_NOSCHED; + if (!mtx) + current->lock_depth = -1; spin_unlock_irqrestore(&mutex->wait_lock, flags); debug_rt_mutex_print_deadlock(&waiter); - if (waiter.task) + if (!mtx || waiter.task) schedule_rt_mutex(mutex); spin_lock_irqsave(&mutex->wait_lock, flags); current->flags |= saved_flags; - set_current_state(TASK_UNINTERRUPTIBLE); + if (mtx) + set_current_state(TASK_UNINTERRUPTIBLE); + else { + current->lock_depth = saved_lock_depth; + state = xchg(¤t->state, TASK_UNINTERRUPTIBLE); + if (unlikely(state == TASK_RUNNING)) + saved_state = TASK_RUNNING; + } } - set_current_state(TASK_RUNNING); + if (mtx) + set_current_state(TASK_RUNNING); + else { + state = xchg(¤t->state, saved_state); + if (unlikely(state == TASK_RUNNING)) + current->state = TASK_RUNNING; + } if (unlikely(waiter.task)) remove_waiter(mutex, &waiter, flags); @@ -1381,7 +1428,7 @@ rt_write_slowlock(struct rw_mutex *rwm, spin_unlock_irqrestore(&mutex->wait_lock, flags); /* Must we reaquire the BKL? */ - if (unlikely(saved_lock_depth >= 0)) + if (mtx && unlikely(saved_lock_depth >= 0)) rt_reacquire_bkl(saved_lock_depth); WARN_ON(atomic_read(&rwm->count)); @@ -1409,6 +1456,11 @@ void fastcall rt_mutex_down_write(struct rt_write_fastlock(rwm, rt_write_slowlock, 1); } +void fastcall rt_rwlock_write_lock(struct rw_mutex *rwm) +{ + rt_write_fastlock(rwm, rt_write_slowlock, 0); +} + static int rt_write_slowtrylock(struct rw_mutex *rwm, int mtx) { @@ -1447,10 +1499,11 @@ int fastcall rt_mutex_down_write_trylock } static void fastcall noinline __sched -rt_read_slowunlock(struct rw_mutex *rwm) +rt_read_slowunlock(struct rw_mutex *rwm, int mtx) { struct rt_mutex *mutex = &rwm->mutex; unsigned long flags; + int savestate = !mtx; struct rt_mutex_waiter *waiter; spin_lock_irqsave(&mutex->wait_lock, flags); @@ -1510,7 +1563,7 @@ rt_read_slowunlock(struct rw_mutex *rwm) * will steal the lock from the reader. This is the * only time we can have a reader pending on a lock. */ - wakeup_next_waiter(mutex, 0); + wakeup_next_waiter(mutex, savestate); out: spin_unlock_irqrestore(&mutex->wait_lock, flags); @@ -1521,7 +1574,8 @@ rt_read_slowunlock(struct rw_mutex *rwm) static inline void rt_read_fastunlock(struct rw_mutex *rwm, - void fastcall (*slowfn)(struct rw_mutex *rwm)) + void fastcall (*slowfn)(struct rw_mutex *rwm, int mtx), + int mtx) { WARN_ON(!atomic_read(&rwm->count)); WARN_ON(!rwm->owner); @@ -1529,20 +1583,26 @@ rt_read_fastunlock(struct rw_mutex *rwm, if (likely(rt_rwlock_cmpxchg(rwm, current, NULL))) rt_mutex_deadlock_account_unlock(current); else - slowfn(rwm); + slowfn(rwm, mtx); } void fastcall rt_mutex_up_read(struct rw_mutex *rwm) { - rt_read_fastunlock(rwm, rt_read_slowunlock); + rt_read_fastunlock(rwm, rt_read_slowunlock, 1); +} + +void fastcall rt_rwlock_read_unlock(struct rw_mutex *rwm) +{ + rt_read_fastunlock(rwm, rt_read_slowunlock, 0); } static void fastcall noinline __sched -rt_write_slowunlock(struct rw_mutex *rwm) +rt_write_slowunlock(struct rw_mutex *rwm, int mtx) { struct rt_mutex *mutex = &rwm->mutex; struct rt_mutex_waiter *waiter; struct task_struct *pendowner; + int savestate = !mtx; unsigned long flags; spin_lock_irqsave(&mutex->wait_lock, flags); @@ -1573,7 +1633,7 @@ rt_write_slowunlock(struct rw_mutex *rwm waiter = rt_mutex_top_waiter(mutex); pendowner = waiter->task; - wakeup_next_waiter(mutex, 0); + wakeup_next_waiter(mutex, savestate); /* another writer is next? */ if (waiter->write_lock) { @@ -1609,7 +1669,10 @@ rt_write_slowunlock(struct rw_mutex *rwm waiter->task = NULL; reader->pi_blocked_on = NULL; - wake_up_process(reader); + if (savestate) + wake_up_process_mutex(reader); + else + wake_up_process(reader); if (rt_mutex_has_waiters(mutex)) waiter = rt_mutex_top_waiter(mutex); @@ -1639,7 +1702,9 @@ rt_write_slowunlock(struct rw_mutex *rwm static inline void rt_write_fastunlock(struct rw_mutex *rwm, - void fastcall (*slowfn)(struct rw_mutex *rwm)) + void fastcall (*slowfn)(struct rw_mutex *rwm, + int mtx), + int mtx) { unsigned long val = (unsigned long)current | RT_RWLOCK_WRITER; @@ -1647,12 +1712,17 @@ rt_write_fastunlock(struct rw_mutex *rwm if (likely(rt_rwlock_cmpxchg(rwm, (struct task_struct *)val, NULL))) rt_mutex_deadlock_account_unlock(current); else - slowfn(rwm); + slowfn(rwm, mtx); } void fastcall rt_mutex_up_write(struct rw_mutex *rwm) { - rt_write_fastunlock(rwm, rt_write_slowunlock); + rt_write_fastunlock(rwm, rt_write_slowunlock, 1); +} + +void fastcall rt_rwlock_write_unlock(struct rw_mutex *rwm) +{ + rt_write_fastunlock(rwm, rt_write_slowunlock, 0); } void rt_mutex_rwsem_init(struct rw_mutex *rwm, const char *name) Index: linux-2.6.24.7/kernel/rtmutex_common.h =================================================================== --- linux-2.6.24.7.orig/kernel/rtmutex_common.h +++ linux-2.6.24.7/kernel/rtmutex_common.h @@ -166,6 +166,10 @@ extern void rt_mutex_down_write(struct r extern int rt_mutex_down_read_trylock(struct rw_mutex *rwm); extern void rt_mutex_down_read(struct rw_mutex *rwm); extern void rt_mutex_rwsem_init(struct rw_mutex *rwm, const char *name); +extern void rt_rwlock_write_lock(struct rw_mutex *rwm); +extern void rt_rwlock_read_lock(struct rw_mutex *rwm); +extern void rt_rwlock_write_unlock(struct rw_mutex *rwm); +extern void rt_rwlock_read_unlock(struct rw_mutex *rwm); #endif /* CONFIG_PREEMPT_RT */ ���������������������������������������������������������������������������������������������������������������������������������������������patches/multi-reader-account.patch������������������������������������������������������������������0000664�0000764�0000764�00000014210�11041657732�016306� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From: Steven Rostedt <srostedt@redhat.com> Subject: map tasks to reader locks held This patch keeps track of all reader locks that are held for a task. The max depth is currently set to 5. A task may own the same lock multiple times for read without affecting this limit. It is bad programming practice to hold more than 5 different locks for read at the same time anyway so this should not be a problem. The 5 lock limit should be way more than enough. Signed-off-by: Steven Rostedt <srostedt@redhat.com> --- include/linux/sched.h | 14 ++++++++++ kernel/fork.c | 4 +++ kernel/rtmutex.c | 66 ++++++++++++++++++++++++++++++++++++++++++++++---- 3 files changed, 80 insertions(+), 4 deletions(-) Index: linux-2.6.24.7/include/linux/sched.h =================================================================== --- linux-2.6.24.7.orig/include/linux/sched.h +++ linux-2.6.24.7/include/linux/sched.h @@ -1005,6 +1005,14 @@ struct sched_entity { #endif }; +#ifdef CONFIG_PREEMPT_RT +struct rw_mutex; +struct reader_lock_struct { + struct rw_mutex *lock; + int count; +}; + +#endif struct task_struct { volatile long state; /* -1 unrunnable, 0 runnable, >0 stopped */ void *stack; @@ -1226,6 +1234,12 @@ struct task_struct { #endif #define MAX_PREEMPT_TRACE 25 +#define MAX_RWLOCK_DEPTH 5 + +#ifdef CONFIG_PREEMPT_RT + int reader_lock_count; + struct reader_lock_struct owned_read_locks[MAX_RWLOCK_DEPTH]; +#endif #ifdef CONFIG_PREEMPT_TRACE unsigned long preempt_trace_eip[MAX_PREEMPT_TRACE]; Index: linux-2.6.24.7/kernel/fork.c =================================================================== --- linux-2.6.24.7.orig/kernel/fork.c +++ linux-2.6.24.7/kernel/fork.c @@ -1206,6 +1206,10 @@ static struct task_struct *copy_process( p->lock_count = 0; #endif +#ifdef CONFIG_PREEMPT_RT + p->reader_lock_count = 0; +#endif + if (pid != &init_struct_pid) { retval = -ENOMEM; pid = alloc_pid(task_active_pid_ns(p)); Index: linux-2.6.24.7/kernel/rtmutex.c =================================================================== --- linux-2.6.24.7.orig/kernel/rtmutex.c +++ linux-2.6.24.7/kernel/rtmutex.c @@ -1038,6 +1038,8 @@ static int try_to_take_rw_read(struct rw struct rt_mutex *mutex = &rwm->mutex; struct rt_mutex_waiter *waiter; struct task_struct *mtxowner; + int reader_count, i; + int incr = 1; assert_spin_locked(&mutex->wait_lock); @@ -1048,6 +1050,16 @@ static int try_to_take_rw_read(struct rw if (unlikely(rt_rwlock_writer(rwm))) return 0; + /* check to see if we don't already own this lock */ + for (i = current->reader_lock_count - 1; i >= 0; i--) { + if (current->owned_read_locks[i].lock == rwm) { + rt_rwlock_set_owner(rwm, RT_RW_READER, 0); + current->owned_read_locks[i].count++; + incr = 0; + goto taken; + } + } + /* A writer is not the owner, but is a writer waiting */ mtxowner = rt_mutex_owner(mutex); @@ -1103,6 +1115,14 @@ static int try_to_take_rw_read(struct rw /* RT_RW_READER forces slow paths */ rt_rwlock_set_owner(rwm, RT_RW_READER, 0); taken: + if (incr) { + reader_count = current->reader_lock_count++; + if (likely(reader_count < MAX_RWLOCK_DEPTH)) { + current->owned_read_locks[reader_count].lock = rwm; + current->owned_read_locks[reader_count].count = 1; + } else + WARN_ON_ONCE(1); + } rt_mutex_deadlock_account_lock(mutex, current); atomic_inc(&rwm->count); return 1; @@ -1256,10 +1276,13 @@ rt_read_fastlock(struct rw_mutex *rwm, void fastcall (*slowfn)(struct rw_mutex *rwm, int mtx), int mtx) { -retry: + retry: if (likely(rt_rwlock_cmpxchg(rwm, NULL, current))) { + int reader_count; + rt_mutex_deadlock_account_lock(&rwm->mutex, current); atomic_inc(&rwm->count); + smp_mb(); /* * It is possible that the owner was zeroed * before we incremented count. If owner is not @@ -1269,6 +1292,13 @@ retry: atomic_dec(&rwm->count); goto retry; } + + reader_count = current->reader_lock_count++; + if (likely(reader_count < MAX_RWLOCK_DEPTH)) { + current->owned_read_locks[reader_count].lock = rwm; + current->owned_read_locks[reader_count].count = 1; + } else + WARN_ON_ONCE(1); } else slowfn(rwm, mtx); } @@ -1308,6 +1338,8 @@ rt_read_fasttrylock(struct rw_mutex *rwm { retry: if (likely(rt_rwlock_cmpxchg(rwm, NULL, current))) { + int reader_count; + rt_mutex_deadlock_account_lock(&rwm->mutex, current); atomic_inc(&rwm->count); /* @@ -1319,6 +1351,13 @@ retry: atomic_dec(&rwm->count); goto retry; } + + reader_count = current->reader_lock_count++; + if (likely(reader_count < MAX_RWLOCK_DEPTH)) { + current->owned_read_locks[reader_count].lock = rwm; + current->owned_read_locks[reader_count].count = 1; + } else + WARN_ON_ONCE(1); return 1; } else return slowfn(rwm, mtx); @@ -1502,9 +1541,10 @@ static void fastcall noinline __sched rt_read_slowunlock(struct rw_mutex *rwm, int mtx) { struct rt_mutex *mutex = &rwm->mutex; + struct rt_mutex_waiter *waiter; unsigned long flags; int savestate = !mtx; - struct rt_mutex_waiter *waiter; + int i; spin_lock_irqsave(&mutex->wait_lock, flags); @@ -1519,6 +1559,18 @@ rt_read_slowunlock(struct rw_mutex *rwm, */ mark_rt_rwlock_check(rwm); + for (i = current->reader_lock_count - 1; i >= 0; i--) { + if (current->owned_read_locks[i].lock == rwm) { + current->owned_read_locks[i].count--; + if (!current->owned_read_locks[i].count) { + current->reader_lock_count--; + WARN_ON_ONCE(i != current->reader_lock_count); + } + break; + } + } + WARN_ON_ONCE(i < 0); + /* * If there are more readers, let the last one do any wakeups. * Also check to make sure the owner wasn't cleared when two @@ -1580,9 +1632,15 @@ rt_read_fastunlock(struct rw_mutex *rwm, WARN_ON(!atomic_read(&rwm->count)); WARN_ON(!rwm->owner); atomic_dec(&rwm->count); - if (likely(rt_rwlock_cmpxchg(rwm, current, NULL))) + if (likely(rt_rwlock_cmpxchg(rwm, current, NULL))) { + int reader_count = --current->reader_lock_count; rt_mutex_deadlock_account_unlock(current); - else + if (unlikely(reader_count < 0)) { + reader_count = 0; + WARN_ON_ONCE(1); + } + WARN_ON_ONCE(current->owned_read_locks[reader_count].lock != rwm); + } else slowfn(rwm, mtx); } ����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/multi-reader-limit.patch��������������������������������������������������������������������0000664�0000764�0000764�00000020123�11041657732�015770� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From: Steven Rostedt <srostedt@redhat.com> Subject: implement reader limit on read write locks This patch allows for limiting the number of readers a lock may have. The limit is default to "no limit". The read write locks now keep track of, not only the number of times a lock is held by read, but also the number of tasks that have a reader. i.e. If 2 tasks hold the same read/write lock, and one task holds the lock twice, the count for the read/write lock would be 3 and the owner count is 2. The limit of readers is controlled by /proc/sys/kernel/rwlock_reader_limit If this is set to zero or negative, than there is no limit. Signed-off-by: Steven Rostedt <srostedt@redhat.com> --- include/linux/rt_lock.h | 1 kernel/rtmutex.c | 89 +++++++++++++++++++++++++++++++++++------------- kernel/sysctl.c | 14 +++++++ 3 files changed, 80 insertions(+), 24 deletions(-) Index: linux-2.6.24.7/include/linux/rt_lock.h =================================================================== --- linux-2.6.24.7.orig/include/linux/rt_lock.h +++ linux-2.6.24.7/include/linux/rt_lock.h @@ -64,6 +64,7 @@ struct rw_mutex { struct task_struct *owner; struct rt_mutex mutex; atomic_t count; /* number of times held for read */ + atomic_t owners; /* number of owners as readers */ }; /* Index: linux-2.6.24.7/kernel/rtmutex.c =================================================================== --- linux-2.6.24.7.orig/kernel/rtmutex.c +++ linux-2.6.24.7/kernel/rtmutex.c @@ -997,6 +997,8 @@ __rt_spin_lock_init(spinlock_t *lock, ch } EXPORT_SYMBOL(__rt_spin_lock_init); +int rt_rwlock_limit; + static inline int rt_release_bkl(struct rt_mutex *lock, unsigned long flags); static inline void rt_reacquire_bkl(int saved_lock_depth); @@ -1070,6 +1072,10 @@ static int try_to_take_rw_read(struct rw goto taken; } + /* Check for rwlock limits */ + if (rt_rwlock_limit && atomic_read(&rwm->owners) >= rt_rwlock_limit) + return 0; + if (mtxowner && mtxowner != RT_RW_READER) { int mode = mtx ? STEAL_NORMAL : STEAL_LATERAL; @@ -1116,6 +1122,7 @@ static int try_to_take_rw_read(struct rw rt_rwlock_set_owner(rwm, RT_RW_READER, 0); taken: if (incr) { + atomic_inc(&rwm->owners); reader_count = current->reader_lock_count++; if (likely(reader_count < MAX_RWLOCK_DEPTH)) { current->owned_read_locks[reader_count].lock = rwm; @@ -1293,6 +1300,7 @@ rt_read_fastlock(struct rw_mutex *rwm, goto retry; } + atomic_inc(&rwm->owners); reader_count = current->reader_lock_count++; if (likely(reader_count < MAX_RWLOCK_DEPTH)) { current->owned_read_locks[reader_count].lock = rwm; @@ -1352,6 +1360,7 @@ retry: goto retry; } + atomic_inc(&rwm->owners); reader_count = current->reader_lock_count++; if (likely(reader_count < MAX_RWLOCK_DEPTH)) { current->owned_read_locks[reader_count].lock = rwm; @@ -1543,6 +1552,7 @@ rt_read_slowunlock(struct rw_mutex *rwm, struct rt_mutex *mutex = &rwm->mutex; struct rt_mutex_waiter *waiter; unsigned long flags; + unsigned int reader_count; int savestate = !mtx; int i; @@ -1565,6 +1575,7 @@ rt_read_slowunlock(struct rw_mutex *rwm, if (!current->owned_read_locks[i].count) { current->reader_lock_count--; WARN_ON_ONCE(i != current->reader_lock_count); + atomic_dec(&rwm->owners); } break; } @@ -1572,20 +1583,34 @@ rt_read_slowunlock(struct rw_mutex *rwm, WARN_ON_ONCE(i < 0); /* - * If there are more readers, let the last one do any wakeups. - * Also check to make sure the owner wasn't cleared when two - * readers released the lock at the same time, and the count - * went to zero before grabbing the wait_lock. + * If the last two (or more) readers unlocked at the same + * time, the owner could be cleared since the count went to + * zero. If this has happened, the rwm owner will not + * be set to current or readers. This means that another reader + * already reset the lock, so there is nothing left to do. */ - if (atomic_read(&rwm->count) || - (rt_rwlock_owner(rwm) != current && - rt_rwlock_owner(rwm) != RT_RW_READER)) { - spin_unlock_irqrestore(&mutex->wait_lock, flags); - return; - } + if ((rt_rwlock_owner(rwm) != current && + rt_rwlock_owner(rwm) != RT_RW_READER)) + goto out; + + /* + * If there are more readers and we are under the limit + * let the last reader do the wakeups. + */ + reader_count = atomic_read(&rwm->count); + if (reader_count && + (!rt_rwlock_limit || atomic_read(&rwm->owners) >= rt_rwlock_limit)) + goto out; /* If no one is blocked, then clear all ownership */ if (!rt_mutex_has_waiters(mutex)) { + /* + * If count is not zero, we are under the limit with + * no other readers. + */ + if (reader_count) + goto out; + /* We could still have a pending reader waiting */ if (rt_mutex_owner_pending(mutex)) { /* set the rwm back to pending */ @@ -1597,24 +1622,32 @@ rt_read_slowunlock(struct rw_mutex *rwm, goto out; } - /* We are the last reader with pending waiters. */ + /* + * If the next waiter is a reader, this can be because of + * two things. One is that we hit the reader limit, or + * Two, there is a pending writer. + * We still only wake up one reader at a time (even if + * we could wake up more). This is because we dont + * have any idea if a writer is pending. + */ waiter = rt_mutex_top_waiter(mutex); - if (waiter->write_lock) + if (waiter->write_lock) { + /* only wake up if there are no readers */ + if (reader_count) + goto out; rwm->owner = RT_RW_PENDING_WRITE; - else + } else { + /* + * It is also possible that the reader limit decreased. + * If the limit did decrease, we may not be able to + * wake up the reader if we are currently above the limit. + */ + if (rt_rwlock_limit && + unlikely(atomic_read(&rwm->owners) >= rt_rwlock_limit)) + goto out; rwm->owner = RT_RW_PENDING_READ; + } - /* - * It is possible to have a reader waiting. We still only - * wake one up in that case. A way we can have a reader waiting - * is because a writer woke up, a higher prio reader came - * and stole the lock from the writer. But the writer now - * is no longer waiting on the lock and needs to retake - * the lock. We simply wake up the reader and let the - * reader have the lock. If the writer comes by, it - * will steal the lock from the reader. This is the - * only time we can have a reader pending on a lock. - */ wakeup_next_waiter(mutex, savestate); out: @@ -1630,15 +1663,22 @@ rt_read_fastunlock(struct rw_mutex *rwm, int mtx) { WARN_ON(!atomic_read(&rwm->count)); + WARN_ON(!atomic_read(&rwm->owners)); WARN_ON(!rwm->owner); atomic_dec(&rwm->count); if (likely(rt_rwlock_cmpxchg(rwm, current, NULL))) { int reader_count = --current->reader_lock_count; + int owners; rt_mutex_deadlock_account_unlock(current); if (unlikely(reader_count < 0)) { reader_count = 0; WARN_ON_ONCE(1); } + owners = atomic_dec_return(&rwm->owners); + if (unlikely(owners < 0)) { + atomic_set(&rwm->owners, 0); + WARN_ON_ONCE(1); + } WARN_ON_ONCE(current->owned_read_locks[reader_count].lock != rwm); } else slowfn(rwm, mtx); @@ -1789,6 +1829,7 @@ void rt_mutex_rwsem_init(struct rw_mutex rwm->owner = NULL; atomic_set(&rwm->count, 0); + atomic_set(&rwm->owners, 0); __rt_mutex_init(mutex, name); } Index: linux-2.6.24.7/kernel/sysctl.c =================================================================== --- linux-2.6.24.7.orig/kernel/sysctl.c +++ linux-2.6.24.7/kernel/sysctl.c @@ -150,6 +150,10 @@ static int parse_table(int __user *, int void __user *, size_t, struct ctl_table *); #endif +#ifdef CONFIG_PREEMPT_RT +extern int rt_rwlock_limit; +#endif + #ifdef CONFIG_PROC_SYSCTL static int proc_do_cad_pid(struct ctl_table *table, int write, struct file *filp, @@ -399,6 +403,16 @@ static struct ctl_table kern_table[] = { .proc_handler = &proc_dointvec, }, #endif +#ifdef CONFIG_PREEMPT_RT + { + .ctl_name = CTL_UNNUMBERED, + .procname = "rwlock_reader_limit", + .data = &rt_rwlock_limit, + .maxlen = sizeof(int), + .mode = 0644, + .proc_handler = &proc_dointvec, + }, +#endif { .ctl_name = KERN_PANIC, .procname = "panic", ���������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/multi-reader-lock-account.patch�������������������������������������������������������������0000664�0000764�0000764�00000033043�11041657735�017244� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From: Steven Rostedt <srostedt@redhat.com> Subject: map read/write locks back to their readers This patch adds a mapping from the read/write lock back to the owners that are readers. This is a link list of tasks that own the lock for read. The link list is protected by the read/write lock's mutex wait_lock. To prevent grabbing this spinlock on the fast path, the list in not updated when there is only one reader. The reader task is pointed to by the owner field of the rw_mutex. When the second reader grabs the read lock it will add the first owner to the list under the wait_lock. Signed-off-by: Steven Rostedt <srostedt@redhat.com> --- include/linux/rt_lock.h | 3 include/linux/sched.h | 2 kernel/fork.c | 8 ++ kernel/rtmutex.c | 187 ++++++++++++++++++++++++++++++++++-------------- 4 files changed, 146 insertions(+), 54 deletions(-) Index: linux-2.6.24.7/include/linux/rt_lock.h =================================================================== --- linux-2.6.24.7.orig/include/linux/rt_lock.h +++ linux-2.6.24.7/include/linux/rt_lock.h @@ -65,6 +65,7 @@ struct rw_mutex { struct rt_mutex mutex; atomic_t count; /* number of times held for read */ atomic_t owners; /* number of owners as readers */ + struct list_head readers; }; /* @@ -194,7 +195,7 @@ extern int __bad_func_type(void); */ #define __RWSEM_INITIALIZER(name) \ - { .owners.mutex = __RT_MUTEX_INITIALIZER(name.owners.mutex), \ + { .owners.mutex = __RT_MUTEX_INITIALIZER(name.owners.mutex), \ RW_DEP_MAP_INIT(name) } #define DECLARE_RWSEM(lockname) \ Index: linux-2.6.24.7/include/linux/sched.h =================================================================== --- linux-2.6.24.7.orig/include/linux/sched.h +++ linux-2.6.24.7/include/linux/sched.h @@ -1009,6 +1009,8 @@ struct sched_entity { struct rw_mutex; struct reader_lock_struct { struct rw_mutex *lock; + struct list_head list; + struct task_struct *task; int count; }; Index: linux-2.6.24.7/kernel/fork.c =================================================================== --- linux-2.6.24.7.orig/kernel/fork.c +++ linux-2.6.24.7/kernel/fork.c @@ -1208,6 +1208,14 @@ static struct task_struct *copy_process( #ifdef CONFIG_PREEMPT_RT p->reader_lock_count = 0; + { + int i; + for (i = 0; i < MAX_RWLOCK_DEPTH; i++) { + INIT_LIST_HEAD(&p->owned_read_locks[i].list); + p->owned_read_locks[i].count = 0; + p->owned_read_locks[i].lock = NULL; + } + } #endif if (pid != &init_struct_pid) { Index: linux-2.6.24.7/kernel/rtmutex.c =================================================================== --- linux-2.6.24.7.orig/kernel/rtmutex.c +++ linux-2.6.24.7/kernel/rtmutex.c @@ -1011,6 +1011,14 @@ rt_rwlock_set_owner(struct rw_mutex *rwm rwm->owner = (struct task_struct *)val; } +static inline void init_rw_lists(struct rw_mutex *rwm) +{ + if (unlikely(!rwm->readers.prev)) { + init_lists(&rwm->mutex); + INIT_LIST_HEAD(&rwm->readers); + } +} + /* * The fast paths of the rw locks do not set up owners to * the mutex. When blocking on an rwlock we must make sure @@ -1035,11 +1043,59 @@ update_rw_mutex_owner(struct rw_mutex *r rt_mutex_set_owner(mutex, mtxowner, 0); } +/* + * The fast path does not add itself to the reader list to keep + * from needing to grab the spinlock. We need to add the owner + * itself. This may seem racy, but in practice, it is fine. + * The link list is protected by mutex->wait_lock. But to find + * the lock on the owner we need to read the owners reader counter. + * That counter is modified only by the owner. We are OK with that + * because to remove the lock that we are looking for, the owner + * must first grab the mutex->wait_lock. The lock will not disappear + * from the owner now, and we don't care if we see other locks + * held or not held. + */ + +static inline void +rt_rwlock_update_owner(struct rw_mutex *rwm, unsigned owners) +{ + struct reader_lock_struct *rls; + struct task_struct *own; + int i; + + if (!owners || rt_rwlock_pending(rwm)) + return; + + own = rt_rwlock_owner(rwm); + if (own == RT_RW_READER) + return; + + for (i = own->reader_lock_count - 1; i >= 0; i--) { + if (own->owned_read_locks[i].lock == rwm) + break; + } + /* It is possible the owner didn't add it yet */ + if (i < 0) + return; + + rls = &own->owned_read_locks[i]; + /* It is also possible that the owner added it already */ + if (rls->list.prev && !list_empty(&rls->list)) + return; + + list_add(&rls->list, &rwm->readers); + + /* change to reader, so no one else updates too */ + rt_rwlock_set_owner(rwm, RT_RW_READER, RT_RWLOCK_CHECK); +} + static int try_to_take_rw_read(struct rw_mutex *rwm, int mtx) { struct rt_mutex *mutex = &rwm->mutex; struct rt_mutex_waiter *waiter; + struct reader_lock_struct *rls; struct task_struct *mtxowner; + int owners; int reader_count, i; int incr = 1; @@ -1055,8 +1111,15 @@ static int try_to_take_rw_read(struct rw /* check to see if we don't already own this lock */ for (i = current->reader_lock_count - 1; i >= 0; i--) { if (current->owned_read_locks[i].lock == rwm) { + rls = ¤t->owned_read_locks[i]; + /* + * If this was taken via the fast path, then + * it hasn't been added to the link list yet. + */ + if (!rls->list.prev || list_empty(&rls->list)) + list_add(&rls->list, &rwm->readers); rt_rwlock_set_owner(rwm, RT_RW_READER, 0); - current->owned_read_locks[i].count++; + rls->count++; incr = 0; goto taken; } @@ -1067,13 +1130,16 @@ static int try_to_take_rw_read(struct rw /* if the owner released it before we marked it then take it */ if (!mtxowner && !rt_rwlock_owner(rwm)) { - WARN_ON(atomic_read(&rwm->count)); - rt_rwlock_set_owner(rwm, current, 0); + /* Still unlock with the slow path (for PI handling) */ + rt_rwlock_set_owner(rwm, RT_RW_READER, 0); goto taken; } + owners = atomic_read(&rwm->owners); + rt_rwlock_update_owner(rwm, owners); + /* Check for rwlock limits */ - if (rt_rwlock_limit && atomic_read(&rwm->owners) >= rt_rwlock_limit) + if (rt_rwlock_limit && owners >= rt_rwlock_limit) return 0; if (mtxowner && mtxowner != RT_RW_READER) { @@ -1125,8 +1191,11 @@ static int try_to_take_rw_read(struct rw atomic_inc(&rwm->owners); reader_count = current->reader_lock_count++; if (likely(reader_count < MAX_RWLOCK_DEPTH)) { - current->owned_read_locks[reader_count].lock = rwm; - current->owned_read_locks[reader_count].count = 1; + rls = ¤t->owned_read_locks[reader_count]; + rls->lock = rwm; + rls->count = 1; + WARN_ON(rls->list.prev && !list_empty(&rls->list)); + list_add(&rls->list, &rwm->readers); } else WARN_ON_ONCE(1); } @@ -1146,12 +1215,13 @@ try_to_take_rw_write(struct rw_mutex *rw own = rt_rwlock_owner(rwm); + /* owners must be zero for writer */ + rt_rwlock_update_owner(rwm, atomic_read(&rwm->owners)); + /* readers or writers? */ if ((own && !rt_rwlock_pending(rwm))) return 0; - WARN_ON(atomic_read(&rwm->count)); - /* * RT_RW_PENDING means that the lock is free, but there are * pending owners on the mutex @@ -1179,7 +1249,7 @@ rt_read_slowlock(struct rw_mutex *rwm, i unsigned long saved_state = -1, state, flags; spin_lock_irqsave(&mutex->wait_lock, flags); - init_lists(mutex); + init_rw_lists(rwm); if (try_to_take_rw_read(rwm, mtx)) { spin_unlock_irqrestore(&mutex->wait_lock, flags); @@ -1193,8 +1263,6 @@ rt_read_slowlock(struct rw_mutex *rwm, i waiter.task = NULL; waiter.write_lock = 0; - init_lists(mutex); - if (mtx) { /* * We drop the BKL here before we go into the wait loop to avoid a @@ -1278,10 +1346,8 @@ rt_read_slowlock(struct rw_mutex *rwm, i debug_rt_mutex_free_waiter(&waiter); } -static inline void -rt_read_fastlock(struct rw_mutex *rwm, - void fastcall (*slowfn)(struct rw_mutex *rwm, int mtx), - int mtx) +static inline int +__rt_read_fasttrylock(struct rw_mutex *rwm) { retry: if (likely(rt_rwlock_cmpxchg(rwm, NULL, current))) { @@ -1301,13 +1367,41 @@ rt_read_fastlock(struct rw_mutex *rwm, } atomic_inc(&rwm->owners); - reader_count = current->reader_lock_count++; + reader_count = current->reader_lock_count; if (likely(reader_count < MAX_RWLOCK_DEPTH)) { current->owned_read_locks[reader_count].lock = rwm; current->owned_read_locks[reader_count].count = 1; } else WARN_ON_ONCE(1); - } else + /* + * If this task is no longer the sole owner of the lock + * or someone is blocking, then we need to add the task + * to the lock. + */ + smp_mb(); + current->reader_lock_count++; + if (unlikely(rwm->owner != current)) { + struct rt_mutex *mutex = &rwm->mutex; + struct reader_lock_struct *rls; + unsigned long flags; + + spin_lock_irqsave(&mutex->wait_lock, flags); + rls = ¤t->owned_read_locks[reader_count]; + if (!rls->list.prev || list_empty(&rls->list)) + list_add(&rls->list, &rwm->readers); + spin_unlock_irqrestore(&mutex->wait_lock, flags); + } + return 1; + } + return 0; +} + +static inline void +rt_read_fastlock(struct rw_mutex *rwm, + void fastcall (*slowfn)(struct rw_mutex *rwm, int mtx), + int mtx) +{ + if (unlikely(!__rt_read_fasttrylock(rwm))) slowfn(rwm, mtx); } @@ -1330,7 +1424,7 @@ rt_read_slowtrylock(struct rw_mutex *rwm int ret = 0; spin_lock_irqsave(&mutex->wait_lock, flags); - init_lists(mutex); + init_rw_lists(rwm); if (try_to_take_rw_read(rwm, mtx)) ret = 1; @@ -1344,31 +1438,9 @@ static inline int rt_read_fasttrylock(struct rw_mutex *rwm, int fastcall (*slowfn)(struct rw_mutex *rwm, int mtx), int mtx) { -retry: - if (likely(rt_rwlock_cmpxchg(rwm, NULL, current))) { - int reader_count; - - rt_mutex_deadlock_account_lock(&rwm->mutex, current); - atomic_inc(&rwm->count); - /* - * It is possible that the owner was zeroed - * before we incremented count. If owner is not - * current, then retry again - */ - if (unlikely(rwm->owner != current)) { - atomic_dec(&rwm->count); - goto retry; - } - - atomic_inc(&rwm->owners); - reader_count = current->reader_lock_count++; - if (likely(reader_count < MAX_RWLOCK_DEPTH)) { - current->owned_read_locks[reader_count].lock = rwm; - current->owned_read_locks[reader_count].count = 1; - } else - WARN_ON_ONCE(1); + if (likely(__rt_read_fasttrylock(rwm))) return 1; - } else + else return slowfn(rwm, mtx); } @@ -1392,7 +1464,7 @@ rt_write_slowlock(struct rw_mutex *rwm, waiter.write_lock = 1; spin_lock_irqsave(&mutex->wait_lock, flags); - init_lists(mutex); + init_rw_lists(rwm); if (try_to_take_rw_write(rwm, mtx)) { spin_unlock_irqrestore(&mutex->wait_lock, flags); @@ -1479,8 +1551,6 @@ rt_write_slowlock(struct rw_mutex *rwm, if (mtx && unlikely(saved_lock_depth >= 0)) rt_reacquire_bkl(saved_lock_depth); - WARN_ON(atomic_read(&rwm->count)); - debug_rt_mutex_free_waiter(&waiter); } @@ -1492,10 +1562,9 @@ rt_write_fastlock(struct rw_mutex *rwm, { unsigned long val = (unsigned long)current | RT_RWLOCK_WRITER; - if (likely(rt_rwlock_cmpxchg(rwm, NULL, val))) { + if (likely(rt_rwlock_cmpxchg(rwm, NULL, val))) rt_mutex_deadlock_account_lock(&rwm->mutex, current); - WARN_ON(atomic_read(&rwm->count)); - } else + else slowfn(rwm, mtx); } @@ -1517,7 +1586,7 @@ rt_write_slowtrylock(struct rw_mutex *rw int ret = 0; spin_lock_irqsave(&mutex->wait_lock, flags); - init_lists(mutex); + init_rw_lists(rwm); if (try_to_take_rw_write(rwm, mtx)) ret = 1; @@ -1535,7 +1604,6 @@ rt_write_fasttrylock(struct rw_mutex *rw if (likely(rt_rwlock_cmpxchg(rwm, NULL, val))) { rt_mutex_deadlock_account_lock(&rwm->mutex, current); - WARN_ON(atomic_read(&rwm->count)); return 1; } else return slowfn(rwm, mtx); @@ -1551,6 +1619,7 @@ rt_read_slowunlock(struct rw_mutex *rwm, { struct rt_mutex *mutex = &rwm->mutex; struct rt_mutex_waiter *waiter; + struct reader_lock_struct *rls; unsigned long flags; unsigned int reader_count; int savestate = !mtx; @@ -1576,6 +1645,10 @@ rt_read_slowunlock(struct rw_mutex *rwm, current->reader_lock_count--; WARN_ON_ONCE(i != current->reader_lock_count); atomic_dec(&rwm->owners); + rls = ¤t->owned_read_locks[i]; + WARN_ON(!rls->list.prev || list_empty(&rls->list)); + list_del_init(&rls->list); + rls->lock = NULL; } break; } @@ -1589,9 +1662,12 @@ rt_read_slowunlock(struct rw_mutex *rwm, * be set to current or readers. This means that another reader * already reset the lock, so there is nothing left to do. */ - if ((rt_rwlock_owner(rwm) != current && - rt_rwlock_owner(rwm) != RT_RW_READER)) + if (unlikely(rt_rwlock_owner(rwm) != current && + rt_rwlock_owner(rwm) != RT_RW_READER)) { + /* Update the owner if necessary */ + rt_rwlock_update_owner(rwm, atomic_read(&rwm->owners)); goto out; + } /* * If there are more readers and we are under the limit @@ -1667,6 +1743,7 @@ rt_read_fastunlock(struct rw_mutex *rwm, WARN_ON(!rwm->owner); atomic_dec(&rwm->count); if (likely(rt_rwlock_cmpxchg(rwm, current, NULL))) { + struct reader_lock_struct *rls; int reader_count = --current->reader_lock_count; int owners; rt_mutex_deadlock_account_unlock(current); @@ -1679,7 +1756,10 @@ rt_read_fastunlock(struct rw_mutex *rwm, atomic_set(&rwm->owners, 0); WARN_ON_ONCE(1); } - WARN_ON_ONCE(current->owned_read_locks[reader_count].lock != rwm); + rls = ¤t->owned_read_locks[reader_count]; + WARN_ON_ONCE(rls->lock != rwm); + WARN_ON(rls->list.prev && !list_empty(&rls->list)); + rls->lock = NULL; } else slowfn(rwm, mtx); } @@ -1830,6 +1910,7 @@ void rt_mutex_rwsem_init(struct rw_mutex rwm->owner = NULL; atomic_set(&rwm->count, 0); atomic_set(&rwm->owners, 0); + INIT_LIST_HEAD(&rwm->readers); __rt_mutex_init(mutex, name); } ���������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/multi-reader-pi.patch�����������������������������������������������������������������������0000664�0000764�0000764�00000021722�11041657734�015272� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From: Steven Rostedt <srostedt@redhat.com> Subject: read lock Priority Inheritance implementation This patch adds the priority inheritance (PI) to the read / write locks. When a task is blocked on the lock that eventually is owned by a reader in the PI chain, it will boost all the readers if they are of lower priority than the blocked task. Signed-off-by: Steven Rostedt <srostedt@redhat.com> --- include/linux/init_task.h | 8 +++ include/linux/rt_lock.h | 4 + kernel/fork.c | 1 kernel/rtmutex.c | 115 ++++++++++++++++++++++++++++++++++++++++++---- 4 files changed, 118 insertions(+), 10 deletions(-) Index: linux-2.6.24.7/include/linux/init_task.h =================================================================== --- linux-2.6.24.7.orig/include/linux/init_task.h +++ linux-2.6.24.7/include/linux/init_task.h @@ -99,6 +99,13 @@ extern struct nsproxy init_nsproxy; #define INIT_PREEMPT_RCU_BOOST(tsk) #endif /* #else #ifdef CONFIG_PREEMPT_RCU_BOOST */ +#ifdef CONFIG_PREEMPT_RT +# define INIT_RW_OWNERS(tsk) .owned_read_locks = { \ + [0 ... (MAX_RWLOCK_DEPTH - 1) ] = { .task = &tsk } }, +#else +# define INIT_RW_OWNERS(tsk) +#endif + extern struct group_info init_groups; #define INIT_STRUCT_PID { \ @@ -189,6 +196,7 @@ extern struct group_info init_groups; INIT_TRACE_IRQFLAGS \ INIT_LOCKDEP \ INIT_PREEMPT_RCU_BOOST(tsk) \ + INIT_RW_OWNERS(tsk) \ } Index: linux-2.6.24.7/include/linux/rt_lock.h =================================================================== --- linux-2.6.24.7.orig/include/linux/rt_lock.h +++ linux-2.6.24.7/include/linux/rt_lock.h @@ -13,6 +13,7 @@ #include <linux/rtmutex.h> #include <asm/atomic.h> #include <linux/spinlock_types.h> +#include <linux/sched_prio.h> #ifdef CONFIG_PREEMPT_RT /* @@ -66,6 +67,7 @@ struct rw_mutex { atomic_t count; /* number of times held for read */ atomic_t owners; /* number of owners as readers */ struct list_head readers; + int prio; }; /* @@ -98,6 +100,7 @@ typedef struct { #define __RW_LOCK_UNLOCKED(name) (rwlock_t) \ { .owners.mutex = __RT_SPIN_INITIALIZER(name.owners.mutex), \ + .owners.prio = MAX_PRIO, \ RW_DEP_MAP_INIT(name) } #else /* !PREEMPT_RT */ @@ -196,6 +199,7 @@ extern int __bad_func_type(void); #define __RWSEM_INITIALIZER(name) \ { .owners.mutex = __RT_MUTEX_INITIALIZER(name.owners.mutex), \ + .owners.prio = MAX_PRIO, \ RW_DEP_MAP_INIT(name) } #define DECLARE_RWSEM(lockname) \ Index: linux-2.6.24.7/kernel/fork.c =================================================================== --- linux-2.6.24.7.orig/kernel/fork.c +++ linux-2.6.24.7/kernel/fork.c @@ -1214,6 +1214,7 @@ static struct task_struct *copy_process( INIT_LIST_HEAD(&p->owned_read_locks[i].list); p->owned_read_locks[i].count = 0; p->owned_read_locks[i].lock = NULL; + p->owned_read_locks[i].task = p; } } #endif Index: linux-2.6.24.7/kernel/rtmutex.c =================================================================== --- linux-2.6.24.7.orig/kernel/rtmutex.c +++ linux-2.6.24.7/kernel/rtmutex.c @@ -139,6 +139,8 @@ static inline void init_lists(struct rt_ } } +static int rt_mutex_get_readers_prio(struct task_struct *task, int prio); + /* * Calculate task priority from the waiter list priority * @@ -149,6 +151,8 @@ int rt_mutex_getprio(struct task_struct { int prio = min(task->normal_prio, get_rcu_prio(task)); + prio = rt_mutex_get_readers_prio(task, prio); + if (likely(!task_has_pi_waiters(task))) return prio; @@ -191,6 +195,11 @@ static void rt_mutex_adjust_prio(struct */ int max_lock_depth = 1024; +static int rt_mutex_adjust_readers(struct rt_mutex *orig_lock, + struct rt_mutex_waiter *orig_waiter, + struct task_struct *top_task, + struct rt_mutex *lock, + int recursion_depth); /* * Adjust the priority chain. Also used for deadlock detection. * Decreases task's usage by one - may thus free the task. @@ -200,7 +209,8 @@ static int rt_mutex_adjust_prio_chain(st int deadlock_detect, struct rt_mutex *orig_lock, struct rt_mutex_waiter *orig_waiter, - struct task_struct *top_task) + struct task_struct *top_task, + int recursion_depth) { struct rt_mutex *lock; struct rt_mutex_waiter *waiter, *top_waiter = orig_waiter; @@ -302,8 +312,13 @@ static int rt_mutex_adjust_prio_chain(st /* Grab the next task */ task = rt_mutex_owner(lock); - /* Writers do not boost their readers. */ + /* + * Readers are special. We may need to boost more than one owner. + */ if (task == RT_RW_READER) { + ret = rt_mutex_adjust_readers(orig_lock, orig_waiter, + top_task, lock, + recursion_depth); spin_unlock_irqrestore(&lock->wait_lock, flags); goto out; } @@ -490,9 +505,12 @@ static int task_blocks_on_rt_mutex(struc spin_unlock(¤t->pi_lock); if (waiter == rt_mutex_top_waiter(lock)) { - /* readers are not handled */ - if (owner == RT_RW_READER) - return 0; + /* readers are handled differently */ + if (owner == RT_RW_READER) { + res = rt_mutex_adjust_readers(lock, waiter, + current, lock, 0); + return res; + } spin_lock(&owner->pi_lock); plist_del(&top_waiter->pi_list_entry, &owner->pi_waiters); @@ -519,7 +537,7 @@ static int task_blocks_on_rt_mutex(struc spin_unlock_irqrestore(&lock->wait_lock, flags); res = rt_mutex_adjust_prio_chain(owner, detect_deadlock, lock, waiter, - current); + current, 0); spin_lock_irq(&lock->wait_lock); @@ -636,7 +654,7 @@ static void remove_waiter(struct rt_mute spin_unlock_irqrestore(&lock->wait_lock, flags); - rt_mutex_adjust_prio_chain(owner, 0, lock, NULL, current); + rt_mutex_adjust_prio_chain(owner, 0, lock, NULL, current, 0); spin_lock_irq(&lock->wait_lock); } @@ -663,7 +681,7 @@ void rt_mutex_adjust_pi(struct task_stru get_task_struct(task); spin_unlock_irqrestore(&task->pi_lock, flags); - rt_mutex_adjust_prio_chain(task, 0, NULL, NULL, task); + rt_mutex_adjust_prio_chain(task, 0, NULL, NULL, task, 0); } /* @@ -1160,7 +1178,6 @@ static int try_to_take_rw_read(struct rw if (rt_rwlock_pending_writer(rwm)) return 0; if (rt_mutex_has_waiters(mutex)) { - /* readers don't do PI */ waiter = rt_mutex_top_waiter(mutex); if (!lock_is_stealable(waiter->task, mode)) return 0; @@ -1174,7 +1191,7 @@ static int try_to_take_rw_read(struct rw spin_unlock(&mtxowner->pi_lock); } } else if (rt_mutex_has_waiters(mutex)) { - /* Readers don't do PI */ + /* Readers do things differently with respect to PI */ waiter = rt_mutex_top_waiter(mutex); spin_lock(¤t->pi_lock); plist_del(&waiter->pi_list_entry, ¤t->pi_waiters); @@ -1680,6 +1697,7 @@ rt_read_slowunlock(struct rw_mutex *rwm, /* If no one is blocked, then clear all ownership */ if (!rt_mutex_has_waiters(mutex)) { + rwm->prio = MAX_PRIO; /* * If count is not zero, we are under the limit with * no other readers. @@ -1910,11 +1928,88 @@ void rt_mutex_rwsem_init(struct rw_mutex rwm->owner = NULL; atomic_set(&rwm->count, 0); atomic_set(&rwm->owners, 0); + rwm->prio = MAX_PRIO; INIT_LIST_HEAD(&rwm->readers); __rt_mutex_init(mutex, name); } +static int rt_mutex_get_readers_prio(struct task_struct *task, int prio) +{ + struct reader_lock_struct *rls; + struct rw_mutex *rwm; + int lock_prio; + int i; + + for (i = 0; i < task->reader_lock_count; i++) { + rls = &task->owned_read_locks[i]; + rwm = rls->lock; + if (rwm) { + lock_prio = rwm->prio; + if (prio > lock_prio) + prio = lock_prio; + } + } + + return prio; +} + +static int rt_mutex_adjust_readers(struct rt_mutex *orig_lock, + struct rt_mutex_waiter *orig_waiter, + struct task_struct *top_task, + struct rt_mutex *lock, + int recursion_depth) +{ + struct reader_lock_struct *rls; + struct rt_mutex_waiter *waiter; + struct task_struct *task; + struct rw_mutex *rwm = container_of(lock, struct rw_mutex, mutex); + + if (rt_mutex_has_waiters(lock)) { + waiter = rt_mutex_top_waiter(lock); + /* + * Do we need to grab the task->pi_lock? + * Really, we are only reading it. If it + * changes, then that should follow this chain + * too. + */ + rwm->prio = waiter->task->prio; + } else + rwm->prio = MAX_PRIO; + + if (recursion_depth >= MAX_RWLOCK_DEPTH) { + WARN_ON(1); + return 1; + } + + list_for_each_entry(rls, &rwm->readers, list) { + task = rls->task; + get_task_struct(task); + /* + * rt_mutex_adjust_prio_chain will do + * the put_task_struct + */ + rt_mutex_adjust_prio_chain(task, 0, orig_lock, + orig_waiter, top_task, + recursion_depth+1); + } + + return 0; +} +#else +static int rt_mutex_adjust_readers(struct rt_mutex *orig_lock, + struct rt_mutex_waiter *orig_waiter, + struct task_struct *top_task, + struct rt_mutex *lock, + int recursion_depth) +{ + return 0; +} + +static int rt_mutex_get_readers_prio(struct task_struct *task, int prio) +{ + return prio; +} #endif /* CONFIG_PREEMPT_RT */ #ifdef CONFIG_PREEMPT_BKL ����������������������������������������������patches/rwlocks-default-nr-readers-nr-cpus.patch����������������������������������������������������0000664�0000764�0000764�00000001320�11041657734�021015� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From: Steven Rostedt <srostedt@redhat.com> Limit the number of readers to number of CPUS by default. Signed-off-by: Steven Rostedt <srostedt@redhat.com> --- kernel/rtmutex.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) Index: linux-2.6.24.7/kernel/rtmutex.c =================================================================== --- linux-2.6.24.7.orig/kernel/rtmutex.c +++ linux-2.6.24.7/kernel/rtmutex.c @@ -1015,7 +1015,7 @@ __rt_spin_lock_init(spinlock_t *lock, ch } EXPORT_SYMBOL(__rt_spin_lock_init); -int rt_rwlock_limit; +int rt_rwlock_limit = NR_CPUS; static inline int rt_release_bkl(struct rt_mutex *lock, unsigned long flags); static inline void rt_reacquire_bkl(int saved_lock_depth); ����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/rwlock-typecast-cmpxchg.patch���������������������������������������������������������������0000664�0000764�0000764�00000002644�11041657734�017056� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������--- kernel/rtmutex.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) Index: linux-2.6.24.7/kernel/rtmutex.c =================================================================== --- linux-2.6.24.7.orig/kernel/rtmutex.c +++ linux-2.6.24.7/kernel/rtmutex.c @@ -1577,7 +1577,7 @@ rt_write_fastlock(struct rw_mutex *rwm, void fastcall (*slowfn)(struct rw_mutex *rwm, int mtx), int mtx) { - unsigned long val = (unsigned long)current | RT_RWLOCK_WRITER; + struct task_struct *val = (void *)((unsigned long)current | RT_RWLOCK_WRITER); if (likely(rt_rwlock_cmpxchg(rwm, NULL, val))) rt_mutex_deadlock_account_lock(&rwm->mutex, current); @@ -1617,7 +1617,7 @@ static inline int rt_write_fasttrylock(struct rw_mutex *rwm, int fastcall (*slowfn)(struct rw_mutex *rwm, int mtx), int mtx) { - unsigned long val = (unsigned long)current | RT_RWLOCK_WRITER; + struct task_struct *val = (void *)((unsigned long)current | RT_RWLOCK_WRITER); if (likely(rt_rwlock_cmpxchg(rwm, NULL, val))) { rt_mutex_deadlock_account_lock(&rwm->mutex, current); @@ -1902,7 +1902,7 @@ rt_write_fastunlock(struct rw_mutex *rwm int mtx), int mtx) { - unsigned long val = (unsigned long)current | RT_RWLOCK_WRITER; + struct task_struct *val = (void *)((unsigned long)current | RT_RWLOCK_WRITER); WARN_ON(rt_rwlock_owner(rwm) != current); if (likely(rt_rwlock_cmpxchg(rwm, (struct task_struct *)val, NULL))) ��������������������������������������������������������������������������������������������patches/rwlock-implement-downgrade-write.patch������������������������������������������������������0000664�0000764�0000764�00000011016�11041657735�020657� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From: Steven Rostedt <srostedt@redhat.com> Subject: rwlocks multi downgrade write This patch implements the rwsem downgrade_write for the RT rwlocks multiple readers code. The code needs to be careful to set up the lock the same way a reader would. Signed-off-by: Steven Rostedt <srostedt@redhat.com> --- kernel/rt.c | 3 - kernel/rtmutex.c | 99 ++++++++++++++++++++++++++++++++++++++++++++++++ kernel/rtmutex_common.h | 1 3 files changed, 101 insertions(+), 2 deletions(-) Index: linux-2.6.24.7/kernel/rt.c =================================================================== --- linux-2.6.24.7.orig/kernel/rt.c +++ linux-2.6.24.7/kernel/rt.c @@ -272,11 +272,10 @@ EXPORT_SYMBOL(rt_up_read); /* * downgrade a write lock into a read lock - * - just wake up any readers at the front of the queue */ void fastcall rt_downgrade_write(struct rw_semaphore *rwsem) { - BUG(); + rt_mutex_downgrade_write(&rwsem->owners); } EXPORT_SYMBOL(rt_downgrade_write); Index: linux-2.6.24.7/kernel/rtmutex.c =================================================================== --- linux-2.6.24.7.orig/kernel/rtmutex.c +++ linux-2.6.24.7/kernel/rtmutex.c @@ -1921,6 +1921,105 @@ void fastcall rt_rwlock_write_unlock(str rt_write_fastunlock(rwm, rt_write_slowunlock, 0); } +/* + * We own the lock for write, and we want to convert it to a read, + * so we simply take the lock as read, and wake up all other readers. + */ +void fastcall __sched +rt_mutex_downgrade_write(struct rw_mutex *rwm) +{ + struct rt_mutex *mutex = &rwm->mutex; + struct reader_lock_struct *rls; + struct rt_mutex_waiter *waiter; + unsigned long flags; + int reader_count; + + spin_lock_irqsave(&mutex->wait_lock, flags); + init_rw_lists(rwm); + + /* we have the lock and are sole owner, then update the accounting */ + atomic_inc(&rwm->count); + atomic_inc(&rwm->owners); + reader_count = current->reader_lock_count++; + rls = ¤t->owned_read_locks[reader_count]; + if (likely(reader_count < MAX_RWLOCK_DEPTH)) { + rls->lock = rwm; + rls->count = 1; + } else + WARN_ON_ONCE(1); + + if (!rt_mutex_has_waiters(mutex)) { + /* We are sole owner, we are done */ + rwm->owner = current; + rwm->prio = MAX_PRIO; + mutex->owner = NULL; + spin_unlock_irqrestore(&mutex->wait_lock, flags); + return; + } + + /* Set us up for multiple readers or conflicts */ + + list_add(&rls->list, &rwm->readers); + rwm->owner = RT_RW_READER; + + /* + * This is like the write unlock, but we already own the + * reader. We still want to wake up other readers that are + * waiting, until we hit the reader limit, or a writer. + */ + + spin_lock(¤t->pi_lock); + waiter = rt_mutex_top_waiter(mutex); + while (waiter && !waiter->write_lock) { + struct task_struct *reader = waiter->task; + + plist_del(&waiter->list_entry, &mutex->wait_list); + + /* nop if not on a list */ + plist_del(&waiter->pi_list_entry, ¤t->pi_waiters); + + waiter->task = NULL; + reader->pi_blocked_on = NULL; + + /* downgrade is only for mutexes */ + wake_up_process(reader); + + if (rt_mutex_has_waiters(mutex)) + waiter = rt_mutex_top_waiter(mutex); + else + waiter = NULL; + } + + /* If a writer is still pending, then update its plist. */ + if (rt_mutex_has_waiters(mutex)) { + struct rt_mutex_waiter *next; + + next = rt_mutex_top_waiter(mutex); + + /* setup this mutex prio for read */ + rwm->prio = next->task->prio; + + /* delete incase we didn't go through the loop */ + plist_del(&next->pi_list_entry, ¤t->pi_waiters); + /* add back in as top waiter */ + plist_add(&next->pi_list_entry, ¤t->pi_waiters); + } else + rwm->prio = MAX_PRIO; + + spin_unlock(¤t->pi_lock); + + rt_mutex_set_owner(mutex, RT_RW_READER, 0); + + spin_unlock_irqrestore(&mutex->wait_lock, flags); + + /* + * Undo pi boosting when necessary. + * If one of the awoken readers boosted us, we don't want to keep + * that priority. + */ + rt_mutex_adjust_prio(current); +} + void rt_mutex_rwsem_init(struct rw_mutex *rwm, const char *name) { struct rt_mutex *mutex = &rwm->mutex; Index: linux-2.6.24.7/kernel/rtmutex_common.h =================================================================== --- linux-2.6.24.7.orig/kernel/rtmutex_common.h +++ linux-2.6.24.7/kernel/rtmutex_common.h @@ -170,6 +170,7 @@ extern void rt_rwlock_write_lock(struct extern void rt_rwlock_read_lock(struct rw_mutex *rwm); extern void rt_rwlock_write_unlock(struct rw_mutex *rwm); extern void rt_rwlock_read_unlock(struct rw_mutex *rwm); +extern void rt_mutex_downgrade_write(struct rw_mutex *rwm); #endif /* CONFIG_PREEMPT_RT */ ������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/sched-nr-migrate-lower-default-preempt-rt.patch���������������������������������������������0000664�0000764�0000764�00000001721�11041657730�022261� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From: Steven Rostedt <srostedt@redhat.com> With the added boost of SCHED_OTHER, the balancing load starts to stain latencies. Bring it back down again. NOTE: This is a workaround, we need to fix this because this work around will once again hurt SCHED_OTHER performance. Signed-off-by: Steven Rostedt <srostedt@redhat.com> --- kernel/sched.c | 4 ++++ 1 file changed, 4 insertions(+) Index: linux-2.6.24.7/kernel/sched.c =================================================================== --- linux-2.6.24.7.orig/kernel/sched.c +++ linux-2.6.24.7/kernel/sched.c @@ -602,7 +602,11 @@ const_debug unsigned int sysctl_sched_fe * Number of tasks to iterate in a single balance run. * Limited because this is done with IRQs disabled. */ +#ifdef CONFIG_PREEMPT_RT +const_debug unsigned int sysctl_sched_nr_migrate = 8; +#else const_debug unsigned int sysctl_sched_nr_migrate = 32; +#endif /* * For kernel-internal use: high-speed (but slightly incorrect) per-cpu �����������������������������������������������patches/arm-fix-compile-error-trace-exit-idle.patch�������������������������������������������������0000664�0000764�0000764�00000001714�11041657731�021364� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������This patch fixes this compile error in 2.6.24.7-rt8: CC arch/arm/kernel/process.o arch/arm/kernel/process.c: In function 'cpu_idle': arch/arm/kernel/process.c:175: error: implicit declaration of function 'trace_preempt_exit_idle' arch/arm/kernel/process.c:180: error: implicit declaration of function 'trace_preempt_enter_idle' Signed-off-by: Remy Bohmer <linux@bohmer.net> --- arch/arm/kernel/process.c | 2 -- 1 file changed, 2 deletions(-) Index: linux-2.6.24.7/arch/arm/kernel/process.c =================================================================== --- linux-2.6.24.7.orig/arch/arm/kernel/process.c +++ linux-2.6.24.7/arch/arm/kernel/process.c @@ -172,12 +172,10 @@ void cpu_idle(void) idle(); leds_event(led_idle_end); local_irq_disable(); - trace_preempt_exit_idle(); tick_nohz_restart_sched_tick(); __preempt_enable_no_resched(); __schedule(); preempt_disable(); - trace_preempt_enter_idle(); local_irq_enable(); } } ����������������������������������������������������patches/sched-wake_up_idle_cpu-rt.patch�������������������������������������������������������������0000664�0000764�0000764�00000000657�11041657732�017302� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������--- kernel/sched.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) Index: linux-2.6.24.7/kernel/sched.c =================================================================== --- linux-2.6.24.7.orig/kernel/sched.c +++ linux-2.6.24.7/kernel/sched.c @@ -887,7 +887,7 @@ void wake_up_idle_cpu(int cpu) { struct rq *rq = cpu_rq(cpu); - if (cpu == smp_processor_id()) + if (cpu == raw_smp_processor_id()) return; /* ���������������������������������������������������������������������������������patches/sched_load_balance_flags.patch��������������������������������������������������������������0000664�0000764�0000764�00000016374�11041673116�017177� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������Subject: sched: load_balance flags From: Peter Zijlstra <a.p.zijlstra@chello.nl> Change all_pinned into a flags field that is passed around the load balance routines. This is done because we need more state in the following patches. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> --- kernel/sched.c | 39 +++++++++++++++++++++------------------ kernel/sched_fair.c | 4 ++-- kernel/sched_idletask.c | 2 +- kernel/sched_rt.c | 2 +- 4 files changed, 25 insertions(+), 22 deletions(-) Index: linux-2.6.24.7/kernel/sched.c =================================================================== --- linux-2.6.24.7.orig/kernel/sched.c +++ linux-2.6.24.7/kernel/sched.c @@ -1043,7 +1043,7 @@ struct rq_iterator { static unsigned long balance_tasks(struct rq *this_rq, int this_cpu, struct rq *busiest, unsigned long max_load_move, struct sched_domain *sd, - enum cpu_idle_type idle, int *all_pinned, + enum cpu_idle_type idle, int *lb_flags, int *this_best_prio, struct rq_iterator *iterator); static int @@ -2269,6 +2269,8 @@ static void update_cpu_load(struct rq *t #ifdef CONFIG_SMP +#define LB_ALL_PINNED 0x01 + /* * double_rq_lock - safely lock two runqueues * @@ -2411,7 +2413,7 @@ static void pull_task(struct rq *src_rq, static int can_migrate_task(struct task_struct *p, struct rq *rq, int this_cpu, struct sched_domain *sd, enum cpu_idle_type idle, - int *all_pinned) + int *lb_flags) { /* * We do not migrate tasks that are: @@ -2423,7 +2425,7 @@ int can_migrate_task(struct task_struct schedstat_inc(p, se.nr_failed_migrations_affine); return 0; } - *all_pinned = 0; + *lb_flags &= ~LB_ALL_PINNED; if (task_running(rq, p)) { schedstat_inc(p, se.nr_failed_migrations_running); @@ -2457,7 +2459,7 @@ int can_migrate_task(struct task_struct static unsigned long balance_tasks(struct rq *this_rq, int this_cpu, struct rq *busiest, unsigned long max_load_move, struct sched_domain *sd, - enum cpu_idle_type idle, int *all_pinned, + enum cpu_idle_type idle, int *lb_flags, int *this_best_prio, struct rq_iterator *iterator) { int loops = 0, pulled = 0, pinned = 0, skip_for_load; @@ -2467,12 +2469,12 @@ balance_tasks(struct rq *this_rq, int th if (max_load_move == 0) goto out; - pinned = 1; - /* * Start the load-balancing iterator: */ p = iterator->start(iterator->arg); + if (p) + pinned = 1; next: if (!p || loops++ > sysctl_sched_nr_migrate) goto out; @@ -2510,8 +2512,8 @@ out: */ schedstat_add(sd, lb_gained[idle], pulled); - if (all_pinned) - *all_pinned = pinned; + if (pinned) + *lb_flags |= LB_ALL_PINNED; return max_load_move - rem_load_move; } @@ -2526,7 +2528,7 @@ out: static int move_tasks(struct rq *this_rq, int this_cpu, struct rq *busiest, unsigned long max_load_move, struct sched_domain *sd, enum cpu_idle_type idle, - int *all_pinned) + int *lb_flags) { const struct sched_class *class = sched_class_highest; unsigned long total_load_moved = 0; @@ -2536,7 +2538,7 @@ static int move_tasks(struct rq *this_rq total_load_moved += class->load_balance(this_rq, this_cpu, busiest, max_load_move - total_load_moved, - sd, idle, all_pinned, &this_best_prio); + sd, idle, lb_flags, &this_best_prio); class = class->next; } while (class && max_load_move > total_load_moved); @@ -2938,7 +2940,7 @@ static int load_balance(int this_cpu, st struct sched_domain *sd, enum cpu_idle_type idle, int *balance) { - int ld_moved, all_pinned = 0, active_balance = 0, sd_idle = 0; + int ld_moved, lb_flags = 0, active_balance = 0, sd_idle = 0; struct sched_group *group; unsigned long imbalance; struct rq *busiest; @@ -2990,7 +2992,7 @@ redo: local_irq_save(flags); double_rq_lock(this_rq, busiest); ld_moved = move_tasks(this_rq, this_cpu, busiest, - imbalance, sd, idle, &all_pinned); + imbalance, sd, idle, &lb_flags); double_rq_unlock(this_rq, busiest); local_irq_restore(flags); @@ -3001,7 +3003,7 @@ redo: resched_cpu(this_cpu); /* All tasks on this runqueue were pinned by CPU affinity */ - if (unlikely(all_pinned)) { + if (unlikely(lb_flags & LB_ALL_PINNED)) { cpu_clear(cpu_of(busiest), cpus); if (!cpus_empty(cpus)) goto redo; @@ -3022,7 +3024,7 @@ redo: */ if (!cpu_isset(this_cpu, busiest->curr->cpus_allowed)) { spin_unlock_irqrestore(&busiest->lock, flags); - all_pinned = 1; + lb_flags |= LB_ALL_PINNED; goto out_one_pinned; } @@ -3070,7 +3072,8 @@ out_balanced: out_one_pinned: /* tune up the balancing interval */ - if ((all_pinned && sd->balance_interval < MAX_PINNED_INTERVAL) || + if (((lb_flags & LB_ALL_PINNED) && + sd->balance_interval < MAX_PINNED_INTERVAL) || (sd->balance_interval < sd->max_interval)) sd->balance_interval *= 2; @@ -3095,7 +3098,7 @@ load_balance_newidle(int this_cpu, struc unsigned long imbalance; int ld_moved = 0; int sd_idle = 0; - int all_pinned = 0; + int lb_flags = 0; cpumask_t cpus = CPU_MASK_ALL; /* @@ -3136,10 +3139,10 @@ redo: update_rq_clock(busiest); ld_moved = move_tasks(this_rq, this_cpu, busiest, imbalance, sd, CPU_NEWLY_IDLE, - &all_pinned); + &lb_flags); spin_unlock(&busiest->lock); - if (unlikely(all_pinned)) { + if (unlikely(lb_flags & LB_ALL_PINNED)) { cpu_clear(cpu_of(busiest), cpus); if (!cpus_empty(cpus)) goto redo; Index: linux-2.6.24.7/kernel/sched_fair.c =================================================================== --- linux-2.6.24.7.orig/kernel/sched_fair.c +++ linux-2.6.24.7/kernel/sched_fair.c @@ -1115,7 +1115,7 @@ static unsigned long load_balance_fair(struct rq *this_rq, int this_cpu, struct rq *busiest, unsigned long max_load_move, struct sched_domain *sd, enum cpu_idle_type idle, - int *all_pinned, int *this_best_prio) + int *lb_flags, int *this_best_prio) { struct cfs_rq *busy_cfs_rq; long rem_load_move = max_load_move; @@ -1151,7 +1151,7 @@ load_balance_fair(struct rq *this_rq, in */ cfs_rq_iterator.arg = busy_cfs_rq; rem_load_move -= balance_tasks(this_rq, this_cpu, busiest, - maxload, sd, idle, all_pinned, + maxload, sd, idle, lb_flags, this_best_prio, &cfs_rq_iterator); Index: linux-2.6.24.7/kernel/sched_idletask.c =================================================================== --- linux-2.6.24.7.orig/kernel/sched_idletask.c +++ linux-2.6.24.7/kernel/sched_idletask.c @@ -48,7 +48,7 @@ static unsigned long load_balance_idle(struct rq *this_rq, int this_cpu, struct rq *busiest, unsigned long max_load_move, struct sched_domain *sd, enum cpu_idle_type idle, - int *all_pinned, int *this_best_prio) + int *lb_flags, int *this_best_prio) { return 0; } Index: linux-2.6.24.7/kernel/sched_rt.c =================================================================== --- linux-2.6.24.7.orig/kernel/sched_rt.c +++ linux-2.6.24.7/kernel/sched_rt.c @@ -718,7 +718,7 @@ static unsigned long load_balance_rt(struct rq *this_rq, int this_cpu, struct rq *busiest, unsigned long max_load_move, struct sched_domain *sd, enum cpu_idle_type idle, - int *all_pinned, int *this_best_prio) + int *lb_flags, int *this_best_prio) { /* don't touch RT tasks */ return 0; ��������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/sched_load_balance_lockbreak.patch����������������������������������������������������������0000664�0000764�0000764�00000011175�11041657732�020040� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������Subject: sched: lock-break the load balance path From: Peter Zijlstra <a.p.zijlstra@chello.nl> move_tasks() can do a lot of work, and it holds two runqueue locks and has IRQs disabled. We already introduced sysctl_sched_nr_migrate to limit the number of task iterations it can do. This patch takes it one step further and drops the locks once we break out of the iteration due to sysctl_sched_nr_migrate and re-enables IRQs. Then it re-acquires everything and continues. Dropping the locks is safe because: - load_balance() doesn't rely on it - load_balance_newidle() uses double_lock_balance() which can already drop the locks. Enabling IRQs should be safe since we already dropped all locks. We add the LB_COMPLETE state to detect the truncated iteration due to sysctl_sched_nr_migrate. For now we must break out of the restart when load_moved is 0, because each iteration will test the same tasks - hence we can live-lock here. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> --- kernel/sched.c | 42 +++++++++++++++++++++++++++++++++++++----- kernel/sched_debug.c | 28 ++++++++++++++++++++++------ 2 files changed, 59 insertions(+), 11 deletions(-) Index: linux-2.6.24.7/kernel/sched.c =================================================================== --- linux-2.6.24.7.orig/kernel/sched.c +++ linux-2.6.24.7/kernel/sched.c @@ -461,6 +461,8 @@ struct rq { unsigned long rto_wakeup; unsigned long rto_pulled; unsigned long rto_pushed; + + unsigned long lb_breaks; #endif struct lock_class_key rq_lock_key; }; @@ -587,6 +589,7 @@ enum { SCHED_FEAT_START_DEBIT = 4, SCHED_FEAT_TREE_AVG = 8, SCHED_FEAT_APPROX_AVG = 16, + SCHED_FEAT_LB_BREAK = 32, }; const_debug unsigned int sysctl_sched_features = @@ -594,7 +597,8 @@ const_debug unsigned int sysctl_sched_fe SCHED_FEAT_WAKEUP_PREEMPT * 1 | SCHED_FEAT_START_DEBIT * 1 | SCHED_FEAT_TREE_AVG * 0 | - SCHED_FEAT_APPROX_AVG * 0; + SCHED_FEAT_APPROX_AVG * 0 | + SCHED_FEAT_LB_BREAK * 1; #define sched_feat(x) (sysctl_sched_features & SCHED_FEAT_##x) @@ -2270,6 +2274,7 @@ static void update_cpu_load(struct rq *t #ifdef CONFIG_SMP #define LB_ALL_PINNED 0x01 +#define LB_COMPLETE 0x02 /* * double_rq_lock - safely lock two runqueues @@ -2476,8 +2481,13 @@ balance_tasks(struct rq *this_rq, int th if (p) pinned = 1; next: - if (!p || loops++ > sysctl_sched_nr_migrate) + if (!p) + goto out; + + if (loops++ > sysctl_sched_nr_migrate) { + *lb_flags &= ~LB_COMPLETE; goto out; + } /* * To help distribute high priority tasks across CPUs we don't * skip a task if it will be the highest priority task (i.e. smallest @@ -2535,11 +2545,30 @@ static int move_tasks(struct rq *this_rq int this_best_prio = this_rq->curr->prio; do { - total_load_moved += - class->load_balance(this_rq, this_cpu, busiest, + unsigned long load_moved; + + *lb_flags |= LB_COMPLETE; + + load_moved = class->load_balance(this_rq, this_cpu, busiest, max_load_move - total_load_moved, sd, idle, lb_flags, &this_best_prio); - class = class->next; + + total_load_moved += load_moved; + + if (!load_moved || *lb_flags & LB_COMPLETE) { + class = class->next; + } else if (sched_feat(LB_BREAK)) { + schedstat_inc(this_rq, lb_breaks); + + double_rq_unlock(this_rq, busiest); + local_irq_enable(); + + if (!in_atomic()) + cond_resched(); + + local_irq_disable(); + double_rq_lock(this_rq, busiest); + } } while (class && max_load_move > total_load_moved); return total_load_moved > 0; @@ -2983,6 +3012,9 @@ redo: ld_moved = 0; if (busiest->nr_running > 1) { + + WARN_ON(irqs_disabled()); + /* * Attempt to move tasks. If find_busiest_group has found * an imbalance but busiest->nr_running <= 1, the group is Index: linux-2.6.24.7/kernel/sched_debug.c =================================================================== --- linux-2.6.24.7.orig/kernel/sched_debug.c +++ linux-2.6.24.7/kernel/sched_debug.c @@ -186,17 +186,33 @@ static void print_cpu(struct seq_file *m P(cpu_load[2]); P(cpu_load[3]); P(cpu_load[4]); -#ifdef CONFIG_PREEMPT_RT - /* Print rt related rq stats */ - P(rt.rt_nr_running); - P(rt.rt_nr_uninterruptible); -# ifdef CONFIG_SCHEDSTATS +#ifdef CONFIG_SCHEDSTATS + P(yld_exp_empty); + P(yld_act_empty); + P(yld_both_empty); + P(yld_count); + + P(sched_switch); + P(sched_count); + P(sched_goidle); + + P(ttwu_count); + P(ttwu_local); + + P(bkl_count); + P(rto_schedule); P(rto_schedule_tail); P(rto_wakeup); P(rto_pulled); P(rto_pushed); -# endif + + P(lb_breaks); +#endif +#ifdef CONFIG_PREEMPT_RT + /* Print rt related rq stats */ + P(rt.rt_nr_running); + P(rt.rt_nr_uninterruptible); #endif #undef P ���������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/sched-load_balance-iterator.patch�����������������������������������������������������������0000664�0000764�0000764�00000004171�11041657733�017567� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������Subject: sched: fixup the load balancer iterator From: Peter Zijlstra <a.p.zijlstra@chello.nl> Solve the live-lock from the previous patch by not restarting the load balance iterator on each go. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> --- kernel/sched.c | 13 +++++++++++-- kernel/sched_fair.c | 3 +++ 2 files changed, 14 insertions(+), 2 deletions(-) Index: linux-2.6.24.7/kernel/sched.c =================================================================== --- linux-2.6.24.7.orig/kernel/sched.c +++ linux-2.6.24.7/kernel/sched.c @@ -2275,6 +2275,7 @@ static void update_cpu_load(struct rq *t #define LB_ALL_PINNED 0x01 #define LB_COMPLETE 0x02 +#define LB_START 0x03 /* * double_rq_lock - safely lock two runqueues @@ -2477,7 +2478,11 @@ balance_tasks(struct rq *this_rq, int th /* * Start the load-balancing iterator: */ - p = iterator->start(iterator->arg); + if (*lb_flags & LB_START) + p = iterator->start(iterator->arg); + else + p = iterator->next(iterator->arg); + if (p) pinned = 1; next: @@ -2544,6 +2549,8 @@ static int move_tasks(struct rq *this_rq unsigned long total_load_moved = 0; int this_best_prio = this_rq->curr->prio; + *lb_flags |= LB_START; + do { unsigned long load_moved; @@ -2555,9 +2562,11 @@ static int move_tasks(struct rq *this_rq total_load_moved += load_moved; - if (!load_moved || *lb_flags & LB_COMPLETE) { + if (*lb_flags & LB_COMPLETE) { class = class->next; + *lb_flags |= LB_START; } else if (sched_feat(LB_BREAK)) { + *lb_flags &= ~LB_START; schedstat_inc(this_rq, lb_breaks); double_rq_unlock(this_rq, busiest); Index: linux-2.6.24.7/kernel/sched_fair.c =================================================================== --- linux-2.6.24.7.orig/kernel/sched_fair.c +++ linux-2.6.24.7/kernel/sched_fair.c @@ -185,6 +185,9 @@ static void __dequeue_entity(struct cfs_ if (cfs_rq->rb_leftmost == &se->run_node) cfs_rq->rb_leftmost = rb_next(&se->run_node); + if (cfs_rq->rb_load_balance_curr == &se->run_node) + cfs_rq->rb_load_balance_curr = rb_next(&se->run_node); + rb_erase(&se->run_node, &cfs_rq->tasks_timeline); } �������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/sched-load_balance-stop.patch���������������������������������������������������������������0000664�0000764�0000764�00000001453�11041657733�016723� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������Subject: sched: one is enough to not be idle From: Peter Zijlstra <a.p.zijlstra@chello.nl> Cap the load balancing on new_idle - you only need one task to run, no need to bring the whole machine in balance again - let the softirq balancer do that. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> --- kernel/sched.c | 3 +++ 1 file changed, 3 insertions(+) Index: linux-2.6.24.7/kernel/sched.c =================================================================== --- linux-2.6.24.7.orig/kernel/sched.c +++ linux-2.6.24.7/kernel/sched.c @@ -2569,6 +2569,9 @@ static int move_tasks(struct rq *this_rq *lb_flags &= ~LB_START; schedstat_inc(this_rq, lb_breaks); + if (idle == CPU_NEWLY_IDLE && total_load_moved) + break; + double_rq_unlock(this_rq, busiest); local_irq_enable(); ���������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/sched-load_balance-is_runnable.patch��������������������������������������������������������0000664�0000764�0000764�00000010773�11041673115�020234� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������Subject: sched: even weaker newidle balancing From: Peter Zijlstra <a.p.zijlstra@chello.nl> On each round see if any of the classes became runnable - if so, stop balancing and run the thing. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> --- include/linux/sched.h | 2 ++ kernel/sched.c | 28 ++++++++++++++++++++-------- kernel/sched_fair.c | 7 +++++++ kernel/sched_idletask.c | 7 +++++++ kernel/sched_rt.c | 6 ++++++ 5 files changed, 42 insertions(+), 8 deletions(-) Index: linux-2.6.24.7/include/linux/sched.h =================================================================== --- linux-2.6.24.7.orig/include/linux/sched.h +++ linux-2.6.24.7/include/linux/sched.h @@ -922,6 +922,8 @@ struct sched_class { void (*pre_schedule) (struct rq *this_rq, struct task_struct *task); void (*post_schedule) (struct rq *this_rq); void (*task_wake_up) (struct rq *this_rq, struct task_struct *task); + + int (*is_runnable) (struct rq *this_rq); #endif void (*set_curr_task) (struct rq *rq); Index: linux-2.6.24.7/kernel/sched.c =================================================================== --- linux-2.6.24.7.orig/kernel/sched.c +++ linux-2.6.24.7/kernel/sched.c @@ -2533,6 +2533,21 @@ out: return max_load_move - rem_load_move; } +static int is_runnable(struct rq *this_rq, const struct sched_class *target_class) +{ + const struct sched_class *class = sched_class_highest; + + for (; class; class = class->next) { + if (class->is_runnable(this_rq)) + return 1; + + if (class == target_class) + break; + } + + return 0; +} + /* * move_tasks tries to move up to max_load_move weighted load from busiest to * this_rq, as part of a balancing operation within domain "sd". @@ -2552,15 +2567,15 @@ static int move_tasks(struct rq *this_rq *lb_flags |= LB_START; do { - unsigned long load_moved; - *lb_flags |= LB_COMPLETE; - load_moved = class->load_balance(this_rq, this_cpu, busiest, - max_load_move - total_load_moved, + total_load_moved += class->load_balance(this_rq, this_cpu, + busiest, max_load_move - total_load_moved, sd, idle, lb_flags, &this_best_prio); - total_load_moved += load_moved; + if (idle == CPU_NEWLY_IDLE && + is_runnable(this_rq, class)) + return 1; if (*lb_flags & LB_COMPLETE) { class = class->next; @@ -2569,9 +2584,6 @@ static int move_tasks(struct rq *this_rq *lb_flags &= ~LB_START; schedstat_inc(this_rq, lb_breaks); - if (idle == CPU_NEWLY_IDLE && total_load_moved) - break; - double_rq_unlock(this_rq, busiest); local_irq_enable(); Index: linux-2.6.24.7/kernel/sched_fair.c =================================================================== --- linux-2.6.24.7.orig/kernel/sched_fair.c +++ linux-2.6.24.7/kernel/sched_fair.c @@ -1188,6 +1188,12 @@ move_one_task_fair(struct rq *this_rq, i return 0; } + +static int +is_runnable_fair(struct rq *this_rq) +{ + return !!this_rq->cfs.nr_running; +} #endif /* @@ -1307,6 +1313,7 @@ static const struct sched_class fair_sch #ifdef CONFIG_SMP .load_balance = load_balance_fair, .move_one_task = move_one_task_fair, + .is_runnable = is_runnable_fair, #endif .set_curr_task = set_curr_task_fair, Index: linux-2.6.24.7/kernel/sched_idletask.c =================================================================== --- linux-2.6.24.7.orig/kernel/sched_idletask.c +++ linux-2.6.24.7/kernel/sched_idletask.c @@ -59,6 +59,12 @@ move_one_task_idle(struct rq *this_rq, i { return 0; } + +static int +is_runnable_idle(struct rq *this_rq) +{ + return 1; +} #endif static void task_tick_idle(struct rq *rq, struct task_struct *curr) @@ -117,6 +123,7 @@ const struct sched_class idle_sched_clas #ifdef CONFIG_SMP .load_balance = load_balance_idle, .move_one_task = move_one_task_idle, + .is_runnable = is_runnable_idle, #endif .set_curr_task = set_curr_task_idle, Index: linux-2.6.24.7/kernel/sched_rt.c =================================================================== --- linux-2.6.24.7.orig/kernel/sched_rt.c +++ linux-2.6.24.7/kernel/sched_rt.c @@ -792,6 +792,11 @@ static void switched_from_rt(struct rq * if (!rq->rt.rt_nr_running) pull_rt_task(rq); } + +static int is_runnable_rt(struct rq *rq) +{ + return !!rq->rt.rt_nr_running; +} #endif /* CONFIG_SMP */ /* @@ -920,6 +925,7 @@ const struct sched_class rt_sched_class .post_schedule = post_schedule_rt, .task_wake_up = task_wake_up_rt, .switched_from = switched_from_rt, + .is_runnable = is_runnable_rt, #endif .set_curr_task = set_curr_task_rt, �����patches/ftrace-trace-sched.patch��������������������������������������������������������������������0000664�0000764�0000764�00000001462�11041657735�015716� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From: Steven Rostedt <srostedt@redhat.com> Subject: ftrace: trace sched.c The clock code has been removed to its own file "sched_clock", and that shouldn't be traced. But we still want to trace the scheduler code. Signed-off-by: Steven Rostedt <srostedt@redhat.com> --- kernel/Makefile | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) Index: linux-2.6.24.7/kernel/Makefile =================================================================== --- linux-2.6.24.7.orig/kernel/Makefile +++ linux-2.6.24.7/kernel/Makefile @@ -11,7 +11,7 @@ obj-y = sched.o fork.o exec_domain.o hrtimer.o rwsem.o latency.o nsproxy.o srcu.o \ utsname.o notifier.o -CFLAGS_REMOVE_sched.o = -pg -mno-spe +CFLAGS_REMOVE_sched.o = -mno-spe ifdef CONFIG_FTRACE # Do not trace debug files and internal ftrace files ��������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/lockdep-avoid-fork-waring.patch�������������������������������������������������������������0000664�0000764�0000764�00000001504�11041657733�017232� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From: Steven Rostedt <srostedt@redhat.com> Subject: ftrace: fix if define to prove locking The preprocessor condition in fork.c incorrectly checks against LOCKDEP, when it should check against PROVE_LOCKING. Signed-off-by: Steven Rostedt <srostedt@redhat.com> --- kernel/fork.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) Index: linux-2.6.24.7/kernel/fork.c =================================================================== --- linux-2.6.24.7.orig/kernel/fork.c +++ linux-2.6.24.7/kernel/fork.c @@ -1051,7 +1051,7 @@ static struct task_struct *copy_process( rt_mutex_init_task(p); -#if defined(CONFIG_TRACE_IRQFLAGS) && defined(CONFIG_LOCKDEP) +#if defined(CONFIG_TRACE_IRQFLAGS) && defined(CONFIG_PROVE_LOCKING) DEBUG_LOCKS_WARN_ON(!p->hardirqs_enabled); DEBUG_LOCKS_WARN_ON(!p->softirqs_enabled); #endif ��������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/ftrace-dont-trace-markers.patch�������������������������������������������������������������0000664�0000764�0000764�00000001222�11041657735�017230� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From: Steven Rostedt <srostedt@redhat.com> Subject: ftrace: dont trace markers It is redundant to trace the marker code. Signed-off-by: Steven Rostedt <srostedt@redhat.com> --- kernel/Makefile | 1 + 1 file changed, 1 insertion(+) Index: linux-2.6.24.7/kernel/Makefile =================================================================== --- linux-2.6.24.7.orig/kernel/Makefile +++ linux-2.6.24.7/kernel/Makefile @@ -21,6 +21,7 @@ CFLAGS_REMOVE_mutex-debug.o = -pg CFLAGS_REMOVE_rtmutex-debug.o = -pg CFLAGS_REMOVE_cgroup-debug.o = -pg CFLAGS_REMOVE_sched_clock.o = -pg +CFLAGS_REMOVE_marker.o = -pg endif obj-$(CONFIG_SYSCTL) += sysctl_check.o ������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/ftrace-record-comm-on-ctrl.patch������������������������������������������������������������0000664�0000764�0000764�00000001552�11041657731�017313� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From: Steven Rostedt <srostedt@redhat.com> Subject: ftrace: record comm on function ctrl change On stress tests, it is possible for the comm of that disables the ftracer to be lost. Record it on turning on or off the function tracer. Signed-off-by: Steven Rostedt <srostedt@redhat.com> --- kernel/trace/trace.c | 2 ++ 1 file changed, 2 insertions(+) Index: linux-2.6.24.7/kernel/trace/trace.c =================================================================== --- linux-2.6.24.7.orig/kernel/trace/trace.c +++ linux-2.6.24.7/kernel/trace/trace.c @@ -1197,10 +1197,12 @@ static struct ftrace_ops trace_ops __rea void tracing_start_function_trace(void) { register_ftrace_function(&trace_ops); + tracing_record_cmdline(current); } void tracing_stop_function_trace(void) { + tracing_record_cmdline(current); unregister_ftrace_function(&trace_ops); } #endif ������������������������������������������������������������������������������������������������������������������������������������������������������patches/ftrace-print-missing-cmdline.patch����������������������������������������������������������0000664�0000764�0000764�00000002546�11041657734�017753� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From: Steven Rostedt <srostedt@redhat.com> Subject: ftrace: fix the command line printing Only half of the command line recording was implemented. The reverse map back from command line array to pid to verify that the command line did indeed belong to the pid, was missing. Signed-off-by: Steven Rostedt <srostedt@redhat.com> --- kernel/trace/trace.c | 8 ++++++++ 1 file changed, 8 insertions(+) Index: linux-2.6.24.7/kernel/trace/trace.c =================================================================== --- linux-2.6.24.7.orig/kernel/trace/trace.c +++ linux-2.6.24.7/kernel/trace/trace.c @@ -685,14 +685,19 @@ static void trace_save_cmdline(struct ta if (!spin_trylock(&trace_cmdline_lock)) return; + /* from the pid, find the index of the cmdline array */ idx = map_pid_to_cmdline[tsk->pid]; + if (idx >= SAVED_CMDLINES) { + /* this is new */ idx = (cmdline_idx + 1) % SAVED_CMDLINES; + /* check the reverse map and reset it if needed */ map = map_cmdline_to_pid[idx]; if (map <= PID_MAX_DEFAULT) map_pid_to_cmdline[map] = (unsigned)-1; + map_cmdline_to_pid[idx] = tsk->pid; map_pid_to_cmdline[tsk->pid] = idx; cmdline_idx = idx; @@ -718,6 +723,9 @@ static char *trace_find_cmdline(int pid) if (map >= SAVED_CMDLINES) goto out; + if (map_cmdline_to_pid[map] != pid) + goto out; + cmdline = saved_cmdlines[map]; out: ����������������������������������������������������������������������������������������������������������������������������������������������������������patches/lockstat-fix-contention-points.patch��������������������������������������������������������0000664�0000764�0000764�00000001351�11041657731�020363� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������Subject: lockstat: fix contention points From: Peter Zijlstra <a.p.zijlstra@chello.nl> blatantly stupid bug.. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> --- kernel/lockdep.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) Index: linux-2.6.24.7/kernel/lockdep.c =================================================================== --- linux-2.6.24.7.orig/kernel/lockdep.c +++ linux-2.6.24.7/kernel/lockdep.c @@ -2889,7 +2889,7 @@ found_it: stats = get_lock_stats(hlock->class); if (point < ARRAY_SIZE(stats->contention_point)) - stats->contention_point[i]++; + stats->contention_point[point]++; if (lock->cpu != smp_processor_id()) stats->bounces[bounce_contended + !!hlock->read]++; put_lock_stats(stats); ���������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/lockstat-output.patch�����������������������������������������������������������������������0000664�0000764�0000764�00000001610�11041657732�015444� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������Subject: lockstat: warn about disabled lock debugging From: Peter Zijlstra <a.p.zijlstra@chello.nl> Avoid confusion and clearly state lock debugging got disabled. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> --- kernel/lockdep_proc.c | 4 ++++ 1 file changed, 4 insertions(+) Index: linux-2.6.24.7/kernel/lockdep_proc.c =================================================================== --- linux-2.6.24.7.orig/kernel/lockdep_proc.c +++ linux-2.6.24.7/kernel/lockdep_proc.c @@ -516,6 +516,10 @@ static void seq_stats(struct seq_file *m static void seq_header(struct seq_file *m) { seq_printf(m, "lock_stat version 0.2\n"); + + if (unlikely(!debug_locks)) + seq_printf(m, "*WARNING* lock debugging disabled!! - possibly due to a lockdep warning\n"); + seq_line(m, '-', 0, 40 + 1 + 10 * (14 + 1)); seq_printf(m, "%40s %14s %14s %14s %14s %14s %14s %14s %14s " "%14s %14s\n", ������������������������������������������������������������������������������������������������������������������������patches/fix_vdso_gtod_vsyscall64_2.patch������������������������������������������������������������0000664�0000764�0000764�00000006115�11041673113�017427� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������This patch fixes VDSO GTOD enhancements for the case where kernel.vsyscall64=2. In this case VDSO GTOD presents a cheap way to access gettimeofday() with no need to issue a real system call. This fix enforces the 1ms resolution VDSO GTOD should present when kernel.vsyscall64=2. This patch offers this resolution for the clocksources that have a vread() function, such as tsc and hpet. Otherwise it keeps uses the read() function offered by the clocksource, that may be costly. Also, a few other tweaks are performed. Signed-off-by: Luis Claudio R. Goncalves <lgoncalv@redhat.com> --- arch/x86/kernel/vsyscall_64.c | 35 +++++++++++++++++++++++++++++++---- 1 file changed, 31 insertions(+), 4 deletions(-) Index: linux-2.6.24.7/arch/x86/kernel/vsyscall_64.c =================================================================== --- linux-2.6.24.7.orig/arch/x86/kernel/vsyscall_64.c +++ linux-2.6.24.7/arch/x86/kernel/vsyscall_64.c @@ -74,14 +74,40 @@ void update_vsyscall(struct timespec *wa unsigned long flags; write_seqlock_irqsave(&vsyscall_gtod_data.lock, flags); + + if (likely(vsyscall_gtod_data.sysctl_enabled == 2)) { + struct timespec tmp = *(wall_time); + cycle_t (*vread)(void); + cycle_t now; + + vread = vsyscall_gtod_data.clock.vread; + if (likely(vread)) + now = vread(); + else + now = clock->read(); + + /* calculate interval: */ + now = (now - clock->cycle_last) & clock->mask; + /* convert to nsecs: */ + tmp.tv_nsec += ( now * clock->mult) >> clock->shift; + + while (tmp.tv_nsec >= NSEC_PER_SEC) { + tmp.tv_sec += 1; + tmp.tv_nsec -= NSEC_PER_SEC; + } + + vsyscall_gtod_data.wall_time_sec = tmp.tv_sec; + vsyscall_gtod_data.wall_time_nsec = tmp.tv_nsec; + } else { + vsyscall_gtod_data.wall_time_sec = wall_time->tv_sec; + vsyscall_gtod_data.wall_time_nsec = wall_time->tv_nsec; + } /* copy vsyscall data */ vsyscall_gtod_data.clock.vread = clock->vread; vsyscall_gtod_data.clock.cycle_last = clock->cycle_last; vsyscall_gtod_data.clock.mask = clock->mask; vsyscall_gtod_data.clock.mult = clock->mult; vsyscall_gtod_data.clock.shift = clock->shift; - vsyscall_gtod_data.wall_time_sec = wall_time->tv_sec; - vsyscall_gtod_data.wall_time_nsec = wall_time->tv_nsec; vsyscall_gtod_data.wall_to_monotonic = wall_to_monotonic; write_sequnlock_irqrestore(&vsyscall_gtod_data.lock, flags); } @@ -134,7 +160,8 @@ static __always_inline void do_vgettimeo } while (tmp.tv_usec != tv->tv_usec || tmp.tv_sec != tv->tv_sec); - tv->tv_usec /= NSEC_PER_USEC; + tv->tv_usec /= NSEC_PER_MSEC; + tv->tv_usec *= USEC_PER_MSEC; return; } @@ -146,7 +173,6 @@ static __always_inline void do_vgettimeo gettimeofday(tv,NULL); return; } - now = vread(); base = __vsyscall_gtod_data.clock.cycle_last; mask = __vsyscall_gtod_data.clock.mask; mult = __vsyscall_gtod_data.clock.mult; @@ -156,6 +182,7 @@ static __always_inline void do_vgettimeo nsec = __vsyscall_gtod_data.wall_time_nsec; } while (read_seqretry(&__vsyscall_gtod_data.lock, seq)); + now = vread(); /* calculate interval: */ cycle_delta = (now - base) & mask; /* convert to nsecs: */ ���������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/rwlocks-fix-no-preempt-rt.patch�������������������������������������������������������������0000664�0000764�0000764�00000004261�11041657732�017246� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From: Steven Rostedt <srostedt@redhat.com> Subject: rwlock: fix non PREEMPT_RT case Seems that the addition of RT_RW_READER broke the non PREEMPT_RT case. This patch fixes it. Signed-off-by: Steven Rostedt <srostedt@redhat.com> --- kernel/rtmutex.c | 16 +++++++++++----- 1 file changed, 11 insertions(+), 5 deletions(-) Index: linux-2.6.24.7/kernel/rtmutex.c =================================================================== --- linux-2.6.24.7.orig/kernel/rtmutex.c +++ linux-2.6.24.7/kernel/rtmutex.c @@ -123,6 +123,12 @@ static inline void mark_rt_rwlock_check( #endif /* CONFIG_PREEMPT_RT */ #endif +#ifdef CONFIG_PREEMPT_RT +#define task_is_reader(task) ((task) == RT_RW_READER) +#else +#define task_is_reader(task) (0) +#endif + int pi_initialized; /* @@ -315,7 +321,7 @@ static int rt_mutex_adjust_prio_chain(st /* * Readers are special. We may need to boost more than one owner. */ - if (task == RT_RW_READER) { + if (task_is_reader(task)) { ret = rt_mutex_adjust_readers(orig_lock, orig_waiter, top_task, lock, recursion_depth); @@ -376,7 +382,7 @@ static inline int try_to_steal_lock(stru if (pendowner == current) return 1; - WARN_ON(rt_mutex_owner(lock) == RT_RW_READER); + WARN_ON(task_is_reader(rt_mutex_owner(lock))); spin_lock(&pendowner->pi_lock); if (!lock_is_stealable(pendowner, mode)) { @@ -506,7 +512,7 @@ static int task_blocks_on_rt_mutex(struc if (waiter == rt_mutex_top_waiter(lock)) { /* readers are handled differently */ - if (owner == RT_RW_READER) { + if (task_is_reader(owner)) { res = rt_mutex_adjust_readers(lock, waiter, current, lock, 0); return res; @@ -524,7 +530,7 @@ static int task_blocks_on_rt_mutex(struc else if (debug_rt_mutex_detect_deadlock(waiter, detect_deadlock)) chain_walk = 1; - if (!chain_walk || owner == RT_RW_READER) + if (!chain_walk || task_is_reader(owner)) return 0; /* @@ -624,7 +630,7 @@ static void remove_waiter(struct rt_mute current->pi_blocked_on = NULL; spin_unlock(¤t->pi_lock); - if (first && owner != current && owner != RT_RW_READER) { + if (first && owner != current && !task_is_reader(owner)) { spin_lock(&owner->pi_lock); �����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/git-ignore-module-markers.patch�������������������������������������������������������������0000664�0000764�0000764�00000001456�11041657735�017266� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From h-shimamoto@ct.jp.nec.com Fri May 23 15:46:28 2008 Date: Fri, 23 May 2008 12:37:40 -0700 From: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com> To: Steven Rostedt <rostedt@goodmis.org>, Ingo Molnar <mingo@elte.hu>, Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org, linux-rt-users@vger.kernel.org Subject: [PATCH -rt] gitignore Module.markers From: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com> Signed-off-by: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com> --- .gitignore | 1 + 1 file changed, 1 insertion(+) Index: linux-2.6.24.7/.gitignore =================================================================== --- linux-2.6.24.7.orig/.gitignore +++ linux-2.6.24.7/.gitignore @@ -27,6 +27,7 @@ vmlinux* !vmlinux.lds.S System.map Module.symvers +Module.markers !.gitignore # ������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/git-ignore-script-lpp.patch�����������������������������������������������������������������0000664�0000764�0000764�00000001443�11041657734�016427� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From h-shimamoto@ct.jp.nec.com Fri May 23 15:46:46 2008 Date: Fri, 23 May 2008 12:37:45 -0700 From: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com> To: Steven Rostedt <rostedt@goodmis.org>, Ingo Molnar <mingo@elte.hu>, Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org, linux-rt-users@vger.kernel.org Subject: [PATCH -rt] gitignore scripts/testlpp From: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com> Signed-off-by: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com> --- scripts/.gitignore | 1 + 1 file changed, 1 insertion(+) Index: linux-2.6.24.7/scripts/.gitignore =================================================================== --- linux-2.6.24.7.orig/scripts/.gitignore +++ linux-2.6.24.7/scripts/.gitignore @@ -6,3 +6,4 @@ kallsyms pnmtologo bin2c unifdef +testlpp �����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/adaptive-optimize-rt-lock-wakeup.patch������������������������������������������������������0000664�0000764�0000764�00000006302�11041657733�020564� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From ghaskins@novell.com Fri May 23 23:32:28 2008 Date: Tue, 20 May 2008 10:49:15 -0400 From: Gregory Haskins <ghaskins@novell.com> To: mingo@elte.hu, tglx@linutronix.de, rostedt@goodmis.org, linux-rt-users@vger.kernel.org Cc: linux-kernel@vger.kernel.org, sdietrich@novell.com, pmorreale@novell.com, mkohari@novell.com, ghaskins@novell.com Subject: [PATCH 1/5] optimize rt lock wakeup [ The following text is in the "utf-8" character set. ] [ Your display is set for the "iso-8859-1" character set. ] [ Some characters may be displayed incorrectly. ] It is redundant to wake the grantee task if it is already running, and the call to wake_up_process is relatively expensive. If we can safely skip it we can measurably improve the performance of the adaptive-locks. Credit goes to Peter Morreale for the general idea. Signed-off-by: Gregory Haskins <ghaskins@novell.com> Signed-off-by: Peter Morreale <pmorreale@novell.com> --- kernel/rtmutex.c | 45 ++++++++++++++++++++++++++++++++++++++++----- 1 file changed, 40 insertions(+), 5 deletions(-) Index: linux-2.6.24.7/kernel/rtmutex.c =================================================================== --- linux-2.6.24.7.orig/kernel/rtmutex.c +++ linux-2.6.24.7/kernel/rtmutex.c @@ -578,6 +578,41 @@ static void wakeup_next_waiter(struct rt pendowner = waiter->task; waiter->task = NULL; + /* + * Do the wakeup before the ownership change to give any spinning + * waiter grantees a headstart over the other threads that will + * trigger once owner changes. + */ + if (!savestate) + wake_up_process(pendowner); + else { + /* + * We can skip the actual (expensive) wakeup if the + * waiter is already running, but we have to be careful + * of race conditions because they may be about to sleep. + * + * The waiter-side protocol has the following pattern: + * 1: Set state != RUNNING + * 2: Conditionally sleep if waiter->task != NULL; + * + * And the owner-side has the following: + * A: Set waiter->task = NULL + * B: Conditionally wake if the state != RUNNING + * + * As long as we ensure 1->2 order, and A->B order, we + * will never miss a wakeup. + * + * Therefore, this barrier ensures that waiter->task = NULL + * is visible before we test the pendowner->state. The + * corresponding barrier is in the sleep logic. + */ + smp_mb(); + + /* If !RUNNING && !RUNNING_MUTEX */ + if (pendowner->state & ~TASK_RUNNING_MUTEX) + wake_up_process_mutex(pendowner); + } + rt_mutex_set_owner(lock, pendowner, RT_MUTEX_OWNER_PENDING); spin_unlock(¤t->pi_lock); @@ -604,11 +639,6 @@ static void wakeup_next_waiter(struct rt plist_add(&next->pi_list_entry, &pendowner->pi_waiters); } spin_unlock(&pendowner->pi_lock); - - if (savestate) - wake_up_process_mutex(pendowner); - else - wake_up_process(pendowner); } /* @@ -860,6 +890,11 @@ rt_spin_lock_slowlock(struct rt_mutex *l if (adaptive_wait(&waiter, orig_owner)) { update_current(TASK_UNINTERRUPTIBLE, &saved_state); + /* + * The xchg() in update_current() is an implicit + * barrier which we rely upon to ensure current->state + * is visible before we test waiter.task. + */ if (waiter.task) schedule_rt_mutex(lock); } ������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/adaptive-task-oncpu.patch�������������������������������������������������������������������0000664�0000764�0000764�00000007775�11041657732�016163� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From ghaskins@novell.com Fri May 23 23:32:44 2008 Date: Tue, 20 May 2008 10:49:20 -0400 From: Gregory Haskins <ghaskins@novell.com> To: mingo@elte.hu, tglx@linutronix.de, rostedt@goodmis.org, linux-rt-users@vger.kernel.org Cc: linux-kernel@vger.kernel.org, sdietrich@novell.com, pmorreale@novell.com, mkohari@novell.com, ghaskins@novell.com Subject: [PATCH 2/5] sched: make task->oncpu available in all configurations [ The following text is in the "utf-8" character set. ] [ Your display is set for the "iso-8859-1" character set. ] [ Some characters may be displayed incorrectly. ] We will use this later in the series to eliminate the need for a function call. [ Steven Rostedt: added task_is_current function ] Signed-off-by: Gregory Haskins <ghaskins@novell.com> --- include/linux/sched.h | 9 ++++++--- kernel/sched.c | 37 ++++++++++++++++++++++++++----------- 2 files changed, 32 insertions(+), 14 deletions(-) Index: linux-2.6.24.7/include/linux/sched.h =================================================================== --- linux-2.6.24.7.orig/include/linux/sched.h +++ linux-2.6.24.7/include/linux/sched.h @@ -1027,10 +1027,8 @@ struct task_struct { int lock_depth; /* BKL lock depth */ #ifdef CONFIG_SMP -#ifdef __ARCH_WANT_UNLOCKED_CTXSW int oncpu; #endif -#endif int prio, static_prio, normal_prio; #ifdef CONFIG_PREEMPT_RCU_BOOST @@ -2235,7 +2233,12 @@ static inline void migration_init(void) } #endif -extern int task_is_current(struct task_struct *task); +#ifdef CONFIG_SMP +static inline int task_is_current(struct task_struct *task) +{ + return task->oncpu; +} +#endif #define TASK_STATE_TO_CHAR_STR "RMSDTtZX" Index: linux-2.6.24.7/kernel/sched.c =================================================================== --- linux-2.6.24.7.orig/kernel/sched.c +++ linux-2.6.24.7/kernel/sched.c @@ -575,10 +575,12 @@ int runqueue_is_locked(void) return ret; } +#ifndef CONFIG_SMP int task_is_current(struct task_struct *task) { return task_rq(task)->curr == task; } +#endif /* * Debugging: various feature bits @@ -661,18 +663,39 @@ static inline int task_current(struct rq return rq->curr == p; } -#ifndef __ARCH_WANT_UNLOCKED_CTXSW static inline int task_running(struct rq *rq, struct task_struct *p) { +#ifdef CONFIG_SMP + return p->oncpu; +#else return task_current(rq, p); +#endif } +#ifndef __ARCH_WANT_UNLOCKED_CTXSW static inline void prepare_lock_switch(struct rq *rq, struct task_struct *next) { +#ifdef CONFIG_SMP + /* + * We can optimise this out completely for !SMP, because the + * SMP rebalancing from interrupt is the only thing that cares + * here. + */ + next->oncpu = 1; +#endif } static inline void finish_lock_switch(struct rq *rq, struct task_struct *prev) { +#ifdef CONFIG_SMP + /* + * After ->oncpu is cleared, the task can be moved to a different CPU. + * We must ensure this doesn't happen until the switch is completely + * finished. + */ + smp_wmb(); + prev->oncpu = 0; +#endif #ifdef CONFIG_DEBUG_SPINLOCK /* this is a valid case when another task releases the spinlock */ rq->lock.owner = current; @@ -688,14 +711,6 @@ static inline void finish_lock_switch(st } #else /* __ARCH_WANT_UNLOCKED_CTXSW */ -static inline int task_running(struct rq *rq, struct task_struct *p) -{ -#ifdef CONFIG_SMP - return p->oncpu; -#else - return task_current(rq, p); -#endif -} static inline void prepare_lock_switch(struct rq *rq, struct task_struct *next) { @@ -1863,7 +1878,7 @@ void sched_fork(struct task_struct *p, i if (likely(sched_info_on())) memset(&p->sched_info, 0, sizeof(p->sched_info)); #endif -#if defined(CONFIG_SMP) && defined(__ARCH_WANT_UNLOCKED_CTXSW) +#if defined(CONFIG_SMP) p->oncpu = 0; #endif #ifdef CONFIG_PREEMPT @@ -5507,7 +5522,7 @@ void __cpuinit init_idle(struct task_str spin_lock_irqsave(&rq->lock, flags); rq->curr = rq->idle = idle; -#if defined(CONFIG_SMP) && defined(__ARCH_WANT_UNLOCKED_CTXSW) +#if defined(CONFIG_SMP) idle->oncpu = 1; #endif spin_unlock_irqrestore(&rq->lock, flags); ���patches/adaptive-adjust-pi-wakeup.patch�������������������������������������������������������������0000664�0000764�0000764�00000004165�11041657734�017261� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From ghaskins@novell.com Fri May 23 23:34:24 2008 Date: Tue, 20 May 2008 10:49:31 -0400 From: Gregory Haskins <ghaskins@novell.com> To: mingo@elte.hu, tglx@linutronix.de, rostedt@goodmis.org, linux-rt-users@vger.kernel.org Cc: linux-kernel@vger.kernel.org, sdietrich@novell.com, pmorreale@novell.com, mkohari@novell.com, ghaskins@novell.com Subject: [PATCH 4/5] adjust pi_lock usage in wakeup [ The following text is in the "utf-8" character set. ] [ Your display is set for the "iso-8859-1" character set. ] [ Some characters may be displayed incorrectly. ] From: Peter W.Morreale <pmorreale@novell.com> In wakeup_next_waiter(), we take the pi_lock, and then find out whether we have another waiter to add to the pending owner. We can reduce contention on the pi_lock for the pending owner if we first obtain the pointer to the next waiter outside of the pi_lock. Signed-off-by: Peter W. Morreale <pmorreale@novell.com> Signed-off-by: Gregory Haskins <ghaskins@novell.com> --- kernel/rtmutex.c | 14 +++++++++----- 1 file changed, 9 insertions(+), 5 deletions(-) Index: linux-2.6.24.7/kernel/rtmutex.c =================================================================== --- linux-2.6.24.7.orig/kernel/rtmutex.c +++ linux-2.6.24.7/kernel/rtmutex.c @@ -562,6 +562,7 @@ static void wakeup_next_waiter(struct rt { struct rt_mutex_waiter *waiter; struct task_struct *pendowner; + struct rt_mutex_waiter *next; spin_lock(¤t->pi_lock); @@ -624,6 +625,12 @@ static void wakeup_next_waiter(struct rt * waiter with higher priority than pending-owner->normal_prio * is blocked on the unboosted (pending) owner. */ + + if (rt_mutex_has_waiters(lock)) + next = rt_mutex_top_waiter(lock); + else + next = NULL; + spin_lock(&pendowner->pi_lock); WARN_ON(!pendowner->pi_blocked_on); @@ -632,12 +639,9 @@ static void wakeup_next_waiter(struct rt pendowner->pi_blocked_on = NULL; - if (rt_mutex_has_waiters(lock)) { - struct rt_mutex_waiter *next; - - next = rt_mutex_top_waiter(lock); + if (next) plist_add(&next->pi_list_entry, &pendowner->pi_waiters); - } + spin_unlock(&pendowner->pi_lock); } �����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/adapt-remove-extra-try-to-lock.patch��������������������������������������������������������0000664�0000764�0000764�00000003122�11041657734�020153� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From ghaskins@novell.com Sat May 24 00:14:29 2008 Date: Tue, 20 May 2008 10:49:36 -0400 From: Gregory Haskins <ghaskins@novell.com> To: mingo@elte.hu, tglx@linutronix.de, rostedt@goodmis.org, linux-rt-users@vger.kernel.org Cc: linux-kernel@vger.kernel.org, sdietrich@novell.com, pmorreale@novell.com, mkohari@novell.com, ghaskins@novell.com Subject: [PATCH 5/5] remove the extra call to try_to_take_lock [ The following text is in the "utf-8" character set. ] [ Your display is set for the "iso-8859-1" character set. ] [ Some characters may be displayed incorrectly. ] From: Peter W. Morreale <pmorreale@novell.com> Remove the redundant attempt to get the lock. While it is true that the exit path with this patch adds an un-necessary xchg (in the event the lock is granted without further traversal in the loop) experimentation shows that we almost never encounter this situation. Signed-off-by: Peter W. Morreale <pmorreale@novell.com> Signed-off-by: Gregory Haskins <ghaskins@novell.com> --- kernel/rtmutex.c | 6 ------ 1 file changed, 6 deletions(-) Index: linux-2.6.24.7/kernel/rtmutex.c =================================================================== --- linux-2.6.24.7.orig/kernel/rtmutex.c +++ linux-2.6.24.7/kernel/rtmutex.c @@ -842,12 +842,6 @@ rt_spin_lock_slowlock(struct rt_mutex *l spin_lock_irqsave(&lock->wait_lock, flags); init_lists(lock); - /* Try to acquire the lock again: */ - if (do_try_to_take_rt_mutex(lock, STEAL_LATERAL)) { - spin_unlock_irqrestore(&lock->wait_lock, flags); - return; - } - BUG_ON(rt_mutex_owner(lock) == current); /* ����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/adaptive-earlybreak-on-steal.patch����������������������������������������������������������0000664�0000764�0000764�00000002754�11041657735�017733� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From: Steven Rostedt <srostedt@redhat.com> Subject: rtmutex: break out early on first run Lock stealing and non cmpxchg will always go into the slow path. This patch detects the fact that we didn't go through the work of blocking and will exit early. Signed-off-by: Steven Rostedt <srostedt@redhat.com> --- kernel/rtmutex.c | 10 +++++++++- 1 file changed, 9 insertions(+), 1 deletion(-) Index: linux-2.6.24.7/kernel/rtmutex.c =================================================================== --- linux-2.6.24.7.orig/kernel/rtmutex.c +++ linux-2.6.24.7/kernel/rtmutex.c @@ -834,6 +834,7 @@ rt_spin_lock_slowlock(struct rt_mutex *l struct rt_mutex_waiter waiter; unsigned long saved_state, state, flags; struct task_struct *orig_owner; + int missed = 0; debug_rt_mutex_init_waiter(&waiter); waiter.task = NULL; @@ -860,8 +861,14 @@ rt_spin_lock_slowlock(struct rt_mutex *l int saved_lock_depth = current->lock_depth; /* Try to acquire the lock */ - if (do_try_to_take_rt_mutex(lock, STEAL_LATERAL)) + if (do_try_to_take_rt_mutex(lock, STEAL_LATERAL)) { + /* If we never blocked break out now */ + if (!missed) + goto unlock; break; + } + missed = 1; + /* * waiter.task is NULL the first time we come here and * when we have been woken up by the previous owner @@ -920,6 +927,7 @@ rt_spin_lock_slowlock(struct rt_mutex *l */ fixup_rt_mutex_waiters(lock); + unlock: spin_unlock_irqrestore(&lock->wait_lock, flags); debug_rt_mutex_free_waiter(&waiter); ��������������������patches/x86-disable-spinlock-preempt.patch����������������������������������������������������������0000664�0000764�0000764�00000002417�11041657733�017611� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������x86: disable spinlock preempt feature since we have ticketlocks From: Gregory Haskins <ghaskins@novell.com> The spinlock preempt feature utilizes spin_trylock() to implement preemptible waiters. However, doing so circumvents the benefit of using a FIFO/ticket lock, so we disable the feature when ticketlocks are enabled. Signed-off-by: Gregory Haskins <ghaskins@novell.com> CC: Nick Piggin <npiggin@suse.de> --- kernel/spinlock.c | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) Index: linux-2.6.24.7/kernel/spinlock.c =================================================================== --- linux-2.6.24.7.orig/kernel/spinlock.c +++ linux-2.6.24.7/kernel/spinlock.c @@ -115,9 +115,12 @@ EXPORT_SYMBOL(__write_trylock_irqsave); * If lockdep is enabled then we use the non-preemption spin-ops * even on CONFIG_PREEMPT, because lockdep assumes that interrupts are * not re-enabled during lock-acquire (which the preempt-spin-ops do): + * + * We also disable them on x86 because we now have ticket/fifo locks, + * which are defeated using a preemptible spinlock */ #if !defined(CONFIG_PREEMPT) || !defined(CONFIG_SMP) || \ - defined(CONFIG_DEBUG_LOCK_ALLOC) + defined(CONFIG_DEBUG_LOCK_ALLOC) || defined(CONFIG_X86) void __lockfunc __read_lock(raw_rwlock_t *lock) { �������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/x86-fifo-ticket-spinlocks.patch�������������������������������������������������������������0000664�0000764�0000764�00000043563�11041657734�017133� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������x86: FIFO ticket spinlocks From: Nick Piggin <npiggin@suse.de> Introduce ticket lock spinlocks for x86 which are FIFO. The implementation is described in the comments. The straight-line lock/unlock instruction sequence is slightly slower than the dec based locks on modern x86 CPUs, however the difference is quite small on Core2 and Opteron when working out of cache, and becomes almost insignificant even on P4 when the lock misses cache. trylock is more significantly slower, but they are relatively rare. On an 8 core (2 socket) Opteron, spinlock unfairness is extremely noticable, with a userspace test having a difference of up to 2x runtime per thread, and some threads are starved or "unfairly" granted the lock up to 1 000 000 (!) times. After this patch, all threads appear to finish at exactly the same time. The memory ordering of the lock does conform to x86 standards, and the implementation has been reviewed by Intel and AMD engineers. The algorithm also tells us how many CPUs are contending the lock, so lockbreak becomes trivial and we no longer have to waste 4 bytes per spinlock for it. After this, we can no longer spin on any locks with preempt enabled and cannot reenable interrupts when spinning on an irq safe lock, because at that point we have already taken a ticket and the would deadlock if the same CPU tries to take the lock again. These are questionable anyway: if the lock happens to be called under a preempt or interrupt disabled section, then it will just have the same latency problems. The real fix is to keep critical sections short, and ensure locks are reasonably fair (which this patch does). Signed-off-by: Nick Piggin <npiggin@suse.de> --- include/asm-x86/spinlock.h | 225 ++++++++++++++++++++++++++++++++++++++- include/asm-x86/spinlock_32.h | 221 -------------------------------------- include/asm-x86/spinlock_64.h | 167 ---------------------------- include/asm-x86/spinlock_types.h | 2 4 files changed, 224 insertions(+), 391 deletions(-) Index: linux-2.6.24.7/include/asm-x86/spinlock.h =================================================================== --- linux-2.6.24.7.orig/include/asm-x86/spinlock.h +++ linux-2.6.24.7/include/asm-x86/spinlock.h @@ -1,5 +1,226 @@ +#ifndef _X86_SPINLOCK_H_ +#define _X86_SPINLOCK_H_ + +#include <asm/atomic.h> +#include <asm/rwlock.h> +#include <asm/page.h> +#include <asm/processor.h> +#include <linux/compiler.h> + +/* + * Your basic SMP spinlocks, allowing only a single CPU anywhere + * + * Simple spin lock operations. There are two variants, one clears IRQ's + * on the local processor, one does not. + * + * These are fair FIFO ticket locks, which are currently limited to 256 + * CPUs. + * + * (the type definitions are in asm/spinlock_types.h) + */ + #ifdef CONFIG_X86_32 -# include "spinlock_32.h" +typedef char _slock_t; +# define LOCK_INS_DEC "decb" +# define LOCK_INS_XCH "xchgb" +# define LOCK_INS_MOV "movb" +# define LOCK_INS_CMP "cmpb" +# define LOCK_PTR_REG "a" #else -# include "spinlock_64.h" +typedef int _slock_t; +# define LOCK_INS_DEC "decl" +# define LOCK_INS_XCH "xchgl" +# define LOCK_INS_MOV "movl" +# define LOCK_INS_CMP "cmpl" +# define LOCK_PTR_REG "D" +#endif + +#if (NR_CPUS > 256) +#error spinlock supports a maximum of 256 CPUs +#endif + +static inline int __raw_spin_is_locked(__raw_spinlock_t *lock) +{ + int tmp = *(volatile signed int *)(&(lock)->slock); + + return (((tmp >> 8) & 0xff) != (tmp & 0xff)); +} + +static inline int __raw_spin_is_contended(__raw_spinlock_t *lock) +{ + int tmp = *(volatile signed int *)(&(lock)->slock); + + return (((tmp >> 8) & 0xff) - (tmp & 0xff)) > 1; +} + +static inline void __raw_spin_lock(__raw_spinlock_t *lock) +{ + short inc = 0x0100; + + /* + * Ticket locks are conceptually two bytes, one indicating the current + * head of the queue, and the other indicating the current tail. The + * lock is acquired by atomically noting the tail and incrementing it + * by one (thus adding ourself to the queue and noting our position), + * then waiting until the head becomes equal to the the initial value + * of the tail. + * + * This uses a 16-bit xadd to increment the tail and also load the + * position of the head, which takes care of memory ordering issues + * and should be optimal for the uncontended case. Note the tail must + * be in the high byte, otherwise the 16-bit wide increment of the low + * byte would carry up and contaminate the high byte. + */ + + __asm__ __volatile__ ( + LOCK_PREFIX "xaddw %w0, %1\n" + "1:\t" + "cmpb %h0, %b0\n\t" + "je 2f\n\t" + "rep ; nop\n\t" + "movb %1, %b0\n\t" + /* don't need lfence here, because loads are in-order */ + "jmp 1b\n" + "2:" + :"+Q" (inc), "+m" (lock->slock) + : + :"memory", "cc"); +} + +#define __raw_spin_lock_flags(lock, flags) __raw_spin_lock(lock) + +static inline int __raw_spin_trylock(__raw_spinlock_t *lock) +{ + int tmp; + short new; + + asm volatile( + "movw %2,%w0\n\t" + "cmpb %h0,%b0\n\t" + "jne 1f\n\t" + "movw %w0,%w1\n\t" + "incb %h1\n\t" + "lock ; cmpxchgw %w1,%2\n\t" + "1:" + "sete %b1\n\t" + "movzbl %b1,%0\n\t" + :"=&a" (tmp), "=Q" (new), "+m" (lock->slock) + : + : "memory", "cc"); + + return tmp; +} + +#if defined(CONFIG_X86_32) && \ + (defined(CONFIG_X86_OOSTORE) || defined(CONFIG_X86_PPRO_FENCE)) +/* + * On PPro SMP or if we are using OOSTORE, we use a locked operation to unlock + * (PPro errata 66, 92) + */ +# define UNLOCK_LOCK_PREFIX LOCK_PREFIX +#else +# define UNLOCK_LOCK_PREFIX +#endif + +static inline void __raw_spin_unlock(__raw_spinlock_t *lock) +{ + __asm__ __volatile__( + UNLOCK_LOCK_PREFIX "incb %0" + :"+m" (lock->slock) + : + :"memory", "cc"); +} + +static inline void __raw_spin_unlock_wait(__raw_spinlock_t *lock) +{ + while (__raw_spin_is_locked(lock)) + cpu_relax(); +} + +/* + * Read-write spinlocks, allowing multiple readers + * but only one writer. + * + * NOTE! it is quite common to have readers in interrupts + * but no interrupt writers. For those circumstances we + * can "mix" irq-safe locks - any writer needs to get a + * irq-safe write-lock, but readers can get non-irqsafe + * read-locks. + * + * On x86, we implement read-write locks as a 32-bit counter + * with the high bit (sign) being the "contended" bit. + */ + +/** + * read_can_lock - would read_trylock() succeed? + * @lock: the rwlock in question. + */ +static inline int __raw_read_can_lock(__raw_rwlock_t *lock) +{ + return (int)(lock)->lock > 0; +} + +/** + * write_can_lock - would write_trylock() succeed? + * @lock: the rwlock in question. + */ +static inline int __raw_write_can_lock(__raw_rwlock_t *lock) +{ + return (lock)->lock == RW_LOCK_BIAS; +} + +static inline void __raw_read_lock(__raw_rwlock_t *rw) +{ + asm volatile(LOCK_PREFIX " subl $1,(%0)\n\t" + "jns 1f\n" + "call __read_lock_failed\n\t" + "1:\n" + ::LOCK_PTR_REG (rw) : "memory"); +} + +static inline void __raw_write_lock(__raw_rwlock_t *rw) +{ + asm volatile(LOCK_PREFIX " subl %1,(%0)\n\t" + "jz 1f\n" + "call __write_lock_failed\n\t" + "1:\n" + ::LOCK_PTR_REG (rw), "i" (RW_LOCK_BIAS) : "memory"); +} + +static inline int __raw_read_trylock(__raw_rwlock_t *lock) +{ + atomic_t *count = (atomic_t *)lock; + + atomic_dec(count); + if (atomic_read(count) >= 0) + return 1; + atomic_inc(count); + return 0; +} + +static inline int __raw_write_trylock(__raw_rwlock_t *lock) +{ + atomic_t *count = (atomic_t *)lock; + + if (atomic_sub_and_test(RW_LOCK_BIAS, count)) + return 1; + atomic_add(RW_LOCK_BIAS, count); + return 0; +} + +static inline void __raw_read_unlock(__raw_rwlock_t *rw) +{ + asm volatile(LOCK_PREFIX "incl %0" :"+m" (rw->lock) : : "memory"); +} + +static inline void __raw_write_unlock(__raw_rwlock_t *rw) +{ + asm volatile(LOCK_PREFIX "addl %1, %0" + : "+m" (rw->lock) : "i" (RW_LOCK_BIAS) : "memory"); +} + +#define _raw_spin_relax(lock) cpu_relax() +#define _raw_read_relax(lock) cpu_relax() +#define _raw_write_relax(lock) cpu_relax() + #endif Index: linux-2.6.24.7/include/asm-x86/spinlock_32.h =================================================================== --- linux-2.6.24.7.orig/include/asm-x86/spinlock_32.h +++ /dev/null @@ -1,221 +0,0 @@ -#ifndef __ASM_SPINLOCK_H -#define __ASM_SPINLOCK_H - -#include <asm/atomic.h> -#include <asm/rwlock.h> -#include <asm/page.h> -#include <asm/processor.h> -#include <linux/compiler.h> - -#ifdef CONFIG_PARAVIRT -#include <asm/paravirt.h> -#else -#define CLI_STRING "cli" -#define STI_STRING "sti" -#define CLI_STI_CLOBBERS -#define CLI_STI_INPUT_ARGS -#endif /* CONFIG_PARAVIRT */ - -/* - * Your basic SMP spinlocks, allowing only a single CPU anywhere - * - * Simple spin lock operations. There are two variants, one clears IRQ's - * on the local processor, one does not. - * - * We make no fairness assumptions. They have a cost. - * - * (the type definitions are in asm/spinlock_types.h) - */ - -static inline int __raw_spin_is_locked(__raw_spinlock_t *x) -{ - return *(volatile signed char *)(&(x)->slock) <= 0; -} - -static inline void __raw_spin_lock(__raw_spinlock_t *lock) -{ - asm volatile("\n1:\t" - LOCK_PREFIX " ; decb %0\n\t" - "jns 3f\n" - "2:\t" - "rep;nop\n\t" - "cmpb $0,%0\n\t" - "jle 2b\n\t" - "jmp 1b\n" - "3:\n\t" - : "+m" (lock->slock) : : "memory"); -} - -/* - * It is easier for the lock validator if interrupts are not re-enabled - * in the middle of a lock-acquire. This is a performance feature anyway - * so we turn it off: - * - * NOTE: there's an irqs-on section here, which normally would have to be - * irq-traced, but on CONFIG_TRACE_IRQFLAGS we never use this variant. - */ -#ifndef CONFIG_PROVE_LOCKING -static inline void __raw_spin_lock_flags(__raw_spinlock_t *lock, unsigned long flags) -{ - asm volatile( - "\n1:\t" - LOCK_PREFIX " ; decb %[slock]\n\t" - "jns 5f\n" - "2:\t" - "testl $0x200, %[flags]\n\t" - "jz 4f\n\t" - STI_STRING "\n" - "3:\t" - "rep;nop\n\t" - "cmpb $0, %[slock]\n\t" - "jle 3b\n\t" - CLI_STRING "\n\t" - "jmp 1b\n" - "4:\t" - "rep;nop\n\t" - "cmpb $0, %[slock]\n\t" - "jg 1b\n\t" - "jmp 4b\n" - "5:\n\t" - : [slock] "+m" (lock->slock) - : [flags] "r" (flags) - CLI_STI_INPUT_ARGS - : "memory" CLI_STI_CLOBBERS); -} -#endif - -static inline int __raw_spin_trylock(__raw_spinlock_t *lock) -{ - char oldval; - asm volatile( - "xchgb %b0,%1" - :"=q" (oldval), "+m" (lock->slock) - :"0" (0) : "memory"); - return oldval > 0; -} - -/* - * __raw_spin_unlock based on writing $1 to the low byte. - * This method works. Despite all the confusion. - * (except on PPro SMP or if we are using OOSTORE, so we use xchgb there) - * (PPro errata 66, 92) - */ - -#if !defined(CONFIG_X86_OOSTORE) && !defined(CONFIG_X86_PPRO_FENCE) - -static inline void __raw_spin_unlock(__raw_spinlock_t *lock) -{ - asm volatile("movb $1,%0" : "+m" (lock->slock) :: "memory"); -} - -#else - -static inline void __raw_spin_unlock(__raw_spinlock_t *lock) -{ - char oldval = 1; - - asm volatile("xchgb %b0, %1" - : "=q" (oldval), "+m" (lock->slock) - : "0" (oldval) : "memory"); -} - -#endif - -static inline void __raw_spin_unlock_wait(__raw_spinlock_t *lock) -{ - while (__raw_spin_is_locked(lock)) - cpu_relax(); -} - -/* - * Read-write spinlocks, allowing multiple readers - * but only one writer. - * - * NOTE! it is quite common to have readers in interrupts - * but no interrupt writers. For those circumstances we - * can "mix" irq-safe locks - any writer needs to get a - * irq-safe write-lock, but readers can get non-irqsafe - * read-locks. - * - * On x86, we implement read-write locks as a 32-bit counter - * with the high bit (sign) being the "contended" bit. - * - * The inline assembly is non-obvious. Think about it. - * - * Changed to use the same technique as rw semaphores. See - * semaphore.h for details. -ben - * - * the helpers are in arch/i386/kernel/semaphore.c - */ - -/** - * read_can_lock - would read_trylock() succeed? - * @lock: the rwlock in question. - */ -static inline int __raw_read_can_lock(__raw_rwlock_t *x) -{ - return (int)(x)->lock > 0; -} - -/** - * write_can_lock - would write_trylock() succeed? - * @lock: the rwlock in question. - */ -static inline int __raw_write_can_lock(__raw_rwlock_t *x) -{ - return (x)->lock == RW_LOCK_BIAS; -} - -static inline void __raw_read_lock(__raw_rwlock_t *rw) -{ - asm volatile(LOCK_PREFIX " subl $1,(%0)\n\t" - "jns 1f\n" - "call __read_lock_failed\n\t" - "1:\n" - ::"a" (rw) : "memory"); -} - -static inline void __raw_write_lock(__raw_rwlock_t *rw) -{ - asm volatile(LOCK_PREFIX " subl $" RW_LOCK_BIAS_STR ",(%0)\n\t" - "jz 1f\n" - "call __write_lock_failed\n\t" - "1:\n" - ::"a" (rw) : "memory"); -} - -static inline int __raw_read_trylock(__raw_rwlock_t *lock) -{ - atomic_t *count = (atomic_t *)lock; - atomic_dec(count); - if (atomic_read(count) >= 0) - return 1; - atomic_inc(count); - return 0; -} - -static inline int __raw_write_trylock(__raw_rwlock_t *lock) -{ - atomic_t *count = (atomic_t *)lock; - if (atomic_sub_and_test(RW_LOCK_BIAS, count)) - return 1; - atomic_add(RW_LOCK_BIAS, count); - return 0; -} - -static inline void __raw_read_unlock(__raw_rwlock_t *rw) -{ - asm volatile(LOCK_PREFIX "incl %0" :"+m" (rw->lock) : : "memory"); -} - -static inline void __raw_write_unlock(__raw_rwlock_t *rw) -{ - asm volatile(LOCK_PREFIX "addl $" RW_LOCK_BIAS_STR ", %0" - : "+m" (rw->lock) : : "memory"); -} - -#define __raw_spin_relax(lock) cpu_relax() -#define __raw_read_relax(lock) cpu_relax() -#define __raw_write_relax(lock) cpu_relax() - -#endif /* __ASM_SPINLOCK_H */ Index: linux-2.6.24.7/include/asm-x86/spinlock_64.h =================================================================== --- linux-2.6.24.7.orig/include/asm-x86/spinlock_64.h +++ /dev/null @@ -1,167 +0,0 @@ -#ifndef __ASM_SPINLOCK_H -#define __ASM_SPINLOCK_H - -#include <asm/atomic.h> -#include <asm/rwlock.h> -#include <asm/page.h> -#include <asm/processor.h> - -/* - * Your basic SMP spinlocks, allowing only a single CPU anywhere - * - * Simple spin lock operations. There are two variants, one clears IRQ's - * on the local processor, one does not. - * - * We make no fairness assumptions. They have a cost. - * - * (the type definitions are in asm/spinlock_types.h) - */ - -static inline int __raw_spin_is_locked(__raw_spinlock_t *lock) -{ - return *(volatile signed int *)(&(lock)->slock) <= 0; -} - -static inline void __raw_spin_lock(__raw_spinlock_t *lock) -{ - asm volatile( - "\n1:\t" - LOCK_PREFIX " ; decl %0\n\t" - "jns 2f\n" - "3:\n" - "rep;nop\n\t" - "cmpl $0,%0\n\t" - "jle 3b\n\t" - "jmp 1b\n" - "2:\t" : "=m" (lock->slock) : : "memory"); -} - -/* - * Same as __raw_spin_lock, but reenable interrupts during spinning. - */ -#ifndef CONFIG_PROVE_LOCKING -static inline void __raw_spin_lock_flags(__raw_spinlock_t *lock, unsigned long flags) -{ - asm volatile( - "\n1:\t" - LOCK_PREFIX " ; decl %0\n\t" - "jns 5f\n" - "testl $0x200, %1\n\t" /* interrupts were disabled? */ - "jz 4f\n\t" - "sti\n" - "3:\t" - "rep;nop\n\t" - "cmpl $0, %0\n\t" - "jle 3b\n\t" - "cli\n\t" - "jmp 1b\n" - "4:\t" - "rep;nop\n\t" - "cmpl $0, %0\n\t" - "jg 1b\n\t" - "jmp 4b\n" - "5:\n\t" - : "+m" (lock->slock) : "r" ((unsigned)flags) : "memory"); -} -#endif - -static inline int __raw_spin_trylock(__raw_spinlock_t *lock) -{ - int oldval; - - asm volatile( - "xchgl %0,%1" - :"=q" (oldval), "=m" (lock->slock) - :"0" (0) : "memory"); - - return oldval > 0; -} - -static inline void __raw_spin_unlock(__raw_spinlock_t *lock) -{ - asm volatile("movl $1,%0" :"=m" (lock->slock) :: "memory"); -} - -static inline void __raw_spin_unlock_wait(__raw_spinlock_t *lock) -{ - while (__raw_spin_is_locked(lock)) - cpu_relax(); -} - -/* - * Read-write spinlocks, allowing multiple readers - * but only one writer. - * - * NOTE! it is quite common to have readers in interrupts - * but no interrupt writers. For those circumstances we - * can "mix" irq-safe locks - any writer needs to get a - * irq-safe write-lock, but readers can get non-irqsafe - * read-locks. - * - * On x86, we implement read-write locks as a 32-bit counter - * with the high bit (sign) being the "contended" bit. - */ - -static inline int __raw_read_can_lock(__raw_rwlock_t *lock) -{ - return (int)(lock)->lock > 0; -} - -static inline int __raw_write_can_lock(__raw_rwlock_t *lock) -{ - return (lock)->lock == RW_LOCK_BIAS; -} - -static inline void __raw_read_lock(__raw_rwlock_t *rw) -{ - asm volatile(LOCK_PREFIX "subl $1,(%0)\n\t" - "jns 1f\n" - "call __read_lock_failed\n" - "1:\n" - ::"D" (rw), "i" (RW_LOCK_BIAS) : "memory"); -} - -static inline void __raw_write_lock(__raw_rwlock_t *rw) -{ - asm volatile(LOCK_PREFIX "subl %1,(%0)\n\t" - "jz 1f\n" - "\tcall __write_lock_failed\n\t" - "1:\n" - ::"D" (rw), "i" (RW_LOCK_BIAS) : "memory"); -} - -static inline int __raw_read_trylock(__raw_rwlock_t *lock) -{ - atomic_t *count = (atomic_t *)lock; - atomic_dec(count); - if (atomic_read(count) >= 0) - return 1; - atomic_inc(count); - return 0; -} - -static inline int __raw_write_trylock(__raw_rwlock_t *lock) -{ - atomic_t *count = (atomic_t *)lock; - if (atomic_sub_and_test(RW_LOCK_BIAS, count)) - return 1; - atomic_add(RW_LOCK_BIAS, count); - return 0; -} - -static inline void __raw_read_unlock(__raw_rwlock_t *rw) -{ - asm volatile(LOCK_PREFIX " ; incl %0" :"=m" (rw->lock) : : "memory"); -} - -static inline void __raw_write_unlock(__raw_rwlock_t *rw) -{ - asm volatile(LOCK_PREFIX " ; addl $" RW_LOCK_BIAS_STR ",%0" - : "=m" (rw->lock) : : "memory"); -} - -#define __raw_spin_relax(lock) cpu_relax() -#define __raw_read_relax(lock) cpu_relax() -#define __raw_write_relax(lock) cpu_relax() - -#endif /* __ASM_SPINLOCK_H */ Index: linux-2.6.24.7/include/asm-x86/spinlock_types.h =================================================================== --- linux-2.6.24.7.orig/include/asm-x86/spinlock_types.h +++ linux-2.6.24.7/include/asm-x86/spinlock_types.h @@ -9,7 +9,7 @@ typedef struct { unsigned int slock; } __raw_spinlock_t; -#define __RAW_SPIN_LOCK_UNLOCKED { 1 } +#define __RAW_SPIN_LOCK_UNLOCKED { 0 } typedef struct { unsigned int lock; ���������������������������������������������������������������������������������������������������������������������������������������������patches/realtime-preempt-warn-about-tracing.patch���������������������������������������������������0000664�0000764�0000764�00000004426�11041657735�021253� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������--- init/main.c | 17 ++++++++++------- 1 file changed, 10 insertions(+), 7 deletions(-) Index: linux-2.6.24.7/init/main.c =================================================================== --- linux-2.6.24.7.orig/init/main.c +++ linux-2.6.24.7/init/main.c @@ -878,7 +878,7 @@ static int __init kernel_init(void * unu WARN_ON(irqs_disabled()); #endif -#define DEBUG_COUNT (defined(CONFIG_DEBUG_RT_MUTEXES) + defined(CONFIG_CRITICAL_PREEMPT_TIMING) + defined(CONFIG_CRITICAL_IRQSOFF_TIMING) + defined(CONFIG_FUNCTION_TRACE) + defined(CONFIG_DEBUG_SLAB) + defined(CONFIG_DEBUG_PAGEALLOC) + defined(CONFIG_LOCKDEP)) +#define DEBUG_COUNT (defined(CONFIG_DEBUG_RT_MUTEXES) + defined(CONFIG_IRQSOFF_TRACER) + defined(CONFIG_PREEMPT_TRACER) + defined(CONFIG_FTRACE) + defined(CONFIG_WAKEUP_LATENCY_HIST) + defined(CONFIG_DEBUG_SLAB) + defined(CONFIG_DEBUG_PAGEALLOC) + defined(CONFIG_LOCKDEP)) #if DEBUG_COUNT > 0 printk(KERN_ERR "*****************************************************************************\n"); @@ -892,14 +892,17 @@ static int __init kernel_init(void * unu #ifdef CONFIG_DEBUG_RT_MUTEXES printk(KERN_ERR "* CONFIG_DEBUG_RT_MUTEXES *\n"); #endif -#ifdef CONFIG_CRITICAL_PREEMPT_TIMING - printk(KERN_ERR "* CONFIG_CRITICAL_PREEMPT_TIMING *\n"); +#ifdef CONFIG_IRQSOFF_TRACER + printk(KERN_ERR "* CONFIG_IRQSOFF_TRACER *\n"); #endif -#ifdef CONFIG_CRITICAL_IRQSOFF_TIMING - printk(KERN_ERR "* CONFIG_CRITICAL_IRQSOFF_TIMING *\n"); +#ifdef CONFIG_PREEMPT_TRACER + printk(KERN_ERR "* CONFIG_PREEMPT_TRACER *\n"); #endif -#ifdef CONFIG_FUNCTION_TRACE - printk(KERN_ERR "* CONFIG_FUNCTION_TRACE *\n"); +#ifdef CONFIG_FTRACE + printk(KERN_ERR "* CONFIG_FTRACE *\n"); +#endif +#ifdef CONFIG_WAKEUP_LATENCY_HIST + printk(KERN_ERR "* CONFIG_WAKEUP_LATENCY_HIST *\n"); #endif #ifdef CONFIG_DEBUG_SLAB printk(KERN_ERR "* CONFIG_DEBUG_SLAB *\n"); ������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/x86-delay-enable-preempt-tglx.patch���������������������������������������������������������0000664�0000764�0000764�00000010550�11041657735�017663� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From: Steven Rostedt <srostedt@redhat.com> Subject: x86: enable preemption in delay The RT team has been searching for a nasty latency. This latency shows up out of the blue and has been seen to be as big as 5ms! Using ftrace I found the cause of the latency. pcscd-2995 3dNh1 52360300us : irq_exit (smp_apic_timer_interrupt) pcscd-2995 3dN.2 52360301us : idle_cpu (irq_exit) pcscd-2995 3dN.2 52360301us : rcu_irq_exit (irq_exit) pcscd-2995 3dN.1 52360771us : smp_apic_timer_interrupt (apic_timer_interrupt ) pcscd-2995 3dN.1 52360771us : exit_idle (smp_apic_timer_interrupt) Here's an example of a 400 us latency. pcscd took a timer interrupt and returned with "need resched" enabled, but did not reschedule until after the next interrupt came in at 52360771us 400us later! At first I thought we somehow missed a preemption check in entry.S. But I also noticed that this always seemed to happen during a __delay call. pcscd-2995 3dN.2 52360836us : rcu_irq_exit (irq_exit) pcscd-2995 3.N.. 52361265us : preempt_schedule (__delay) Looking at the x86 delay, I found my problem. In git commit 35d5d08a085c56f153458c3f5d8ce24123617faf, Andrew Morton placed preempt_disable around the entire delay due to TSC's not working nicely on SMP. Unfortunately for those that care about latencies this is devastating! Especially when we have callers to mdelay(8). Here I enable preemption during the loop and account for anytime the task migrates to a new CPU. The delay asked for may be extended a bit by the migration, but delay only guarantees that it will delay for that minimum time. Delaying longer should not be an issue. [ Thanks to Thomas Gleixner for spotting that cpu wasn't updated, and to place the rep_nop between preempt_enabled/disable. ] Signed-off-by: Steven Rostedt <srostedt@redhat.com> --- arch/x86/lib/delay_32.c | 31 +++++++++++++++++++++++++++---- arch/x86/lib/delay_64.c | 30 ++++++++++++++++++++++++++---- 2 files changed, 53 insertions(+), 8 deletions(-) Index: linux-2.6.24.7/arch/x86/lib/delay_32.c =================================================================== --- linux-2.6.24.7.orig/arch/x86/lib/delay_32.c +++ linux-2.6.24.7/arch/x86/lib/delay_32.c @@ -42,13 +42,36 @@ static void delay_loop(unsigned long loo static void delay_tsc(unsigned long loops) { unsigned long bclock, now; + int cpu; - preempt_disable(); /* TSC's are per-cpu */ + preempt_disable(); + cpu = smp_processor_id(); rdtscl(bclock); - do { - rep_nop(); + for (;;) { rdtscl(now); - } while ((now-bclock) < loops); + if ((now - bclock) >= loops) + break; + + /* Allow RT tasks to run */ + preempt_enable(); + rep_nop(); + preempt_disable(); + + /* + * It is possible that we moved to another CPU, and + * since TSC's are per-cpu we need to calculate + * that. The delay must guarantee that we wait "at + * least" the amount of time. Being moved to another + * CPU could make the wait longer but we just need to + * make sure we waited long enough. Rebalance the + * counter for this CPU. + */ + if (unlikely(cpu != smp_processor_id())) { + loops -= (now - bclock); + cpu = smp_processor_id(); + rdtscl(bclock); + } + } preempt_enable(); } Index: linux-2.6.24.7/arch/x86/lib/delay_64.c =================================================================== --- linux-2.6.24.7.orig/arch/x86/lib/delay_64.c +++ linux-2.6.24.7/arch/x86/lib/delay_64.c @@ -29,14 +29,36 @@ int read_current_timer(unsigned long *ti void __delay(unsigned long loops) { unsigned bclock, now; + int cpu; - preempt_disable(); /* TSC's are pre-cpu */ + preempt_disable(); + cpu = smp_processor_id(); rdtscl(bclock); - do { - rep_nop(); + for (;;) { rdtscl(now); + if ((now - bclock) >= loops) + break; + + /* Allow RT tasks to run */ + preempt_enable(); + rep_nop(); + preempt_disable(); + + /* + * It is possible that we moved to another CPU, and + * since TSC's are per-cpu we need to calculate + * that. The delay must guarantee that we wait "at + * least" the amount of time. Being moved to another + * CPU could make the wait longer but we just need to + * make sure we waited long enough. Rebalance the + * counter for this CPU. + */ + if (unlikely(cpu != smp_processor_id())) { + loops -= (now - bclock); + cpu = smp_processor_id(); + rdtscl(bclock); + } } - while ((now-bclock) < loops); preempt_enable(); } EXPORT_SYMBOL(__delay); ��������������������������������������������������������������������������������������������������������������������������������������������������������patches/ftrace-compile-fixes.patch������������������������������������������������������������������0000664�0000764�0000764�00000001427�11041657734�016300� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From: Steven Rostedt <srostedt@redhat.com> Subject: rt: remove call to stop tracer Remove user_trace_stop that was more of a hack to debug xrun latencies. Signed-off-by: Steven Rostedt <srostedt@redhat.com> --- sound/core/pcm_lib.c | 1 - 1 file changed, 1 deletion(-) Index: linux-2.6.24.7/sound/core/pcm_lib.c =================================================================== --- linux-2.6.24.7.orig/sound/core/pcm_lib.c +++ linux-2.6.24.7/sound/core/pcm_lib.c @@ -131,7 +131,6 @@ static void xrun(struct snd_pcm_substrea snd_pcm_stop(substream, SNDRV_PCM_STATE_XRUN); #ifdef CONFIG_SND_PCM_XRUN_DEBUG if (substream->pstr->xrun_debug) { - user_trace_stop(); snd_printd(KERN_DEBUG "XRUN: pcmC%dD%d%c\n", substream->pcm->card->number, substream->pcm->device, �����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/ftrace-fix-header.patch���������������������������������������������������������������������0000664�0000764�0000764�00000003454�11041657731�015547� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From clark.williams@gmail.com Sat May 24 20:47:58 2008 Date: Sat, 24 May 2008 14:49:40 -0500 From: Clark Williams <clark.williams@gmail.com> To: Steven Rostedt <rostedt@goodmis.org> Cc: Steven Rostedt <srostedt@redhat.com>, LKML <linux-kernel@vger.kernel.org>, RT <linux-rt-users@vger.kernel.org> Subject: [PATCH -rt] fix for compiling 2.6.24.7-rt10 without CONFIG_FTRACE [ The following text is in the "UTF-8" character set. ] [ Your display is set for the "iso-8859-1" character set. ] [ Some characters may be displayed incorrectly. ] -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Steven, If you build a debugging kernel and don't have CONFIG_FTRACE turned on, -rt10 dies when compiling arch/x86/kernel/x8664_ksyms_64.c, because ktime_t isn't defined in the prototypes at the bottom of include/linux/ftrace.h. Patch to fix attached. Clark -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (GNU/Linux) Comment: Using GnuPG with Fedora - http://enigmail.mozdev.org iEYEARECAAYFAkg4cVQACgkQqA4JVb61b9d0gQCffgzXgm2qaftlj5Q3fjjtyolD J2MAnAoy4j9s2AUhZjwagT6OXzJ3Plgq =9Ypr -----END PGP SIGNATURE----- [ Part 2: "Attached Text" ] fix to handle compiling debugging without CONFIG_FTRACE From: Clark Williams <williams@redhat.com> Signed-off-by: Clark Williams <williams@redhat.com> --- include/linux/ftrace.h | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) Index: linux-2.6.24.7/include/linux/ftrace.h =================================================================== --- linux-2.6.24.7.orig/include/linux/ftrace.h +++ linux-2.6.24.7/include/linux/ftrace.h @@ -1,10 +1,11 @@ #ifndef _LINUX_FTRACE_H #define _LINUX_FTRACE_H +#include <linux/ktime.h> + #ifdef CONFIG_FTRACE #include <linux/linkage.h> -#include <linux/ktime.h> #include <linux/fs.h> extern int ftrace_enabled; ��������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/rcupreempt-trace-marker-update.patch��������������������������������������������������������0000664�0000764�0000764�00000006616�11041657731�020315� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From: Steven Rostedt <srostedt@redhat.com> Subject: rcupreempt trace update to new markers Update the rcupreempt tracing with the new markers. Signed-off-by: Steven Rostedt <srostedt@redhat.com> --- include/linux/rcupreempt_trace.h | 8 ++++---- kernel/rcupreempt_trace.c | 21 +++++++++------------ 2 files changed, 13 insertions(+), 16 deletions(-) Index: linux-2.6.24.7/include/linux/rcupreempt_trace.h =================================================================== --- linux-2.6.24.7.orig/include/linux/rcupreempt_trace.h +++ linux-2.6.24.7/include/linux/rcupreempt_trace.h @@ -76,8 +76,8 @@ struct rcupreempt_probe_data { }; #define DEFINE_RCUPREEMPT_MARKER_HANDLER(rcupreempt_trace_worker) \ -void rcupreempt_trace_worker##_callback(const struct marker *mdata, \ - void *private_data, const char *format, ...) \ +void rcupreempt_trace_worker##_callback(void *private_data, void *call_data, \ + const char *format, va_list *args) \ { \ struct rcupreempt_trace *trace; \ trace = (&per_cpu(trace_data, smp_processor_id())); \ @@ -113,8 +113,8 @@ struct preempt_rcu_boost_trace { }; #define DEFINE_PREEMPT_RCU_BOOST_MARKER_HANDLER(preempt_rcu_boost_var) \ -void preempt_rcu_boost_var##_callback(const struct marker *mdata, \ - void *private_data, const char *format, ...) \ +void preempt_rcu_boost_var##_callback(void *private_data, void *call_data, \ + const char *format, va_list *args) \ { \ struct preempt_rcu_boost_trace *boost_trace; \ boost_trace = (&per_cpu(boost_trace_data, smp_processor_id())); \ Index: linux-2.6.24.7/kernel/rcupreempt_trace.c =================================================================== --- linux-2.6.24.7.orig/kernel/rcupreempt_trace.c +++ linux-2.6.24.7/kernel/rcupreempt_trace.c @@ -536,10 +536,6 @@ static int __init rcupreempt_trace_init( if (ret) printk(KERN_INFO "Unable to register rcupreempt \ probe %s\n", rcupreempt_probe_array[i].name); - ret = marker_arm(p->name); - if (ret) - printk(KERN_INFO "Unable to arm rcupreempt probe %s\n", - p->name); } printk(KERN_INFO "RCU Preempt markers registered\n"); @@ -552,10 +548,6 @@ static int __init rcupreempt_trace_init( if (ret) printk(KERN_INFO "Unable to register Preempt RCU Boost \ probe %s\n", preempt_rcu_boost_probe_array[i].name); - ret = marker_arm(p->name); - if (ret) - printk(KERN_INFO "Unable to arm Preempt RCU Boost \ - markers %s\n", p->name); } #endif /* CONFIG_PREEMPT_RCU_BOOST */ @@ -573,14 +565,19 @@ static void __exit rcupreempt_trace_clea { int i; - for (i = 0; i < ARRAY_SIZE(rcupreempt_probe_array); i++) - marker_probe_unregister(rcupreempt_probe_array[i].name); + for (i = 0; i < ARRAY_SIZE(rcupreempt_probe_array); i++) { + struct rcupreempt_probe_data *p = &rcupreempt_probe_array[i]; + marker_probe_unregister(p->name, p->probe_func, p); + } printk(KERN_INFO "RCU Preempt markers unregistered\n"); #ifdef CONFIG_PREEMPT_RCU_BOOST rcu_trace_boost_destroy(); - for (i = 0; i < ARRAY_SIZE(preempt_rcu_boost_probe_array); i++) - marker_probe_unregister(preempt_rcu_boost_probe_array[i].name); + for (i = 0; i < ARRAY_SIZE(preempt_rcu_boost_probe_array); i++) { + struct preempt_rcu_boost_probe *p = \ + &preempt_rcu_boost_probe_array[i]; + marker_probe_unregister(p->name, p->probe_func, p); + } printk(KERN_INFO "Preempt RCU Boost markers unregistered\n"); #endif /* CONFIG_PREEMPT_RCU_BOOST */ debugfs_remove(statdir); ������������������������������������������������������������������������������������������������������������������patches/marker-upstream-example.patch���������������������������������������������������������������0000664�0000764�0000764�00000004262�11041657731�017037� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From: Steven Rostedt <srostedt@redhat.com> Subject: markers: update samples to markers upstream. Update the marker sample code to match upstream. Signed-off-by: Steven Rostedt <srostedt@redhat.com> --- samples/markers/probe-example.c | 24 +++++++++--------------- 1 file changed, 9 insertions(+), 15 deletions(-) Index: linux-2.6.24.7/samples/markers/probe-example.c =================================================================== --- linux-2.6.24.7.orig/samples/markers/probe-example.c +++ linux-2.6.24.7/samples/markers/probe-example.c @@ -20,31 +20,27 @@ struct probe_data { marker_probe_func *probe_func; }; -void probe_subsystem_event(const struct marker *mdata, void *private, - const char *format, ...) +void probe_subsystem_event(void *private, void *calldata, + const char *format, va_list *args) { - va_list ap; /* Declare args */ unsigned int value; const char *mystr; /* Assign args */ - va_start(ap, format); - value = va_arg(ap, typeof(value)); - mystr = va_arg(ap, typeof(mystr)); + value = va_arg(*args, typeof(value)); + mystr = va_arg(*args, typeof(mystr)); /* Call printk */ printk(KERN_DEBUG "Value %u, string %s\n", value, mystr); /* or count, check rights, serialize data in a buffer */ - - va_end(ap); } atomic_t eventb_count = ATOMIC_INIT(0); -void probe_subsystem_eventb(const struct marker *mdata, void *private, - const char *format, ...) +void probe_subsystem_eventb(void *private, void *calldata, + const char *format, va_list *args) { /* Increment counter */ atomic_inc(&eventb_count); @@ -72,10 +68,6 @@ static int __init probe_init(void) if (result) printk(KERN_INFO "Unable to register probe %s\n", probe_array[i].name); - result = marker_arm(probe_array[i].name); - if (result) - printk(KERN_INFO "Unable to arm probe %s\n", - probe_array[i].name); } return 0; } @@ -85,7 +77,9 @@ static void __exit probe_fini(void) int i; for (i = 0; i < ARRAY_SIZE(probe_array); i++) - marker_probe_unregister(probe_array[i].name); + marker_probe_unregister(probe_array[i].name, + probe_array[i].probe_func, &probe_array[i]); + printk(KERN_INFO "Number of event b : %u\n", atomic_read(&eventb_count)); } ����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/nmi-show-regs-fix.patch���������������������������������������������������������������������0000664�0000764�0000764�00000011321�11041657732�015545� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From h-shimamoto@ct.jp.nec.com Tue May 27 19:37:24 2008 Date: Tue, 27 May 2008 15:45:00 -0700 From: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com> To: Steven Rostedt <rostedt@goodmis.org>, Ingo Molnar <mingo@elte.hu>, Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org, linux-rt-users@vger.kernel.org Subject: [PATCH -rt] fix sysrq+l when nmi_watchdog disabled From: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com> In nmi_show_all_regs(), set nmi_show_regs for all cpus but NMI never come to itself when nmi_watchdog is disabled. It means the kernel hangs up when sysrq+l is issued. Call irq_show_regs_callback() itself before calling smp_send_nmi_allbutself(). Signed-off-by: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com> --- Steven, this is a fix for what you pointed. http://lkml.org/lkml/2008/4/28/455 arch/x86/kernel/nmi_32.c | 51 ++++++++++++++++++++++++++++------------------- arch/x86/kernel/nmi_64.c | 49 +++++++++++++++++++++++++++++---------------- 2 files changed, 63 insertions(+), 37 deletions(-) Index: linux-2.6.24.7/arch/x86/kernel/nmi_32.c =================================================================== --- linux-2.6.24.7.orig/arch/x86/kernel/nmi_32.c +++ linux-2.6.24.7/arch/x86/kernel/nmi_32.c @@ -326,21 +326,49 @@ extern void die_nmi(struct pt_regs *, co int nmi_show_regs[NR_CPUS]; +static DEFINE_RAW_SPINLOCK(nmi_print_lock); + +notrace int irq_show_regs_callback(int cpu, struct pt_regs *regs) +{ + if (!nmi_show_regs[cpu]) + return 0; + + spin_lock(&nmi_print_lock); + printk(KERN_WARNING "NMI show regs on CPU#%d:\n", cpu); + printk(KERN_WARNING "apic_timer_irqs: %d\n", + per_cpu(irq_stat, cpu).apic_timer_irqs); + show_regs(regs); + spin_unlock(&nmi_print_lock); + nmi_show_regs[cpu] = 0; + return 1; +} + void nmi_show_all_regs(void) { - int i; + struct pt_regs *regs; + int i, cpu; if (system_state == SYSTEM_BOOTING) return; - printk(KERN_WARNING "nmi_show_all_regs(): start on CPU#%d.\n", - raw_smp_processor_id()); + preempt_disable(); + + regs = get_irq_regs(); + cpu = smp_processor_id(); + + printk(KERN_WARNING "nmi_show_all_regs(): start on CPU#%d.\n", cpu); dump_stack(); for_each_online_cpu(i) nmi_show_regs[i] = 1; + if (regs) + irq_show_regs_callback(cpu, regs); + else + nmi_show_regs[cpu] = 0; + smp_send_nmi_allbutself(); + preempt_enable(); for_each_online_cpu(i) { while (nmi_show_regs[i] == 1) @@ -348,23 +376,6 @@ void nmi_show_all_regs(void) } } -static DEFINE_RAW_SPINLOCK(nmi_print_lock); - -notrace int irq_show_regs_callback(int cpu, struct pt_regs *regs) -{ - if (!nmi_show_regs[cpu]) - return 0; - - spin_lock(&nmi_print_lock); - printk(KERN_WARNING "NMI show regs on CPU#%d:\n", cpu); - printk(KERN_WARNING "apic_timer_irqs: %d\n", - per_cpu(irq_stat, cpu).apic_timer_irqs); - show_regs(regs); - spin_unlock(&nmi_print_lock); - nmi_show_regs[cpu] = 0; - return 1; -} - notrace __kprobes int nmi_watchdog_tick(struct pt_regs * regs, unsigned reason) { Index: linux-2.6.24.7/arch/x86/kernel/nmi_64.c =================================================================== --- linux-2.6.24.7.orig/arch/x86/kernel/nmi_64.c +++ linux-2.6.24.7/arch/x86/kernel/nmi_64.c @@ -320,17 +320,48 @@ void touch_nmi_watchdog(void) int nmi_show_regs[NR_CPUS]; +static DEFINE_RAW_SPINLOCK(nmi_print_lock); + +notrace int irq_show_regs_callback(int cpu, struct pt_regs *regs) +{ + if (!nmi_show_regs[cpu]) + return 0; + + spin_lock(&nmi_print_lock); + printk(KERN_WARNING "NMI show regs on CPU#%d:\n", cpu); + printk(KERN_WARNING "apic_timer_irqs: %d\n", read_pda(apic_timer_irqs)); + show_regs(regs); + spin_unlock(&nmi_print_lock); + nmi_show_regs[cpu] = 0; + return 1; +} + void nmi_show_all_regs(void) { - int i; + struct pt_regs *regs; + int i, cpu; if (system_state == SYSTEM_BOOTING) return; + preempt_disable(); + + regs = get_irq_regs(); + cpu = smp_processor_id(); + + printk(KERN_WARNING "nmi_show_all_regs(): start on CPU#%d.\n", cpu); + dump_stack(); + for_each_online_cpu(i) nmi_show_regs[i] = 1; + if (regs) + irq_show_regs_callback(cpu, regs); + else + nmi_show_regs[cpu] = 0; + smp_send_nmi_allbutself(); + preempt_enable(); for_each_online_cpu(i) { while (nmi_show_regs[i] == 1) @@ -338,22 +369,6 @@ void nmi_show_all_regs(void) } } -static DEFINE_RAW_SPINLOCK(nmi_print_lock); - -notrace int irq_show_regs_callback(int cpu, struct pt_regs *regs) -{ - if (!nmi_show_regs[cpu]) - return 0; - - spin_lock(&nmi_print_lock); - printk(KERN_WARNING "NMI show regs on CPU#%d:\n", cpu); - printk(KERN_WARNING "apic_timer_irqs: %d\n", read_pda(apic_timer_irqs)); - show_regs(regs); - spin_unlock(&nmi_print_lock); - nmi_show_regs[cpu] = 0; - return 1; -} - notrace int __kprobes nmi_watchdog_tick(struct pt_regs * regs, unsigned reason) { ���������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/sched-fix-rt-task-wakeup.patch��������������������������������������������������������������0000664�0000764�0000764�00000005316�11041657731�017017� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From ghaskins@novell.com Tue May 27 21:24:29 2008 Date: Tue, 27 May 2008 18:59:34 -0600 From: Gregory Haskins <ghaskins@novell.com> To: Steven Rostedt <rostedt@goodmis.org>, linux-rt-users@vger.kernel.org Cc: linux-kernel@vger.kernel.org, Ingo Molnar <mingo@elte.hu>, Thomas Gleixner <tglx@linutronix.de>, Gregory Haskins <ghaskins@novell.com> Subject: [PATCH 1/3] sched: fix RT task-wakeup logic [ The following text is in the "utf-8" character set. ] [ Your display is set for the "iso-8859-1" character set. ] [ Some characters may be displayed incorrectly. ] Dmitry Adamushko pointed out a logic error in task_wake_up_rt() where we will always evaluate to "true". You can find the thread here: http://lkml.org/lkml/2008/4/22/296 In reality, we only want to try to push tasks away when a wake up request is not going to preempt the current task. So lets fix it. Note: We introduce test_tsk_need_resched() instead of open-coding the flag check so that the merge-conflict with -rt should help remind us that we may need to support NEEDS_RESCHED_DELAYED in the future, too. Signed-off-by: Gregory Haskins <ghaskins@novell.com> CC: Dmitry Adamushko <dmitry.adamushko@gmail.com> CC: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Ingo Molnar <mingo@elte.hu> --- include/linux/sched.h | 7 ++++++- kernel/sched_rt.c | 2 +- 2 files changed, 7 insertions(+), 2 deletions(-) Index: linux-2.6.24.7/include/linux/sched.h =================================================================== --- linux-2.6.24.7.orig/include/linux/sched.h +++ linux-2.6.24.7/include/linux/sched.h @@ -2017,6 +2017,11 @@ static inline void clear_tsk_need_resche clear_tsk_thread_flag(tsk,TIF_NEED_RESCHED); } +static inline int test_tsk_need_resched(struct task_struct *tsk) +{ + return unlikely(test_tsk_thread_flag(tsk,TIF_NEED_RESCHED)); +} + static inline int signal_pending(struct task_struct *p) { return unlikely(test_tsk_thread_flag(p,TIF_SIGPENDING)); @@ -2024,7 +2029,7 @@ static inline int signal_pending(struct static inline int _need_resched(void) { - return unlikely(test_thread_flag(TIF_NEED_RESCHED)); + return unlikely(test_tsk_need_resched(current)); } static inline int need_resched(void) Index: linux-2.6.24.7/kernel/sched_rt.c =================================================================== --- linux-2.6.24.7.orig/kernel/sched_rt.c +++ linux-2.6.24.7/kernel/sched_rt.c @@ -707,7 +707,7 @@ static void post_schedule_rt(struct rq * static void task_wake_up_rt(struct rq *rq, struct task_struct *p) { if (!task_running(rq, p) && - (p->prio >= rq->rt.highest_prio) && + !test_tsk_need_resched(rq->curr) && rq->rt.overloaded) { push_rt_tasks(rq); schedstat_inc(rq, rto_wakeup); ������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/sched-fix-sched-fair-wakeup.patch�����������������������������������������������������������0000664�0000764�0000764�00000003621�11041673107�017427� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From ghaskins@novell.com Tue May 27 21:24:41 2008 Date: Tue, 27 May 2008 18:59:39 -0600 From: Gregory Haskins <ghaskins@novell.com> To: Steven Rostedt <rostedt@goodmis.org>, linux-rt-users@vger.kernel.org Cc: linux-kernel@vger.kernel.org, Ingo Molnar <mingo@elte.hu>, Thomas Gleixner <tglx@linutronix.de>, Gregory Haskins <ghaskins@novell.com> Subject: [PATCH 2/3] sched: fix SCHED_FAIR wake-idle logic error [ The following text is in the "utf-8" character set. ] [ Your display is set for the "iso-8859-1" character set. ] [ Some characters may be displayed incorrectly. ] We currently use an optimization to skip the overhead of wake-idle processing if more than one task is assigned to a run-queue. The assumption is that the system must already be load-balanced or we wouldnt be overloaded to begin with. The problem is that we are looking at rq->nr_running, which may include RT tasks in addition to CFS tasks. Since the presence of RT tasks really has no bearing on the balance status of CFS tasks, this throws the calculation off. This patch changes the logic to only consider the number of CFS tasks when making the decision to optimze the wake-idle. Signed-off-by: Gregory Haskins <ghaskins@novell.com> CC: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@elte.hu> --- kernel/sched_fair.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) Index: linux-2.6.24.7/kernel/sched_fair.c =================================================================== --- linux-2.6.24.7.orig/kernel/sched_fair.c +++ linux-2.6.24.7/kernel/sched_fair.c @@ -858,7 +858,7 @@ static int wake_idle(int cpu, struct tas * sibling runqueue info. This will avoid the checks and cache miss * penalities associated with that. */ - if (idle_cpu(cpu) || cpu_rq(cpu)->nr_running > 1) + if (idle_cpu(cpu) || cpu_rq(cpu)->cfs.nr_running > 1) return cpu; for_each_domain(cpu, sd) { ���������������������������������������������������������������������������������������������������������������patches/trace_hist-latediv.patch��������������������������������������������������������������������0000664�0000764�0000764�00000004535�11041657733�016047� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������--- kernel/trace/trace_hist.c | 22 +++++++++++++--------- 1 file changed, 13 insertions(+), 9 deletions(-) Index: linux-2.6.24.7/kernel/trace/trace_hist.c =================================================================== --- linux-2.6.24.7.orig/kernel/trace/trace_hist.c +++ linux-2.6.24.7/kernel/trace/trace_hist.c @@ -37,7 +37,6 @@ enum { struct hist_data { atomic_t hist_mode; /* 0 log, 1 don't log */ unsigned long min_lat; - unsigned long avg_lat; unsigned long max_lat; unsigned long long beyond_hist_bound_samples; unsigned long long accumulate_lat; @@ -70,7 +69,6 @@ static char *wakeup_latency_hist_dir = " void notrace latency_hist(int latency_type, int cpu, unsigned long latency) { struct hist_data *my_hist; - unsigned long long total_samples; if ((cpu < 0) || (cpu >= NR_CPUS) || (latency_type < INTERRUPT_LATENCY) || (latency_type > WAKEUP_LATENCY) || (latency < 0)) @@ -117,11 +115,8 @@ void notrace latency_hist(int latency_ty else if (latency > my_hist->max_lat) my_hist->max_lat = latency; - total_samples = my_hist->total_samples++; + my_hist->total_samples++; my_hist->accumulate_lat += latency; - if (likely(total_samples)) - my_hist->avg_lat = (unsigned long) - div64_64(my_hist->accumulate_lat, total_samples); return; } @@ -135,16 +130,26 @@ static void *l_start(struct seq_file *m, return NULL; if (index == 0) { + char avgstr[32]; + atomic_dec(&my_hist->hist_mode); + if (likely(my_hist->total_samples)) { + unsigned long avg = (unsigned long) + div64_64(my_hist->accumulate_lat, + my_hist->total_samples); + sprintf(avgstr, "%lu", avg); + } else + strcpy(avgstr, "<undef>"); + seq_printf(m, "#Minimum latency: %lu microseconds.\n" - "#Average latency: %lu microseconds.\n" + "#Average latency: %s microseconds.\n" "#Maximum latency: %lu microseconds.\n" "#Total samples: %llu\n" "#There are %llu samples greater or equal" " than %d microseconds\n" "#usecs\t%16s\n" , my_hist->min_lat - , my_hist->avg_lat + , avgstr , my_hist->max_lat , my_hist->total_samples , my_hist->beyond_hist_bound_samples @@ -220,7 +225,6 @@ static void hist_reset(struct hist_data hist->max_lat = 0UL; hist->total_samples = 0ULL; hist->accumulate_lat = 0ULL; - hist->avg_lat = 0UL; atomic_inc(&hist->hist_mode); } �������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/rwlock-prio-fix.patch�����������������������������������������������������������������������0000664�0000764�0000764�00000002207�11041657732�015321� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From: Steven Rostedt <srostedt@redhat.com> Subject: rwlock: reset prio on unlocks and wakeups The unlocking of an rwlock that woke up processes did not update the rwm prio. Signed-off-by: Steven Rostedt <srostedt@redhat.com> --- kernel/rtmutex.c | 12 +++++++++++- 1 file changed, 11 insertions(+), 1 deletion(-) Index: linux-2.6.24.7/kernel/rtmutex.c =================================================================== --- linux-2.6.24.7.orig/kernel/rtmutex.c +++ linux-2.6.24.7/kernel/rtmutex.c @@ -1791,6 +1791,12 @@ rt_read_slowunlock(struct rw_mutex *rwm, wakeup_next_waiter(mutex, savestate); + if (rt_mutex_has_waiters(mutex)) { + waiter = rt_mutex_top_waiter(mutex); + rwm->prio = waiter->task->prio; + } else + rwm->prio = MAX_PRIO; + out: spin_unlock_irqrestore(&mutex->wait_lock, flags); @@ -1932,7 +1938,11 @@ rt_write_slowunlock(struct rw_mutex *rwm plist_del(&next->pi_list_entry, &pendowner->pi_waiters); /* add back in as top waiter */ plist_add(&next->pi_list_entry, &pendowner->pi_waiters); - } + + rwm->prio = next->task->prio; + } else + rwm->prio = MAX_PRIO; + spin_unlock(&pendowner->pi_lock); out: �����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/rwlock-fixes.patch��������������������������������������������������������������������������0000664�0000764�0000764�00000013052�11041657732�014702� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From: Steven Rostedt <srostedt@redhat.com> Subject: rwlock: fix pi_list race conditions Found a few pi_list problems, this patch fixes. Signed-off-by: Steven Rostedt <srostedt@redhat.com> --- kernel/rtmutex.c | 56 ++++++++++++++++++++++++++++++++++++++++--------------- 1 file changed, 41 insertions(+), 15 deletions(-) Index: linux-2.6.24.7/kernel/rtmutex.c =================================================================== --- linux-2.6.24.7.orig/kernel/rtmutex.c +++ linux-2.6.24.7/kernel/rtmutex.c @@ -1108,6 +1108,23 @@ update_rw_mutex_owner(struct rw_mutex *r rt_mutex_set_owner(mutex, mtxowner, 0); } +#ifdef CONFIG_DEBUG_RT_MUTEXES +/* + * A rw lock is about to be added or has already been + * removed from current. Make sure it doesn't exist still. + */ +static void rw_check_held(struct rw_mutex *rwm) +{ + int reader_count = current->reader_lock_count; + int i; + + for (i = 0; i < reader_count; i++) + WARN_ON_ONCE(current->owned_read_locks[i].lock == rwm); +} +#else +# define rw_check_held(rwm) do { } while (0) +#endif + /* * The fast path does not add itself to the reader list to keep * from needing to grab the spinlock. We need to add the owner @@ -1122,16 +1139,14 @@ update_rw_mutex_owner(struct rw_mutex *r */ static inline void -rt_rwlock_update_owner(struct rw_mutex *rwm, unsigned owners) +rt_rwlock_update_owner(struct rw_mutex *rwm, struct task_struct *own) { struct reader_lock_struct *rls; - struct task_struct *own; int i; - if (!owners || rt_rwlock_pending(rwm)) + if (!own || rt_rwlock_pending(rwm)) return; - own = rt_rwlock_owner(rwm); if (own == RT_RW_READER) return; @@ -1201,7 +1216,7 @@ static int try_to_take_rw_read(struct rw } owners = atomic_read(&rwm->owners); - rt_rwlock_update_owner(rwm, owners); + rt_rwlock_update_owner(rwm, rt_rwlock_owner(rwm)); /* Check for rwlock limits */ if (rt_rwlock_limit && owners >= rt_rwlock_limit) @@ -1253,6 +1268,7 @@ static int try_to_take_rw_read(struct rw taken: if (incr) { atomic_inc(&rwm->owners); + rw_check_held(rwm); reader_count = current->reader_lock_count++; if (likely(reader_count < MAX_RWLOCK_DEPTH)) { rls = ¤t->owned_read_locks[reader_count]; @@ -1280,11 +1296,12 @@ try_to_take_rw_write(struct rw_mutex *rw own = rt_rwlock_owner(rwm); /* owners must be zero for writer */ - rt_rwlock_update_owner(rwm, atomic_read(&rwm->owners)); + if (own) { + rt_rwlock_update_owner(rwm, own); - /* readers or writers? */ - if ((own && !rt_rwlock_pending(rwm))) - return 0; + if (!rt_rwlock_pending(rwm)) + return 0; + } /* * RT_RW_PENDING means that the lock is free, but there are @@ -1431,6 +1448,7 @@ __rt_read_fasttrylock(struct rw_mutex *r } atomic_inc(&rwm->owners); + rw_check_held(rwm); reader_count = current->reader_lock_count; if (likely(reader_count < MAX_RWLOCK_DEPTH)) { current->owned_read_locks[reader_count].lock = rwm; @@ -1713,6 +1731,7 @@ rt_read_slowunlock(struct rw_mutex *rwm, WARN_ON(!rls->list.prev || list_empty(&rls->list)); list_del_init(&rls->list); rls->lock = NULL; + rw_check_held(rwm); } break; } @@ -1729,7 +1748,7 @@ rt_read_slowunlock(struct rw_mutex *rwm, if (unlikely(rt_rwlock_owner(rwm) != current && rt_rwlock_owner(rwm) != RT_RW_READER)) { /* Update the owner if necessary */ - rt_rwlock_update_owner(rwm, atomic_read(&rwm->owners)); + rt_rwlock_update_owner(rwm, rt_rwlock_owner(rwm)); goto out; } @@ -1786,7 +1805,8 @@ rt_read_slowunlock(struct rw_mutex *rwm, if (rt_rwlock_limit && unlikely(atomic_read(&rwm->owners) >= rt_rwlock_limit)) goto out; - rwm->owner = RT_RW_PENDING_READ; + if (!reader_count) + rwm->owner = RT_RW_PENDING_READ; } wakeup_next_waiter(mutex, savestate); @@ -1812,6 +1832,7 @@ rt_read_fastunlock(struct rw_mutex *rwm, WARN_ON(!atomic_read(&rwm->count)); WARN_ON(!atomic_read(&rwm->owners)); WARN_ON(!rwm->owner); + smp_mb(); atomic_dec(&rwm->count); if (likely(rt_rwlock_cmpxchg(rwm, current, NULL))) { struct reader_lock_struct *rls; @@ -1830,7 +1851,9 @@ rt_read_fastunlock(struct rw_mutex *rwm, rls = ¤t->owned_read_locks[reader_count]; WARN_ON_ONCE(rls->lock != rwm); WARN_ON(rls->list.prev && !list_empty(&rls->list)); + WARN_ON(rls->count != 1); rls->lock = NULL; + rw_check_held(rwm); } else slowfn(rwm, mtx); } @@ -1936,8 +1959,11 @@ rt_write_slowunlock(struct rw_mutex *rwm next = rt_mutex_top_waiter(mutex); /* delete incase we didn't go through the loop */ plist_del(&next->pi_list_entry, &pendowner->pi_waiters); - /* add back in as top waiter */ - plist_add(&next->pi_list_entry, &pendowner->pi_waiters); + + /* This could also be a reader (if reader_limit is set) */ + if (next->write_lock) + /* add back in as top waiter */ + plist_add(&next->pi_list_entry, &pendowner->pi_waiters); rwm->prio = next->task->prio; } else @@ -1997,6 +2023,7 @@ rt_mutex_downgrade_write(struct rw_mutex /* we have the lock and are sole owner, then update the accounting */ atomic_inc(&rwm->count); atomic_inc(&rwm->owners); + rw_check_held(rwm); reader_count = current->reader_lock_count++; rls = ¤t->owned_read_locks[reader_count]; if (likely(reader_count < MAX_RWLOCK_DEPTH)) { @@ -2058,8 +2085,7 @@ rt_mutex_downgrade_write(struct rw_mutex /* delete incase we didn't go through the loop */ plist_del(&next->pi_list_entry, ¤t->pi_waiters); - /* add back in as top waiter */ - plist_add(&next->pi_list_entry, ¤t->pi_waiters); + /* No need to add back since readers don't have PI waiters */ } else rwm->prio = MAX_PRIO; ��������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/event-trace-hrtimer-trace.patch�������������������������������������������������������������0000664�0000764�0000764�00000015734�11041657730�017255� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From: Steven Rostedt <srostedt@redhat.com> Subject: event-tracer: add clockevent trace The old latency tracer recorded clockevent programming of the timer. This patch adds that back in to the event tracer. Signed-off-by: Steven Rostedt <srostedt@redhat.com> --- include/linux/ftrace.h | 7 +++++++ kernel/time/clockevents.c | 3 +++ kernel/trace/trace.c | 26 ++++++++++++++++++++++++++ kernel/trace/trace.h | 13 +++++++++++++ kernel/trace/trace_events.c | 44 ++++++++++++++++++++++++++++++++++++++++++++ 5 files changed, 93 insertions(+) Index: linux-2.6.24.7/include/linux/ftrace.h =================================================================== --- linux-2.6.24.7.orig/include/linux/ftrace.h +++ linux-2.6.24.7/include/linux/ftrace.h @@ -176,6 +176,12 @@ static inline void ftrace_event_task_dea { trace_mark(ftrace_event_task_deactivate, "%p %d", p, cpu); } + +static inline void ftrace_event_program_event(ktime_t *expires, int64_t *delta) +{ + trace_mark(ftrace_event_timer, "%p %p", expires, delta); +} + #else # define ftrace_event_irq(irq, user, ip) do { } while (0) # define ftrace_event_fault(ip, error, addr) do { } while (0) @@ -184,6 +190,7 @@ static inline void ftrace_event_task_dea # define ftrace_event_timestamp(now) do { } while (0) # define ftrace_event_task_activate(p, cpu) do { } while (0) # define ftrace_event_task_deactivate(p, cpu) do { } while (0) +# define ftrace_event_program_event(p, d) do { } while (0) #endif /* CONFIG_TRACE_EVENTS */ #endif /* _LINUX_FTRACE_H */ Index: linux-2.6.24.7/kernel/time/clockevents.c =================================================================== --- linux-2.6.24.7.orig/kernel/time/clockevents.c +++ linux-2.6.24.7/kernel/time/clockevents.c @@ -18,6 +18,7 @@ #include <linux/notifier.h> #include <linux/smp.h> #include <linux/sysdev.h> +#include <linux/ftrace.h> /* The registered clock event devices */ static LIST_HEAD(clockevent_devices); @@ -85,6 +86,8 @@ int clockevents_program_event(struct clo delta = ktime_to_ns(ktime_sub(expires, now)); + ftrace_event_program_event(&expires, &delta); + if (delta <= 0) return -ETIME; Index: linux-2.6.24.7/kernel/trace/trace.c =================================================================== --- linux-2.6.24.7.orig/kernel/trace/trace.c +++ linux-2.6.24.7/kernel/trace/trace.c @@ -1069,6 +1069,22 @@ void tracing_event_timer_set(struct trac entry->timer.timer = timer; } +void tracing_event_program_event(struct trace_array *tr, + struct trace_array_cpu *data, + unsigned long flags, + unsigned long ip, + ktime_t *expires, int64_t *delta) +{ + struct trace_entry *entry; + + entry = tracing_get_trace_entry(tr, data); + tracing_generic_entry_update(entry, flags); + entry->type = TRACE_PROGRAM_EVENT; + entry->program.ip = ip; + entry->program.expire = *expires; + entry->program.delta = *delta; +} + void tracing_event_timer_triggered(struct trace_array *tr, struct trace_array_cpu *data, unsigned long flags, @@ -1722,6 +1738,11 @@ print_lat_fmt(struct trace_iterator *ite trace_seq_printf(s, " (%Ld)\n", entry->timestamp.now.tv64); break; + case TRACE_PROGRAM_EVENT: + seq_print_ip_sym(s, entry->program.ip, sym_flags); + trace_seq_printf(s, " (%Ld) (%Ld)\n", + entry->program.expire, entry->program.delta); + break; case TRACE_TASK_ACT: seq_print_ip_sym(s, entry->task.ip, sym_flags); comm = trace_find_cmdline(entry->task.pid); @@ -1897,6 +1918,11 @@ static int print_trace_fmt(struct trace_ trace_seq_printf(s, " (%Ld)\n", entry->timestamp.now.tv64); break; + case TRACE_PROGRAM_EVENT: + seq_print_ip_sym(s, entry->program.ip, sym_flags); + trace_seq_printf(s, " (%Ld) (%Ld)\n", + entry->program.expire, entry->program.delta); + break; case TRACE_TASK_ACT: seq_print_ip_sym(s, entry->task.ip, sym_flags); comm = trace_find_cmdline(entry->task.pid); Index: linux-2.6.24.7/kernel/trace/trace.h =================================================================== --- linux-2.6.24.7.orig/kernel/trace/trace.h +++ linux-2.6.24.7/kernel/trace/trace.h @@ -22,6 +22,7 @@ enum trace_type { TRACE_TIMER_SET, TRACE_TIMER_TRIG, TRACE_TIMESTAMP, + TRACE_PROGRAM_EVENT, TRACE_TASK_ACT, TRACE_TASK_DEACT, TRACE_SYSCALL, @@ -79,6 +80,12 @@ struct timer_entry { void *timer; }; +struct program_entry { + unsigned long ip; + ktime_t expire; + int64_t delta; +}; + struct timestamp_entry { unsigned long ip; ktime_t now; @@ -145,6 +152,7 @@ struct trace_entry { struct fault_entry fault; struct timer_entry timer; struct timestamp_entry timestamp; + struct program_entry program; struct task_entry task; struct wakeup_entry wakeup; struct syscall_entry syscall; @@ -331,6 +339,11 @@ void tracing_event_task_deactivate(struc unsigned long ip, struct task_struct *p, int cpu); +void tracing_event_program_event(struct trace_array *tr, + struct trace_array_cpu *data, + unsigned long flags, + unsigned long ip, + ktime_t *expires, int64_t *delta); void tracing_event_wakeup(struct trace_array *tr, struct trace_array_cpu *data, unsigned long flags, Index: linux-2.6.24.7/kernel/trace/trace_events.c =================================================================== --- linux-2.6.24.7.orig/kernel/trace/trace_events.c +++ linux-2.6.24.7/kernel/trace/trace_events.c @@ -291,6 +291,40 @@ event_hrtimer_callback(void *probe_data, } static void +event_program_event_callback(void *probe_data, void *call_data, + const char *format, va_list *args) +{ + struct trace_array *tr = probe_data; + struct trace_array_cpu *data; + unsigned long flags; + ktime_t *expires; + int64_t *delta; + long disable; + int cpu; + + if (!tracer_enabled) + return; + + getarg(expires, *args); + getarg(delta, *args); + + /* interrupts should be off, we are in an interrupt */ + cpu = smp_processor_id(); + data = tr->data[cpu]; + + disable = atomic_inc_return(&data->disabled); + if (disable != 1) + goto out; + + local_save_flags(flags); + tracing_event_program_event(tr, data, flags, CALLER_ADDR1, expires, delta); + + out: + atomic_dec(&data->disabled); +} + + +static void event_task_activate_callback(void *probe_data, void *call_data, const char *format, va_list *args) { @@ -511,8 +545,16 @@ static void event_tracer_register(struct if (ret) goto out9; + ret = event_register_marker("ftrace_event_timer", "%p %p", + event_program_event_callback, tr); + if (ret) + goto out10; + return; + out10: + marker_probe_unregister("kernel_sched_schedule", + event_ctx_callback, tr); out9: marker_probe_unregister("kernel_sched_wakeup_new", event_wakeup_callback, tr); @@ -544,6 +586,8 @@ static void event_tracer_register(struct static void event_tracer_unregister(struct trace_array *tr) { + marker_probe_unregister("ftrace_event_timer", + event_program_event_callback, tr); marker_probe_unregister("kernel_sched_schedule", event_ctx_callback, tr); marker_probe_unregister("kernel_sched_wakeup_new", ������������������������������������patches/rwlock-torture.patch������������������������������������������������������������������������0000664�0000764�0000764�00000047011�11041657735�015275� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From: Steven Rostedt <srostedt@redhat.com> Subject: rwlock: rwlock torture test This patch adds an rwlock torture test that can be used to test rwlocks. This may only be loaded as a module. Running this will severally bring the system performance to an halt. Signed-off-by: Steven Rostedt <srostedt@redhat.com> --- kernel/Makefile | 1 kernel/rwlock_torture.c | 781 ++++++++++++++++++++++++++++++++++++++++++++++++ lib/Kconfig.debug | 11 3 files changed, 793 insertions(+) Index: linux-2.6.24.7/kernel/Makefile =================================================================== --- linux-2.6.24.7.orig/kernel/Makefile +++ linux-2.6.24.7/kernel/Makefile @@ -68,6 +68,7 @@ obj-$(CONFIG_SYSFS) += ksysfs.o obj-$(CONFIG_DETECT_SOFTLOCKUP) += softlockup.o obj-$(CONFIG_GENERIC_HARDIRQS) += irq/ obj-$(CONFIG_SECCOMP) += seccomp.o +obj-$(CONFIG_RWLOCK_TORTURE_TEST) += rwlock_torture.o obj-$(CONFIG_RCU_TORTURE_TEST) += rcutorture.o obj-$(CONFIG_CLASSIC_RCU) += rcuclassic.o obj-$(CONFIG_PREEMPT_RCU) += rcuclassic.o rcupreempt.o Index: linux-2.6.24.7/kernel/rwlock_torture.c =================================================================== --- /dev/null +++ linux-2.6.24.7/kernel/rwlock_torture.c @@ -0,0 +1,781 @@ +#include <linux/module.h> +#include <linux/spinlock.h> +#include <linux/rwsem.h> +#include <linux/smp_lock.h> +#include <linux/kthread.h> +#include <linux/time.h> +#include <linux/interrupt.h> +#include <linux/delay.h> +#include <linux/random.h> +#include <linux/kallsyms.h> + +#include "rtmutex_common.h" + +#ifdef CONFIG_LOGDEV +#include <linux/logdev.h> +#else +#define lfcnprint(x...) do { } while (0) +#define lmark() do { } while(0) +#define logdev_dump() do { } while (0) +#endif + +static DEFINE_RWLOCK(lock1); +static DEFINE_RWLOCK(lock2); +static DEFINE_RWLOCK(lock3); + +static DECLARE_RWSEM(sem1); +static DECLARE_RWSEM(sem2); +static DECLARE_RWSEM(sem3); + +static DEFINE_MUTEX(mutex1); +static DEFINE_MUTEX(mutex2); +static DEFINE_MUTEX(mutex3); + +struct locks { + union { + struct rw_semaphore *sem; + rwlock_t *lock; + struct mutex *mutex; + }; + int type; + char *name; + int read_cnt; + int write_cnt; + int downgrade; + int taken; + int retaken; +}; + +enum { LOCK_TYPE_LOCK = 0, + LOCK_TYPE_SEM = 1, + LOCK_TYPE_MUTEX = 2 +}; + +static struct locks test_lock1 = { + { + .lock = &lock1, + }, + .type = LOCK_TYPE_LOCK, + .name = "lock1", +}; + +static struct locks test_lock2 = { + { + .lock = &lock2, + }, + .type = LOCK_TYPE_LOCK, + .name = "lock2", +}; + +static struct locks test_lock3 = { + { + .lock = &lock3, + }, + .type = LOCK_TYPE_LOCK, + .name = "lock3", +}; + +static struct locks test_sem1 = { + { + .sem = &sem1, + }, + .type = LOCK_TYPE_SEM, + .name = "sem1", +}; + +static struct locks test_sem2 = { + { + .sem = &sem2, + }, + .type = LOCK_TYPE_SEM, + .name = "sem2", +}; + +static struct locks test_sem3 = { + { + .sem = &sem3, + }, + .type = LOCK_TYPE_SEM, + .name = "sem3", +}; + +static struct locks test_mutex1 = { + { + .mutex = &mutex1, + }, + .type = LOCK_TYPE_MUTEX, + .name = "mutex1", +}; + +static struct locks test_mutex2 = { + { + .mutex = &mutex2, + }, + .type = LOCK_TYPE_MUTEX, + .name = "mutex2", +}; + +static struct locks test_mutex3 = { + { + .mutex = &mutex3, + }, + .type = LOCK_TYPE_MUTEX, + .name = "mutex3", +}; + +static int test_done; + +#define TIME_MAX 20000 + +#define DEFAULT_NR_THREADS 300 +#define DEFAULT_NR_RT_THREADS 10 + +/* times in usecs */ +#define DEFAULT_SCHED_OTHER_TIME_US 1000 +#define DEFAULT_SCHED_FIFO_TIME_US 200 + +/* this is in millisecs */ +#define DEFAULT_SCHED_FIFO_SLEEP_TIME 2 +#define DEFAULT_SCHED_OTHER_SLEEP_TIME 1 + +#define DEFAULT_RT_THREAD_PRIO 40 + +#define NR_TESTS 3 +static unsigned long sched_other_time_usecs = DEFAULT_SCHED_OTHER_TIME_US; +static unsigned long sched_fifo_time_usecs = DEFAULT_SCHED_FIFO_TIME_US; +static unsigned int sched_other_sleep_ms = DEFAULT_SCHED_OTHER_SLEEP_TIME; +static unsigned int sched_fifo_sleep_ms = DEFAULT_SCHED_FIFO_SLEEP_TIME; + +static unsigned long rt_thread_prio = DEFAULT_RT_THREAD_PRIO; +static unsigned int thread_count = DEFAULT_NR_THREADS; +static unsigned int rt_thread_count = DEFAULT_NR_RT_THREADS; +static int test_time = 30; +static struct task_struct **tsks; + +static int perform_downgrade_write = 0; + +enum { + LOCK_READ = 0, + LOCK_WRITE = 1, + SEM_READ = 2, + SEM_WRITE = 3, + MUTEX = 5 /* must be odd */ +}; + +#ifdef CONFIG_PREEMPT_RT +static void show_rtm_owner(char *str, struct rt_mutex *rtm) +{ + struct task_struct *owner; + unsigned long val; + char *name; + + rcu_read_lock(); + val = (unsigned long)rtm->owner; + owner = (struct task_struct *)(val & ~3UL); + name = "NULL"; + if (owner) { + if (owner == (struct task_struct *)0x100) + name = "READER"; + else + name = owner->comm; + } + printk("%s val: %lx owner: %s\n", str, val, name); + + rcu_read_unlock(); +} + +static void show_mutex_owner(char *str, struct mutex *mutex) +{ + show_rtm_owner(str, &mutex->lock); +} + +static void show_rwm_owner(char *str, struct rw_mutex *rwm) +{ + struct reader_lock_struct *rls; + struct task_struct *owner; + unsigned long val; + char *name; + + rcu_read_lock(); + val = (unsigned long)rwm->owner; + owner = (struct task_struct *)(val & ~3UL); + name = "NULL"; + if (owner) { + switch ((unsigned long)owner) { + case 0x100: + name = "READER"; + break; + case 0x200: + name = "PENDING READER"; + break; + case 0x400: + name = "PENDING WRITER"; + break; + default: + name = owner->comm; + } + } + printk("%s val: %lx owner: %s count %d owners %d ", str, val, name, + atomic_read(&rwm->count), + atomic_read(&rwm->owners)); + show_rtm_owner(" mutex: ", &rwm->mutex); + list_for_each_entry(rls, &rwm->readers, list) { + if (!rls->task) + printk("NULL TASK!!!\n"); + else + printk(" owned by: %s:%d\n", + rls->task->comm, rls->task->pid); + } + rcu_read_unlock(); +} + +static void show_rwlock_owner(char *str, rwlock_t *lock) +{ + show_rwm_owner(str, &lock->owners); +} + +static void show_sem_owner(char *str, struct rw_semaphore *sem) +{ + show_rwm_owner(str, &sem->owners); +} + +void print_owned_read_locks(struct task_struct *tsk) +{ + int i; + + if (!tsk->reader_lock_count) + return; + + oops_in_progress++; + printk(" %s:%d owns:\n", tsk->comm, tsk->pid); + for (i = 0; i < tsk->reader_lock_count; i++) { + printk(" %p\n", tsk->owned_read_locks[i].lock); + } + oops_in_progress--; +} + +#else +# define show_sem_owner(x...) do { } while (0) +# define show_rwlock_owner(x...) do { } while (0) +# define show_mutex_owner(x...) do { } while (0) +#endif + +static int do_read(int read) +{ + unsigned long x; + int ret; + + x = random32(); + + /* rwlock can not schedule */ + if (!(read & ~1)) { + ret = LOCK_READ; + goto out; + } + + /* every other time pick a mutex */ + if (x & 0x1000) + return MUTEX; /* do mutex */ + + /* alternate between locks and semaphores */ + if (x & 0x10) + ret = LOCK_READ; + else + ret = SEM_READ; + + out: + /* Do write 1 in 16 times */ + return ret | !(x & 0xf); +} + +static struct locks * +pick_lock(struct locks *lock, struct locks *sem, struct locks *mutex, int read) +{ + switch (read) { + case LOCK_READ: + case LOCK_WRITE: + return lock; + case SEM_READ: + case SEM_WRITE: + return sem; + case MUTEX: + return mutex; + } + return NULL; +} + +static void do_lock(struct locks *lock, int read) +{ + switch (read) { + case LOCK_READ: + if (unlikely(lock->type != LOCK_TYPE_LOCK)) { + printk("FAILED expected lock but got %d\n", + lock->type); + return; + } + lfcnprint("read lock %s %p count=%d owners=%d", + lock->name, lock, atomic_read(&lock->lock->owners.count), + atomic_read(&lock->lock->owners.owners)); + read_lock(lock->lock); + break; + case LOCK_WRITE: + if (unlikely(lock->type != LOCK_TYPE_LOCK)) { + printk("FAILED expected lock but got %d\n", + lock->type); + return; + } + lfcnprint("write lock %s %p", lock->name, lock); + write_lock(lock->lock); + break; + case SEM_READ: + if (unlikely(lock->type != LOCK_TYPE_SEM)) { + printk("FAILED expected sem but got %d\n", + lock->type); + return; + } + lfcnprint("read sem %s %p count=%d owners=%d", + lock->name, lock, + atomic_read(&lock->sem->owners.count), + atomic_read(&lock->sem->owners.owners)); + down_read(lock->sem); + break; + case SEM_WRITE: + if (unlikely(lock->type != LOCK_TYPE_SEM)) { + printk("FAILED expected sem but got %d\n", + lock->type); + return; + } + lfcnprint("write sem %s %p", lock->name, lock); + down_write(lock->sem); + break; + case MUTEX: + if (unlikely(lock->type != LOCK_TYPE_MUTEX)) { + printk("FAILED expected mutex but got %d\n", + lock->type); + return; + } + lfcnprint("mutex %s %p", lock->name, lock); + mutex_lock(lock->mutex); + break; + default: + printk("bad lock value %d!!!\n", read); + } + lfcnprint("taken %s %p", lock->name, lock); +} + +static void do_unlock(struct locks *lock, int read) +{ + switch (read) { + case LOCK_READ: + if (unlikely(lock->type != LOCK_TYPE_LOCK)) { + printk("FAILED expected lock but got %d\n", + lock->type); + return; + } + lfcnprint("read lock %s %p count=%d owners=%d", + lock->name, lock, atomic_read(&lock->lock->owners.count), + atomic_read(&lock->lock->owners.owners)); + read_unlock(lock->lock); + break; + case LOCK_WRITE: + if (unlikely(lock->type != LOCK_TYPE_LOCK)) { + printk("FAILED expected lock but got %d\n", + lock->type); + return; + } + lfcnprint("write lock %s %p", lock->name, lock); + write_unlock(lock->lock); + break; + case SEM_READ: + if (unlikely(lock->type != LOCK_TYPE_SEM)) { + printk("FAILED expected sem but got %d\n", + lock->type); + return; + } + lfcnprint("read sem %s, %p count=%d owners=%d", + lock->name, lock, atomic_read(&lock->sem->owners.count), + atomic_read(&lock->sem->owners.owners)); + up_read(lock->sem); + break; + case SEM_WRITE: + if (unlikely(lock->type != LOCK_TYPE_SEM)) { + printk("FAILED expected sem but got %d\n", + lock->type); + return; + } + lfcnprint("write sem %s %p", lock->name, lock); + up_write(lock->sem); + break; + case MUTEX: + if (unlikely(lock->type != LOCK_TYPE_MUTEX)) { + printk("FAILED expected mutex but got %d\n", + lock->type); + return; + } + lfcnprint("mutex %s %p", lock->name, lock); + mutex_unlock(lock->mutex); + break; + default: + printk("bad lock value %d!!!\n", read); + } + lfcnprint("unlocked"); +} + +static void do_something(unsigned long time, int ignore) +{ + lmark(); + if (test_done) + return; + if (time > TIME_MAX) + time = TIME_MAX; + udelay(time); +} + +static void do_downgrade(unsigned long time, struct locks *lock, int *read) +{ + struct rw_semaphore *sem = lock->sem; + unsigned long x; + + if (!perform_downgrade_write) + return; + + if (test_done) + return; + + if (*read == SEM_WRITE) { + x = random32(); + + /* Do downgrade write 1 in 16 times of a write */ + if (!(x & 0xf)) { + lfcnprint("downgrade %p", sem); + lock->downgrade++; + downgrade_write(sem); + do_something(time, 0); + /* need to do unlock read */ + *read = SEM_READ; + } + } +} + +static void update_stats(int read, struct locks *lock) +{ + switch (read) { + case LOCK_READ: + case SEM_READ: + lock->read_cnt++; + break; + case LOCK_WRITE: + case SEM_WRITE: + lock->write_cnt++; + break; + } + lock->taken++; +} + +#define MAX_DEPTH 10 + +static void run_lock(void (*func)(unsigned long time, int read), + struct locks *lock, unsigned long time, int read, int depth); + +static void do_again(void (*func)(unsigned long time, int read), + struct locks *lock, unsigned long time, int read, int depth) +{ + unsigned long x; + + if (test_done) + return; + + /* If this was grabbed for read via rwlock, do again */ + if (likely(read != LOCK_READ) || depth >= MAX_DEPTH) + return; + + x = random32(); + if (x & 1) { + lfcnprint("read lock again"); + run_lock(func, lock, time, read, depth+1); + } +} + +static void run_lock(void (*func)(unsigned long time, int read), + struct locks *lock, unsigned long time, int read, int depth) +{ + if (test_done) + return; + + update_stats(read, lock); + if (depth) + lock->retaken++; + do_lock(lock, read); + if (!test_done) { + func(time, do_read(read)); + do_again(func, lock, time, read, depth); + } + do_downgrade(time, lock, &read); + do_unlock(lock, read); + +} + +static void run_one_lock(unsigned long time, int read) +{ + struct locks *lock; + + lmark(); + lock = pick_lock(&test_lock1, &test_sem1, &test_mutex1, read); + run_lock(do_something, lock, time, read, 0); +} + +static void run_two_locks(unsigned long time, int read) +{ + struct locks *lock; + + lmark(); + lock = pick_lock(&test_lock2, &test_sem2, &test_mutex2, read); + run_lock(run_one_lock, lock, time, read, 0); +} + +static void run_three_locks(unsigned long time, int read) +{ + struct locks *lock; + + lmark(); + lock = pick_lock(&test_lock3, &test_sem3, &test_mutex3, read); + run_lock(run_two_locks, lock, time, read, 0); +} + +static int run_test(unsigned long time) +{ + unsigned long long start; + int read; + int ret; + + if (test_done) + return 0; + + start = random32(); + + read = do_read(MUTEX); + + switch (ret = (start & 3)) { + case 0: + run_one_lock(time, read); + break; + case 1: + run_two_locks(time, read); + break; + case 2: + run_three_locks(time, read); + break; + default: + ret = 1; + run_two_locks(time, read); + } + + WARN_ON_ONCE(current->reader_lock_count); + + return ret; +} + +static int rwlock_thread(void *arg) +{ + long prio = (long)arg; + unsigned long time; + unsigned long run; + struct sched_param param; + + time = sched_fifo_time_usecs; + if (prio) { + param.sched_priority = prio; + sched_setscheduler(current, SCHED_FIFO, ¶m); + time = sched_fifo_time_usecs; + } + + while (!kthread_should_stop()) { + run = run_test(time); + + if (prio) + msleep(sched_fifo_sleep_ms); + else + msleep(sched_other_sleep_ms); + } + + return 0; +} + +static void print_lock_stat(struct locks *lock) +{ + switch (lock->type) { + case LOCK_TYPE_LOCK: + case LOCK_TYPE_SEM: + printk("%8s taken for read: %9d\n", lock->name, lock->read_cnt); + printk("%8s taken for write: %8d\n", lock->name, lock->write_cnt); + if (lock->type == LOCK_TYPE_LOCK) { + printk("%8s retaken: %9d\n", + lock->name, lock->retaken); + } else if (perform_downgrade_write) { + printk("%8s downgraded: %9d\n", + lock->name, lock->downgrade); + } + } + printk("%8s taken: %8d\n\n", lock->name, lock->taken); +} + +static int __init mutex_stress_init(void) +{ + long i; + + tsks = kmalloc(sizeof(*tsks) * (thread_count + rt_thread_count), GFP_KERNEL); + if (!tsks) { + printk("failed to allocate tasks\n"); + return -1; + } + + printk("create threads and run for %d seconds\n", test_time); + + for (i=0; i < thread_count; i++) + tsks[i] = kthread_run(rwlock_thread, NULL, "mtest%d", i); + for (i=0; i < rt_thread_count; i++) { + long prio = rt_thread_prio + i; + tsks[thread_count + i] = + kthread_run(rwlock_thread, (void*)prio, + "mtest%d", thread_count + i); + } + + + set_current_state(TASK_INTERRUPTIBLE); + schedule_timeout(test_time * HZ); + + printk("kill threads\n"); + test_done = 1; + + set_current_state(TASK_INTERRUPTIBLE); + /* sleep some to allow all tasks to finish */ + schedule_timeout(3 * HZ); + + lfcnprint("Done"); + + show_rwlock_owner("lock1: ", &lock1); + show_rwlock_owner("lock2: ", &lock2); + show_rwlock_owner("lock3: ", &lock3); + + show_sem_owner("sem1: ", &sem1); + show_sem_owner("sem2: ", &sem2); + show_sem_owner("sem3: ", &sem3); + + show_mutex_owner("mutex1: ", &mutex1); + show_mutex_owner("mutex2: ", &mutex2); + show_mutex_owner("mutex3: ", &mutex3); + + oops_in_progress++; +// logdev_dump(); + oops_in_progress--; + +#ifdef CONFIG_PREEMPT_RT + for (i=0; i < (thread_count + rt_thread_count); i++) { + if (tsks[i]) { + struct rt_mutex *mtx; + unsigned long own; + struct rt_mutex_waiter *w; + + spin_lock_irq(&tsks[i]->pi_lock); + + print_owned_read_locks(tsks[i]); + + if (tsks[i]->pi_blocked_on) { + w = (void *)tsks[i]->pi_blocked_on; + mtx = w->lock; + spin_unlock_irq(&tsks[i]->pi_lock); + spin_lock_irq(&mtx->wait_lock); + spin_lock(&tsks[i]->pi_lock); + own = (unsigned long)mtx->owner & ~3UL; + oops_in_progress++; + printk("%s:%d is blocked on ", + tsks[i]->comm, tsks[i]->pid); + __print_symbol("%s", (unsigned long)mtx); + if (own == 0x100) + printk(" owner is READER\n"); + else if (!(own & ~300)) + printk(" owner is ILLEGAL!!\n"); + else if (!own) + printk(" has no owner!\n"); + else { + struct task_struct *owner = (void*)own; + + printk(" owner is %s:%d\n", + owner->comm, owner->pid); + } + oops_in_progress--; + + spin_unlock(&tsks[i]->pi_lock); + spin_unlock_irq(&mtx->wait_lock); + } else { + print_owned_read_locks(tsks[i]); + spin_unlock_irq(&tsks[i]->pi_lock); + } + } + } +#endif + for (i=0; i < (thread_count + rt_thread_count); i++) { + if (tsks[i]) + kthread_stop(tsks[i]); + } + + print_lock_stat(&test_lock1); + print_lock_stat(&test_lock2); + print_lock_stat(&test_lock3); + print_lock_stat(&test_sem1); + print_lock_stat(&test_sem2); + print_lock_stat(&test_sem3); + print_lock_stat(&test_mutex1); + print_lock_stat(&test_mutex2); + print_lock_stat(&test_mutex3); + + if (!perform_downgrade_write) { + printk("No downgrade writes performed.\n" + " To enable it, pass in perform_downgrade_write=1 to the module\n"); + } + + return 0; +} + +static void mutex_stress_exit(void) +{ +} + +module_init(mutex_stress_init); +module_exit(mutex_stress_exit); + +module_param(perform_downgrade_write, int, 0644); +MODULE_PARM_DESC(perform_downgrade_write, + "Perform downgrade_write in the test"); + +module_param(sched_other_time_usecs, ulong, 0644); +MODULE_PARM_DESC(sched_other_time_usecs, + "Number of usecs to \"do something\""); + +module_param(sched_fifo_time_usecs, ulong, 0644); +MODULE_PARM_DESC(sched_fifo_time_usecs, + "Number of usecs for rt tasks to \"do something\""); + +module_param(sched_other_sleep_ms, uint, 0644); +MODULE_PARM_DESC(sched_other_sleep_ms, + "Number of usecs for tasks to sleep"); + +module_param(sched_fifo_sleep_ms, uint, 0644); +MODULE_PARM_DESC(sched_fifo_sleep_ms, + "Number of usecs for rt tasks to sleep"); + +module_param(rt_thread_prio, long, 0644); +MODULE_PARM_DESC(rt_thread_prio, "priority if FIFO tasks"); + +module_param(thread_count, uint, 0644); +MODULE_PARM_DESC(thread_count, "Number of threads to run"); + +module_param(rt_thread_count, uint, 0644); +MODULE_PARM_DESC(rt_thread_count, "Number of RT threads to run"); + +module_param(test_time, uint, 0644); +MODULE_PARM_DESC(test_time, "Number of seconds to run the test"); + +MODULE_AUTHOR("Steven Rostedt"); +MODULE_DESCRIPTION("Mutex Stress"); +MODULE_LICENSE("GPL"); Index: linux-2.6.24.7/lib/Kconfig.debug =================================================================== --- linux-2.6.24.7.orig/lib/Kconfig.debug +++ linux-2.6.24.7/lib/Kconfig.debug @@ -229,6 +229,17 @@ config DEBUG_SEMAPHORE verbose debugging messages. If you suspect a semaphore problem or a kernel hacker asks for this option then say Y. Otherwise say N. +config RWLOCK_TORTURE_TEST + tristate "torture tests for Priority Inheritance RW locks" + depends on DEBUG_KERNEL + depends on m + default n + help + This option provides a kernel modules that runs a torture test + of several threads that try to grab mutexes, rwlocks and rwsems. + + Say N if you are unsure. + config DEBUG_LOCK_ALLOC bool "Lock debugging: detect incorrect freeing of live locks" depends on DEBUG_KERNEL && TRACE_IRQFLAGS_SUPPORT && STACKTRACE_SUPPORT && LOCKDEP_SUPPORT �����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/ftrace-wakeup-rawspinlock.patch�������������������������������������������������������������0000664�0000764�0000764�00000006061�11041657733�017360� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From: Steven Rostedt <srostedt@redhat.com> Subject: ftrace: user raw spin lock for wakeup function trace Lockdep gets hung up by function traces that call spinlocks. Change wakeup spin lock used in the wake up function tracing to raw. Signed-off-by: Steven Rostedt <srostedt@redhat.com> --- kernel/trace/trace_sched_wakeup.c | 27 ++++++++++++++++----------- 1 file changed, 16 insertions(+), 11 deletions(-) Index: linux-2.6.24.7/kernel/trace/trace_sched_wakeup.c =================================================================== --- linux-2.6.24.7.orig/kernel/trace/trace_sched_wakeup.c +++ linux-2.6.24.7/kernel/trace/trace_sched_wakeup.c @@ -26,7 +26,8 @@ static struct task_struct *wakeup_task; static int wakeup_cpu; static unsigned wakeup_prio = -1; -static DEFINE_RAW_SPINLOCK(wakeup_lock); +static __raw_spinlock_t wakeup_lock = + (__raw_spinlock_t)__RAW_SPIN_LOCK_UNLOCKED; static void __wakeup_reset(struct trace_array *tr); @@ -56,7 +57,8 @@ wakeup_tracer_call(unsigned long ip, uns if (unlikely(disabled != 1)) goto out; - spin_lock_irqsave(&wakeup_lock, flags); + raw_local_irq_save(flags); + __raw_spin_lock(&wakeup_lock); if (unlikely(!wakeup_task)) goto unlock; @@ -71,7 +73,8 @@ wakeup_tracer_call(unsigned long ip, uns trace_function(tr, data, ip, parent_ip, flags); unlock: - spin_unlock_irqrestore(&wakeup_lock, flags); + __raw_spin_unlock(&wakeup_lock); + raw_local_irq_restore(flags); out: atomic_dec(&data->disabled); @@ -145,7 +148,8 @@ wakeup_sched_switch(void *private, void if (likely(disabled != 1)) goto out; - spin_lock_irqsave(&wakeup_lock, flags); + local_irq_save(flags); + __raw_spin_lock(&wakeup_lock); /* We could race with grabbing wakeup_lock */ if (unlikely(!tracer_enabled || next != wakeup_task)) @@ -174,7 +178,8 @@ wakeup_sched_switch(void *private, void out_unlock: __wakeup_reset(tr); - spin_unlock_irqrestore(&wakeup_lock, flags); + __raw_spin_unlock(&wakeup_lock); + local_irq_restore(flags); out: atomic_dec(&tr->data[cpu]->disabled); } @@ -209,8 +214,6 @@ static void __wakeup_reset(struct trace_ struct trace_array_cpu *data; int cpu; - assert_spin_locked(&wakeup_lock); - for_each_possible_cpu(cpu) { data = tr->data[cpu]; tracing_reset(data); @@ -229,9 +232,11 @@ static void wakeup_reset(struct trace_ar { unsigned long flags; - spin_lock_irqsave(&wakeup_lock, flags); + local_irq_save(flags); + __raw_spin_lock(&wakeup_lock); __wakeup_reset(tr); - spin_unlock_irqrestore(&wakeup_lock, flags); + __raw_spin_unlock(&wakeup_lock); + local_irq_restore(flags); } static void @@ -252,7 +257,7 @@ wakeup_check_start(struct trace_array *t goto out; /* interrupts should be off from try_to_wake_up */ - spin_lock(&wakeup_lock); + __raw_spin_lock(&wakeup_lock); /* check for races. */ if (!tracer_enabled || p->prio >= wakeup_prio) @@ -274,7 +279,7 @@ wakeup_check_start(struct trace_array *t CALLER_ADDR1, CALLER_ADDR2, flags); out_locked: - spin_unlock(&wakeup_lock); + __raw_spin_unlock(&wakeup_lock); out: atomic_dec(&tr->data[cpu]->disabled); } �������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/radix-tree-lockdep-plus1.patch��������������������������������������������������������������0000664�0000764�0000764�00000001767�11041657733�017025� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From: Steven Rostedt <srostedt@redhat.com> Subject: lockdep: add +1 to radix tree array The radix tree array was off by 1 used for lockdep annotation. Signed-off-by: Steven Rostedt <srostedt@redhat.com> --- lib/radix-tree.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) Index: linux-2.6.24.7/lib/radix-tree.c =================================================================== --- linux-2.6.24.7.orig/lib/radix-tree.c +++ linux-2.6.24.7/lib/radix-tree.c @@ -77,10 +77,10 @@ struct radix_tree_path { static unsigned long height_to_maxindex[RADIX_TREE_MAX_PATH + 1] __read_mostly; #ifdef CONFIG_RADIX_TREE_CONCURRENT -static struct lock_class_key radix_node_class[RADIX_TREE_MAX_PATH]; +static struct lock_class_key radix_node_class[RADIX_TREE_MAX_PATH + 1]; #endif #ifdef CONFIG_DEBUG_LOCK_ALLOC -static const char *radix_node_key_string[RADIX_TREE_MAX_PATH] = { +static const char *radix_node_key_string[RADIX_TREE_MAX_PATH + 1] = { "radix-node-00", "radix-node-01", "radix-node-02", ���������patches/sched-cpupri-hotplug-support.patch����������������������������������������������������������0000664�0000764�0000764�00000020650�11041657733�020050� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From ghaskins@novell.com Thu Jun 5 09:51:42 2008 Date: Wed, 04 Jun 2008 15:04:05 -0400 From: Gregory Haskins <ghaskins@novell.com> To: Ingo Molnar <mingo@elte.hu>, Thomas Gleixner <tglx@linutonix.de>, Steven Rostedt <rostedt@goodmis.org>, Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Gregory Haskins <ghaskins@novell.com>, linux-kernel@vger.kernel.org, linux-rt-users@vger.kernel.org, Peter Zijlstra <peterz@infradead.org> Subject: [PATCH 1/2] sched: fix cpupri hotplug support [ The following text is in the "utf-8" character set. ] [ Your display is set for the "iso-8859-1" character set. ] [ Some characters may be displayed incorrectly. ] The RT folks over at RedHat found an issue w.r.t. hotplug support which was traced to problems with the cpupri infrastructure in the scheduler: https://bugzilla.redhat.com/show_bug.cgi?id=449676 This bug affects 23-rt12+, 24-rtX, 25-rtX, and sched-devel. This patch applies to 25.4-rt4, though it should trivially apply to most cpupri enabled kernels mentioned above. It turned out that the issue was that offline cpus could get inadvertently registered with cpupri so that they were erroneously selected during migration decisions. The end result would be an OOPS as the offline cpu had tasks routed to it. This patch generalizes the old join/leave domain interface into an online/offline interface, and adjusts the root-domain/hotplug code to utilize it. I was able to easily reproduce the issue prior to this patch, and am no longer able to reproduce it after this patch. I can offline cpus indefinately and everything seems to be in working order. Thanks to Arnaldo (acme), Thomas, and Peter for doing the legwork to point me in the right direction. Also thank you to Peter for reviewing the early iterations of this patch. Signed-off-by: Gregory Haskins <ghaskins@novell.com> CC: Ingo Molnar <mingo@elte.hu> CC: Thomas Gleixner <tglx@linutonix.de> CC: Steven Rostedt <rostedt@goodmis.org> CC: Arnaldo Carvalho de Melo <acme@redhat.com> CC: Peter Zijlstra <peterz@infradead.org> --- include/linux/sched.h | 4 +-- kernel/sched.c | 54 +++++++++++++++++++++++++++++++++++++------------- kernel/sched_rt.c | 26 +++++++++++++++++------- 3 files changed, 61 insertions(+), 23 deletions(-) Index: linux-2.6.24.7/include/linux/sched.h =================================================================== --- linux-2.6.24.7.orig/include/linux/sched.h +++ linux-2.6.24.7/include/linux/sched.h @@ -931,8 +931,8 @@ struct sched_class { void (*task_new) (struct rq *rq, struct task_struct *p); void (*set_cpus_allowed)(struct task_struct *p, cpumask_t *newmask); - void (*join_domain)(struct rq *rq); - void (*leave_domain)(struct rq *rq); + void (*rq_online)(struct rq *rq); + void (*rq_offline)(struct rq *rq); void (*switched_from) (struct rq *this_rq, struct task_struct *task, int running); Index: linux-2.6.24.7/kernel/sched.c =================================================================== --- linux-2.6.24.7.orig/kernel/sched.c +++ linux-2.6.24.7/kernel/sched.c @@ -428,6 +428,7 @@ struct rq { int push_cpu; /* cpu of this runqueue: */ int cpu; + int online; struct task_struct *migration_thread; struct list_head migration_queue; @@ -1093,6 +1094,8 @@ static int task_hot(struct task_struct * #endif #define sched_class_highest (&rt_sched_class) +#define for_each_class(class) \ + for (class = sched_class_highest; class; class = class->next) /* * Update delta_exec, delta_fair fields for rq. @@ -6109,6 +6112,36 @@ static void unregister_sched_domain_sysc } #endif +static void set_rq_online(struct rq *rq) +{ + if (!rq->online) { + const struct sched_class *class; + + cpu_set(rq->cpu, rq->rd->online); + rq->online = 1; + + for_each_class(class) { + if (class->rq_online) + class->rq_online(rq); + } + } +} + +static void set_rq_offline(struct rq *rq) +{ + if (rq->online) { + const struct sched_class *class; + + for_each_class(class) { + if (class->rq_offline) + class->rq_offline(rq); + } + + cpu_clear(rq->cpu, rq->rd->online); + rq->online = 0; + } +} + /* * migration_call - callback that gets triggered when a CPU is added. * Here we can start up the necessary migration thread for the new CPU. @@ -6149,7 +6182,8 @@ migration_call(struct notifier_block *nf spin_lock_irqsave(&rq->lock, flags); if (rq->rd) { BUG_ON(!cpu_isset(cpu, rq->rd->span)); - cpu_set(cpu, rq->rd->online); + + set_rq_online(rq); } spin_unlock_irqrestore(&rq->lock, flags); break; @@ -6209,7 +6243,7 @@ migration_call(struct notifier_block *nf spin_lock_irqsave(&rq->lock, flags); if (rq->rd) { BUG_ON(!cpu_isset(cpu, rq->rd->span)); - cpu_clear(cpu, rq->rd->online); + set_rq_offline(rq); } spin_unlock_irqrestore(&rq->lock, flags); break; @@ -6407,7 +6441,6 @@ sd_parent_degenerate(struct sched_domain static void rq_attach_root(struct rq *rq, struct root_domain *rd) { unsigned long flags; - const struct sched_class *class; struct root_domain *reap = NULL; spin_lock_irqsave(&rq->lock, flags); @@ -6415,13 +6448,10 @@ static void rq_attach_root(struct rq *rq if (rq->rd) { struct root_domain *old_rd = rq->rd; - for (class = sched_class_highest; class; class = class->next) { - if (class->leave_domain) - class->leave_domain(rq); - } + if (cpu_isset(rq->cpu, old_rd->online)) + set_rq_offline(rq); cpu_clear(rq->cpu, old_rd->span); - cpu_clear(rq->cpu, old_rd->online); if (atomic_dec_and_test(&old_rd->refcount)) reap = old_rd; @@ -6432,12 +6462,7 @@ static void rq_attach_root(struct rq *rq cpu_set(rq->cpu, rd->span); if (cpu_isset(rq->cpu, cpu_online_map)) - cpu_set(rq->cpu, rd->online); - - for (class = sched_class_highest; class; class = class->next) { - if (class->join_domain) - class->join_domain(rq); - } + set_rq_online(rq); spin_unlock_irqrestore(&rq->lock, flags); @@ -7452,6 +7477,7 @@ void __init sched_init(void) rq->next_balance = jiffies; rq->push_cpu = 0; rq->cpu = i; + rq->online = 0; rq->migration_thread = NULL; INIT_LIST_HEAD(&rq->migration_queue); rq->rt.highest_prio = MAX_RT_PRIO; Index: linux-2.6.24.7/kernel/sched_rt.c =================================================================== --- linux-2.6.24.7.orig/kernel/sched_rt.c +++ linux-2.6.24.7/kernel/sched_rt.c @@ -12,6 +12,9 @@ static inline int rt_overloaded(struct r static inline void rt_set_overload(struct rq *rq) { + if (!rq->online) + return; + cpu_set(rq->cpu, rq->rd->rto_mask); /* * Make sure the mask is visible before we set @@ -26,6 +29,9 @@ static inline void rt_set_overload(struc static inline void rt_clear_overload(struct rq *rq) { + if (!rq->online) + return; + /* the order here really doesn't matter */ atomic_dec(&rq->rd->rto_count); cpu_clear(rq->cpu, rq->rd->rto_mask); @@ -78,7 +84,10 @@ static inline void inc_rt_tasks(struct t #ifdef CONFIG_SMP if (p->prio < rq->rt.highest_prio) { rq->rt.highest_prio = p->prio; - cpupri_set(&rq->rd->cpupri, rq->cpu, p->prio); + + if (rq->online) + cpupri_set(&rq->rd->cpupri, rq->cpu, + p->prio); } if (p->nr_cpus_allowed > 1) rq->rt.rt_nr_migratory++; @@ -113,8 +122,11 @@ static inline void dec_rt_tasks(struct t rq->rt.rt_nr_migratory--; } - if (rq->rt.highest_prio != highest_prio) - cpupri_set(&rq->rd->cpupri, rq->cpu, rq->rt.highest_prio); + if (rq->rt.highest_prio != highest_prio) { + if (rq->online) + cpupri_set(&rq->rd->cpupri, rq->cpu, + rq->rt.highest_prio); + } update_rt_migration(rq); #endif /* CONFIG_SMP */ @@ -758,7 +770,7 @@ static void set_cpus_allowed_rt(struct t p->nr_cpus_allowed = weight; } /* Assumes rq->lock is held */ -static void join_domain_rt(struct rq *rq) +static void rq_online_rt(struct rq *rq) { if (rq->rt.overloaded) rt_set_overload(rq); @@ -767,7 +779,7 @@ static void join_domain_rt(struct rq *rq } /* Assumes rq->lock is held */ -static void leave_domain_rt(struct rq *rq) +static void rq_offline_rt(struct rq *rq) { if (rq->rt.overloaded) rt_clear_overload(rq); @@ -919,8 +931,8 @@ const struct sched_class rt_sched_class .load_balance = load_balance_rt, .move_one_task = move_one_task_rt, .set_cpus_allowed = set_cpus_allowed_rt, - .join_domain = join_domain_rt, - .leave_domain = leave_domain_rt, + .rq_online = rq_online_rt, + .rq_offline = rq_offline_rt, .pre_schedule = pre_schedule_rt, .post_schedule = post_schedule_rt, .task_wake_up = task_wake_up_rt, ����������������������������������������������������������������������������������������patches/sched-cpupri-priocount.patch����������������������������������������������������������������0000664�0000764�0000764�00000002677�11041657733�016707� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From ghaskins@novell.com Thu Jun 5 09:51:54 2008 Date: Wed, 04 Jun 2008 15:04:10 -0400 From: Gregory Haskins <ghaskins@novell.com> To: Ingo Molnar <mingo@elte.hu>, Thomas Gleixner <tglx@linutonix.de>, Steven Rostedt <rostedt@goodmis.org>, Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Gregory Haskins <ghaskins@novell.com>, linux-kernel@vger.kernel.org, linux-rt-users@vger.kernel.org, Peter Zijlstra <peterz@infradead.org> Subject: [PATCH 2/2] sched: fix cpupri priocount [ The following text is in the "utf-8" character set. ] [ Your display is set for the "iso-8859-1" character set. ] [ Some characters may be displayed incorrectly. ] A rounding error was pointed out by Peter Zijlstra which would result in the structure holding priorities to be off by one. Signed-off-by: Gregory Haskins <ghaskins@novell.com> CC: Peter Zijlstra <peterz@infradead.org> --- kernel/sched_cpupri.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) Index: linux-2.6.24.7/kernel/sched_cpupri.h =================================================================== --- linux-2.6.24.7.orig/kernel/sched_cpupri.h +++ linux-2.6.24.7/kernel/sched_cpupri.h @@ -4,7 +4,7 @@ #include <linux/sched.h> #define CPUPRI_NR_PRIORITIES 2+MAX_RT_PRIO -#define CPUPRI_NR_PRI_WORDS CPUPRI_NR_PRIORITIES/BITS_PER_LONG +#define CPUPRI_NR_PRI_WORDS (CPUPRI_NR_PRIORITIES + BITS_PER_LONG/2)/BITS_PER_LONG #define CPUPRI_INVALID -1 #define CPUPRI_IDLE 0 �����������������������������������������������������������������patches/ftrace-hotplug-fix.patch��������������������������������������������������������������������0000664�0000764�0000764�00000010513�11041657733�015775� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From: Steven Rostedt <srostedt@redhat.com> Subject: ftrace: cpu hotplug fix Peter Zijlstra found that taking down and bringing up a new CPU caused ftrace to crash the kernel. This was due to some arch calls that were being traced by the function tracer before the smp_processor_id was set up. Since the function tracer uses smp_processor_id it caused a triple fault. Instead of adding notrace all over the architecture code to prevent this problem, it is easier to simply disable the function tracer when bringing up a new CPU. Signed-off-by: Steven Rostedt <srostedt@redhat.com> --- include/linux/ftrace.h | 11 ++++++++--- kernel/cpu.c | 11 ++++++++++- kernel/trace/ftrace.c | 24 ++++++++++++++++++++++++ kernel/trace/trace_irqsoff.c | 3 +++ kernel/trace/trace_sched_wakeup.c | 2 +- 5 files changed, 46 insertions(+), 5 deletions(-) Index: linux-2.6.24.7/include/linux/ftrace.h =================================================================== --- linux-2.6.24.7.orig/include/linux/ftrace.h +++ linux-2.6.24.7/include/linux/ftrace.h @@ -35,10 +35,15 @@ void clear_ftrace_function(void); extern void ftrace_stub(unsigned long a0, unsigned long a1); extern void mcount(void); +void ftrace_enable(void); +void ftrace_disable(void); + #else /* !CONFIG_FTRACE */ -# define register_ftrace_function(ops) do { } while (0) -# define unregister_ftrace_function(ops) do { } while (0) -# define clear_ftrace_function(ops) do { } while (0) +# define register_ftrace_function(ops) do { } while (0) +# define unregister_ftrace_function(ops) do { } while (0) +# define clear_ftrace_function(ops) do { } while (0) +# define ftrace_enable() do { } while (0) +# define ftrace_disable() do { } while (0) #endif /* CONFIG_FTRACE */ #ifdef CONFIG_DYNAMIC_FTRACE Index: linux-2.6.24.7/kernel/cpu.c =================================================================== --- linux-2.6.24.7.orig/kernel/cpu.c +++ linux-2.6.24.7/kernel/cpu.c @@ -14,6 +14,7 @@ #include <linux/kthread.h> #include <linux/stop_machine.h> #include <linux/mutex.h> +#include <linux/ftrace.h> /* This protects CPUs going up and down... */ static DEFINE_MUTEX(cpu_add_remove_lock); @@ -244,9 +245,17 @@ static int __cpuinit _cpu_up(unsigned in goto out_notify; } - /* Arch-specific enabling code. */ mutex_lock(&cpu_bitmask_lock); + /* + * Disable function tracing while bringing up a new CPU. + * We don't want to trace functions that can not handle a + * smp_processor_id() call. + */ + ftrace_disable(); + + /* Arch-specific enabling code. */ ret = __cpu_up(cpu); + ftrace_enable(); mutex_unlock(&cpu_bitmask_lock); if (ret != 0) goto out_notify; Index: linux-2.6.24.7/kernel/trace/ftrace.c =================================================================== --- linux-2.6.24.7.orig/kernel/trace/ftrace.c +++ linux-2.6.24.7/kernel/trace/ftrace.c @@ -148,6 +148,30 @@ static int __unregister_ftrace_function( return ret; } +static int ftrace_disabled_count; +static int save_ftrace_enabled; + +void ftrace_disable(void) +{ + mutex_lock(&ftrace_sysctl_lock); + + save_ftrace_enabled = ftrace_enabled; + ftrace_enabled = 0; +} + +void ftrace_enable(void) +{ + /* ftrace_enable must be paired with ftrace_disable */ + if (!mutex_is_locked(&ftrace_sysctl_lock)) { + WARN_ON(1); + return; + } + + ftrace_enabled = save_ftrace_enabled; + + mutex_unlock(&ftrace_sysctl_lock); +} + #ifdef CONFIG_DYNAMIC_FTRACE static struct task_struct *ftraced_task; Index: linux-2.6.24.7/kernel/trace/trace_irqsoff.c =================================================================== --- linux-2.6.24.7.orig/kernel/trace/trace_irqsoff.c +++ linux-2.6.24.7/kernel/trace/trace_irqsoff.c @@ -77,6 +77,9 @@ irqsoff_tracer_call(unsigned long ip, un long disabled; int cpu; + if (unlikely(!ftrace_enabled)) + return; + /* * Does not matter if we preempt. We test the flags * afterward, to see if irqs are disabled or not. Index: linux-2.6.24.7/kernel/trace/trace_sched_wakeup.c =================================================================== --- linux-2.6.24.7.orig/kernel/trace/trace_sched_wakeup.c +++ linux-2.6.24.7/kernel/trace/trace_sched_wakeup.c @@ -45,7 +45,7 @@ wakeup_tracer_call(unsigned long ip, uns int resched; int cpu; - if (likely(!wakeup_task)) + if (likely(!wakeup_task) || !ftrace_enabled) return; resched = need_resched(); �������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/rwlock-pi-lock-reader.patch�����������������������������������������������������������������0000664�0000764�0000764�00000006323�11041657734�016367� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From: Steven Rostedt <srostedt@redhat.com> Subject: rwlock: pi_lock fixes When waking up multiple readers we need to hold the pi_lock to modify the pending lists. This patch also localizes the locks a bit more. Signed-off-by: Steven Rostedt <srostedt@redhat.com> --- kernel/rtmutex.c | 19 +++++++++++++------ 1 file changed, 13 insertions(+), 6 deletions(-) Index: linux-2.6.24.7/kernel/rtmutex.c =================================================================== --- linux-2.6.24.7.orig/kernel/rtmutex.c +++ linux-2.6.24.7/kernel/rtmutex.c @@ -1918,7 +1918,6 @@ rt_write_slowunlock(struct rw_mutex *rwm if (!rt_mutex_has_waiters(mutex)) goto out; - spin_lock(&pendowner->pi_lock); /* * Wake up all readers. * This gets a bit more complex. More than one reader can't @@ -1933,13 +1932,17 @@ rt_write_slowunlock(struct rw_mutex *rwm while (waiter && !waiter->write_lock) { struct task_struct *reader = waiter->task; + spin_lock(&pendowner->pi_lock); plist_del(&waiter->list_entry, &mutex->wait_list); /* nop if not on a list */ plist_del(&waiter->pi_list_entry, &pendowner->pi_waiters); + spin_unlock(&pendowner->pi_lock); + spin_lock(&reader->pi_lock); waiter->task = NULL; reader->pi_blocked_on = NULL; + spin_unlock(&reader->pi_lock); if (savestate) wake_up_process_mutex(reader); @@ -1957,6 +1960,8 @@ rt_write_slowunlock(struct rw_mutex *rwm struct rt_mutex_waiter *next; next = rt_mutex_top_waiter(mutex); + + spin_lock(&pendowner->pi_lock); /* delete incase we didn't go through the loop */ plist_del(&next->pi_list_entry, &pendowner->pi_waiters); @@ -1964,13 +1969,12 @@ rt_write_slowunlock(struct rw_mutex *rwm if (next->write_lock) /* add back in as top waiter */ plist_add(&next->pi_list_entry, &pendowner->pi_waiters); + spin_unlock(&pendowner->pi_lock); rwm->prio = next->task->prio; } else rwm->prio = MAX_PRIO; - spin_unlock(&pendowner->pi_lock); - out: spin_unlock_irqrestore(&mutex->wait_lock, flags); @@ -2052,18 +2056,21 @@ rt_mutex_downgrade_write(struct rw_mutex * waiting, until we hit the reader limit, or a writer. */ - spin_lock(¤t->pi_lock); waiter = rt_mutex_top_waiter(mutex); while (waiter && !waiter->write_lock) { struct task_struct *reader = waiter->task; + spin_lock(¤t->pi_lock); plist_del(&waiter->list_entry, &mutex->wait_list); /* nop if not on a list */ plist_del(&waiter->pi_list_entry, ¤t->pi_waiters); + spin_unlock(¤t->pi_lock); + spin_lock(&reader->pi_lock); waiter->task = NULL; reader->pi_blocked_on = NULL; + spin_unlock(&reader->pi_lock); /* downgrade is only for mutexes */ wake_up_process(reader); @@ -2083,14 +2090,14 @@ rt_mutex_downgrade_write(struct rw_mutex /* setup this mutex prio for read */ rwm->prio = next->task->prio; + spin_lock(¤t->pi_lock); /* delete incase we didn't go through the loop */ plist_del(&next->pi_list_entry, ¤t->pi_waiters); + spin_unlock(¤t->pi_lock); /* No need to add back since readers don't have PI waiters */ } else rwm->prio = MAX_PRIO; - spin_unlock(¤t->pi_lock); - rt_mutex_set_owner(mutex, RT_RW_READER, 0); spin_unlock_irqrestore(&mutex->wait_lock, flags); �������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/fix-adaptive-hack.patch���������������������������������������������������������������������0000664�0000764�0000764�00000004454�11041657730�015556� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������Subject: fix-adaptive-hack.patch From: Thomas Gleixner <tglx@linutronix.de> Date: Tue, 17 Jun 2008 18:01:12 +0200 Signed-off-by: Thomas Gleixner <tglx@linutronix.de> --- kernel/rtmutex.c | 34 +++++++++------------------------- 1 file changed, 9 insertions(+), 25 deletions(-) Index: linux-2.6.24.7/kernel/rtmutex.c =================================================================== --- linux-2.6.24.7.orig/kernel/rtmutex.c +++ linux-2.6.24.7/kernel/rtmutex.c @@ -774,41 +774,22 @@ update_current(unsigned long new_state, static int adaptive_wait(struct rt_mutex_waiter *waiter, struct task_struct *orig_owner) { - int sleep = 0; - for (;;) { /* we are the owner? */ if (!waiter->task) - break; + return 0; - /* - * We need to read the owner of the lock and then check - * its state. But we can't let the owner task be freed - * while we read the state. We grab the rcu_lock and - * this makes sure that the owner task wont disappear - * between testing that it still has the lock, and checking - * its state. - */ - rcu_read_lock(); /* Owner changed? Then lets update the original */ - if (orig_owner != rt_mutex_owner(waiter->lock)) { - rcu_read_unlock(); - break; - } + if (orig_owner != rt_mutex_owner(waiter->lock)) + return 0; /* Owner went to bed, so should we */ - if (!task_is_current(orig_owner)) { - sleep = 1; - rcu_read_unlock(); - break; - } - rcu_read_unlock(); + if (!task_is_current(orig_owner)) + return 1; cpu_relax(); } - - return sleep; } #else static int adaptive_wait(struct rt_mutex_waiter *waiter, @@ -889,11 +870,13 @@ rt_spin_lock_slowlock(struct rt_mutex *l current->lock_depth = -1; current->flags &= ~PF_NOSCHED; orig_owner = rt_mutex_owner(lock); + get_task_struct(orig_owner); spin_unlock_irqrestore(&lock->wait_lock, flags); debug_rt_mutex_print_deadlock(&waiter); if (adaptive_wait(&waiter, orig_owner)) { + put_task_struct(orig_owner); update_current(TASK_UNINTERRUPTIBLE, &saved_state); /* * The xchg() in update_current() is an implicit @@ -902,7 +885,8 @@ rt_spin_lock_slowlock(struct rt_mutex *l */ if (waiter.task) schedule_rt_mutex(lock); - } + } else + put_task_struct(orig_owner); spin_lock_irqsave(&lock->wait_lock, flags); current->flags |= saved_flags; ��������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/rwlock-slowunlock-mutex-fix.patch�����������������������������������������������������������0000664�0000764�0000764�00000002002�11041657731�017700� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From: Steven Rostedt <srostedt@redhat.com> Subject: rwlock: reset mutex on multilpe readers in unlock Reset the rwlock mutex owner to readers if the lock is currently held by other readers. Signed-off-by: Steven Rostedt <srostedt@redhat.com> --- kernel/rtmutex.c | 10 ++++++++++ 1 file changed, 10 insertions(+) Index: linux-2.6.24.7/kernel/rtmutex.c =================================================================== --- linux-2.6.24.7.orig/kernel/rtmutex.c +++ linux-2.6.24.7/kernel/rtmutex.c @@ -1795,6 +1795,16 @@ rt_read_slowunlock(struct rw_mutex *rwm, wakeup_next_waiter(mutex, savestate); + /* + * If we woke up a reader but the lock is already held by readers + * we need to set the mutex owner to RT_RW_READER, since the + * wakeup_next_waiter set it to the pending reader. + */ + if (reader_count) { + WARN_ON(waiter->write_lock); + rt_mutex_set_owner(mutex, RT_RW_READER, 0); + } + if (rt_mutex_has_waiters(mutex)) { waiter = rt_mutex_top_waiter(mutex); rwm->prio = waiter->task->prio; ������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/rwlock-slowunlock-mutex-fix2.patch����������������������������������������������������������0000664�0000764�0000764�00000003021�11041657734�017767� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From: Steven Rostedt <srostedt@redhat.com> Subject: rwlock: remove waiter from pi_list If the pending owner was a reader and we have multiple readers than we need to remove it from the pi list. [ Thomas Gleixner added the grabing of the pi_locks to the removal ] Signed-off-by: Steven Rostedt <srostedt@redhat.com> --- kernel/rtmutex.c | 13 +++++++++++++ 1 file changed, 13 insertions(+) Index: linux-2.6.24.7/kernel/rtmutex.c =================================================================== --- linux-2.6.24.7.orig/kernel/rtmutex.c +++ linux-2.6.24.7/kernel/rtmutex.c @@ -1685,6 +1685,7 @@ rt_read_slowunlock(struct rw_mutex *rwm, { struct rt_mutex *mutex = &rwm->mutex; struct rt_mutex_waiter *waiter; + struct task_struct *pendowner; struct reader_lock_struct *rls; unsigned long flags; unsigned int reader_count; @@ -1793,6 +1794,7 @@ rt_read_slowunlock(struct rw_mutex *rwm, rwm->owner = RT_RW_PENDING_READ; } + pendowner = waiter->task; wakeup_next_waiter(mutex, savestate); /* @@ -1808,6 +1810,17 @@ rt_read_slowunlock(struct rw_mutex *rwm, if (rt_mutex_has_waiters(mutex)) { waiter = rt_mutex_top_waiter(mutex); rwm->prio = waiter->task->prio; + /* + * If readers still own this lock, then we need + * to update the pi_list too. Readers have a separate + * path in the PI chain. + */ + if (reader_count) { + spin_lock(&pendowner->pi_lock); + plist_del(&waiter->pi_list_entry, + &pendowner->pi_waiters); + spin_unlock(&pendowner->pi_lock); + } } else rwm->prio = MAX_PRIO; ���������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/rt-mutex-use-inline.patch�������������������������������������������������������������������0000664�0000764�0000764�00000001501�11041657732�016114� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������Subject: rt-mutex-cleanup.patch From: Thomas Gleixner <tglx@linutronix.de> Date: Fri, 20 Jun 2008 12:20:09 +0200 Signed-off-by: Thomas Gleixner <tglx@linutronix.de> --- kernel/rtmutex.c | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-) Index: linux-2.6.24.7/kernel/rtmutex.c =================================================================== --- linux-2.6.24.7.orig/kernel/rtmutex.c +++ linux-2.6.24.7/kernel/rtmutex.c @@ -124,9 +124,12 @@ static inline void mark_rt_rwlock_check( #endif #ifdef CONFIG_PREEMPT_RT -#define task_is_reader(task) ((task) == RT_RW_READER) +static inline int task_is_reader(struct task_struct *task) +{ + return task == RT_RW_READER; +} #else -#define task_is_reader(task) (0) +static inline int task_is_reader(struct task_struct *task) { return 0; } #endif int pi_initialized; �����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/rt-mutex-namespace.patch��������������������������������������������������������������������0000664�0000764�0000764�00000011717�11041657734�016014� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������Subject: rt-mutex-namespace.patch From: Thomas Gleixner <tglx@linutronix.de> Date: Fri, 20 Jun 2008 12:22:52 +0200 Signed-off-by: Thomas Gleixner <tglx@linutronix.de> --- kernel/rtmutex.c | 21 ++++++++++++--------- kernel/rtmutex_common.h | 18 ++++++++++-------- 2 files changed, 22 insertions(+), 17 deletions(-) Index: linux-2.6.24.7/kernel/rtmutex.c =================================================================== --- linux-2.6.24.7.orig/kernel/rtmutex.c +++ linux-2.6.24.7/kernel/rtmutex.c @@ -1291,7 +1291,7 @@ try_to_take_rw_write(struct rw_mutex *rw } /* - * RT_RW_PENDING means that the lock is free, but there are + * RT_RWLOCK_PENDING means that the lock is free, but there are * pending owners on the mutex */ WARN_ON(own && !rt_mutex_owner_pending(mutex)); @@ -1629,7 +1629,8 @@ rt_write_fastlock(struct rw_mutex *rwm, void fastcall (*slowfn)(struct rw_mutex *rwm, int mtx), int mtx) { - struct task_struct *val = (void *)((unsigned long)current | RT_RWLOCK_WRITER); + struct task_struct *val = (void *)((unsigned long)current | + RT_RWLOCK_WRITER); if (likely(rt_rwlock_cmpxchg(rwm, NULL, val))) rt_mutex_deadlock_account_lock(&rwm->mutex, current); @@ -1669,7 +1670,8 @@ static inline int rt_write_fasttrylock(struct rw_mutex *rwm, int fastcall (*slowfn)(struct rw_mutex *rwm, int mtx), int mtx) { - struct task_struct *val = (void *)((unsigned long)current | RT_RWLOCK_WRITER); + struct task_struct *val = (void *)((unsigned long)current | + RT_RWLOCK_WRITER); if (likely(rt_rwlock_cmpxchg(rwm, NULL, val))) { rt_mutex_deadlock_account_lock(&rwm->mutex, current); @@ -1762,7 +1764,7 @@ rt_read_slowunlock(struct rw_mutex *rwm, /* We could still have a pending reader waiting */ if (rt_mutex_owner_pending(mutex)) { /* set the rwm back to pending */ - rwm->owner = RT_RW_PENDING_READ; + rwm->owner = RT_RWLOCK_PENDING_READ; } else { rwm->owner = NULL; mutex->owner = NULL; @@ -1783,7 +1785,7 @@ rt_read_slowunlock(struct rw_mutex *rwm, /* only wake up if there are no readers */ if (reader_count) goto out; - rwm->owner = RT_RW_PENDING_WRITE; + rwm->owner = RT_RWLOCK_PENDING_WRITE; } else { /* * It is also possible that the reader limit decreased. @@ -1794,7 +1796,7 @@ rt_read_slowunlock(struct rw_mutex *rwm, unlikely(atomic_read(&rwm->owners) >= rt_rwlock_limit)) goto out; if (!reader_count) - rwm->owner = RT_RW_PENDING_READ; + rwm->owner = RT_RWLOCK_PENDING_READ; } pendowner = waiter->task; @@ -1919,11 +1921,11 @@ rt_write_slowunlock(struct rw_mutex *rwm /* another writer is next? */ if (waiter->write_lock) { - rwm->owner = RT_RW_PENDING_WRITE; + rwm->owner = RT_RWLOCK_PENDING_WRITE; goto out; } - rwm->owner = RT_RW_PENDING_READ; + rwm->owner = RT_RWLOCK_PENDING_READ; if (!rt_mutex_has_waiters(mutex)) goto out; @@ -1999,7 +2001,8 @@ rt_write_fastunlock(struct rw_mutex *rwm int mtx), int mtx) { - struct task_struct *val = (void *)((unsigned long)current | RT_RWLOCK_WRITER); + struct task_struct *val = (void *)((unsigned long)current | + RT_RWLOCK_WRITER); WARN_ON(rt_rwlock_owner(rwm) != current); if (likely(rt_rwlock_cmpxchg(rwm, (struct task_struct *)val, NULL))) Index: linux-2.6.24.7/kernel/rtmutex_common.h =================================================================== --- linux-2.6.24.7.orig/kernel/rtmutex_common.h +++ linux-2.6.24.7/kernel/rtmutex_common.h @@ -123,29 +123,31 @@ static inline unsigned long rt_mutex_own #define RT_RWLOCK_WRITER 2UL #define RT_RWLOCK_MASKALL 3UL -/* used as reader owner of the mutex */ -#define RT_RW_READER (struct task_struct *)0x100 - /* used when a writer releases the lock with waiters */ /* pending owner is a reader */ -#define RT_RW_PENDING_READ (struct task_struct *)0x200 +#define RT_RWLOCK_PENDING_READ ((struct task_struct *)0x200) /* pending owner is a writer */ -#define RT_RW_PENDING_WRITE (struct task_struct *)0x400 +#define RT_RWLOCK_PENDING_WRITE ((struct task_struct *)0x400) /* Either of the above is true */ -#define RT_RW_PENDING_MASK (0x600 | RT_RWLOCK_MASKALL) +#define RT_RWLOCK_PENDING_MASK \ + ((unsigned long) RT_RWLOCK_PENDING_READ | \ + (unsigned long) RT_RWLOCK_PENDING_WRITE | RT_RWLOCK_MASKALL) + +/* used as reader owner of the rt_mutex inside of the rw_mutex */ +#define RT_RW_READER (struct task_struct *)0x100 /* Return true if lock is not owned but has pending owners */ static inline int rt_rwlock_pending(struct rw_mutex *rwm) { unsigned long owner = (unsigned long)rwm->owner; - return (owner & RT_RW_PENDING_MASK) == owner; + return (owner & RT_RWLOCK_PENDING_MASK) == owner; } static inline int rt_rwlock_pending_writer(struct rw_mutex *rwm) { unsigned long owner = (unsigned long)rwm->owner; return rt_rwlock_pending(rwm) && - (owner & (unsigned long)RT_RW_PENDING_WRITE); + (owner & (unsigned long)RT_RWLOCK_PENDING_WRITE); } static inline struct task_struct *rt_rwlock_owner(struct rw_mutex *rwm) �������������������������������������������������patches/rtmutex-debug-fix.patch���������������������������������������������������������������������0000664�0000764�0000764�00000001260�11041657732�015643� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������Subject: rtmutex-debug-fix.patch From: Thomas Gleixner <tglx@linutronix.de> Date: Fri, 20 Jun 2008 12:27:50 +0200 Signed-off-by: Thomas Gleixner <tglx@linutronix.de> --- kernel/rtmutex-debug.c | 3 +++ 1 file changed, 3 insertions(+) Index: linux-2.6.24.7/kernel/rtmutex-debug.c =================================================================== --- linux-2.6.24.7.orig/kernel/rtmutex-debug.c +++ linux-2.6.24.7/kernel/rtmutex-debug.c @@ -59,6 +59,9 @@ void rt_mutex_debug_task_free(struct tas { DEBUG_LOCKS_WARN_ON(!plist_head_empty(&task->pi_waiters)); DEBUG_LOCKS_WARN_ON(task->pi_blocked_on); +#ifdef CONFIG_PREEMPT_RT + WARN_ON(task->reader_lock_count); +#endif } /* ������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/rwlock-protect-reader_lock_count.patch������������������������������������������������������0000664�0000764�0000764�00000010452�11041657732�020725� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������--- kernel/rtmutex.c | 35 ++++++++++++++++++++++++++++------- 1 file changed, 28 insertions(+), 7 deletions(-) Index: linux-2.6.24.7/kernel/rtmutex.c =================================================================== --- linux-2.6.24.7.orig/kernel/rtmutex.c +++ linux-2.6.24.7/kernel/rtmutex.c @@ -1137,6 +1137,12 @@ rt_rwlock_update_owner(struct rw_mutex * if (own == RT_RW_READER) return; + /* + * We don't need to grab the pi_lock to look at the reader list + * since we hold the rwm wait_lock. We only care about the pointer + * to this lock, and we own the wait_lock, so that pointer + * can't be changed. + */ for (i = own->reader_lock_count - 1; i >= 0; i--) { if (own->owned_read_locks[i].lock == rwm) break; @@ -1256,6 +1262,7 @@ static int try_to_take_rw_read(struct rw if (incr) { atomic_inc(&rwm->owners); rw_check_held(rwm); + spin_lock(¤t->pi_lock); reader_count = current->reader_lock_count++; if (likely(reader_count < MAX_RWLOCK_DEPTH)) { rls = ¤t->owned_read_locks[reader_count]; @@ -1265,6 +1272,7 @@ static int try_to_take_rw_read(struct rw list_add(&rls->list, &rwm->readers); } else WARN_ON_ONCE(1); + spin_unlock(¤t->pi_lock); } rt_mutex_deadlock_account_lock(mutex, current); atomic_inc(&rwm->count); @@ -1420,6 +1428,7 @@ __rt_read_fasttrylock(struct rw_mutex *r retry: if (likely(rt_rwlock_cmpxchg(rwm, NULL, current))) { int reader_count; + unsigned long flags; rt_mutex_deadlock_account_lock(&rwm->mutex, current); atomic_inc(&rwm->count); @@ -1436,30 +1445,31 @@ __rt_read_fasttrylock(struct rw_mutex *r atomic_inc(&rwm->owners); rw_check_held(rwm); - reader_count = current->reader_lock_count; + local_irq_save(flags); + spin_lock(¤t->pi_lock); + reader_count = current->reader_lock_count++; if (likely(reader_count < MAX_RWLOCK_DEPTH)) { current->owned_read_locks[reader_count].lock = rwm; current->owned_read_locks[reader_count].count = 1; } else WARN_ON_ONCE(1); + spin_unlock(¤t->pi_lock); /* * If this task is no longer the sole owner of the lock * or someone is blocking, then we need to add the task * to the lock. */ - smp_mb(); - current->reader_lock_count++; if (unlikely(rwm->owner != current)) { struct rt_mutex *mutex = &rwm->mutex; struct reader_lock_struct *rls; - unsigned long flags; - spin_lock_irqsave(&mutex->wait_lock, flags); + spin_lock(&mutex->wait_lock); rls = ¤t->owned_read_locks[reader_count]; if (!rls->list.prev || list_empty(&rls->list)) list_add(&rls->list, &rwm->readers); - spin_unlock_irqrestore(&mutex->wait_lock, flags); + spin_unlock(&mutex->wait_lock); } + local_irq_restore(flags); return 1; } return 0; @@ -1712,6 +1722,7 @@ rt_read_slowunlock(struct rw_mutex *rwm, for (i = current->reader_lock_count - 1; i >= 0; i--) { if (current->owned_read_locks[i].lock == rwm) { + spin_lock(¤t->pi_lock); current->owned_read_locks[i].count--; if (!current->owned_read_locks[i].count) { current->reader_lock_count--; @@ -1723,6 +1734,7 @@ rt_read_slowunlock(struct rw_mutex *rwm, rls->lock = NULL; rw_check_held(rwm); } + spin_unlock(¤t->pi_lock); break; } } @@ -1848,8 +1860,14 @@ rt_read_fastunlock(struct rw_mutex *rwm, atomic_dec(&rwm->count); if (likely(rt_rwlock_cmpxchg(rwm, current, NULL))) { struct reader_lock_struct *rls; - int reader_count = --current->reader_lock_count; + unsigned long flags; + int reader_count; int owners; + + spin_lock_irqsave(¤t->pi_lock, flags); + reader_count = --current->reader_lock_count; + spin_unlock_irqrestore(¤t->pi_lock, flags); + rt_mutex_deadlock_account_unlock(current); if (unlikely(reader_count < 0)) { reader_count = 0; @@ -2041,6 +2059,8 @@ rt_mutex_downgrade_write(struct rw_mutex atomic_inc(&rwm->count); atomic_inc(&rwm->owners); rw_check_held(rwm); + + spin_lock(¤t->pi_lock); reader_count = current->reader_lock_count++; rls = ¤t->owned_read_locks[reader_count]; if (likely(reader_count < MAX_RWLOCK_DEPTH)) { @@ -2048,6 +2068,7 @@ rt_mutex_downgrade_write(struct rw_mutex rls->count = 1; } else WARN_ON_ONCE(1); + spin_unlock(¤t->pi_lock); if (!rt_mutex_has_waiters(mutex)) { /* We are sole owner, we are done */ ����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/ftrace-stop-trace-on-crash.patch������������������������������������������������������������0000664�0000764�0000764�00000011144�11041657735�017323� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������Subject: fix-tracer-wreckage-wtf-is-this-code-all-features.patch From: Thomas Gleixner <tglx@linutronix.de> Date: Thu, 19 Jun 2008 19:24:14 +0200 Signed-off-by: Thomas Gleixner <tglx@linutronix.de> --- arch/x86/kernel/traps_32.c | 2 ++ arch/x86/kernel/traps_64.c | 3 +++ include/linux/ftrace.h | 11 ++++++++--- kernel/trace/ftrace.c | 17 +++++++++++++++++ kernel/trace/trace.c | 31 +++++++++++++++++++++++++++++++ 5 files changed, 61 insertions(+), 3 deletions(-) Index: linux-2.6.24.7/arch/x86/kernel/traps_32.c =================================================================== --- linux-2.6.24.7.orig/arch/x86/kernel/traps_32.c +++ linux-2.6.24.7/arch/x86/kernel/traps_32.c @@ -385,6 +385,8 @@ void die(const char * str, struct pt_reg static int die_counter; unsigned long flags; + ftrace_stop(); + oops_enter(); if (die.lock_owner != raw_smp_processor_id()) { Index: linux-2.6.24.7/arch/x86/kernel/traps_64.c =================================================================== --- linux-2.6.24.7.orig/arch/x86/kernel/traps_64.c +++ linux-2.6.24.7/arch/x86/kernel/traps_64.c @@ -524,6 +524,9 @@ void __kprobes oops_end(unsigned long fl void __kprobes __die(const char * str, struct pt_regs * regs, long err) { static int die_counter; + + ftrace_stop(); + printk(KERN_EMERG "%s: %04lx [%u] ", str, err & 0xffff,++die_counter); #ifdef CONFIG_PREEMPT printk("PREEMPT "); Index: linux-2.6.24.7/include/linux/ftrace.h =================================================================== --- linux-2.6.24.7.orig/include/linux/ftrace.h +++ linux-2.6.24.7/include/linux/ftrace.h @@ -38,12 +38,18 @@ extern void mcount(void); void ftrace_enable(void); void ftrace_disable(void); +/* totally disable ftrace - can not re-enable after this */ +void ftrace_kill(void); +void __ftrace_kill(void); + #else /* !CONFIG_FTRACE */ # define register_ftrace_function(ops) do { } while (0) # define unregister_ftrace_function(ops) do { } while (0) # define clear_ftrace_function(ops) do { } while (0) # define ftrace_enable() do { } while (0) # define ftrace_disable() do { } while (0) +# define ftrace_kill() do { } while (0) +# define __ftrace_kill() do { } while (0) #endif /* CONFIG_FTRACE */ #ifdef CONFIG_DYNAMIC_FTRACE @@ -90,9 +96,6 @@ void ftrace_enable_daemon(void); # define ftrace_enable_daemon() do { } while (0) #endif -/* totally disable ftrace - can not re-enable after this */ -void ftrace_kill(void); - static inline void tracer_disable(void) { #ifdef CONFIG_FTRACE @@ -138,9 +141,11 @@ static inline void tracer_disable(void) #ifdef CONFIG_TRACING extern void ftrace_special(unsigned long arg1, unsigned long arg2, unsigned long arg3); +void ftrace_stop(void); #else static inline void ftrace_special(unsigned long arg1, unsigned long arg2, unsigned long arg3) { } +static inline void ftrace_stop(void) { } #endif #ifdef CONFIG_EVENT_TRACER Index: linux-2.6.24.7/kernel/trace/ftrace.c =================================================================== --- linux-2.6.24.7.orig/kernel/trace/ftrace.c +++ linux-2.6.24.7/kernel/trace/ftrace.c @@ -1490,6 +1490,23 @@ void ftrace_kill(void) } /** + * __ftrace_kill - shutdown ftrace in a mean fashion + * + * In case of system failure we want to stop ftrace as soon as + * possible. This is like ftrace_kill but does not grab the + * mutexes nor does it call the kstop machine. + * + * This one is save to use in atomic. + */ +void __ftrace_kill(void) +{ + ftrace_disabled = 1; + ftrace_enabled = 0; + + clear_ftrace_function(); +} + +/** * register_ftrace_function - register a function for profiling * @ops - ops structure that holds the function for profiling. * Index: linux-2.6.24.7/kernel/trace/trace.c =================================================================== --- linux-2.6.24.7.orig/kernel/trace/trace.c +++ linux-2.6.24.7/kernel/trace/trace.c @@ -3243,6 +3243,37 @@ static __init void tracer_init_debugfs(v #endif } +/** + * ftrace_stop - called when we need to drastically disable the tracer. + */ +void ftrace_stop(void) +{ + struct tracer *saved_tracer = current_trace; + struct trace_array *tr = &global_trace; + struct trace_array_cpu *data; + int i; + + __ftrace_kill(); + for_each_tracing_cpu(i) { + data = tr->data[i]; + atomic_inc(&data->disabled); + } + tracer_enabled = 0; + + /* + * TODO: make a safe method to ctrl_update. + * ctrl_update may schedule, but currently only + * does when ftrace is enabled. + */ + if (tr->ctrl) { + tr->ctrl = 0; + if (saved_tracer && saved_tracer->ctrl_update) + saved_tracer->ctrl_update; + } + + +} + static int trace_alloc_page(void) { struct trace_array_cpu *data; ����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/rwlock-torture-no-rt.patch������������������������������������������������������������������0000664�0000764�0000764�00000001153�11041657731�016323� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From: Steven Rostedt <srostedt@redhat.com> Subject: rwlock: fix torture test to handle non-rt Signed-off-by: Steven Rostedt <srostedt@redhat.com> --- kernel/rwlock_torture.c | 2 ++ 1 file changed, 2 insertions(+) Index: linux-2.6.24.7/kernel/rwlock_torture.c =================================================================== --- linux-2.6.24.7.orig/kernel/rwlock_torture.c +++ linux-2.6.24.7/kernel/rwlock_torture.c @@ -570,7 +570,9 @@ static int run_test(unsigned long time) run_two_locks(time, read); } +#ifdef CONFIG_PREEMPT_RT WARN_ON_ONCE(current->reader_lock_count); +#endif return ret; } ���������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/fix-config-debug-rt-mutex-lock-underflow-warnings.patch�������������������������������������0000664�0000764�0000764�00000010244�11041657731�023743� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������Subject: Fix CONFIG_DEBUG_RT_MUTEX lock underflow warnings From: john stultz <johnstul@us.ibm.com> Date: Wed, 02 Jul 2008 17:57:32 -0700 All, So if I enable CONFIG_DEBUG_RT_MUTEXES with 2.6.24.7-rt14, I tend to quickly see a number of BUG warnings when running Java tests: BUG: jxeinajar/3383: lock count underflow! Pid: 3383, comm: jxeinajar Not tainted 2.6.24-ibmrt2.5john #3 Call Trace: [<ffffffff8107208d>] rt_mutex_deadlock_account_unlock+0x5d/0x70 [<ffffffff817d6aa5>] rt_read_slowunlock+0x35/0x550 [<ffffffff8107173d>] rt_mutex_up_read+0x3d/0xc0 [<ffffffff81072a99>] rt_up_read+0x29/0x30 [<ffffffff8106e34e>] do_futex+0x32e/0xd40 [<ffffffff8107173d>] ? rt_mutex_up_read+0x3d/0xc0 [<ffffffff81072a99>] ? rt_up_read+0x29/0x30 [<ffffffff8106f370>] compat_sys_futex+0xa0/0x110 [<ffffffff81010a36>] ? syscall_trace_enter+0x86/0xb0 [<ffffffff8102ff04>] cstar_do_call+0x1b/0x65 INFO: lockdep is turned off. --------------------------- | preempt count: 00000001 ] | 1-level deep critical section nesting: ---------------------------------------- ... [<ffffffff817d8e42>] .... __spin_lock_irqsave+0x22/0x60 ......[<ffffffff817d6a93>] .. ( <= rt_read_slowunlock+0x23/0x550) After some debugging and with Steven's help, we realized that with rwlocks, rt_mutex_deadlock_account_lock can be called multiple times in parallel (where as in most cases the mutex must be held by the caller to to call the function). This can cause integer lock_count value being used to be non-atomically incremented. The following patch converts lock_count to a atomic_t and resolves the warnings. Let me know if you have any feedback or comments! thanks -john Signed-off-by: John Stultz <johnstul@us.ibm.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Clark Williams <williams@redhat.com> Cc: dvhltc <dvhltc@linux.vnet.ibm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> --- include/linux/sched.h | 2 +- kernel/fork.c | 2 +- kernel/rtmutex-debug.c | 12 ++++++------ 3 files changed, 8 insertions(+), 8 deletions(-) Index: linux-2.6.24.7/include/linux/sched.h =================================================================== --- linux-2.6.24.7.orig/include/linux/sched.h +++ linux-2.6.24.7/include/linux/sched.h @@ -1250,7 +1250,7 @@ struct task_struct { #define MAX_LOCK_STACK MAX_PREEMPT_TRACE #ifdef CONFIG_DEBUG_PREEMPT - int lock_count; + atomic_t lock_count; # ifdef CONFIG_PREEMPT_RT struct rt_mutex *owned_lock[MAX_LOCK_STACK]; # endif Index: linux-2.6.24.7/kernel/fork.c =================================================================== --- linux-2.6.24.7.orig/kernel/fork.c +++ linux-2.6.24.7/kernel/fork.c @@ -1203,7 +1203,7 @@ static struct task_struct *copy_process( if (retval) goto bad_fork_cleanup_namespaces; #ifdef CONFIG_DEBUG_PREEMPT - p->lock_count = 0; + atomic_set(&p->lock_count, 0); #endif #ifdef CONFIG_PREEMPT_RT Index: linux-2.6.24.7/kernel/rtmutex-debug.c =================================================================== --- linux-2.6.24.7.orig/kernel/rtmutex-debug.c +++ linux-2.6.24.7/kernel/rtmutex-debug.c @@ -176,7 +176,7 @@ void rt_mutex_deadlock_account_lock(struct rt_mutex *lock, struct task_struct *task) { #ifdef CONFIG_DEBUG_PREEMPT - if (task->lock_count >= MAX_LOCK_STACK) { + if (atomic_read(&task->lock_count) >= MAX_LOCK_STACK) { if (!debug_locks_off()) return; printk("BUG: %s/%d: lock count overflow!\n", @@ -185,16 +185,16 @@ rt_mutex_deadlock_account_lock(struct rt return; } #ifdef CONFIG_PREEMPT_RT - task->owned_lock[task->lock_count] = lock; + task->owned_lock[atomic_read(&task->lock_count)] = lock; #endif - task->lock_count++; + atomic_inc(&task->lock_count); #endif } void rt_mutex_deadlock_account_unlock(struct task_struct *task) { #ifdef CONFIG_DEBUG_PREEMPT - if (!task->lock_count) { + if (!atomic_read(&task->lock_count)) { if (!debug_locks_off()) return; printk("BUG: %s/%d: lock count underflow!\n", @@ -202,9 +202,9 @@ void rt_mutex_deadlock_account_unlock(st dump_stack(); return; } - task->lock_count--; + atomic_dec(&task->lock_count); #ifdef CONFIG_PREEMPT_RT - task->owned_lock[task->lock_count] = NULL; + task->owned_lock[atomic_read(&task->lock_count)] = NULL; #endif #endif } ������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/cpu-hotplug-vs-slab.patch�������������������������������������������������������������������0000664�0000764�0000764�00000025076�11041673102�016077� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������Subject: cpu-hotplug: vs slab From: Peter Zijlstra <a.p.zijlstra@chello.nl> Date: Tue, 10 Jun 2008 13:13:00 +0200 Fix up the slab allocator to be cpu-hotplug safe (again, pure -rt regression). On -rt we protect per-cpu state by locks instead of disabling preemption/irqs. This keeps all the code preemptible at the cost of possible remote memory access. The race was that cpu-hotplug - which assumes to be cpu local and non- preemptible, didn't take the per-cpu lock. This also means that the normal lock acquire needs to be aware of cpus getting off-lined while its waiting. Clean up some of the macro mess while we're there. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Clark Williams <williams@redhat.com> Cc: Gregory Haskins <ghaskins@novell.com> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: Gautham R Shenoy <ego@in.ibm.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> --- mm/slab.c | 170 ++++++++++++++++++++++++++++++++++++++++++++------------------ 1 file changed, 122 insertions(+), 48 deletions(-) Index: linux-2.6.24.7/mm/slab.c =================================================================== --- linux-2.6.24.7.orig/mm/slab.c +++ linux-2.6.24.7/mm/slab.c @@ -125,43 +125,116 @@ * the CPU number of the lock there. */ #ifndef CONFIG_PREEMPT_RT + # define slab_irq_disable(cpu) \ do { local_irq_disable(); (cpu) = smp_processor_id(); } while (0) # define slab_irq_enable(cpu) local_irq_enable() + +static inline void slab_irq_disable_this_rt(int cpu) +{ +} + +static inline void slab_irq_enable_rt(int cpu) +{ +} + # define slab_irq_save(flags, cpu) \ do { local_irq_save(flags); (cpu) = smp_processor_id(); } while (0) # define slab_irq_restore(flags, cpu) local_irq_restore(flags) + /* * In the __GFP_WAIT case we enable/disable interrupts on !PREEMPT_RT, * which has no per-CPU locking effect since we are holding the cache * lock in that case already. - * - * (On PREEMPT_RT, these are NOPs, but we have to drop/get the irq locks.) */ -# define slab_irq_disable_nort(cpu) slab_irq_disable(cpu) -# define slab_irq_enable_nort(cpu) slab_irq_enable(cpu) -# define slab_irq_disable_rt(flags) do { (void)(flags); } while (0) -# define slab_irq_enable_rt(flags) do { (void)(flags); } while (0) +static void slab_irq_enable_GFP_WAIT(gfp_t flags, int *cpu) +{ + if (flags & __GFP_WAIT) + local_irq_enable(); +} + +static void slab_irq_disable_GFP_WAIT(gfp_t flags, int *cpu) +{ + if (flags & __GFP_WAIT) + local_irq_disable(); +} + # define slab_spin_lock_irq(lock, cpu) \ do { spin_lock_irq(lock); (cpu) = smp_processor_id(); } while (0) -# define slab_spin_unlock_irq(lock, cpu) \ - spin_unlock_irq(lock) +# define slab_spin_unlock_irq(lock, cpu) spin_unlock_irq(lock) + # define slab_spin_lock_irqsave(lock, flags, cpu) \ do { spin_lock_irqsave(lock, flags); (cpu) = smp_processor_id(); } while (0) # define slab_spin_unlock_irqrestore(lock, flags, cpu) \ do { spin_unlock_irqrestore(lock, flags); } while (0) -#else + +#else /* CONFIG_PREEMPT_RT */ + +/* + * Instead of serializing the per-cpu state by disabling interrupts we do so + * by a lock. This keeps the code preemptable - albeit at the cost of remote + * memory access when the task does get migrated away. + */ DEFINE_PER_CPU_LOCKED(int, slab_irq_locks) = { 0, }; -# define slab_irq_disable(cpu) (void)get_cpu_var_locked(slab_irq_locks, &(cpu)) -# define slab_irq_enable(cpu) put_cpu_var_locked(slab_irq_locks, cpu) + +static void _slab_irq_disable(int *cpu) +{ + int this_cpu; + spinlock_t *lock; + +again: + this_cpu = raw_smp_processor_id(); + lock = &__get_cpu_lock(slab_irq_locks, this_cpu); + + spin_lock(lock); + if (unlikely(!cpu_online(this_cpu))) { + /* + * Bail - the cpu got hot-unplugged while we were waiting + * for the lock. + */ + spin_unlock(lock); + goto again; + } + + *cpu = this_cpu; +} + +#define slab_irq_disable(cpu) _slab_irq_disable(&(cpu)) + +static inline void slab_irq_enable(int cpu) +{ + spin_unlock(&__get_cpu_lock(slab_irq_locks, cpu)); +} + +static inline void slab_irq_disable_this_rt(int cpu) +{ + spin_lock(&__get_cpu_lock(slab_irq_locks, cpu)); +} + +static inline void slab_irq_enable_rt(int cpu) +{ + spin_unlock(&__get_cpu_lock(slab_irq_locks, cpu)); +} + # define slab_irq_save(flags, cpu) \ do { slab_irq_disable(cpu); (void) (flags); } while (0) # define slab_irq_restore(flags, cpu) \ do { slab_irq_enable(cpu); (void) (flags); } while (0) -# define slab_irq_disable_rt(cpu) slab_irq_disable(cpu) -# define slab_irq_enable_rt(cpu) slab_irq_enable(cpu) -# define slab_irq_disable_nort(cpu) do { } while (0) -# define slab_irq_enable_nort(cpu) do { } while (0) + +/* + * On PREEMPT_RT we have to drop the locks unconditionally to avoid lock + * recursion on the cache_grow()->alloc_slabmgmt() path. + */ +static void slab_irq_enable_GFP_WAIT(gfp_t flags, int *cpu) +{ + slab_irq_enable(*cpu); +} + +static void slab_irq_disable_GFP_WAIT(gfp_t flags, int *cpu) +{ + slab_irq_disable(*cpu); +} + # define slab_spin_lock_irq(lock, cpu) \ do { slab_irq_disable(cpu); spin_lock(lock); } while (0) # define slab_spin_unlock_irq(lock, cpu) \ @@ -170,7 +243,8 @@ DEFINE_PER_CPU_LOCKED(int, slab_irq_lock do { slab_irq_disable(cpu); spin_lock_irqsave(lock, flags); } while (0) # define slab_spin_unlock_irqrestore(lock, flags, cpu) \ do { spin_unlock_irqrestore(lock, flags); slab_irq_enable(cpu); } while (0) -#endif + +#endif /* CONFIG_PREEMPT_RT */ /* * DEBUG - 1 for kmem_cache_create() to honour; SLAB_RED_ZONE & SLAB_POISON. @@ -1221,7 +1295,7 @@ cache_free_alien(struct kmem_cache *cach } #endif -static void __cpuinit cpuup_canceled(long cpu) +static void __cpuinit cpuup_canceled(int cpu) { struct kmem_cache *cachep; struct kmem_list3 *l3 = NULL; @@ -1231,7 +1305,7 @@ static void __cpuinit cpuup_canceled(lon struct array_cache *nc; struct array_cache *shared; struct array_cache **alien; - int this_cpu; + int orig_cpu = cpu; cpumask_t mask; mask = node_to_cpumask(node); @@ -1243,31 +1317,30 @@ static void __cpuinit cpuup_canceled(lon if (!l3) goto free_array_cache; - slab_spin_lock_irq(&l3->list_lock, this_cpu); + spin_lock_irq(&l3->list_lock); /* Free limit for this kmem_list3 */ l3->free_limit -= cachep->batchcount; if (nc) free_block(cachep, nc->entry, nc->avail, node, - &this_cpu); + &cpu); if (!cpus_empty(mask)) { - slab_spin_unlock_irq(&l3->list_lock, - this_cpu); + spin_unlock_irq(&l3->list_lock); goto free_array_cache; } shared = l3->shared; if (shared) { free_block(cachep, shared->entry, - shared->avail, node, &this_cpu); + shared->avail, node, &cpu); l3->shared = NULL; } alien = l3->alien; l3->alien = NULL; - slab_spin_unlock_irq(&l3->list_lock, this_cpu); + spin_unlock_irq(&l3->list_lock); kfree(shared); if (alien) { @@ -1276,6 +1349,7 @@ static void __cpuinit cpuup_canceled(lon } free_array_cache: kfree(nc); + BUG_ON(cpu != orig_cpu); } /* * In the previous loop, all the objects were freed to @@ -1290,13 +1364,12 @@ free_array_cache: } } -static int __cpuinit cpuup_prepare(long cpu) +static int __cpuinit cpuup_prepare(int cpu) { struct kmem_cache *cachep; struct kmem_list3 *l3 = NULL; int node = cpu_to_node(cpu); const int memsize = sizeof(struct kmem_list3); - int this_cpu; /* * We need to do this right in the beginning since @@ -1327,11 +1400,11 @@ static int __cpuinit cpuup_prepare(long cachep->nodelists[node] = l3; } - slab_spin_lock_irq(&cachep->nodelists[node]->list_lock, this_cpu); + spin_lock_irq(&cachep->nodelists[node]->list_lock); cachep->nodelists[node]->free_limit = (1 + nr_cpus_node(node)) * cachep->batchcount + cachep->num; - slab_spin_unlock_irq(&cachep->nodelists[node]->list_lock, this_cpu); + spin_unlock_irq(&cachep->nodelists[node]->list_lock); } /* @@ -1368,7 +1441,7 @@ static int __cpuinit cpuup_prepare(long l3 = cachep->nodelists[node]; BUG_ON(!l3); - slab_spin_lock_irq(&l3->list_lock, this_cpu); + spin_lock_irq(&l3->list_lock); if (!l3->shared) { /* * We are serialised from CPU_DEAD or @@ -1383,7 +1456,7 @@ static int __cpuinit cpuup_prepare(long alien = NULL; } #endif - slab_spin_unlock_irq(&l3->list_lock, this_cpu); + spin_unlock_irq(&l3->list_lock); kfree(shared); free_alien_cache(alien); } @@ -1402,7 +1475,18 @@ static int __cpuinit cpuup_callback(stru switch (action) { case CPU_LOCK_ACQUIRE: mutex_lock(&cache_chain_mutex); + return NOTIFY_OK; + case CPU_LOCK_RELEASE: + mutex_unlock(&cache_chain_mutex); + return NOTIFY_OK; + + default: break; + } + + slab_irq_disable_this_rt(cpu); + + switch (action) { case CPU_UP_PREPARE: case CPU_UP_PREPARE_FROZEN: err = cpuup_prepare(cpu); @@ -1444,10 +1528,10 @@ static int __cpuinit cpuup_callback(stru case CPU_UP_CANCELED_FROZEN: cpuup_canceled(cpu); break; - case CPU_LOCK_RELEASE: - mutex_unlock(&cache_chain_mutex); - break; } + + slab_irq_enable_rt(cpu); + return err ? NOTIFY_BAD : NOTIFY_OK; } @@ -2898,9 +2982,7 @@ static int cache_grow(struct kmem_cache offset *= cachep->colour_off; - if (local_flags & __GFP_WAIT) - slab_irq_enable_nort(*this_cpu); - slab_irq_enable_rt(*this_cpu); + slab_irq_enable_GFP_WAIT(local_flags, this_cpu); /* * The test for missing atomic flag is performed here, rather than @@ -2930,9 +3012,7 @@ static int cache_grow(struct kmem_cache cache_init_objs(cachep, slabp); - slab_irq_disable_rt(*this_cpu); - if (local_flags & __GFP_WAIT) - slab_irq_disable_nort(*this_cpu); + slab_irq_disable_GFP_WAIT(local_flags, this_cpu); check_irq_off(); spin_lock(&l3->list_lock); @@ -2946,9 +3026,7 @@ static int cache_grow(struct kmem_cache opps1: kmem_freepages(cachep, objp); failed: - slab_irq_disable_rt(*this_cpu); - if (local_flags & __GFP_WAIT) - slab_irq_disable_nort(*this_cpu); + slab_irq_disable_GFP_WAIT(local_flags, this_cpu); return 0; } @@ -3395,16 +3473,12 @@ retry: * We may trigger various forms of reclaim on the allowed * set and go into memory reserves if necessary. */ - if (local_flags & __GFP_WAIT) - slab_irq_enable_nort(*this_cpu); - slab_irq_enable_rt(*this_cpu); + slab_irq_enable_GFP_WAIT(local_flags, this_cpu); kmem_flagcheck(cache, flags); obj = kmem_getpages(cache, flags, -1); - slab_irq_disable_rt(*this_cpu); - if (local_flags & __GFP_WAIT) - slab_irq_disable_nort(*this_cpu); + slab_irq_disable_GFP_WAIT(local_flags, this_cpu); if (obj) { /* ������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/cpu-hotplug-vs-page-alloc.patch�������������������������������������������������������������0000664�0000764�0000764�00000004446�11041673102�017160� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������Subject: cpu-hotplug: vs page_alloc From: Peter Zijlstra <a.p.zijlstra@chello.nl> Date: Tue, 10 Jun 2008 13:13:01 +0200 On -rt we protect per-cpu state by locks instead of disabling preemption/irqs. This keeps all the code preemptible at the cost of possible remote memory access. The race was that cpu-hotplug - which assumes to be cpu local and non- preemptible, didn't take the per-cpu lock. This also means that the normal lock acquire needs to be aware of cpus getting off-lined while its waiting. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Clark Williams <williams@redhat.com> Cc: Gregory Haskins <ghaskins@novell.com> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: Gautham R Shenoy <ego@in.ibm.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> --- mm/page_alloc.c | 24 +++++++++++++++++++++--- 1 file changed, 21 insertions(+), 3 deletions(-) Index: linux-2.6.24.7/mm/page_alloc.c =================================================================== --- linux-2.6.24.7.orig/mm/page_alloc.c +++ linux-2.6.24.7/mm/page_alloc.c @@ -176,7 +176,19 @@ static inline void __lock_cpu_pcp(unsign static inline void lock_cpu_pcp(unsigned long *flags, int *this_cpu) { #ifdef CONFIG_PREEMPT_RT - (void)get_cpu_var_locked(pcp_locks, this_cpu); + spinlock_t *lock; + int cpu; + +again: + cpu = raw_smp_processor_id(); + lock = &__get_cpu_lock(pcp_locks, cpu); + + spin_lock(lock); + if (unlikely(!cpu_online(cpu))) { + spin_unlock(lock); + goto again; + } + *this_cpu = cpu; flags = 0; #else local_irq_save(*flags); @@ -2781,12 +2793,17 @@ static inline void free_zone_pagesets(in struct zone *zone; for_each_zone(zone) { - struct per_cpu_pageset *pset = zone_pcp(zone, cpu); + struct per_cpu_pageset *pset; + unsigned long flags; + + __lock_cpu_pcp(&flags, cpu); + pset = zone_pcp(zone, cpu); + zone_pcp(zone, cpu) = NULL; + unlock_cpu_pcp(flags, cpu); /* Free per_cpu_pageset if it is slab allocated */ if (pset != &boot_pageset[cpu]) kfree(pset); - zone_pcp(zone, cpu) = NULL; } } @@ -2812,6 +2829,7 @@ static int __cpuinit pageset_cpuup_callb default: break; } + return ret; } ��������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/cpu-hotplug-cpu-up-vs-preempt-rt.patch������������������������������������������������������0000664�0000764�0000764�00000013054�11041673101�020454� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������Subject: cpu-hotplug: cpu_up vs preempt-rt From: Peter Zijlstra <a.p.zijlstra@chello.nl> Date: Tue, 10 Jun 2008 13:13:02 +0200 On PREEMPT_RT the allocators use preemptible locks, cpu bootstrap must have IRQs disabled because there are no IRQ/exception stacks yet, these we allocate atomically, which is not possible on -rt. Solve this by allocating these stacks on the boot cpu (which already has its stacks). This also allows cpu-up to fail instead of panic on OOM scenarios. I suspect it also fixes a memory leak, as I cannot find the place where cpu_down frees these cpu stacks, but each cpu_up used to allocate new ones. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Clark Williams <williams@redhat.com> Cc: Gregory Haskins <ghaskins@novell.com> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: Gautham R Shenoy <ego@in.ibm.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> --- arch/x86/kernel/setup64.c | 31 ++-------------------- arch/x86/kernel/smpboot_64.c | 57 +++++++++++++++++++++++++++++++++++++++++ include/asm-x86/processor_64.h | 4 ++ 3 files changed, 65 insertions(+), 27 deletions(-) Index: linux-2.6.24.7/arch/x86/kernel/setup64.c =================================================================== --- linux-2.6.24.7.orig/arch/x86/kernel/setup64.c +++ linux-2.6.24.7/arch/x86/kernel/setup64.c @@ -137,19 +137,12 @@ void pda_init(int cpu) pda->pcurrent = &init_task; pda->irqstackptr = boot_cpu_stack; } else { - pda->irqstackptr = (char *) - __get_free_pages(GFP_ATOMIC, IRQSTACK_ORDER); - if (!pda->irqstackptr) - panic("cannot allocate irqstack for cpu %d", cpu); + pda->irqstackptr = (char *)per_cpu(init_tss, cpu).irqstack; } - pda->irqstackptr += IRQSTACKSIZE-64; } -char boot_exception_stacks[(N_EXCEPTION_STACKS - 1) * EXCEPTION_STKSZ + DEBUG_STKSZ] -__attribute__((section(".bss.page_aligned"))); - extern asmlinkage void ignore_sysret(void); /* May not be marked __init: used by software suspend */ @@ -203,15 +196,13 @@ void __cpuinit cpu_init (void) struct tss_struct *t = &per_cpu(init_tss, cpu); struct orig_ist *orig_ist = &per_cpu(orig_ist, cpu); unsigned long v; - char *estacks = NULL; struct task_struct *me; int i; /* CPU 0 is initialised in head64.c */ if (cpu != 0) { pda_init(cpu); - } else - estacks = boot_exception_stacks; + } me = current; @@ -245,22 +236,8 @@ void __cpuinit cpu_init (void) /* * set up and load the per-CPU TSS */ - for (v = 0; v < N_EXCEPTION_STACKS; v++) { - static const unsigned int order[N_EXCEPTION_STACKS] = { - [0 ... N_EXCEPTION_STACKS - 1] = EXCEPTION_STACK_ORDER, -#if DEBUG_STACK > 0 - [DEBUG_STACK - 1] = DEBUG_STACK_ORDER -#endif - }; - if (cpu) { - estacks = (char *)__get_free_pages(GFP_ATOMIC, order[v]); - if (!estacks) - panic("Cannot allocate exception stack %ld %d\n", - v, cpu); - } - estacks += PAGE_SIZE << order[v]; - orig_ist->ist[v] = t->ist[v] = (unsigned long)estacks; - } + for (v = 0; v < N_EXCEPTION_STACKS; v++) + orig_ist->ist[v] = t->ist[v] = (unsigned long)t->estacks[v]; t->io_bitmap_base = offsetof(struct tss_struct, io_bitmap); /* Index: linux-2.6.24.7/arch/x86/kernel/smpboot_64.c =================================================================== --- linux-2.6.24.7.orig/arch/x86/kernel/smpboot_64.c +++ linux-2.6.24.7/arch/x86/kernel/smpboot_64.c @@ -535,6 +535,60 @@ static void __cpuinit do_fork_idle(struc complete(&c_idle->done); } +static char boot_exception_stacks[(N_EXCEPTION_STACKS - 1) * EXCEPTION_STKSZ + DEBUG_STKSZ] +__attribute__((section(".bss.page_aligned"))); + +static int __cpuinit allocate_stacks(int cpu) +{ + static const unsigned int order[N_EXCEPTION_STACKS] = { + [0 ... N_EXCEPTION_STACKS - 1] = EXCEPTION_STACK_ORDER, +#if DEBUG_STACK > 0 + [DEBUG_STACK - 1] = DEBUG_STACK_ORDER +#endif + }; + struct tss_struct *t = &per_cpu(init_tss, cpu); + int node = cpu_to_node(cpu); + struct page *page; + char *estack; + int v; + + if (cpu && !t->irqstack) { + page = alloc_pages_node(node, GFP_KERNEL, + IRQSTACK_ORDER); + if (!page) + goto fail_oom; + t->irqstack = page_address(page); + } + + if (!cpu) + estack = boot_exception_stacks; + + for (v = 0; v < N_EXCEPTION_STACKS; v++) { + if (t->estacks[v]) + continue; + + if (cpu) { + page = alloc_pages_node(node, GFP_KERNEL, order[v]); + if (!page) + goto fail_oom; + estack = page_address(page); + } + estack += PAGE_SIZE << order[v]; + /* + * XXX: can we set t->isr[v] here directly, or will that be + * modified later? - the existance of orig_ist seems to suggest + * it _can_ be modified, which would imply we'd need to reset + * it. + */ + t->estacks[v] = estack; + } + + return 0; + +fail_oom: + return -ENOMEM; +} + /* * Boot one CPU. */ @@ -605,6 +659,9 @@ static int __cpuinit do_boot_cpu(int cpu return PTR_ERR(c_idle.idle); } + if (allocate_stacks(cpu)) + return -ENOMEM; + set_idle_for_cpu(cpu, c_idle.idle); do_rest: Index: linux-2.6.24.7/include/asm-x86/processor_64.h =================================================================== --- linux-2.6.24.7.orig/include/asm-x86/processor_64.h +++ linux-2.6.24.7/include/asm-x86/processor_64.h @@ -197,6 +197,10 @@ struct tss_struct { * 8 bytes, for an extra "long" of ~0UL */ unsigned long io_bitmap[IO_BITMAP_LONGS + 1]; + + void *irqstack; + void *estacks[N_EXCEPTION_STACKS]; + } __attribute__((packed)) ____cacheline_aligned; ������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/rcu-backport-rcu-cpu-hotplug-support.patch��������������������������������������������������0000664�0000764�0000764�00000007231�11041673101�021415� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������Subject: rcu: backport RCU cpu hotplug support From: Peter Zijlstra <a.p.zijlstra@chello.nl> Date: Tue, 10 Jun 2008 13:13:03 +0200 backport the RCU cpu-hotplug support from .26-rc to .24-rt Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Clark Williams <williams@redhat.com> Cc: Gregory Haskins <ghaskins@novell.com> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: Gautham R Shenoy <ego@in.ibm.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> --- kernel/rcupreempt.c | 58 +++++++++++++++++++++++++++++++++++++++++++++++++--- 1 file changed, 55 insertions(+), 3 deletions(-) Index: linux-2.6.24.7/kernel/rcupreempt.c =================================================================== --- linux-2.6.24.7.orig/kernel/rcupreempt.c +++ linux-2.6.24.7/kernel/rcupreempt.c @@ -820,6 +820,13 @@ void rcu_offline_cpu_rt(int cpu) smp_mb(); /* Subsequent RCU read-side critical sections */ /* seen -after- acknowledgement. */ } + + __get_cpu_var(rcu_flipctr)[0] += per_cpu(rcu_flipctr, cpu)[0]; + __get_cpu_var(rcu_flipctr)[1] += per_cpu(rcu_flipctr, cpu)[1]; + + per_cpu(rcu_flipctr, cpu)[0] = 0; + per_cpu(rcu_flipctr, cpu)[1] = 0; + cpu_clear(cpu, rcu_cpu_online_map); spin_unlock_irqrestore(&rcu_ctrlblk.fliplock, oldirq); @@ -833,8 +840,9 @@ void rcu_offline_cpu_rt(int cpu) * fix. */ + local_irq_save(oldirq); rdp = RCU_DATA_ME(); - spin_lock_irqsave(&rdp->lock, oldirq); + spin_lock(&rdp->lock); *rdp->nexttail = list; if (list) rdp->nexttail = tail; @@ -866,9 +874,11 @@ void rcu_process_callbacks_rt(struct sof { unsigned long flags; struct rcu_head *next, *list; - struct rcu_data *rdp = RCU_DATA_ME(); + struct rcu_data *rdp; - spin_lock_irqsave(&rdp->lock, flags); + local_irq_save(flags); + rdp = RCU_DATA_ME(); + spin_lock(&rdp->lock); list = rdp->donelist; if (list == NULL) { spin_unlock_irqrestore(&rdp->lock, flags); @@ -951,6 +961,32 @@ int rcu_pending_rt(int cpu) return 0; } +static int __cpuinit rcu_cpu_notify(struct notifier_block *self, + unsigned long action, void *hcpu) +{ + long cpu = (long)hcpu; + + switch (action) { + case CPU_UP_PREPARE: + case CPU_UP_PREPARE_FROZEN: + rcu_online_cpu_rt(cpu); + break; + case CPU_UP_CANCELED: + case CPU_UP_CANCELED_FROZEN: + case CPU_DEAD: + case CPU_DEAD_FROZEN: + rcu_offline_cpu_rt(cpu); + break; + default: + break; + } + return NOTIFY_OK; +} + +static struct notifier_block __cpuinitdata rcu_nb = { + .notifier_call = rcu_cpu_notify, +}; + void __init rcu_init_rt(void) { int cpu; @@ -972,6 +1008,22 @@ void __init rcu_init_rt(void) rdp->donetail = &rdp->donelist; } rcu_preempt_boost_init(); + register_cpu_notifier(&rcu_nb); + + /* + * We don't need protection against CPU-Hotplug here + * since + * a) If a CPU comes online while we are iterating over the + * cpu_online_map below, we would only end up making a + * duplicate call to rcu_online_cpu() which sets the corresponding + * CPU's mask in the rcu_cpu_online_map. + * + * b) A CPU cannot go offline at this point in time since the user + * does not have access to the sysfs interface, nor do we + * suspend the system. + */ + for_each_online_cpu(cpu) + rcu_cpu_notify(&rcu_nb, CPU_UP_PREPARE, (void *)(long) cpu); } /* �����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/cpu-hotplug-cpu-down-vs-preempt-rt.patch����������������������������������������������������0000664�0000764�0000764�00000010476�11041673101�021004� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������Subject: cpu-hotplug: cpu_down vs preempt-rt From: Peter Zijlstra <a.p.zijlstra@chello.nl> Date: Tue, 10 Jun 2008 13:13:04 +0200 idle_task_exit() calls mmdrop() from the idle thread, but in PREEMPT_RT all the allocator locks are sleeping locks - for obvious reasons scheduling away the idle thread gives some curious problems. Solve this by pushing the mmdrop() into an RCU callback, however we can't use RCU because the CPU is already down and all the local RCU state has been destroyed. Therefore create a new call_rcu() variant that enqueues the callback on an online cpu. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Clark Williams <williams@redhat.com> Cc: Gregory Haskins <ghaskins@novell.com> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: Gautham R Shenoy <ego@in.ibm.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> --- include/linux/mm_types.h | 5 +++++ include/linux/rcupreempt.h | 2 ++ kernel/rcupreempt.c | 29 +++++++++++++++++++++++++++++ kernel/sched.c | 13 +++++++++++++ 4 files changed, 49 insertions(+) Index: linux-2.6.24.7/include/linux/mm_types.h =================================================================== --- linux-2.6.24.7.orig/include/linux/mm_types.h +++ linux-2.6.24.7/include/linux/mm_types.h @@ -10,6 +10,7 @@ #include <linux/rbtree.h> #include <linux/rwsem.h> #include <linux/completion.h> +#include <linux/rcupdate.h> #include <asm/page.h> #include <asm/mmu.h> @@ -222,6 +223,10 @@ struct mm_struct { /* aio bits */ rwlock_t ioctx_list_lock; struct kioctx *ioctx_list; + +#ifdef CONFIG_PREEMPT_RT + struct rcu_head rcu_head; +#endif }; #endif /* _LINUX_MM_TYPES_H */ Index: linux-2.6.24.7/include/linux/rcupreempt.h =================================================================== --- linux-2.6.24.7.orig/include/linux/rcupreempt.h +++ linux-2.6.24.7/include/linux/rcupreempt.h @@ -83,6 +83,8 @@ extern void FASTCALL(call_rcu_classic(st void (*func)(struct rcu_head *head))); extern void FASTCALL(call_rcu_preempt(struct rcu_head *head, void (*func)(struct rcu_head *head))); +extern void FASTCALL(call_rcu_preempt_online(struct rcu_head *head, + void (*func)(struct rcu_head *head))); extern void __rcu_read_lock(void); extern void __rcu_read_unlock(void); extern void __synchronize_sched(void); Index: linux-2.6.24.7/kernel/rcupreempt.c =================================================================== --- linux-2.6.24.7.orig/kernel/rcupreempt.c +++ linux-2.6.24.7/kernel/rcupreempt.c @@ -916,6 +916,35 @@ void fastcall call_rcu_preempt(struct rc } EXPORT_SYMBOL_GPL(call_rcu_preempt); +void fastcall call_rcu_preempt_online(struct rcu_head *head, + void (*func)(struct rcu_head *rcu)) +{ + struct rcu_data *rdp; + unsigned long flags; + int cpu; + + head->func = func; + head->next = NULL; +again: + cpu = first_cpu(cpu_online_map); + rdp = RCU_DATA_CPU(cpu); + + spin_lock_irqsave(&rdp->lock, flags); + if (unlikely(!cpu_online(cpu))) { + /* + * cpu is removed from the online map before rcu_offline_cpu + * is called. + */ + spin_unlock_irqrestore(&rdp->lock, flags); + goto again; + } + + *rdp->nexttail = head; + rdp->nexttail = &head->next; + spin_unlock_irqrestore(&rdp->lock, flags); + +} + /* * Check to see if any future RCU-related work will need to be done * by the current CPU, even if none need be done immediately, returning Index: linux-2.6.24.7/kernel/sched.c =================================================================== --- linux-2.6.24.7.orig/kernel/sched.c +++ linux-2.6.24.7/kernel/sched.c @@ -5888,6 +5888,15 @@ void sched_idle_next(void) spin_unlock_irqrestore(&rq->lock, flags); } +#ifdef CONFIG_PREEMPT_RT +void mmdrop_rcu(struct rcu_head *head) +{ + struct mm_struct *mm = container_of(head, struct mm_struct, rcu_head); + + mmdrop(mm); +} +#endif + /* * Ensures that the idle task is using init_mm right before its cpu goes * offline. @@ -5900,7 +5909,11 @@ void idle_task_exit(void) if (mm != &init_mm) switch_mm(mm, &init_mm, current); +#ifdef CONFIG_PREEMPT_RT + call_rcu_preempt_online(&mm->rcu_head, mmdrop_rcu); +#else mmdrop(mm); +#endif } /* called under rq->lock with disabled interrupts */ ��������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/re-cpu-hotplug-cpu-down-vs-preempt-rt.patch�������������������������������������������������0000664�0000764�0000764�00000005216�11041673100�021403� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������Subject: Re: cpu-hotplug: cpu_down vs preempt-rt From: Peter Zijlstra <a.p.zijlstra@chello.nl> Date: Wed, 11 Jun 2008 08:53:45 +0200 Because 5/5 has a horrible bug... We should only do __mmdrop() from rcu, not mmdrop(). --- --- include/linux/sched.h | 7 +++++++ kernel/fork.c | 12 ++++++++++++ kernel/sched.c | 11 +---------- 3 files changed, 20 insertions(+), 10 deletions(-) Index: linux-2.6.24.7/include/linux/sched.h =================================================================== --- linux-2.6.24.7.orig/include/linux/sched.h +++ linux-2.6.24.7/include/linux/sched.h @@ -1832,6 +1832,7 @@ extern struct mm_struct * mm_alloc(void) /* mmdrop drops the mm and the page tables */ extern void FASTCALL(__mmdrop(struct mm_struct *)); extern void FASTCALL(__mmdrop_delayed(struct mm_struct *)); +extern void FASTCALL(__mmdrop_rcu(struct mm_struct *)); static inline void mmdrop(struct mm_struct * mm) { @@ -1845,6 +1846,12 @@ static inline void mmdrop_delayed(struct __mmdrop_delayed(mm); } +static inline void mmdrop_rcu(struct mm_struct * mm) +{ + if (atomic_dec_and_test(&mm->mm_count)) + __mmdrop_rcu(mm); +} + /* mmput gets rid of the mappings and all user-space */ extern void mmput(struct mm_struct *); /* Grab a reference to a task's mm, if it is not already going away */ Index: linux-2.6.24.7/kernel/fork.c =================================================================== --- linux-2.6.24.7.orig/kernel/fork.c +++ linux-2.6.24.7/kernel/fork.c @@ -431,6 +431,18 @@ void fastcall __mmdrop(struct mm_struct free_mm(mm); } +#ifdef CONFIG_PREEMPT_RT +static void ___mmdrop_rcu(struct rcu_head *head) +{ + __mmdrop(container_of(head, struct mm_struct, rcu_head)); +} + +void fastcall __mmdrop_rcu(struct mm_struct *mm) +{ + call_rcu_preempt_online(&mm->rcu_head, ___mmdrop_rcu); +} +#endif + /* * Decrement the use count and release all resources for an mm. */ Index: linux-2.6.24.7/kernel/sched.c =================================================================== --- linux-2.6.24.7.orig/kernel/sched.c +++ linux-2.6.24.7/kernel/sched.c @@ -5888,15 +5888,6 @@ void sched_idle_next(void) spin_unlock_irqrestore(&rq->lock, flags); } -#ifdef CONFIG_PREEMPT_RT -void mmdrop_rcu(struct rcu_head *head) -{ - struct mm_struct *mm = container_of(head, struct mm_struct, rcu_head); - - mmdrop(mm); -} -#endif - /* * Ensures that the idle task is using init_mm right before its cpu goes * offline. @@ -5910,7 +5901,7 @@ void idle_task_exit(void) if (mm != &init_mm) switch_mm(mm, &init_mm, current); #ifdef CONFIG_PREEMPT_RT - call_rcu_preempt_online(&mm->rcu_head, mmdrop_rcu); + mmdrop_rcu(mm); #else mmdrop(mm); #endif ����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/rt-rwlock-conservative-locking.patch��������������������������������������������������������0000664�0000764�0000764�00000004322�11041657730�020341� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������Date: Tue, 15 Jul 2008 20:26:50 -0400 (EDT) From: Steven Rostedt <rostedt@goodmis.org> Subject: [PATCH RT] rwlock: be more conservative in locking reader_lock_count John Stultz was hitting one of the rwlock warnings. This was indeed a bug. The assumption of trying not to take locks was incorrect and prone to bugs. This patch adds a few locks around the needed areas to correct the issue and make the code a bit more robust. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> --- kernel/rtmutex.c | 16 ++++++++-------- 1 file changed, 8 insertions(+), 8 deletions(-) Index: linux-2.6.24.7/kernel/rtmutex.c =================================================================== --- linux-2.6.24.7.orig/kernel/rtmutex.c +++ linux-2.6.24.7/kernel/rtmutex.c @@ -1137,16 +1137,13 @@ rt_rwlock_update_owner(struct rw_mutex * if (own == RT_RW_READER) return; - /* - * We don't need to grab the pi_lock to look at the reader list - * since we hold the rwm wait_lock. We only care about the pointer - * to this lock, and we own the wait_lock, so that pointer - * can't be changed. - */ + spin_lock(&own->pi_lock); for (i = own->reader_lock_count - 1; i >= 0; i--) { if (own->owned_read_locks[i].lock == rwm) break; } + spin_unlock(&own->pi_lock); + /* It is possible the owner didn't add it yet */ if (i < 0) return; @@ -1453,7 +1450,6 @@ __rt_read_fasttrylock(struct rw_mutex *r current->owned_read_locks[reader_count].count = 1; } else WARN_ON_ONCE(1); - spin_unlock(¤t->pi_lock); /* * If this task is no longer the sole owner of the lock * or someone is blocking, then we need to add the task @@ -1463,12 +1459,16 @@ __rt_read_fasttrylock(struct rw_mutex *r struct rt_mutex *mutex = &rwm->mutex; struct reader_lock_struct *rls; + /* preserve lock order, we only need wait_lock now */ + spin_unlock(¤t->pi_lock); + spin_lock(&mutex->wait_lock); rls = ¤t->owned_read_locks[reader_count]; if (!rls->list.prev || list_empty(&rls->list)) list_add(&rls->list, &rwm->readers); spin_unlock(&mutex->wait_lock); - } + } else + spin_unlock(¤t->pi_lock); local_irq_restore(flags); return 1; } ��������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/ftrace-call-function-pointer.patch����������������������������������������������������������0000664�0000764�0000764�00000001501�11041657730�017735� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������Date: Tue, 15 Jul 2008 08:09:29 -0700 From: Josh Triplett <josht@linux.vnet.ibm.com> Subject: [PATCH] ftrace: Actually call function pointer in ftrace_stop ftrace_stop used a function pointed as a no-op expression, rather than actually calling it. Signed-off-by: Josh Triplett <josh@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> --- kernel/trace/trace.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) Index: linux-2.6.24.7/kernel/trace/trace.c =================================================================== --- linux-2.6.24.7.orig/kernel/trace/trace.c +++ linux-2.6.24.7/kernel/trace/trace.c @@ -3268,7 +3268,7 @@ void ftrace_stop(void) if (tr->ctrl) { tr->ctrl = 0; if (saved_tracer && saved_tracer->ctrl_update) - saved_tracer->ctrl_update; + saved_tracer->ctrl_update(tr); } �����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/idle-fix.diff�������������������������������������������������������������������������������0000664�0000764�0000764�00000016535�11041657734�013612� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������--- arch/arm/kernel/process.c | 2 +- arch/mips/kernel/process.c | 2 +- arch/powerpc/kernel/idle.c | 2 +- arch/powerpc/platforms/iseries/setup.c | 4 ++-- arch/sparc64/kernel/process.c | 2 +- arch/um/kernel/process.c | 2 +- arch/x86/kernel/process_32.c | 2 +- arch/x86/kernel/process_64.c | 2 +- include/linux/tick.h | 5 +++-- kernel/softirq.c | 2 +- kernel/time/tick-sched.c | 13 +++++++++++-- 11 files changed, 24 insertions(+), 14 deletions(-) Index: linux-2.6.24.7/arch/arm/kernel/process.c =================================================================== --- linux-2.6.24.7.orig/arch/arm/kernel/process.c +++ linux-2.6.24.7/arch/arm/kernel/process.c @@ -167,7 +167,7 @@ void cpu_idle(void) if (!idle) idle = default_idle; leds_event(led_idle_start); - tick_nohz_stop_sched_tick(); + tick_nohz_stop_sched_tick(1); while (!need_resched() && !need_resched_delayed()) idle(); leds_event(led_idle_end); Index: linux-2.6.24.7/arch/mips/kernel/process.c =================================================================== --- linux-2.6.24.7.orig/arch/mips/kernel/process.c +++ linux-2.6.24.7/arch/mips/kernel/process.c @@ -53,7 +53,7 @@ void __noreturn cpu_idle(void) { /* endless idle loop with no priority at all */ while (1) { - tick_nohz_stop_sched_tick(); + tick_nohz_stop_sched_tick(1); while (!need_resched() && !need_resched_delayed()) { #ifdef CONFIG_SMTC_IDLE_HOOK_DEBUG extern void smtc_idle_loop_hook(void); Index: linux-2.6.24.7/arch/powerpc/kernel/idle.c =================================================================== --- linux-2.6.24.7.orig/arch/powerpc/kernel/idle.c +++ linux-2.6.24.7/arch/powerpc/kernel/idle.c @@ -60,7 +60,7 @@ void cpu_idle(void) set_thread_flag(TIF_POLLING_NRFLAG); while (1) { - tick_nohz_stop_sched_tick(); + tick_nohz_stop_sched_tick(1); while (!need_resched() && !need_resched_delayed() && !cpu_should_die()) { ppc64_runlatch_off(); Index: linux-2.6.24.7/arch/powerpc/platforms/iseries/setup.c =================================================================== --- linux-2.6.24.7.orig/arch/powerpc/platforms/iseries/setup.c +++ linux-2.6.24.7/arch/powerpc/platforms/iseries/setup.c @@ -563,7 +563,7 @@ static void yield_shared_processor(void) static void iseries_shared_idle(void) { while (1) { - tick_nohz_stop_sched_tick(); + tick_nohz_stop_sched_tick(1); while (!need_resched() && !need_resched_delayed() && !hvlpevent_is_pending()) { local_irq_disable(); @@ -595,7 +595,7 @@ static void iseries_dedicated_idle(void) set_thread_flag(TIF_POLLING_NRFLAG); while (1) { - tick_nohz_stop_sched_tick(); + tick_nohz_stop_sched_tick(1); if (!need_resched()) { while (!need_resched()) { ppc64_runlatch_off(); Index: linux-2.6.24.7/arch/sparc64/kernel/process.c =================================================================== --- linux-2.6.24.7.orig/arch/sparc64/kernel/process.c +++ linux-2.6.24.7/arch/sparc64/kernel/process.c @@ -93,7 +93,7 @@ void cpu_idle(void) set_thread_flag(TIF_POLLING_NRFLAG); while(1) { - tick_nohz_stop_sched_tick(); + tick_nohz_stop_sched_tick(1); while (!need_resched() && !cpu_is_offline(cpu)) sparc64_yield(cpu); Index: linux-2.6.24.7/arch/um/kernel/process.c =================================================================== --- linux-2.6.24.7.orig/arch/um/kernel/process.c +++ linux-2.6.24.7/arch/um/kernel/process.c @@ -247,7 +247,7 @@ void default_idle(void) if (need_resched()) schedule(); - tick_nohz_stop_sched_tick(); + tick_nohz_stop_sched_tick(1); nsecs = disable_timer(); idle_sleep(nsecs); tick_nohz_restart_sched_tick(); Index: linux-2.6.24.7/arch/x86/kernel/process_32.c =================================================================== --- linux-2.6.24.7.orig/arch/x86/kernel/process_32.c +++ linux-2.6.24.7/arch/x86/kernel/process_32.c @@ -179,7 +179,7 @@ void cpu_idle(void) /* endless idle loop with no priority at all */ while (1) { - tick_nohz_stop_sched_tick(); + tick_nohz_stop_sched_tick(1); while (!need_resched() && !need_resched_delayed()) { void (*idle)(void); Index: linux-2.6.24.7/arch/x86/kernel/process_64.c =================================================================== --- linux-2.6.24.7.orig/arch/x86/kernel/process_64.c +++ linux-2.6.24.7/arch/x86/kernel/process_64.c @@ -212,7 +212,7 @@ void cpu_idle (void) current_thread_info()->status |= TS_POLLING; /* endless idle loop with no priority at all */ while (1) { - tick_nohz_stop_sched_tick(); + tick_nohz_stop_sched_tick(1); while (!need_resched() && !need_resched_delayed()) { void (*idle)(void); Index: linux-2.6.24.7/include/linux/tick.h =================================================================== --- linux-2.6.24.7.orig/include/linux/tick.h +++ linux-2.6.24.7/include/linux/tick.h @@ -47,6 +47,7 @@ struct tick_sched { unsigned long check_clocks; enum tick_nohz_mode nohz_mode; ktime_t idle_tick; + int inidle; int tick_stopped; unsigned long idle_jiffies; unsigned long idle_calls; @@ -99,12 +100,12 @@ static inline int tick_check_oneshot_cha #endif /* !CONFIG_GENERIC_CLOCKEVENTS */ # ifdef CONFIG_NO_HZ -extern void tick_nohz_stop_sched_tick(void); +extern void tick_nohz_stop_sched_tick(int inindle); extern void tick_nohz_restart_sched_tick(void); extern void tick_nohz_update_jiffies(void); extern ktime_t tick_nohz_get_sleep_length(void); # else -static inline void tick_nohz_stop_sched_tick(void) { } +static inline void tick_nohz_stop_sched_tick(int inidle) { } static inline void tick_nohz_restart_sched_tick(void) { } static inline void tick_nohz_update_jiffies(void) { } static inline ktime_t tick_nohz_get_sleep_length(void) Index: linux-2.6.24.7/kernel/softirq.c =================================================================== --- linux-2.6.24.7.orig/kernel/softirq.c +++ linux-2.6.24.7/kernel/softirq.c @@ -479,7 +479,7 @@ void irq_exit(void) #ifdef CONFIG_NO_HZ /* Make sure that timer wheel updates are propagated */ if (!in_interrupt() && idle_cpu(smp_processor_id()) && !need_resched()) - tick_nohz_stop_sched_tick(); + tick_nohz_stop_sched_tick(0); rcu_irq_exit(); #endif __preempt_enable_no_resched(); Index: linux-2.6.24.7/kernel/time/tick-sched.c =================================================================== --- linux-2.6.24.7.orig/kernel/time/tick-sched.c +++ linux-2.6.24.7/kernel/time/tick-sched.c @@ -150,7 +150,7 @@ void tick_nohz_update_jiffies(void) * Called either from the idle loop or from irq_exit() when an idle period was * just interrupted by an interrupt which did not cause a reschedule. */ -void tick_nohz_stop_sched_tick(void) +void tick_nohz_stop_sched_tick(int inidle) { unsigned long seq, last_jiffies, next_jiffies, delta_jiffies, flags; struct tick_sched *ts; @@ -178,6 +178,11 @@ void tick_nohz_stop_sched_tick(void) if (unlikely(ts->nohz_mode == NOHZ_MODE_INACTIVE)) goto end; + if (!inidle && !ts->inidle) + goto end; + + ts->inidle = 1; + if (need_resched() || need_resched_delayed()) goto end; @@ -338,8 +343,12 @@ void tick_nohz_restart_sched_tick(void) unsigned long ticks; ktime_t now, delta; - if (!ts->tick_stopped) + if (!ts->inidle || !ts->tick_stopped) { + ts->inidle = 0; return; + } + + ts->inidle = 0; rcu_exit_nohz(); �������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/cpu-hotplug-cpu-down-vs-preempt-rt_fix.patch������������������������������������������������0000664�0000764�0000764�00000000705�11041657733�021661� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������--- kernel/rcupreempt.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) Index: linux-2.6.24.7/kernel/rcupreempt.c =================================================================== --- linux-2.6.24.7.orig/kernel/rcupreempt.c +++ linux-2.6.24.7/kernel/rcupreempt.c @@ -860,7 +860,7 @@ void __devinit rcu_online_cpu_rt(int cpu #else /* #ifdef CONFIG_HOTPLUG_CPU */ -void rcu_offline_cpu(int cpu) +void rcu_offline_cpu_rt(int cpu) { } �����������������������������������������������������������patches/fix_misplaced_mb.patch����������������������������������������������������������������������0000664�0000764�0000764�00000003671�11041657732�015560� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������Subject: rcu: fix misplaced mb() From: Paul E. McKenney <paulmck@linux.vnet.ibm.com> In the process of writing up the mechanical proof of correctness for the dynticks/preemptable-RCU interface, I noticed misplaced memory barriers in rcu_enter_nohz() and rcu_exit_nohz(). This patch puts them in the right place and adds a comment. The key thing to keep in mind is that rcu_enter_nohz() is -exiting- the mode that can legally execute RCU read-side critical sections. The memory barrier must be between any potential RCU read-side critical sections and the increment of the per-CPU dynticks_progress_counter, and thus must come -before- this increment. And vice versa for rcu_exit_nohz(). The locking in the scheduler is probably saving us for the moment. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> --- include/linux/rcupreempt.h | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) Index: linux-2.6.24.7/include/linux/rcupreempt.h =================================================================== --- linux-2.6.24.7.orig/include/linux/rcupreempt.h +++ linux-2.6.24.7/include/linux/rcupreempt.h @@ -105,6 +105,7 @@ DECLARE_PER_CPU(long, dynticks_progress_ static inline void rcu_enter_nohz(void) { + smp_mb(); /* CPUs seeing ++ must see prior RCU read-side crit sects */ __get_cpu_var(dynticks_progress_counter)++; if (unlikely(__get_cpu_var(dynticks_progress_counter) & 0x1)) { printk("BUG: bad accounting of dynamic ticks\n"); @@ -113,13 +114,12 @@ static inline void rcu_enter_nohz(void) /* try to fix it */ __get_cpu_var(dynticks_progress_counter)++; } - mb(); } static inline void rcu_exit_nohz(void) { - mb(); __get_cpu_var(dynticks_progress_counter)++; + smp_mb(); /* CPUs seeing ++ must see later RCU read-side crit sects */ if (unlikely(!(__get_cpu_var(dynticks_progress_counter) & 0x1))) { printk("BUG: bad accounting of dynamic ticks\n"); printk(" will try to fix, but it is best to reboot\n"); �����������������������������������������������������������������������patches/fix_sys_sched_rr_get_interval_slice_for_SCHED_FIFO_tasks.patch������������������������������0000664�0000764�0000764�00000002116�11041657735�025332� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������Subject: sched: fix the wrong time slice value for SCHED_FIFO tasks From: Miao Xie <miaox@cn.fujitsu.com> X-Git-Tag: v2.6.25-rc5~13^2~2 X-Git-Url: http://git.kernel.org/?p=linux%2Fkernel%2Fgit%2Ftorvalds%2Flinux-2.6.git;a=commitdiff_plain;h=1868f958eb56fc41c5985c8732e564a400c5fdf5 sched: fix the wrong time slice value for SCHED_FIFO tasks Function sys_sched_rr_get_interval returns wrong time slice value for SCHED_FIFO tasks. The time slice for SCHED_FIFO tasks should be 0. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> --- kernel/sched.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) Index: linux-2.6.24.7/kernel/sched.c =================================================================== --- linux-2.6.24.7.orig/kernel/sched.c +++ linux-2.6.24.7/kernel/sched.c @@ -5392,7 +5392,7 @@ long sys_sched_rr_get_interval(pid_t pid time_slice = 0; if (p->policy == SCHED_RR) { time_slice = DEF_TIMESLICE; - } else { + } else if (p->policy != SCHED_FIFO) { struct sched_entity *se = &p->se; unsigned long flags; struct rq *rq; ��������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/ftrace-preempt-trace-check.patch������������������������������������������������������������0000664�0000764�0000764�00000002356�11041657733�017360� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������Subject: ftrace: only trace preempt off with preempt tracer From: Steven Rostedt <srostedt@redhat.com> When PREEMPT_TRACER and IRQSOFF_TRACER are both configured and irqsoff tracer is running, the preempt_off sections might also be traced. Thanks to Andrew Morton for pointing out my mistake of spin_lock disabling interrupts while he was reviewing ftrace.txt. Seems that my example I used actually hit this bug. Signed-off-by: Steven Rostedt <srostedt@redhat.com> --- kernel/trace/trace_irqsoff.c | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) Index: linux-2.6.24.7/kernel/trace/trace_irqsoff.c =================================================================== --- linux-2.6.24.7.orig/kernel/trace/trace_irqsoff.c +++ linux-2.6.24.7/kernel/trace/trace_irqsoff.c @@ -358,13 +358,15 @@ EXPORT_SYMBOL(trace_hardirqs_off_caller) void trace_preempt_on(unsigned long a0, unsigned long a1) { tracing_hist_preempt_stop(0); - stop_critical_timing(a0, a1); + if (preempt_trace()) + stop_critical_timing(a0, a1); } void trace_preempt_off(unsigned long a0, unsigned long a1) { start_critical_timing(a0, a1); - tracing_hist_preempt_start(); + if (preempt_trace()) + tracing_hist_preempt_start(); } #endif /* CONFIG_PREEMPT_TRACER */ ����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/fix_SCHED_FIFO_spec_violation.patch���������������������������������������������������������0000664�0000764�0000764�00000020424�11041673076�017662� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������Enqueue deprioritized RT tasks to head of prio array From: Clark Williams <williams@redhat.com> This patch backports Peter Z's enqueue to head of prio array on de-prioritization to 2.6.24.7-rt14 which doesn't have the enqueue_rt_entity and associated changes. I've run several long running real-time java benchmarks and it's holding so far. Steven, please consider this patch for inclusion in the next 2.6.24.7-rtX release. Peter, I didn't include your Signed-off-by as only about half your original patch applied to 2.6.24.7-r14. If you're happy with this version, would you also sign off? Signed-off-by: Darren Hart <dvhltc@us.ibm.com> --- include/linux/sched.h | 9 +++++++-- kernel/sched.c | 27 ++++++++++++++------------- kernel/sched_fair.c | 6 ++++-- kernel/sched_idletask.c | 2 +- kernel/sched_rt.c | 13 +++++++++---- 5 files changed, 35 insertions(+), 22 deletions(-) Index: linux-2.6.24.7/include/linux/sched.h =================================================================== --- linux-2.6.24.7.orig/include/linux/sched.h +++ linux-2.6.24.7/include/linux/sched.h @@ -897,11 +897,16 @@ struct uts_namespace; struct rq; struct sched_domain; +#define ENQUEUE_WAKEUP 0x01 +#define ENQUEUE_HEAD 0x02 + +#define DEQUEUE_SLEEP 0x01 + struct sched_class { const struct sched_class *next; - void (*enqueue_task) (struct rq *rq, struct task_struct *p, int wakeup); - void (*dequeue_task) (struct rq *rq, struct task_struct *p, int sleep); + void (*enqueue_task) (struct rq *rq, struct task_struct *p, int flags); + void (*dequeue_task) (struct rq *rq, struct task_struct *p, int flags); void (*yield_task) (struct rq *rq); int (*select_task_rq)(struct task_struct *p, int sync); Index: linux-2.6.24.7/kernel/sched.c =================================================================== --- linux-2.6.24.7.orig/kernel/sched.c +++ linux-2.6.24.7/kernel/sched.c @@ -1046,7 +1046,7 @@ static const u32 prio_to_wmult[40] = { /* 15 */ 119304647, 148102320, 186737708, 238609294, 286331153, }; -static void activate_task(struct rq *rq, struct task_struct *p, int wakeup); +static void activate_task(struct rq *rq, struct task_struct *p, int flags); /* * runqueue iterator, to support SMP load-balancing between different @@ -1155,16 +1155,16 @@ static void set_load_weight(struct task_ p->se.load.inv_weight = prio_to_wmult[p->static_prio - MAX_RT_PRIO]; } -static void enqueue_task(struct rq *rq, struct task_struct *p, int wakeup) +static void enqueue_task(struct rq *rq, struct task_struct *p, int flags) { sched_info_queued(p); - p->sched_class->enqueue_task(rq, p, wakeup); + p->sched_class->enqueue_task(rq, p, flags); p->se.on_rq = 1; } -static void dequeue_task(struct rq *rq, struct task_struct *p, int sleep) +static void dequeue_task(struct rq *rq, struct task_struct *p, int flags) { - p->sched_class->dequeue_task(rq, p, sleep); + p->sched_class->dequeue_task(rq, p, flags); p->se.on_rq = 0; } @@ -1219,26 +1219,26 @@ static int effective_prio(struct task_st /* * activate_task - move a task to the runqueue. */ -static void activate_task(struct rq *rq, struct task_struct *p, int wakeup) +static void activate_task(struct rq *rq, struct task_struct *p, int flags) { if (p->state == TASK_UNINTERRUPTIBLE) rq->nr_uninterruptible--; ftrace_event_task_activate(p, cpu_of(rq)); - enqueue_task(rq, p, wakeup); + enqueue_task(rq, p, flags); inc_nr_running(p, rq); } /* * deactivate_task - remove a task from the runqueue. */ -static void deactivate_task(struct rq *rq, struct task_struct *p, int sleep) +static void deactivate_task(struct rq *rq, struct task_struct *p, int flags) { if (p->state == TASK_UNINTERRUPTIBLE) rq->nr_uninterruptible++; ftrace_event_task_deactivate(p, cpu_of(rq)); - dequeue_task(rq, p, sleep); + dequeue_task(rq, p, flags); dec_nr_running(p, rq); } @@ -1759,7 +1759,7 @@ out_activate: else schedstat_inc(p, se.nr_wakeups_remote); update_rq_clock(rq); - activate_task(rq, p, 1); + activate_task(rq, p, ENQUEUE_WAKEUP); check_preempt_curr(rq, p); success = 1; @@ -3968,7 +3968,7 @@ asmlinkage void __sched __schedule(void) prev->state = TASK_RUNNING; } else { touch_softlockup_watchdog(); - deactivate_task(rq, prev, 1); + deactivate_task(rq, prev, DEQUEUE_SLEEP); } switch_count = &prev->nvcsw; } @@ -4431,7 +4431,7 @@ EXPORT_SYMBOL(sleep_on_timeout); void task_setprio(struct task_struct *p, int prio) { unsigned long flags; - int oldprio, prev_resched, on_rq, running; + int oldprio, prev_resched, on_rq, running, down; struct rq *rq; const struct sched_class *prev_class = p->sched_class; @@ -4472,6 +4472,7 @@ void task_setprio(struct task_struct *p, else p->sched_class = &fair_sched_class; + down = (prio > p->prio) ? ENQUEUE_HEAD : 0; p->prio = prio; // trace_special_pid(p->pid, __PRIO(oldprio), PRIO(p)); @@ -4480,7 +4481,7 @@ void task_setprio(struct task_struct *p, if (running) p->sched_class->set_curr_task(rq); if (on_rq) { - enqueue_task(rq, p, 0); + enqueue_task(rq, p, down); check_class_changed(rq, p, prev_class, oldprio, running); } // trace_special(prev_resched, _need_resched(), 0); Index: linux-2.6.24.7/kernel/sched_fair.c =================================================================== --- linux-2.6.24.7.orig/kernel/sched_fair.c +++ linux-2.6.24.7/kernel/sched_fair.c @@ -756,10 +756,11 @@ static inline struct sched_entity *paren * increased. Here we update the fair scheduling stats and * then put the task into the rbtree: */ -static void enqueue_task_fair(struct rq *rq, struct task_struct *p, int wakeup) +static void enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags) { struct cfs_rq *cfs_rq; struct sched_entity *se = &p->se; + int wakeup = flags & ENQUEUE_WAKEUP; for_each_sched_entity(se) { if (se->on_rq) @@ -775,10 +776,11 @@ static void enqueue_task_fair(struct rq * decreased. We remove the task from the rbtree and * update the fair scheduling stats: */ -static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int sleep) +static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags) { struct cfs_rq *cfs_rq; struct sched_entity *se = &p->se; + int sleep = flags & DEQUEUE_SLEEP; for_each_sched_entity(se) { cfs_rq = cfs_rq_of(se); Index: linux-2.6.24.7/kernel/sched_idletask.c =================================================================== --- linux-2.6.24.7.orig/kernel/sched_idletask.c +++ linux-2.6.24.7/kernel/sched_idletask.c @@ -31,7 +31,7 @@ static struct task_struct *pick_next_tas * message if some code attempts to do it: */ static void -dequeue_task_idle(struct rq *rq, struct task_struct *p, int sleep) +dequeue_task_idle(struct rq *rq, struct task_struct *p, int flags) { spin_unlock_irq(&rq->lock); printk(KERN_ERR "bad: scheduling from the idle thread!\n"); Index: linux-2.6.24.7/kernel/sched_rt.c =================================================================== --- linux-2.6.24.7.orig/kernel/sched_rt.c +++ linux-2.6.24.7/kernel/sched_rt.c @@ -181,11 +181,16 @@ unsigned long rt_nr_uninterruptible_cpu( return cpu_rq(cpu)->rt.rt_nr_uninterruptible; } -static void enqueue_task_rt(struct rq *rq, struct task_struct *p, int wakeup) +static void enqueue_task_rt(struct rq *rq, struct task_struct *p, int flags) { struct rt_prio_array *array = &rq->rt.active; - list_add_tail(&p->run_list, array->queue + p->prio); + + if (unlikely(flags & ENQUEUE_HEAD)) + list_add(&p->run_list, array->queue + p->prio); + else + list_add_tail(&p->run_list, array->queue + p->prio); + __set_bit(p->prio, array->bitmap); inc_rt_tasks(p, rq); @@ -196,7 +201,7 @@ static void enqueue_task_rt(struct rq *r /* * Adding/removing a task to/from a priority array: */ -static void dequeue_task_rt(struct rq *rq, struct task_struct *p, int sleep) +static void dequeue_task_rt(struct rq *rq, struct task_struct *p, int flags) { struct rt_prio_array *array = &rq->rt.active; @@ -306,7 +311,7 @@ static void put_prev_task_rt(struct rq * #define RT_MAX_TRIES 3 static int double_lock_balance(struct rq *this_rq, struct rq *busiest); -static void deactivate_task(struct rq *rq, struct task_struct *p, int sleep); +static void deactivate_task(struct rq *rq, struct task_struct *p, int flags); static int pick_rt_task(struct rq *rq, struct task_struct *p, int cpu) { ��������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/ppc64-fix-preempt-unsafe-paths-accessing-per_cpu-variables.patch����������������������������0000644�0000764�0000764�00000014115�11043606267�025407� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������From chirag@linux.vnet.ibm.com Wed Jul 9 18:33:02 2008 Date: Wed, 9 Jul 2008 21:35:43 +0530 From: Chirag Jog <chirag@linux.vnet.ibm.com> To: linux.kernel@vger.kernel.org, linux-rt-users@vger.kernel.org, linuxppc-dev@ozlabs.org Cc: Dinakar Guniguntala <dino@in.ibm.com>, Timothy R. Chavez <tim.chavez@linux.vnet.ibm.com>, paulmck@linux.vnet.ibm.com, Nivedita Singhvi <niv@us.ibm.com>, Josh Triplett <josht@linux.vnet.ibm.com>, Steven Rostedt <rostedt@goodmis.org> Subject: [PATCH][RT][PPC64] Fix preempt unsafe paths accessing per_cpu variables Hi, This patch fixes various paths in the -rt kernel on powerpc64 where per_cpu variables are accessed in a preempt unsafe way. When a power box with -rt kernel is booted, multiple BUG messages are generated "BUG: init:1 task might have lost a preemption check!". After booting a kernel with these patches applied, these messages don't appear. Also I ran the realtime tests from ltp to ensure the stability. Signed-Off-By: Chirag <chirag@linux.vnet.ibm.com> arch/powerpc/mm/tlb_64.c | 31 ++++++++++++++++--------------- arch/powerpc/platforms/pseries/iommu.c | 14 ++++++++++---- include/asm-powerpc/tlb.h | 5 ++--- 3 files changed, 28 insertions(+), 22 deletions(-) Index: linux-2.6.25.8-rt7/arch/powerpc/mm/tlb_64.c =================================================================== --- linux-2.6.25.8-rt7.orig/arch/powerpc/mm/tlb_64.c 2008-07-09 21:29:21.000000000 +0530 +++ linux-2.6.25.8-rt7/arch/powerpc/mm/tlb_64.c 2008-07-09 21:30:37.000000000 +0530 @@ -38,7 +38,6 @@ * include/asm-powerpc/tlb.h file -- tgall */ DEFINE_PER_CPU_LOCKED(struct mmu_gather, mmu_gathers); -DEFINE_PER_CPU(struct pte_freelist_batch *, pte_freelist_cur); unsigned long pte_freelist_forced_free; struct pte_freelist_batch @@ -48,7 +47,7 @@ pgtable_free_t tables[0]; }; -DEFINE_PER_CPU(struct pte_freelist_batch *, pte_freelist_cur); +DEFINE_PER_CPU_LOCKED(struct pte_freelist_batch *, pte_freelist_cur); unsigned long pte_freelist_forced_free; #define PTE_FREELIST_SIZE \ @@ -92,24 +91,21 @@ void pgtable_free_tlb(struct mmu_gather *tlb, pgtable_free_t pgf) { - /* - * This is safe since tlb_gather_mmu has disabled preemption. - * tlb->cpu is set by tlb_gather_mmu as well. - */ + int cpu; cpumask_t local_cpumask = cpumask_of_cpu(tlb->cpu); - struct pte_freelist_batch **batchp = &__get_cpu_var(pte_freelist_cur); + struct pte_freelist_batch **batchp = &get_cpu_var_locked(pte_freelist_cur, &cpu); if (atomic_read(&tlb->mm->mm_users) < 2 || cpus_equal(tlb->mm->cpu_vm_mask, local_cpumask)) { pgtable_free(pgf); - return; + goto cleanup; } if (*batchp == NULL) { *batchp = (struct pte_freelist_batch *)__get_free_page(GFP_ATOMIC); if (*batchp == NULL) { pgtable_free_now(pgf); - return; + goto cleanup; } (*batchp)->index = 0; } @@ -118,6 +114,9 @@ pte_free_submit(*batchp); *batchp = NULL; } + + cleanup: + put_cpu_var_locked(pte_freelist_cur, cpu); } /* @@ -253,13 +252,15 @@ void pte_free_finish(void) { - /* This is safe since tlb_gather_mmu has disabled preemption */ - struct pte_freelist_batch **batchp = &__get_cpu_var(pte_freelist_cur); + int cpu; + struct pte_freelist_batch **batchp = &get_cpu_var_locked(pte_freelist_cur, &cpu); - if (*batchp == NULL) - return; - pte_free_submit(*batchp); - *batchp = NULL; + if (*batchp) { + pte_free_submit(*batchp); + *batchp = NULL; + } + + put_cpu_var_locked(pte_freelist_cur, cpu); } /** Index: linux-2.6.25.8-rt7/include/asm-powerpc/tlb.h =================================================================== --- linux-2.6.25.8-rt7.orig/include/asm-powerpc/tlb.h 2008-07-09 21:29:21.000000000 +0530 +++ linux-2.6.25.8-rt7/include/asm-powerpc/tlb.h 2008-07-09 21:29:41.000000000 +0530 @@ -40,18 +40,17 @@ static inline void tlb_flush(struct mmu_gather *tlb) { - struct ppc64_tlb_batch *tlbbatch = &__get_cpu_var(ppc64_tlb_batch); + struct ppc64_tlb_batch *tlbbatch = &get_cpu_var(ppc64_tlb_batch); /* If there's a TLB batch pending, then we must flush it because the * pages are going to be freed and we really don't want to have a CPU * access a freed page because it has a stale TLB */ if (tlbbatch->index) { - preempt_disable(); __flush_tlb_pending(tlbbatch); - preempt_enable(); } + put_cpu_var(ppc64_tlb_batch); pte_free_finish(); } Index: linux-2.6.25.8-rt7/arch/powerpc/platforms/pseries/iommu.c =================================================================== --- linux-2.6.25.8-rt7.orig/arch/powerpc/platforms/pseries/iommu.c 2008-07-09 21:29:21.000000000 +0530 +++ linux-2.6.25.8-rt7/arch/powerpc/platforms/pseries/iommu.c 2008-07-09 21:29:41.000000000 +0530 @@ -124,7 +124,7 @@ } } -static DEFINE_PER_CPU(u64 *, tce_page) = NULL; +static DEFINE_PER_CPU_LOCKED(u64 *, tce_page) = NULL; static void tce_buildmulti_pSeriesLP(struct iommu_table *tbl, long tcenum, long npages, unsigned long uaddr, @@ -135,12 +135,13 @@ u64 *tcep; u64 rpn; long l, limit; + int cpu; if (npages == 1) return tce_build_pSeriesLP(tbl, tcenum, npages, uaddr, direction); - tcep = __get_cpu_var(tce_page); + tcep = get_cpu_var_locked(tce_page, &cpu); /* This is safe to do since interrupts are off when we're called * from iommu_alloc{,_sg}() @@ -148,10 +149,13 @@ if (!tcep) { tcep = (u64 *)__get_free_page(GFP_ATOMIC); /* If allocation fails, fall back to the loop implementation */ - if (!tcep) + if (!tcep) { + put_cpu_var_locked(tce_page, cpu); return tce_build_pSeriesLP(tbl, tcenum, npages, uaddr, direction); - __get_cpu_var(tce_page) = tcep; + } + + per_cpu_var_locked(tce_page, cpu) = tcep; } rpn = (virt_to_abs(uaddr)) >> TCE_SHIFT; @@ -188,6 +192,8 @@ printk("\ttce[0] val = 0x%lx\n", tcep[0]); show_stack(current, (unsigned long *)__get_SP()); } + + put_cpu_var_locked(tce_page, cpu); } static void tce_free_pSeriesLP(struct iommu_table *tbl, long tcenum, long npages) -- To unsubscribe from this list: send the line "unsubscribe linux-rt-users" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ���������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/bz235099-idle-load-fix.patch����������������������������������������������������������������0000664�0000764�0000764�00000010613�11041673075�016006� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������Michal Schmidt's fix for load average calculation From: Michal Schmidt <mschmidt@redhat.com> Subject: Re: [PATCH] idle load == # of CPUs This is an attempt to fix https://bugzilla.redhat.com/show_bug.cgi?id=253099 The bug is caused by the fact that the local timer interrupts happen usually at the same time on all CPUs. All CPUs then raise their timer softirqs. When calc_load() runs, it sees not only itself but all the other softirq-timer per-cpu threads running too. And softirq-sched too, which is woken up from scheduler_tick(). In a BZ comment I speculated about three possible solutions: (a) Somehow make sure the timer interrupts are synchronized between CPUs, but with a per-CPU offset, so that they don't fire at the same time. (b) Make the timer softirq threads (and softirq-sched?) special and not take them into loadavg calculation. (c) Don't calculate loadavg from the timer softirq. It should be possible to run calc_load() periodically from a hrtimer. This patch implements (b). Comments welcome. Michal --- fs/proc/proc_misc.c | 22 ++++++++++++++++------ kernel/timer.c | 49 ++++++++++++++++++++++++++++++++++++++++++++++++- 2 files changed, 64 insertions(+), 7 deletions(-) Index: linux-2.6.24.7/fs/proc/proc_misc.c =================================================================== --- linux-2.6.24.7.orig/fs/proc/proc_misc.c +++ linux-2.6.24.7/fs/proc/proc_misc.c @@ -83,10 +83,15 @@ static int loadavg_read_proc(char *page, { int a, b, c; int len; + unsigned long seq; + + do { + seq = read_seqbegin(&xtime_lock); + a = avenrun[0] + (FIXED_1/200); + b = avenrun[1] + (FIXED_1/200); + c = avenrun[2] + (FIXED_1/200); + } while (read_seqretry(&xtime_lock, seq)); - a = avenrun[0] + (FIXED_1/200); - b = avenrun[1] + (FIXED_1/200); - c = avenrun[2] + (FIXED_1/200); len = sprintf(page,"%d.%02d %d.%02d %d.%02d %ld/%d %d\n", LOAD_INT(a), LOAD_FRAC(a), LOAD_INT(b), LOAD_FRAC(b), @@ -104,10 +109,15 @@ static int loadavg_rt_read_proc(char *pa extern unsigned long rt_nr_running(void); int a, b, c; int len; + unsigned long seq; + + do { + seq = read_seqbegin(&xtime_lock); + a = avenrun_rt[0] + (FIXED_1/200); + b = avenrun_rt[1] + (FIXED_1/200); + c = avenrun_rt[2] + (FIXED_1/200); + } while (read_seqretry(&xtime_lock, seq)); - a = avenrun_rt[0] + (FIXED_1/200); - b = avenrun_rt[1] + (FIXED_1/200); - c = avenrun_rt[2] + (FIXED_1/200); len = sprintf(page,"%d.%02d %d.%02d %d.%02d %ld/%d %d\n", LOAD_INT(a), LOAD_FRAC(a), LOAD_INT(b), LOAD_FRAC(b), Index: linux-2.6.24.7/kernel/timer.c =================================================================== --- linux-2.6.24.7.orig/kernel/timer.c +++ linux-2.6.24.7/kernel/timer.c @@ -38,6 +38,7 @@ #include <linux/delay.h> #include <linux/tick.h> #include <linux/kallsyms.h> +#include <linux/kthread.h> #include <asm/uaccess.h> #include <asm/unistd.h> @@ -929,7 +930,7 @@ void update_process_times(int user_tick) static unsigned long count_active_tasks(void) { /* - * On PREEMPT_RT, we are running in the timer softirq thread, + * On PREEMPT_RT, we are running in the loadavg thread, * so consider 1 less running tasks: */ #ifdef CONFIG_PREEMPT_RT @@ -999,6 +1000,50 @@ static inline void calc_load(unsigned lo } } +#ifdef CONFIG_PREEMPT_RT +static int loadavg_calculator(void *data) +{ + unsigned long now, last; + + last = jiffies; + while (!kthread_should_stop()) { + struct timespec delay = { + .tv_sec = LOAD_FREQ / HZ, + .tv_nsec = 0 + }; + + hrtimer_nanosleep(&delay, NULL, HRTIMER_MODE_REL, + CLOCK_MONOTONIC); + now = jiffies; + write_seqlock_irq(&xtime_lock); + calc_load(now - last); + write_sequnlock_irq(&xtime_lock); + last = now; + } + + return 0; +} + +static int __init start_loadavg_calculator(void) +{ + struct task_struct *p; + struct sched_param param = { .sched_priority = MAX_USER_RT_PRIO/2 }; + + p = kthread_create(loadavg_calculator, NULL, "loadavg"); + if (IS_ERR(p)) { + printk(KERN_ERR "Could not create the loadavg thread.\n"); + return 1; + } + + sched_setscheduler(p, SCHED_FIFO, ¶m); + wake_up_process(p); + + return 0; +} + +late_initcall(start_loadavg_calculator); +#endif + /* * Called by the local, per-CPU timer interrupt on SMP. */ @@ -1027,7 +1072,9 @@ static inline void update_times(void) ticks = jiffies - last_tick; if (ticks) { last_tick += ticks; +#ifndef CONFIG_PREEMPT_RT calc_load(ticks); +#endif } write_sequnlock_irqrestore(&xtime_lock, flags); } ���������������������������������������������������������������������������������������������������������������������patches/raw-spinlocks-for-nmi-print.patch�����������������������������������������������������������0000664�0000764�0000764�00000002356�11041673075�017563� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������change spinlock to be raw spinlock when serializing prints from NMI From: Clark Williams <williams@redhat.com> --- arch/x86/kernel/nmi_32.c | 2 +- arch/x86/kernel/nmi_64.c | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) Index: linux-2.6.24.7/arch/x86/kernel/nmi_32.c =================================================================== --- linux-2.6.24.7.orig/arch/x86/kernel/nmi_32.c +++ linux-2.6.24.7/arch/x86/kernel/nmi_32.c @@ -401,7 +401,7 @@ nmi_watchdog_tick(struct pt_regs * regs, } if (cpu_isset(cpu, backtrace_mask)) { - static DEFINE_SPINLOCK(lock); /* Serialise the printks */ + static DEFINE_RAW_SPINLOCK(lock); /* Serialise the printks */ spin_lock(&lock); printk("NMI backtrace for cpu %d\n", cpu); Index: linux-2.6.24.7/arch/x86/kernel/nmi_64.c =================================================================== --- linux-2.6.24.7.orig/arch/x86/kernel/nmi_64.c +++ linux-2.6.24.7/arch/x86/kernel/nmi_64.c @@ -393,7 +393,7 @@ nmi_watchdog_tick(struct pt_regs * regs, } if (cpu_isset(cpu, backtrace_mask)) { - static DEFINE_SPINLOCK(lock); /* Serialise the printks */ + static DEFINE_RAW_SPINLOCK(lock); /* Serialise the printks */ spin_lock(&lock); printk("NMI backtrace for cpu %d\n", cpu); ����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/fix-a-previously-reverted-fix.patch���������������������������������������������������������0000664�0000764�0000764�00000002304�11043075114�020076� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������Subject: Fix a previously reverted "fix" From: Chirag Jog <chirag@linux.vnet.ibm.com> Date: Thu, 10 Jul 2008 22:34:56 +0530 This patch reintroduces a "fix" that got reverted. Here was the original patch http://lkml.org/lkml/2007/5/22/133 Here is the new patch This patch also fixes OOPS reported here: http://lkml.org/lkml/2008/6/19/146 >From tsutomu.owa@toshiba.co.jp Signed-Off-By: Chirag <chirag@linux.vnet.ibm.com> --- arch/powerpc/kernel/entry_64.S | 9 ++------- 1 file changed, 2 insertions(+), 7 deletions(-) Index: linux-2.6.24.7/arch/powerpc/kernel/entry_64.S =================================================================== --- linux-2.6.24.7.orig/arch/powerpc/kernel/entry_64.S +++ linux-2.6.24.7/arch/powerpc/kernel/entry_64.S @@ -580,14 +580,9 @@ do_work: cmpdi r0,0 crandc eq,cr1*4+eq,eq bne restore - /* here we are preempting the current task */ 1: - li r0,1 - stb r0,PACASOFTIRQEN(r13) - stb r0,PACAHARDIRQEN(r13) - ori r10,r10,MSR_EE - mtmsrd r10,1 /* reenable interrupts */ - bl .preempt_schedule + /* preempt_schedule_irq() expects interrupts disabled. */ + bl .preempt_schedule_irq mfmsr r10 clrrdi r9,r1,THREAD_SHIFT rldicl r10,r10,48,1 /* disable interrupts again */ ����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/powerpc-xics-move-the-call-to-irq-radix-revmap-from-xics-startup-to-xics-host-map.patch�����0000664�0000764�0000764�00000005616�11041675522�031674� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������Subject: powerpc - XICS: move the call to irq_radix_revmap from xics_startup to xics_host_map From: Sebastien Dugue <sebastien.dugue@bull.net> Date: Wed, 23 Jul 2008 17:00:24 +0200 From: Sebastien Dugue <sebastien.dugue@bull.net> Date: Tue, 22 Jul 2008 13:05:24 +0200 Subject: [PATCH][RT] powerpc - XICS: move the call to irq_radix_revmap from xics_startup to xics_host_map This patch moves the insertion of an irq into the reverse mapping radix tree from xics_startup() into xics_host_map(). The reason for this change is that xics_startup() is called with preemption disabled (which is not the case for xics_host_map()) which is a problem under a preempt-rt kernel as we cannot even allocate GFP_ATOMIC memory for the radix tree nodes. Signed-off-by: Sebastien Dugue <sebastien.dugue@bull.net> Cc: Tim Chavez <tinytim@us.ibm.com> Cc: Jean Pierre Dion <jean-pierre.dion@bull.net> Cc: linuxppc-dev@ozlabs.org Cc: paulus@samba.org Cc: Gilles Carry <Gilles.Carry@ext.bull.net> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Michael Ellerman <michael@ellerman.id.au> --- arch/powerpc/platforms/pseries/xics.c | 18 ++++++++++++------ 1 file changed, 12 insertions(+), 6 deletions(-) Index: linux-2.6.24.7/arch/powerpc/platforms/pseries/xics.c =================================================================== --- linux-2.6.24.7.orig/arch/powerpc/platforms/pseries/xics.c +++ linux-2.6.24.7/arch/powerpc/platforms/pseries/xics.c @@ -262,12 +262,6 @@ static void xics_mask_irq(unsigned int v static unsigned int xics_startup(unsigned int virq) { - unsigned int irq; - - /* force a reverse mapping of the interrupt so it gets in the cache */ - irq = (unsigned int)irq_map[virq].hwirq; - irq_radix_revmap(xics_host, irq); - /* unmask it */ xics_unmask_irq(virq); return 0; @@ -488,8 +482,14 @@ static int xics_host_match(struct irq_ho static int xics_host_map_direct(struct irq_host *h, unsigned int virq, irq_hw_number_t hw) { + unsigned int irq; + pr_debug("xics: map_direct virq %d, hwirq 0x%lx\n", virq, hw); + /* force a reverse mapping of the interrupt so it gets in the cache */ + irq = (unsigned int)irq_map[virq].hwirq; + irq_radix_revmap(xics_host, irq); + get_irq_desc(virq)->status |= IRQ_LEVEL; set_irq_chip_and_handler(virq, &xics_pic_direct, handle_fasteoi_irq); return 0; @@ -498,8 +498,14 @@ static int xics_host_map_direct(struct i static int xics_host_map_lpar(struct irq_host *h, unsigned int virq, irq_hw_number_t hw) { + unsigned int irq; + pr_debug("xics: map_direct virq %d, hwirq 0x%lx\n", virq, hw); + /* force a reverse mapping of the interrupt so it gets in the cache */ + irq = (unsigned int)irq_map[virq].hwirq; + irq_radix_revmap(xics_host, irq); + get_irq_desc(virq)->status |= IRQ_LEVEL; set_irq_chip_and_handler(virq, &xics_pic_lpar, handle_fasteoi_irq); return 0; ������������������������������������������������������������������������������������������������������������������patches/powerpc-make-the-irq-reverse-mapping-radix-tree-lockless.patch������������������������������0000664�0000764�0000764�00000014262�11041675522�025210� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������Subject: powerpc - Make the irq reverse mapping radix tree lockless From: Sebastien Dugue <sebastien.dugue@bull.net> Date: Wed, 23 Jul 2008 17:01:02 +0200 From: Sebastien Dugue <sebastien.dugue@bull.net> Date: Tue, 22 Jul 2008 11:56:41 +0200 Subject: [PATCH][RT] powerpc - Make the irq reverse mapping radix tree lockless The radix tree used by interrupt controllers for their irq reverse mapping (currently only the XICS found on pSeries) have a complex locking scheme dating back to before the advent of the concurrent radix tree on preempt-rt. Take advantage of this and of the fact that the items of the tree are pointers to a static array (irq_map) elements which can never go under us to simplify the locking. Concurrency between readers and writers are handled by the intrinsic properties of the concurrent radix tree. Concurrency between the tree initialization which is done asynchronously with readers and writers access is handled via an atomic variable (revmap_trees_allocated) set when the tree has been initialized and checked before any reader or writer access just like we used to check for tree.gfp_mask != 0 before. Signed-off-by: Sebastien Dugue <sebastien.dugue@bull.net> Cc: Tim Chavez <tinytim@us.ibm.com> Cc: Jean Pierre Dion <jean-pierre.dion@bull.net> Cc: linuxppc-dev@ozlabs.org Cc: paulus@samba.org Cc: Gilles Carry <Gilles.Carry@ext.bull.net> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> --- arch/powerpc/kernel/irq.c | 102 ++++++++++++---------------------------------- 1 file changed, 27 insertions(+), 75 deletions(-) Index: linux-2.6.24.7/arch/powerpc/kernel/irq.c =================================================================== --- linux-2.6.24.7.orig/arch/powerpc/kernel/irq.c +++ linux-2.6.24.7/arch/powerpc/kernel/irq.c @@ -404,8 +404,7 @@ void do_softirq(void) static LIST_HEAD(irq_hosts); static DEFINE_RAW_SPINLOCK(irq_big_lock); -static DEFINE_PER_CPU(unsigned int, irq_radix_reader); -static unsigned int irq_radix_writer; +static atomic_t revmap_trees_allocated = ATOMIC_INIT(0); struct irq_map_entry irq_map[NR_IRQS]; static unsigned int irq_virq_count = NR_IRQS; static struct irq_host *irq_default_host; @@ -548,57 +547,6 @@ void irq_set_virq_count(unsigned int cou irq_virq_count = count; } -/* radix tree not lockless safe ! we use a brlock-type mecanism - * for now, until we can use a lockless radix tree - */ -static void irq_radix_wrlock(unsigned long *flags) -{ - unsigned int cpu, ok; - - spin_lock_irqsave(&irq_big_lock, *flags); - irq_radix_writer = 1; - smp_mb(); - do { - barrier(); - ok = 1; - for_each_possible_cpu(cpu) { - if (per_cpu(irq_radix_reader, cpu)) { - ok = 0; - break; - } - } - if (!ok) - cpu_relax(); - } while(!ok); -} - -static void irq_radix_wrunlock(unsigned long flags) -{ - smp_wmb(); - irq_radix_writer = 0; - spin_unlock_irqrestore(&irq_big_lock, flags); -} - -static void irq_radix_rdlock(unsigned long *flags) -{ - local_irq_save(*flags); - __get_cpu_var(irq_radix_reader) = 1; - smp_mb(); - if (likely(irq_radix_writer == 0)) - return; - __get_cpu_var(irq_radix_reader) = 0; - smp_wmb(); - spin_lock(&irq_big_lock); - __get_cpu_var(irq_radix_reader) = 1; - spin_unlock(&irq_big_lock); -} - -static void irq_radix_rdunlock(unsigned long flags) -{ - __get_cpu_var(irq_radix_reader) = 0; - local_irq_restore(flags); -} - static int irq_setup_virq(struct irq_host *host, unsigned int virq, irq_hw_number_t hwirq) { @@ -753,7 +701,6 @@ void irq_dispose_mapping(unsigned int vi { struct irq_host *host; irq_hw_number_t hwirq; - unsigned long flags; if (virq == NO_IRQ) return; @@ -785,15 +732,20 @@ void irq_dispose_mapping(unsigned int vi if (hwirq < host->revmap_data.linear.size) host->revmap_data.linear.revmap[hwirq] = NO_IRQ; break; - case IRQ_HOST_MAP_TREE: + case IRQ_HOST_MAP_TREE: { + DEFINE_RADIX_TREE_CONTEXT(ctx, &host->revmap_data.tree); + /* Check if radix tree allocated yet */ - if (host->revmap_data.tree.gfp_mask == 0) + if (atomic_read(&revmap_trees_allocated) == 0) break; - irq_radix_wrlock(&flags); - radix_tree_delete(&host->revmap_data.tree, hwirq); - irq_radix_wrunlock(flags); + + radix_tree_lock(&ctx); + radix_tree_delete(ctx.tree, hwirq); + radix_tree_unlock(&ctx); + break; } + } /* Destroy map */ smp_mb(); @@ -846,22 +798,20 @@ unsigned int irq_radix_revmap(struct irq struct radix_tree_root *tree; struct irq_map_entry *ptr; unsigned int virq; - unsigned long flags; WARN_ON(host->revmap_type != IRQ_HOST_MAP_TREE); - /* Check if the radix tree exist yet. We test the value of - * the gfp_mask for that. Sneaky but saves another int in the - * structure. If not, we fallback to slow mode - */ - tree = &host->revmap_data.tree; - if (tree->gfp_mask == 0) + /* Check if the radix tree exist yet. */ + if (atomic_read(&revmap_trees_allocated) == 0) return irq_find_mapping(host, hwirq); - /* Now try to resolve */ - irq_radix_rdlock(&flags); + /* + * Now try to resolve + * No rcu_read_lock(ing) needed, the ptr returned can't go under us + * as it's referencing an entry in the static irq_map table. + */ + tree = &host->revmap_data.tree; ptr = radix_tree_lookup(tree, hwirq); - irq_radix_rdunlock(flags); /* Found it, return */ if (ptr) { @@ -872,9 +822,10 @@ unsigned int irq_radix_revmap(struct irq /* If not there, try to insert it */ virq = irq_find_mapping(host, hwirq); if (virq != NO_IRQ) { - irq_radix_wrlock(&flags); - radix_tree_insert(tree, hwirq, &irq_map[virq]); - irq_radix_wrunlock(flags); + DEFINE_RADIX_TREE_CONTEXT(ctx, tree); + radix_tree_lock(&ctx); + radix_tree_insert(ctx.tree, hwirq, &irq_map[virq]); + radix_tree_unlock(&ctx); } return virq; } @@ -985,14 +936,15 @@ void irq_early_init(void) static int irq_late_init(void) { struct irq_host *h; - unsigned long flags; - irq_radix_wrlock(&flags); list_for_each_entry(h, &irq_hosts, link) { if (h->revmap_type == IRQ_HOST_MAP_TREE) INIT_RADIX_TREE(&h->revmap_data.tree, GFP_ATOMIC); } - irq_radix_wrunlock(flags); + + /* Make sure the radix trees inits are visible before setting the flag */ + smp_mb(); + atomic_set(&revmap_trees_allocated, 1); return 0; } ����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/trace-do-not-wakeup-when-irqs-disabled.patch������������������������������������������������0000664�0000764�0000764�00000002017�11041724723�021525� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������Subject: trace-do-not-wakeup-when-irqs-disabled.patch From: Thomas Gleixner <tglx@linutronix.de> Date: Wed, 23 Jul 2008 23:52:47 +0200 When PREEMPT_RT is enabled then the wakeup code (including the tracer) can be called with interrupts disabled which triggers the might sleep check in rt_spin_lock_fastlock(). Do not call wakeup when interrupts are disabled in the PREEMPT_RT case. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> --- kernel/trace/trace.c | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) Index: linux-2.6.24.7/kernel/trace/trace.c =================================================================== --- linux-2.6.24.7.orig/kernel/trace/trace.c +++ linux-2.6.24.7/kernel/trace/trace.c @@ -151,7 +151,10 @@ void trace_wake_up(void) * have for now: */ if (!(trace_flags & TRACE_ITER_BLOCK) && !runqueue_is_locked()) - wake_up(&trace_wait); +#ifdef CONFIG_PREEMPT_RT + if (!irqs_disabled()) +#endif + wake_up(&trace_wait); } #define ENTRIES_PER_PAGE (PAGE_SIZE / sizeof(struct trace_entry)) �����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/acpi-fix-enter-c1.patch���������������������������������������������������������������������0000664�0000764�0000764�00000003232�11042036645�015373� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������Subject: acpi-fix-enter-c1.patch From: Thomas Gleixner <tglx@linutronix.de> Date: Thu, 24 Jul 2008 01:13:43 +0200 Signed-off-by: Thomas Gleixner <tglx@linutronix.de> --- drivers/acpi/processor_idle.c | 19 ++++++++++++++++--- 1 file changed, 16 insertions(+), 3 deletions(-) Index: linux-2.6.24.7/drivers/acpi/processor_idle.c =================================================================== --- linux-2.6.24.7.orig/drivers/acpi/processor_idle.c +++ linux-2.6.24.7/drivers/acpi/processor_idle.c @@ -209,7 +209,7 @@ static void acpi_safe_halt(void) * test NEED_RESCHED: */ smp_mb(); - if (!need_resched() || !need_resched_delayed()) + if (!need_resched() && !need_resched_delayed()) safe_halt(); current_thread_info()->status |= TS_POLLING; } @@ -382,7 +382,7 @@ static void acpi_processor_idle(void) * Check whether we truly need to go idle, or should * reschedule: */ - if (unlikely(need_resched())) { + if (need_resched() || need_resched_delayed()) { local_irq_enable(); return; } @@ -472,7 +472,7 @@ static void acpi_processor_idle(void) * test NEED_RESCHED: */ smp_mb(); - if (need_resched()) { + if (need_resched() || need_resched_delayed()) { current_thread_info()->status |= TS_POLLING; local_irq_enable(); return; @@ -1378,6 +1378,19 @@ static int acpi_idle_enter_c1(struct cpu if (unlikely(!pr)) return 0; + local_irq_disable(); + + /* Do not access any ACPI IO ports in suspend path */ + if (acpi_idle_suspend) { + acpi_safe_halt(); + return 0; + } + + if (need_resched() || need_resched_delayed()) { + local_irq_enable(); + return 0; + } + if (pr->flags.bm_check) acpi_idle_update_bm_rld(pr, cx); ����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/hotplug-smp-boot-fix.patch������������������������������������������������������������������0000664�0000764�0000764�00000003117�11042036666�016270� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������--- arch/x86/kernel/head64.c | 1 + arch/x86/kernel/smpboot_64.c | 2 +- include/asm-x86/proto.h | 1 + 3 files changed, 3 insertions(+), 1 deletion(-) Index: linux-2.6.24.7/arch/x86/kernel/head64.c =================================================================== --- linux-2.6.24.7.orig/arch/x86/kernel/head64.c +++ linux-2.6.24.7/arch/x86/kernel/head64.c @@ -70,6 +70,7 @@ void __init x86_64_start_kernel(char * r cpu_pda(i) = &boot_cpu_pda[i]; pda_init(0); + allocate_stacks(0); copy_bootdata(__va(real_mode_data)); #ifdef CONFIG_SMP cpu_set(0, cpu_online_map); Index: linux-2.6.24.7/arch/x86/kernel/smpboot_64.c =================================================================== --- linux-2.6.24.7.orig/arch/x86/kernel/smpboot_64.c +++ linux-2.6.24.7/arch/x86/kernel/smpboot_64.c @@ -538,7 +538,7 @@ static void __cpuinit do_fork_idle(struc static char boot_exception_stacks[(N_EXCEPTION_STACKS - 1) * EXCEPTION_STKSZ + DEBUG_STKSZ] __attribute__((section(".bss.page_aligned"))); -static int __cpuinit allocate_stacks(int cpu) +int __cpuinit allocate_stacks(int cpu) { static const unsigned int order[N_EXCEPTION_STACKS] = { [0 ... N_EXCEPTION_STACKS - 1] = EXCEPTION_STACK_ORDER, Index: linux-2.6.24.7/include/asm-x86/proto.h =================================================================== --- linux-2.6.24.7.orig/include/asm-x86/proto.h +++ linux-2.6.24.7/include/asm-x86/proto.h @@ -10,6 +10,7 @@ struct pt_regs; extern void start_kernel(void); extern void pda_init(int); +extern int allocate_stacks(int cpu); extern void early_idt_handler(void); �������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/cpu-hotplug-fix-fix-fix.patch���������������������������������������������������������������0000664�0000764�0000764�00000010503�11043075113�016653� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������--- arch/x86/kernel/head64.c | 1 arch/x86/kernel/setup64.c | 54 +++++++++++++++++++++++++++++++++++++++++++ arch/x86/kernel/smpboot_64.c | 54 ------------------------------------------- arch/x86/kernel/traps_64.c | 1 4 files changed, 55 insertions(+), 55 deletions(-) Index: linux-2.6.24.7/arch/x86/kernel/head64.c =================================================================== --- linux-2.6.24.7.orig/arch/x86/kernel/head64.c +++ linux-2.6.24.7/arch/x86/kernel/head64.c @@ -70,7 +70,6 @@ void __init x86_64_start_kernel(char * r cpu_pda(i) = &boot_cpu_pda[i]; pda_init(0); - allocate_stacks(0); copy_bootdata(__va(real_mode_data)); #ifdef CONFIG_SMP cpu_set(0, cpu_online_map); Index: linux-2.6.24.7/arch/x86/kernel/setup64.c =================================================================== --- linux-2.6.24.7.orig/arch/x86/kernel/setup64.c +++ linux-2.6.24.7/arch/x86/kernel/setup64.c @@ -143,6 +143,60 @@ void pda_init(int cpu) pda->irqstackptr += IRQSTACKSIZE-64; } +static char boot_exception_stacks[(N_EXCEPTION_STACKS - 1) * EXCEPTION_STKSZ + DEBUG_STKSZ] +__attribute__((section(".bss.page_aligned"))); + +int __cpuinit allocate_stacks(int cpu) +{ + static const unsigned int order[N_EXCEPTION_STACKS] = { + [0 ... N_EXCEPTION_STACKS - 1] = EXCEPTION_STACK_ORDER, +#if DEBUG_STACK > 0 + [DEBUG_STACK - 1] = DEBUG_STACK_ORDER +#endif + }; + struct tss_struct *t = &per_cpu(init_tss, cpu); + int node = cpu_to_node(cpu); + struct page *page; + char *estack; + int v; + + if (cpu && !t->irqstack) { + page = alloc_pages_node(node, GFP_KERNEL, + IRQSTACK_ORDER); + if (!page) + goto fail_oom; + t->irqstack = page_address(page); + } + + if (!cpu) + estack = boot_exception_stacks; + + for (v = 0; v < N_EXCEPTION_STACKS; v++) { + if (t->estacks[v]) + continue; + + if (cpu) { + page = alloc_pages_node(node, GFP_KERNEL, order[v]); + if (!page) + goto fail_oom; + estack = page_address(page); + } + estack += PAGE_SIZE << order[v]; + /* + * XXX: can we set t->isr[v] here directly, or will that be + * modified later? - the existance of orig_ist seems to suggest + * it _can_ be modified, which would imply we'd need to reset + * it. + */ + t->estacks[v] = estack; + } + + return 0; + +fail_oom: + return -ENOMEM; +} + extern asmlinkage void ignore_sysret(void); /* May not be marked __init: used by software suspend */ Index: linux-2.6.24.7/arch/x86/kernel/smpboot_64.c =================================================================== --- linux-2.6.24.7.orig/arch/x86/kernel/smpboot_64.c +++ linux-2.6.24.7/arch/x86/kernel/smpboot_64.c @@ -535,60 +535,6 @@ static void __cpuinit do_fork_idle(struc complete(&c_idle->done); } -static char boot_exception_stacks[(N_EXCEPTION_STACKS - 1) * EXCEPTION_STKSZ + DEBUG_STKSZ] -__attribute__((section(".bss.page_aligned"))); - -int __cpuinit allocate_stacks(int cpu) -{ - static const unsigned int order[N_EXCEPTION_STACKS] = { - [0 ... N_EXCEPTION_STACKS - 1] = EXCEPTION_STACK_ORDER, -#if DEBUG_STACK > 0 - [DEBUG_STACK - 1] = DEBUG_STACK_ORDER -#endif - }; - struct tss_struct *t = &per_cpu(init_tss, cpu); - int node = cpu_to_node(cpu); - struct page *page; - char *estack; - int v; - - if (cpu && !t->irqstack) { - page = alloc_pages_node(node, GFP_KERNEL, - IRQSTACK_ORDER); - if (!page) - goto fail_oom; - t->irqstack = page_address(page); - } - - if (!cpu) - estack = boot_exception_stacks; - - for (v = 0; v < N_EXCEPTION_STACKS; v++) { - if (t->estacks[v]) - continue; - - if (cpu) { - page = alloc_pages_node(node, GFP_KERNEL, order[v]); - if (!page) - goto fail_oom; - estack = page_address(page); - } - estack += PAGE_SIZE << order[v]; - /* - * XXX: can we set t->isr[v] here directly, or will that be - * modified later? - the existance of orig_ist seems to suggest - * it _can_ be modified, which would imply we'd need to reset - * it. - */ - t->estacks[v] = estack; - } - - return 0; - -fail_oom: - return -ENOMEM; -} - /* * Boot one CPU. */ Index: linux-2.6.24.7/arch/x86/kernel/traps_64.c =================================================================== --- linux-2.6.24.7.orig/arch/x86/kernel/traps_64.c +++ linux-2.6.24.7/arch/x86/kernel/traps_64.c @@ -1138,6 +1138,7 @@ void __init trap_init(void) /* * Should be a barrier for any external CPU state. */ + allocate_stacks(0); cpu_init(); } ���������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/sched-fix-dequeued-race.patch���������������������������������������������������������������0000664�0000764�0000764�00000001316�11042113616�016633� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������Subject: sched-fix-dequeued-race.patch From: Thomas Gleixner <tglx@linutronix.de> Date: Thu, 24 Jul 2008 16:14:31 +0200 Signed-off-by: Thomas Gleixner <tglx@linutronix.de> --- kernel/sched.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) Index: linux-2.6.24.7/kernel/sched.c =================================================================== --- linux-2.6.24.7.orig/kernel/sched.c +++ linux-2.6.24.7/kernel/sched.c @@ -3769,7 +3769,7 @@ void scheduler_tick(void) rq->clock = next_tick; rq->tick_timestamp = rq->clock; update_cpu_load(rq); - if (curr != rq->idle) /* FIXME: needed? */ + if (curr != rq->idle && curr->se.on_rq) curr->sched_class->task_tick(rq, curr); spin_unlock(&rq->lock); ������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/x86-64-fix-compile.patch��������������������������������������������������������������������0000664�0000764�0000764�00000001162�11043075113�015337� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������Subject: x86-64-fix-compile.patch From: Thomas Gleixner <tglx@linutronix.de> Date: Fri, 25 Jul 2008 15:54:04 +0200 Signed-off-by: Thomas Gleixner <tglx@linutronix.de> --- kernel/trace/trace.h | 4 ++++ 1 file changed, 4 insertions(+) Index: linux-2.6.24.7/kernel/trace/trace.h =================================================================== --- linux-2.6.24.7.orig/kernel/trace/trace.h +++ linux-2.6.24.7/kernel/trace/trace.h @@ -7,6 +7,10 @@ #include <linux/clocksource.h> #include <linux/mmiotrace.h> +#ifdef CONFIG_X86_64 +#include <asm/asm-offsets.h> +#endif + enum trace_type { __TRACE_FIRST_TYPE = 0, ��������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/trace-ktime-scalar.patch��������������������������������������������������������������������0000664�0000764�0000764�00000010051�11043075113�015716� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������Subject: ftrace: print ktime values in readable form From: Thomas Gleixner <tglx@linutronix.de> Date: Sat, 26 Jul 2008 23:01:37 +0200 Printing the tv64 member of the ktime_t expiry/timestamp is unreadable on 32bit machines which don't have KTIME_SCALAR set. Convert the ktime_t value to a timespec instead and print sec.nsec value, which makes the time values much easier to read also on those machines which use the 64bit scalar nsec storage. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> --- kernel/trace/trace.c | 47 +++++++++++++++++++++++++++++++---------------- 1 file changed, 31 insertions(+), 16 deletions(-) Index: linux-2.6.24.7/kernel/trace/trace.c =================================================================== --- linux-2.6.24.7.orig/kernel/trace/trace.c +++ linux-2.6.24.7/kernel/trace/trace.c @@ -1633,6 +1633,13 @@ extern unsigned long ia32_sys_call_table # define IA32_NR_syscalls (ia32_syscall_end - ia32_sys_call_table) #endif +static void trace_print_ktime(struct trace_seq *s, ktime_t t) +{ + struct timespec ts = ktime_to_timespec(t); + + trace_seq_printf(s, " (%ld.%09ld)", ts.tv_sec, ts.tv_nsec); +} + static int print_lat_fmt(struct trace_iterator *iter, unsigned int trace_idx, int cpu) { @@ -1728,23 +1735,23 @@ print_lat_fmt(struct trace_iterator *ite break; case TRACE_TIMER_SET: seq_print_ip_sym(s, entry->timer.ip, sym_flags); - trace_seq_printf(s, " (%Ld) (%p)\n", - entry->timer.expire, entry->timer.timer); + trace_print_ktime(s, entry->timer.expire); + trace_seq_printf(s, " (%p)\n", entry->timer.timer); break; case TRACE_TIMER_TRIG: seq_print_ip_sym(s, entry->timer.ip, sym_flags); - trace_seq_printf(s, " (%Ld) (%p)\n", - entry->timer.expire, entry->timer.timer); + trace_print_ktime(s, entry->timer.expire); + trace_seq_printf(s, " (%p)\n", entry->timer.timer); break; case TRACE_TIMESTAMP: seq_print_ip_sym(s, entry->timestamp.ip, sym_flags); - trace_seq_printf(s, " (%Ld)\n", - entry->timestamp.now.tv64); + trace_print_ktime(s, entry->timestamp.now); + trace_seq_puts(s, "\n"); break; case TRACE_PROGRAM_EVENT: seq_print_ip_sym(s, entry->program.ip, sym_flags); - trace_seq_printf(s, " (%Ld) (%Ld)\n", - entry->program.expire, entry->program.delta); + trace_print_ktime(s, entry->program.expire); + trace_seq_printf(s, " (%Ld)\n", entry->program.delta); break; case TRACE_TASK_ACT: seq_print_ip_sym(s, entry->task.ip, sym_flags); @@ -1822,6 +1829,14 @@ static int print_trace_fmt(struct trace_ ret = trace_seq_printf(s, "[%02d] ", iter->cpu); if (!ret) return 0; + + ret = trace_seq_printf(s, "%c%c %2d ", + (entry->flags & TRACE_FLAG_IRQS_OFF) ? 'd' : '.', + ((entry->flags & TRACE_FLAG_NEED_RESCHED) ? 'N' : '.'), + entry->preempt_count); + if (!ret) + return 0; + ret = trace_seq_printf(s, "%5lu.%06lu: ", secs, usec_rem); if (!ret) return 0; @@ -1908,23 +1923,23 @@ static int print_trace_fmt(struct trace_ break; case TRACE_TIMER_SET: seq_print_ip_sym(s, entry->timer.ip, sym_flags); - trace_seq_printf(s, " (%Ld) (%p)\n", - entry->timer.expire, entry->timer.timer); + trace_print_ktime(s, entry->timer.expire); + trace_seq_printf(s, " (%p)\n", entry->timer.timer); break; case TRACE_TIMER_TRIG: seq_print_ip_sym(s, entry->timer.ip, sym_flags); - trace_seq_printf(s, " (%Ld) (%p)\n", - entry->timer.expire, entry->timer.timer); + trace_print_ktime(s, entry->timer.expire); + trace_seq_printf(s, " (%p)\n", entry->timer.timer); break; case TRACE_TIMESTAMP: seq_print_ip_sym(s, entry->timestamp.ip, sym_flags); - trace_seq_printf(s, " (%Ld)\n", - entry->timestamp.now.tv64); + trace_print_ktime(s, entry->timestamp.now); + trace_seq_puts(s, "\n"); break; case TRACE_PROGRAM_EVENT: seq_print_ip_sym(s, entry->program.ip, sym_flags); - trace_seq_printf(s, " (%Ld) (%Ld)\n", - entry->program.expire, entry->program.delta); + trace_print_ktime(s, entry->program.expire); + trace_seq_printf(s, " (%Ld)\n", entry->program.delta); break; case TRACE_TASK_ACT: seq_print_ip_sym(s, entry->task.ip, sym_flags); ���������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/nfs-stats-miss-preemption.patch�������������������������������������������������������������0000664�0000764�0000764�00000002343�11043075113�017326� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������Subject: nfs: fix missing preemption check From: Thomas Gleixner <tglx@linutronix.de> Date: Sun, 27 Jul 2008 00:54:19 +0200 NFS iostats use get_cpu()/put_cpu_no_preempt(). That misses a preemption check for no good reason and introduces long latencies when a wakeup of a higher priority task happens in the preempt disabled region. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> --- fs/nfs/iostat.h | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) Index: linux-2.6.24.7/fs/nfs/iostat.h =================================================================== --- linux-2.6.24.7.orig/fs/nfs/iostat.h +++ linux-2.6.24.7/fs/nfs/iostat.h @@ -125,7 +125,7 @@ static inline void nfs_inc_server_stats( cpu = get_cpu(); iostats = per_cpu_ptr(server->io_stats, cpu); iostats->events[stat] ++; - put_cpu_no_resched(); + put_cpu(); } static inline void nfs_inc_stats(struct inode *inode, enum nfs_stat_eventcounters stat) @@ -141,7 +141,7 @@ static inline void nfs_add_server_stats( cpu = get_cpu(); iostats = per_cpu_ptr(server->io_stats, cpu); iostats->bytes[stat] += addend; - put_cpu_no_resched(); + put_cpu(); } static inline void nfs_add_stats(struct inode *inode, enum nfs_stat_bytecounters stat, unsigned long addend) ���������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������patches/version.patch�������������������������������������������������������������������������������0000664�0000764�0000764�00000001172�11043606220�013736� 0����������������������������������������������������������������������������������������������������ustar �tglx����������������������������tglx�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������Subject: add -rt extra-version From: Ingo Molnar <mingo@elte.hu> add -rt extra-version. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> --- Makefile | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) Index: linux-2.6.24.7/Makefile =================================================================== --- linux-2.6.24.7.orig/Makefile +++ linux-2.6.24.7/Makefile @@ -1,7 +1,7 @@ VERSION = 2 PATCHLEVEL = 6 SUBLEVEL = 24 -EXTRAVERSION = .7 +EXTRAVERSION = .7-rt17 NAME = Err Metey! A Heury Beelge-a Ret! # *DOCUMENTATION* ��������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������