Tag: hang
nvidia-drivers + edge-triggered + noapic
by HidekiAI on Sep.14, 2008, under Technology Opinions
It’s interesting how I can learn when problems rises. Almost always, that’s usually the case of how we educate ourselves because we run into an obstacle. It’s not a welcome learning experience in some cases, but nevertheless it was educational and I’m glad I’ve encountered this issue (but hopefully I never have to deal with it for more than 30 minutes next time).
Just a week ago, I’ve decided to upgrade to the latest kenel v2.6.26-r1 (Gentoo) and usually, I just let it use the oldconfig (“make oldconfig“) and do menuconfig (“make menuconfig“) to verify what is most critical to me is still configured the way I need and then “make all” (followed by reboot and rebuild all the critical modules and applications that needs to be rebuilt such as nvidia-drivers, openvpn, wine, etc).
I used to have alsa as a separate module rather than the kernel and only load the module in when needed, but I had some incompatibility issues back in 2.6.25 or .24 so I stopped doing that.
I love nvidia, despite what some may complain that they are closed source, what counts to me and most of the users are the end resultl As long as it works (you get glx, twinview (multi-monitor), opengl, etc) that’s all it matters. And they (Nvidia) has been quite good and reliable with it.
On my 2.6.25 and earlier kenels, due to ignorance of how things works, I’ve had nvidiafb (frame buffer version of nvidia for X) as modules compiled, and I’ve never had any collisions nor problems with my nvidia-drivers.
In any case, after rebuilding my nvidia-drivers with the latest 2.6.26-r1 gentoo kernel, my X (xdm, kdm, startx, you name the methods of starting it up, I’ve tried it) stopped working. I’ve become somewhat spoiled with having Thunderbird as my mail-client as well as my recent discoveries of Kdevelop which is a kick-ass IDE (way better than Eclipse, since I’m spoiled with Visual Studio, Kdevelop “feels like” Visual Studio even in debugging mode) and I had to have my X-server!
When I peeked into /var/log/kdm.log (when I started via xdm -> kdm) or into /var/log/Xorg.0.log file, they both indicated that nvidia driver was having issues with edge-trigger and that I should look at Chapter 8 for more details to solve this issue. Wonderful… Why sudden complaints, I don’t know…
So here’s the section from Chapter 8 from Nvidia:
My X server fails to start, and my X log file contains the error:
(EE) NVIDIA(0): The interrupt for NVIDIA graphics device PCI:x:x:x
(EE) NVIDIA(0): appears to be edge-triggered. Please see the COMMON
(EE) NVIDIA(0): PROBLEMS section in the README for additional information.An edge-triggered interrupt means that the kernel has programmed the interrupt as edge-triggered rather than level-triggered in the Advanced Programmable Interrupt Controller (APIC). Edge-triggered interrupts are not intended to be used for sharing an interrupt line between multiple devices; level-triggered interrupts are the intended trigger for such usage. When using edge-triggered interrupts, it is common for device drivers using that interrupt line to stop receiving interrupts. This would appear to the end user as those devices no longer working, and potentially as a full system hang. These problems tend to be more common when multiple devices are sharing that interrupt line.
This occurs when ACPI is not used to program interrupt routing in the APIC. This often occurs on 2.4 Linux kernels, which do not fully support ACPI, or 2.6 kernels when ACPI is disabled or fails to initialize. In these cases, the Linux kernel falls back to tables provided by the system BIOS. In some cases the system BIOS assumes ACPI will be used for routing interrupts and configures these tables to incorrectly label all interrupts as edge-triggered. The current interrupt configuration can be found in /proc/interrupts.
Available workarounds include: updating to a newer system BIOS, trying a 2.6 kernel with ACPI enabled, or passing the ‘noapic’ option to the kernel to force interrupt routing through the traditional Programmable Interrupt Controller (PIC). Newer kernels also provide an interrupt polling mechanism to attempt to work around this problem. This mechanism can be enabled by passing the ‘irqpoll’ option to the kernel.
Currently, the NVIDIA driver will attempt to detect edge triggered interrupts and X will purposely fail to start (to avoid stability issues). This behavior can be overridden by setting the “NVreg_RMEdgeIntrCheck” NVIDIA Linux kernel module parameter. This parameter defaults to “1″, which enables the edge triggered interrupt detection. Set this parameter to “0″ to disable this detection.
When I don’t understand a thing, my first instincts are to just absorb as much as I can from all this mish-mash they blabbered about and look for anything useful in laymen’s terms. The only thing that made sense (as to solution to this problem) was the last paragraph about NVreg_RMEdgeintrCheck.
So I immediately googled for that term and see if anybody did the same. One post I’ve found said that he had succeeded, so I went and tried it:
modprobe -r nvidiafb modprobe -r nvidia modprobe --verbose nvidia NVreg_RMEdgeIntrCheck=0 /etc/init.d/xdm start
I got a little further, I got the xdm to show but then, it hang…Â it was worth a try…
Thereafter, I’ve then found on other forums that you cannot have both nvidiafb and nvidia driver, if you want framebuffer driver (so your consoles looks good), you’re supposed to use vesafb (although I did read that nvidia driver does well with console). So I then made sure that I only had vesa framebuffer as module but disabled nvidiafb. Recompiled the kenel, modules, did the rain-dance and so on. And yes, I’ve recompiled the nvidia-drivers too. No go…
I then downloaded the driver shell installer direct from Nvidia and tried running this .sh file. It complained about there is already a nvidia.ko file as well as needing to recompile and something. So I deleted the file from my /lib/modules folder (specific to that kernel version) and tried it again. This time, it seemed to install, but then after another rain-dance and rebooting, the dmesg log said that it cannot find i2c_add and i2c_delete or something. Why all of a sudden, I’ve no clue… It turns out I had i2c_core not even set in my kernel .config. After few more tries (2nd try made me realize that I needed i2c_nvidia as module), I no longer got the error in dmesg, but still no X.
So I gave up and played “Metal Gear Solid 4” in hard-boss mode…Â The bosses (especially the octopus boss and the Liquid Snake) were frustrating (but I’ve finished the hard-boss mode), but not as frustrating as the foreign jibberish (at that time) of this Chapter 8 issue and my X not working all of a sudden (because it worked fine up to 2.6.25)…
So I had to understand what Nvidia’s Chapter 8 was trying to say, so I needed to do some research and learn what all this meant. My first confusion (and I’ve seen posts on forums that some has had the similar confusion) was the differences between ACPI and APIC. What bothers me so much is that one of the paragraphs used these two acrynyms in a single sentence and I thought it was a typo error.
So I grep’d my /usr/src/linux/.config file for APIC and found none, but found few for ACPI. It turns out ACPI is for power interface, and I had them disabled (and still is disabled) because power management keeps crashing my X (maybe it is because of APCI, perhaps now I should re-enable power management).
What’s most important is APIC, and what I am quite glad about is Wikipedia. Once I found it there, I next read about PIC, then learned about Northbridge and Southbridge. Then I got interested in reading about nForce1, nForce2, nForce3, and nForce4. I’ve even read about Super I/O.
Armed with knowledge and no longer ignorant about what APIC means, I finally understood what Nvidia was saying… If you bumped into my page because you’re having the same problem as I had, treat yourself to Wikipedia and enjoy the education. It was (at least for me) a treat to learn this.
So my current solution?
I modified my /boot/grub/grub.conf and added “noapic” to my kernel parameter…
LinkedIn profile
Recent Comments