Failed to initialize NVML: Driver/library version mismatch

Failed to initialize NVML: Driver/library version mismatch
NVML library version: 570.172

今天执行nvidia-smi -l 查看GPU使用情况的时候,发现不能显示了,提示上面的信息。查询一番后,了解到nvidia的驱动版和内核版本不匹配造成的。

1、查看最近的nvidia相关的更新:果然发现上周(9月16号)有驱动的更新记录

cat /var/log/dpkg.log | grep nvidia

2025-09-16 06:56:23 upgrade nvidia-driver-570-server:amd64 570.133.20-0ubuntu0.24.04.1 570.172.08-0ubuntu0.24.04.1

2、查看当前nvidia驱动版本:当前的驱动版本为570.172.08

dpkg -l | grep nvidia

ii  libnvidia-cfg1-570-server:amd64       570.172.08-0ubuntu0.24.04.1             amd64        NVIDIA binary OpenGL/GLX configuration library
ii  libnvidia-common-570-server           570.172.08-0ubuntu0.24.04.1             all          Shared files used by the NVIDIA libraries
ii  libnvidia-compute-570-server:amd64    570.172.08-0ubuntu0.24.04.1             amd64        NVIDIA libcompute package
ii  libnvidia-decode-570-server:amd64     570.172.08-0ubuntu0.24.04.1             amd64        NVIDIA Video Decoding runtime libraries
ii  libnvidia-egl-wayland1:amd64          1:1.1.17-0ubuntu0~gpu24.04.1            amd64        Wayland EGL External Platform library -- shared library
ii  libnvidia-encode-570-server:amd64     570.172.08-0ubuntu0.24.04.1             amd64        NVENC Video Encoding runtime library
ii  libnvidia-extra-570-server:amd64      570.172.08-0ubuntu0.24.04.1             amd64        Extra libraries for the NVIDIA Server Driver
ii  libnvidia-fbc1-570-server:amd64       570.172.08-0ubuntu0.24.04.1             amd64        NVIDIA OpenGL-based Framebuffer Capture runtime library
ii  libnvidia-gl-570-server:amd64         570.172.08-0ubuntu0.24.04.1             amd64        NVIDIA OpenGL/GLX/EGL/GLES GLVND libraries and Vulkan ICD
ii  nvidia-compute-utils-570-server       570.172.08-0ubuntu0.24.04.1             amd64        NVIDIA compute utilities
ii  nvidia-dkms-570-server                570.172.08-0ubuntu0.24.04.1             amd64        NVIDIA DKMS package
ii  nvidia-driver-570-server              570.172.08-0ubuntu0.24.04.1             amd64        NVIDIA Server Driver metapackage
ii  nvidia-firmware-570-server-570.172.08 570.172.08-0ubuntu0.24.04.1             amd64        Firmware files used by the kernel module
ii  nvidia-kernel-common-570-server       570.172.08-0ubuntu0.24.04.1             amd64        Shared files used with the kernel module
ii  nvidia-kernel-source-570-server       570.172.08-0ubuntu0.24.04.1             amd64        NVIDIA kernel source package
ii  nvidia-utils-570-server               570.172.08-0ubuntu0.24.04.1             amd64        NVIDIA Server Driver support binaries
ii  xserver-xorg-video-nvidia-570-server  570.172.08-0ubuntu0.24.04.1             amd64        NVIDIA binary Xorg driver

3、查看当前内核版本:当前内核版本为570.133.20,570.133.20<570.172.08,所以是内核版本滞后于驱动版本

cat /proc/driver/nvidia/version

NVRM version: NVIDIA UNIX x86_64 Kernel Module  570.133.20  Sun Apr 13 04:50:56 UTC 2025
GCC version:  gcc version 13.3.0 (Ubuntu 13.3.0-6ubuntu2~24.04)

4、更新内核版本:重启系统往往也能更新内核,这里不选择重启,采取下面移除相应模块然后重新加载的方式,操作之后就能看到相关GPU信息了

sudo rmmod nvidia_uvm #nvidia_uvm
sudo rmmod nvidia_drm #移除nvidia_drm 模块
sudo rmmod nvidia_modeset #移除nvidia_modeset 模块
sudo rmmod nvidia #移除nvidia 模块

sudo nvidia-smi   #nvidia-smi发现没有kernel mod的时候,会自动装载

Reference:
《已解决【nvidia-smi】Failed to initialize NVML: Driver/library version mismatch解决方法》
驱动版本与库文件不匹配(Failed to initialize NVML: Driver/library version mismatch)导致nvidia驱动无法运行的解决思路(不重启)