Skip to content

Directly check the location of libcuda.so where the linker expects to find it#185

Open
ocaisa wants to merge 1 commit intoEESSI:mainfrom
ocaisa:correct_cuda_checks
Open

Directly check the location of libcuda.so where the linker expects to find it#185
ocaisa wants to merge 1 commit intoEESSI:mainfrom
ocaisa:correct_cuda_checks

Conversation

@ocaisa
Copy link
Member

@ocaisa ocaisa commented Mar 23, 2026

Fixes #184

@ocaisa
Copy link
Member Author

ocaisa commented Mar 23, 2026

Tested this on Vega, where the drivers for 2023.06 are available but not for 2025.06:

[eualano@gn06 ~]$ source /cvmfs/software.eessi.io/versions/2025.06/init/lmod/bash
Modules purged before initialising EESSI
Module for EESSI/2025.06 loaded successfully
EESSI has selected x86_64/amd/zen2 as the compatible CPU target for EESSI/2025.06
EESSI has selected accel/nvidia/cc80 as the compatible accelerator target for EESSI/2025.06
(for debug information when loading the EESSI module, set the environment variable EESSI_MODULE_DEBUG_INIT)

# Without the PR it (incorrectly) loads the module
{EESSI/2025.06} [eualano@gn06 ~]$ module load OSU-Micro-Benchmarks/7.5-gompi-2024a-CUDA-12.6.0

# Enable the PR
{EESSI/2025.06} [eualano@gn06 ~]$ export LMOD_PACKAGE_PATH=$PWD/software-layer-scripts/generate/.lmod
{EESSI/2025.06} [eualano@gn06 ~]$ module load OSU-Micro-Benchmarks/7.5-gompi-2024a-CUDA-12.6.0
Lmod has detected the following error:
You requested to load UCX-CUDA  which relies on the CUDA runtime environment and driver libraries. In order to be able to use the module, you will need to make sure EESSI can find
the GPU driver libraries on your host system. The file being checked for on your system is
/cvmfs/software.eessi.io/versions/2025.06/compat/linux/x86_64/lib/nvidia/libcuda.so
You can override this check by setting the environment variable EESSI_OVERRIDE_GPU_CHECK but the loaded application will not be able to execute on your system.
For more information on how to do this, see https://www.eessi.io/docs/site_specific_config/gpu/.

While processing the following module(s):
...

{EESSI/2025.06} [eualano@gn06 ~]$ module purge

# This resets LMOD_PACKAGE_PATH
[eualano@gn06 ~]$ module load EESSI/2023.06
Module for EESSI/2023.06 loaded successfully
{EESSI/2023.06} [eualano@gn06 ~]$ module load OSU-Micro-Benchmarks/7.5-gompi-2023b-CUDA-12.4.0

# Still works with the PR enabled
{EESSI/2023.06} [eualano@gn06 ~]$ export LMOD_PACKAGE_PATH=$PWD/software-layer-scripts/generate/.lmod
{EESSI/2023.06} [eualano@gn06 ~]$ module load OSU-Micro-Benchmarks/7.5-gompi-2023b-CUDA-12.4.0
{EESSI/2023.06} [eualano@gn06 ~]$

@ocaisa
Copy link
Member Author

ocaisa commented Mar 23, 2026

bot: build repo:eessi.io-2023.06-software instance:eessi-bot-deucalion for:arch=aarch64/a64fx
bot: build repo:eessi.io-2025.06-software instance:eessi-bot-deucalion for:arch=aarch64/a64fx

@eessi-bot-deucalion
Copy link

eessi-bot-deucalion bot commented Mar 23, 2026

New job on instance eessi-bot-deucalion for repository eessi.io-2023.06-software
Building on: a64fx
Building for: aarch64/a64fx
Job dir: /home/eessibot/new-bot/jobs/2026.03/pr_185/1065206

date job status comment
Mar 23 16:57:43 UTC 2026 submitted job id 1065206 awaits release by job manager
Mar 23 16:58:35 UTC 2026 released job awaits launch by Slurm scheduler
Mar 23 16:59:38 UTC 2026 running job 1065206 is running
Mar 23 17:09:06 UTC 2026 finished
😁 SUCCESS (click triangle for details)
Details
✅ job output file slurm-1065206.out
✅ no message matching FATAL:
✅ no message matching ERROR:
✅ no message matching FAILED:
✅ no message matching required modules missing:
✅ found message(s) matching No missing installations
✅ found message matching .tar.* created!
Artefacts
eessi-2023.06-software-linux-aarch64-a64fx-17742853250.tar.zstsize: 0 MiB (4385 bytes)
entries: 1
modules under 2023.06/software/linux/aarch64/a64fx/modules/all
no module files in tarball
software under 2023.06/software/linux/aarch64/a64fx/software
no software packages in tarball
reprod directories under 2023.06/software/linux/aarch64/a64fx/reprod
no reprod directories in tarball
other under 2023.06/software/linux/aarch64/a64fx
.lmod/SitePackage.lua
Mar 23 17:09:06 UTC 2026 test result
😁 SUCCESS (click triangle for details)
ReFrame Summary
[ SKIP ] ( 1/10) Skipping test: nodes in this partition only have 30720 MiB memory available (per node) according to the current ReFrame configuration, but 49152 MiB is needed
[ SKIP ] ( 2/10) Skipping test: nodes in this partition only have 30720 MiB memory available (per node) according to the current ReFrame configuration, but 49152 MiB is needed
[ SKIP ] ( 3/10) Skipping test: nodes in this partition only have 30720 MiB memory available (per node) according to the current ReFrame configuration, but 49152 MiB is needed
[ SKIP ] ( 4/10) Skipping test: nodes in this partition only have 30720 MiB memory available (per node) according to the current ReFrame configuration, but 49152 MiB is needed
[ OK ] ( 5/10) EESSI_LAMMPS_lj %device_type=cpu %module_name=LAMMPS/29Aug2024-foss-2023b-kokkos %scale=1_node /aeb2d9df @BotBuildTests:a64fx+default
P: perf: 579.829 timesteps/s (r:0, l:None, u:None)
[ OK ] ( 6/10) EESSI_LAMMPS_lj %device_type=cpu %module_name=LAMMPS/2Aug2023_update2-foss-2023a-kokkos %scale=1_node /04ff9ece @BotBuildTests:a64fx+default
P: perf: 525.855 timesteps/s (r:0, l:None, u:None)
[ OK ] ( 7/10) EESSI_OSU_pt2pt_CPU %benchmark_info=mpi.pt2pt.osu_latency %module_name=OSU-Micro-Benchmarks/7.2-gompi-2023b %scale=1_node /15cad6c4 @BotBuildTests:a64fx+default
P: latency: 1.64 us (r:0, l:None, u:None)
[ OK ] ( 8/10) EESSI_OSU_pt2pt_CPU %benchmark_info=mpi.pt2pt.osu_latency %module_name=OSU-Micro-Benchmarks/7.1-1-gompi-2023a %scale=1_node /6672deda @BotBuildTests:a64fx+default
P: latency: 1.67 us (r:0, l:None, u:None)
[ OK ] ( 9/10) EESSI_OSU_pt2pt_CPU %benchmark_info=mpi.pt2pt.osu_bw %module_name=OSU-Micro-Benchmarks/7.2-gompi-2023b %scale=1_node /2a9a47b1 @BotBuildTests:a64fx+default
P: bandwidth: 8778.83 MB/s (r:0, l:None, u:None)
[ OK ] (10/10) EESSI_OSU_pt2pt_CPU %benchmark_info=mpi.pt2pt.osu_bw %module_name=OSU-Micro-Benchmarks/7.1-1-gompi-2023a %scale=1_node /1b24ab8e @BotBuildTests:a64fx+default
P: bandwidth: 8081.74 MB/s (r:0, l:None, u:None)
[ PASSED ] Ran 6/10 test case(s) from 10 check(s) (0 failure(s), 4 skipped, 0 aborted)
Details
✅ job output file slurm-1065206.out
✅ no message matching ERROR:
✅ no message matching [\s*FAILED\s*].*Ran .* test case

@eessi-bot-deucalion
Copy link

eessi-bot-deucalion bot commented Mar 23, 2026

New job on instance eessi-bot-deucalion for repository eessi.io-2025.06-software
Building on: a64fx
Building for: aarch64/a64fx
Job dir: /home/eessibot/new-bot/jobs/2026.03/pr_185/1065207

date job status comment
Mar 23 16:57:49 UTC 2026 submitted job id 1065207 awaits release by job manager
Mar 23 16:58:32 UTC 2026 released job awaits launch by Slurm scheduler
Mar 23 16:59:40 UTC 2026 running job 1065207 is running
Mar 23 17:05:59 UTC 2026 finished
😁 SUCCESS (click triangle for details)
Details
✅ job output file slurm-1065207.out
✅ no message matching FATAL:
✅ no message matching ERROR:
✅ no message matching FAILED:
✅ no message matching required modules missing:
✅ found message(s) matching No missing installations
✅ found message matching .tar.* created!
Artefacts
eessi-2025.06-software-linux-aarch64-a64fx-17742852720.tar.zstsize: 0 MiB (4388 bytes)
entries: 1
modules under 2025.06/software/linux/aarch64/a64fx/modules/all
no module files in tarball
software under 2025.06/software/linux/aarch64/a64fx/software
no software packages in tarball
reprod directories under 2025.06/software/linux/aarch64/a64fx/reprod
no reprod directories in tarball
other under 2025.06/software/linux/aarch64/a64fx
.lmod/SitePackage.lua
Mar 23 17:05:59 UTC 2026 test result
😁 SUCCESS (click triangle for details)
ReFrame Summary
[ SKIP ] (1/4) EESSI_OSU_coll %benchmark_info=mpi.collective.osu_allreduce %module_name=OSU-Micro-Benchmarks/7.5-gompi-2025a %scale=1_node %device_type=cpu /e4bf9965 @BotBuildTests:a64fx+default [Skipping test: nodes in this partition only have 30720 MiB memory available (per node) according to the current ReFrame configuration, but 49152 MiB is needed]
[ SKIP ] (2/4) EESSI_OSU_coll %benchmark_info=mpi.collective.osu_alltoall %module_name=OSU-Micro-Benchmarks/7.5-gompi-2025a %scale=1_node %device_type=cpu /3da4890b @BotBuildTests:a64fx+default [Skipping test: nodes in this partition only have 30720 MiB memory available (per node) according to the current ReFrame configuration, but 49152 MiB is needed]
[ OK ] (3/4) EESSI_OSU_pt2pt_CPU %benchmark_info=mpi.pt2pt.osu_latency %module_name=OSU-Micro-Benchmarks/7.5-gompi-2025a %scale=1_node /3255009a @BotBuildTests:a64fx+default
P: latency: 0.89 us (r:0, l:None, u:None)
[ OK ] (4/4) EESSI_OSU_pt2pt_CPU %benchmark_info=mpi.pt2pt.osu_bw %module_name=OSU-Micro-Benchmarks/7.5-gompi-2025a %scale=1_node /59f4b331 @BotBuildTests:a64fx+default
P: bandwidth: 8084.31 MB/s (r:0, l:None, u:None)
[ PASSED ] Ran 2/4 test case(s) from 4 check(s) (0 failure(s), 2 skipped, 0 aborted)
Details
✅ job output file slurm-1065207.out
✅ no message matching ERROR:
✅ no message matching [\s*FAILED\s*].*Ran .* test case

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Hook for checking if CUDA driver is available does not work for EESSI/2025.06

1 participant