Nvidia Industry · Engineering

Senior Networking Solution Test Engineer – AI Cluster Debugging

CHF 150'000 – 170'000 / year

Description

We are looking for a Senior Networking Test Engineer with strong system‑level debugging skills to join our End‑to‑End Verification team! You will work on pioneering NVLink, Ethernet and InfiniBand ‑ based AI clusters. Additionally, you will ow complex issues across hardware, system software and AI workloads.

Responsibilities:

  • Design and review test and product requirements across the NVLink, Ethernet and InfiniBand / NIC / DPU / Switch portfolio, focusing on large‑scale AI cluster behavior.
  • Build and maintain realistic customer‑like testbeds, including heterogeneous hardware, OS / driver combinations and complex network fabrics.
  • Own end‑to‑end cluster troubleshooting: reproduce customer scenarios, triage across the stack and drive issues to root cause and fix.
  • Read and understand relevant source code to identify defects, validate fixes and improve logging and instrumentation.
  • Collaborate closely with development teams to debug NCCL, RoCE/RDMA and related networking components using logs, code inspection and targeted experiments.
  • Define tests and guide the automation team to implement robust, debuggable suites that produce actionable logs, metrics and traces.
  • Run Regression, Performance, Functional and Scale testing, analyze results and provide clear, data‑driven reports to collaborators.
  • Profile and benchmark deep learning training and inference workloads, correlating model‑level metrics with system and network telemetry to uncover bottlenecks.

Qualifications:

  • B.A./B.Sc. in Computer Science, Electrical Engineering, or equivalent IT/Network/Systems experience.
  • 8+ years of hands‑on networking or system‑level testing and debugging on Linux.
  • Strong Linux networking and debugging skills (for example perf, tcpdump, ethtool, iproute2).
  • Proven production‑grade debugging experience: forming hypotheses, running experiments, and driving issues to root cause under pressure.
  • Expertise in host‑side NIC validation and tuning (offloads, queues, interrupts, firmware/driver interactions).
  • Strong knowledge of AI networking libraries (such as NCCL) and protocols (such as RoCE and RDMA), including performance and correctness debugging.
  • Ability to read and reason about source code (C/C++/Python or similar) and collaborate closely with developers on fixes.
  • Solid scripting and automation skills with Bash / Python / Ansible for setup, log collection, and experiment orchestration.
  • Fast learner, familiar with modern AI tools and workflows, able to adapt quickly.
  • Excellent analytical, problem‑solving and communication skills, with strong ownership and a collaborative approach.

Nice to have:

  • Hands‑on debugging of collective communication libraries (for example NCCL) or large‑scale LLM training / inference clusters.
  • Experience with large cluster environments (tens to thousands of GPUs or nodes), including incident response and post‑mortem analysis.
  • Deep expertise in tuning and debugging congestion control and lossless Ethernet for AI workloads (for example DCQCN, ECN, PFC).
  • Familiarity with NVIDIA networking technologies (for example BlueField / BF3, ConnectX NICs) and their software stack and diagnostics.
  • Experience debugging issues that span multiple layers (L2/L3, transport, AI frameworks) or contributing to open‑source networking / AI systems.
Apply Now