Training Performance Engineer - Acceleration
Description
Kaiko is building a next-generation agentic clinical AI assistant that helps clinicians reason across patient data, guidelines, and diagnostics. Kaiko trains its own foundation models for clinical work. The program runs on open-weight MoE bases in the hundreds-of-billions to trillion-parameter range. You own throughput on our Blackwell training cluster — instrument runs, identify utilization gaps, and ship optimizations that push MFU, wall-clock, and uptime. You work alongside research as new architectures and phases land on the cluster.
You will be based in either The Netherlands or Switzerland, with the expectation of spending at least 50% of your time at the office.
Responsibilities
- Instrument and analyze runs — MFU, throughput, uptime — and close gaps against predicted wall-clocks.
- Benchmark NCCL collectives over InfiniBand and NVLink — including rail/topology behaviour and congestion at scale, and keep a current picture of what the fabric delivers.
- Drive low-precision training in our stack and validate the speed-up.
- Tune MoE parallelism (TP / PP / CP / EP / DP) per phase and characterise expert-parallel comm cost on the cluster fabric.
- Land custom attention-variant kernels (e.g. hybrid, latent-attention) into the training stack.
Qualifications
- Deep GPU systems experience, with kernel-level CUDA / Triton work and comfort with CUTLASS, Flash Attention, Pytorch and Nsight profiling.
- Production experience with NCCL on InfiniBand or equivalent high-bandwidth interconnects.
- Parallelism literacy: TP / PP / CP / EP / DP under memory, comm, and MFU constraints.
- Tracks the relevant systems literature and brings it into the stack.
Nice to have:
- Low-precision training (FP8, expert-only quant, dynamic loss scaling).
- Sparse / hybrid / MLA attention at the kernel level.
- Has shipped large-scale MoE training in production — pre-training, SFT, or RL.
- Stack experience with Megatron, NeMo, or comparable.
Why kaiko
At kaiko, we believe the best ideas come from collaboration, ownership and ambition. We've built a team of international experts where your work has a direct impact. Here's what we value:
- Ownership: You'll have the autonomy to set your own goals, make critical decisions, and see the direct impact of your work.
- Collaboration: You'll have to approach disagreement with curiosity, build on common ground, and create solutions together.
- Ambition: You'll be surrounded by people who set high standards for themselves and others, who see obstacles as opportunities, and who are relentless in their work to create better outcomes for patients.
In addition, we offer:
- An attractive and competitive salary, a good pension plan, and 25 vacation days per year.
- Great offsites and team events to strengthen the team and celebrate successes together.
- A EUR 1000 learning and development budget to help you grow.
- Autonomy to do your work the way that works best for you, whether you have a kid or prefer early mornings.
- An annual commuting subsidy.