Member of Technical Staff, AI Networking - MAI Superintelligence Team
Description
Microsoft AI is hiring a Member of Technical Staff, AI Networking to design and scale the world’s most advanced high-performance networks powering Copilot and next-generation AI systems. Join the team building the fabric that connects frontier-class datacenters, enables multi-gigawatt AI supercomputers, and supports the training of the most sophisticated AI models on the planet. As an AI Networking Engineer, you’ll shape the end-to-end networking architecture, link-layer to fabric-wide systems for hyperscale AI training clusters; design, bring up, and scale the distributed Ethernet and InfiniBand fabrics that connect hundreds of thousands of GPUs across multi-megawatt data halls. You’ll benchmark, profile, debug and tune the training and inference of AI workloads running in the production clusters. You’ll engineer ultra-low-latency ROCE networks, design congestion-free transport mechanisms, optimize lossless fabrics at 10k–100k+ GPU scale, and partner deeply across Azure, Microsoft AI, and datacenter teams to turn cutting-edge ideas into running global infrastructure.
Microsoft Superintelligence Team’s mission is to empower every person and every organization on the planet to achieve more. This role is part of Microsoft AI's Superintelligence Team. The MAIST is a startup-like team inside Microsoft AI, created to push the boundaries of AI toward Humanist Superintelligence—ultra-capable systems that remain controllable, safety-aligned, and anchored to human values.
Responsibilities
- Advanced ROCE transport design, congestion control, ECN/WRED/DCTCP tuning
- Fabric architecture, topology planning, network modeling, and scaling strategy
- Telemetry, observability, reliability engineering, and automated troubleshooting
- Develop and tune the deployment of novel routing techniques to achieve reliability in large networks
- Work with world class network designers like NVIDIA, Broadcom, and in-house silicon/network co-design teams
- AI training + inference cluster bring-up, performance benchmarking, and root-cause analysis
- Gather data and insights to develop the pretraining compute roadmap
- Find a path to get things done despite roadblocks to get your work into the hands of users quickly and iteratively
- Enjoy working in a fast-paced, design-driven, product development cycle
- Embody our Culture and Values
Qualifications
Required
- Bachelor's Degree in Computer Science or related technical field AND 6+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience.
Preferred
- Master's Degree in Computer Science or related technical field AND 8+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR Bachelor's Degree in Computer Science or related technical field AND 12+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience.