與今日招聘企業(yè)隨時溝通
與今日招聘企業(yè)隨時溝通
Position Overview We are seeking an experienced Infrastructure Engineer to architect and manage our AI computing infrastructure. The ideal candidate will have extensive experience in building and scaling ML infrastructure, with particular emphasis on distributed training systems and GPU cluster management. Key Responsibilities Design and implement high-performance computing infrastructure for large-scale AI model training Manage and optimize GPU clusters for distributed training workloads Build and maintain container orchestration systems for ML workflows Implement efficient resource allocation and scheduling systems Design and maintain monitoring and alerting systems for compute infrastructure Optimize infrastructure costs while maintaining performance Collaborate with ML teams to support their computing needs Ensure system reliability, security, and scalability Required Qualifications Master's degree in Computer Science, Systems Engineering, or related field 8+ years of experience in infrastructure engineering, with focus on ML/AI infrastructure Strong experience with: GPU cluster management and optimization Kubernetes and container orchestration Linux system administration Infrastructure as Code (IaC) Proven track record in building large-scale computing systems Experience with major cloud providers (AWS/GCP/Azure or Alibaba Cloud/Tencent Cloud etc) Preferred Qualifications Experience with ML infrastructure at major tech companies Knowledge of distributed training systems (PyTorch DDP, Horovod) Familiarity with ML frameworks and their infrastructure requirements Experience with high-performance networking (InfiniBand, RDMA) Background in performance optimization and troubleshooting Understanding of ML workload characteristics Bilingual proficiency (English/Chinese) Technical Skills Computing Infrastructure GPU Clusters: NVIDIA DGX, GPU management tools Distributed Systems: Slurm, Kubernetes ML Platforms: Kubeflow, Ray Job Scheduling: YARN, Slurm Cloud & Networking Cloud Platforms: International: AWS, GCP, Azure China: Alibaba Cloud, Tencent Cloud Networking: InfiniBand, RDMA, TCP/IP optimization Load Balancing: HAProxy, NGINX Infrastructure Management Container Technologies: Docker, Kubernetes, Singularity IaC: Terraform, Ansible, CloudFormation CI/CD: Jenkins, GitLab CI Monitoring: Prometheus, Grafana, ELK Stack Development Languages: Python, Go, Shell scripting Version Control: Git Documentation: Markdown, Confluence What We Offer Opportunity to build cutting-edge AI infrastructure Competitive salary and equity package Access to latest hardware and technologies Professional development opportunities Comprehensive health benefits Learning and conference budget Location ?Hong Kong (on-site, Hong Kong Science and Technology Park) Expected Impact Design and implement next-generation AI computing infrastructure Optimize resource utilization and cost efficiency Improve training speed and efficiency for AI models Build scalable and reliable systems Projects You'll Work On Building automated GPU cluster management systems Implementing efficient resource scheduling for ML workloads Optimizing distributed training infrastructure Setting up monitoring and observability systems Designing disaster recovery and backup solutions
Position Overview We are seeking an experienced AI Research Scientist to lead foundation model development initiatives. The ideal candidate will have hands-on experience in training large-scale models at major tech companies and a proven track record in advancing the state-of-the-art in foundation models. Key Responsibilities Lead the architecture design and training of large-scale foundation models Develop and optimize model training pipelines for distributed systems Drive research initiatives in model scaling, efficiency, and performance Implement innovative approaches to improve model capabilities and training efficiency Collaborate with the engineering team to productionize research breakthroughs Guide technical decisions related to model architecture and training strategies Required Qualifications Ph.D. in Computer Science, Machine Learning, or related field 3+ years of experience in training large-scale models at major tech companies, including: International tech leaders (e.g., Google, Meta, Microsoft, OpenAI, Anthropic) OR Leading Chinese tech companies (e.g., ByteDance, Alibaba, Baidu, Tencent, SenseTime, Huawei) Proven experience with distributed training systems and large-scale model optimization Deep understanding of transformer architectures and their variants Strong track record in developing and training foundation models Extensive experience with PyTorch and/or JAX Publication record in top-tier conferences (NeurIPS, ICML, ICLR) Preferred Qualifications Experience with both Chinese and international AI ecosystems Familiarity with Chinese AI infrastructure (e.g., ModelArts, PAI, ByteMLab) Background in scaling laws and efficient training strategies Experience with video generation models or multimodal architectures Track record of open-source contributions to major ML frameworks Experience with ML infrastructure design and implementation Familiarity with mixed-precision training and model parallelism Experience with custom CUDA kernels and optimization Technical Expertise Large-Scale Training: Distributed training frameworks, model parallelism strategies Infrastructure: International cloud platforms (AWS/GCP) Chinese cloud platforms (Alibaba Cloud, Tencent Cloud, Huawei Cloud) Languages: Python, CUDA, C++ (optional) Frameworks: Standard: PyTorch, JAX, DeepSpeed, Megatron-LM Chinese ecosystem: PaddlePaddle, MindSpore (plus) Development Tools: Git, Docker, Kubernetes Monitoring: Weights & Biases, MLflow, or similar tools What We Offer Opportunity to shape the future of foundation models in video generation Leadership role in technical decision-making Access to substantial computing resources and infrastructure Competitive compensation package including equity Regular collaboration with top researchers in the field Support for conference attendance and research publication International exposure and collaboration opportunities Location Hong Kong (on-site, Hong Kong Science and Technology Park) Expected Impact Drive the development of next-generation foundation models Lead research initiatives that push the boundaries of model capabilities Build and mentor a world-class research team
免費求職
企業(yè)直招 不收取任何費用專業(yè)招聘
專人服務 入職速度快50%優(yōu)質服務
24小時客服響應 快速解答今日招聘網是由濱興科技運營的人才網,未經今日招聘網同意,不得轉載本網站之所有招工招聘信息及作品
浙公網安備 33010802002895號 人力資源服務許可證 330108000033號 增值電信經營許可證 出版物經營許可證 新出發(fā)濱字第0235號 浙B2-20080178-14
互聯(lián)網違法和不良信息舉報中心:0571-87774297 donemi@163.com