We believe the world is changing faster than ever

Join a Two Sigma Ventures portfolio company to help shape the future of technology.
65
companies
534
Jobs

SRE/Deployment Team Lead

Comet.ml

Comet.ml

United States · Remote
Posted on Jul 6, 2024

Join us to advance data science and machine learning.

Comet is accelerating the machine learning development process for data science and ML teams. From the individual data scientist tracking training runs to the enterprise team moving hundreds of models into production, Comet is the platform used by some of the most innovative builders in the industry. We started Comet to make it possible for teams to manage and optimize models across the complete ML lifecycle and achieve business value faster.

You’re welcome here

Working in Comet’s fast, dynamic startup environment is challenging and fun. We are looking for people who are customer-focused, work collaboratively, and want to be a voice in advancing Comet’s leadership in the marketplace. If you are excited about empowering technology innovators around the globe in creating world-changing machine learning models, Comet is the right place for you.

Comet is backed by more than $63 million in venture-capital funding, and we are the MLOps platform of choice for teams at Ancestry, The RealReal, Uber, WorkFusion, and Zappos. We are a remote-first company with offices in New York City (U.S.A.) and Tel-Aviv (Israel). And we’re just getting started. CRN featured Comet as one of the 10 hottest machine learning and data science startups in 2021.

Comet is an equal opportunity employer. We celebrate diversity and are committed to creating an inclusive environment for all employees without regard to race, religion, color, sex, gender identity, gender expression, sexual orientation, national origin, ancestry, citizenship status, uniform service member status, marital status, pregnancy, age, medical condition, physical or mental disability, genetic information/characteristics, and any other characteristic protected by State or Federal law.

We are seeking an experienced and dynamic SRE/Deployment Team Lead to join our growing team. The ideal candidate will have a strong background in software engineering, system administration, and a passion for automation, reliability, and performance. This role will be located in NYC or fully remote in the USA, with some flexibility in work hours required to collaborate with a global team based in Tel Aviv and Europe. As a lead, you will be responsible for designing, implementing, and maintaining our deployment, ensuring the stability and scalability of our infrastructure, and leading a team of talented engineers.

Key Responsibilities:

  • Oversee the deployment, monitoring, and maintenance of production systems.
  • Develop and maintain all deployment options for Comet, including multi-cloud, on-premises, and bare-metal deployments, using Linux single server or containerization technologies such as Kubernetes.
  • Quickly identify and resolve infrastructure bugs, ensuring high system availability and reliability.
  • Implement and maintain infrastructure as code using tools such as Terraform, Ansible, or similar.
  • Ensure high availability, scalability, and reliability of services and applications.
  • Work closely with customers to understand their deployment needs and provide effective support for deploying and maintaining Comet on their infrastructure.
  • Collaborate with cross-functional teams, including development, QA, support, and other teams, to ensure seamless integration and successful deployment of new features and updates.
  • Mentor and lead a team of DevOps, SRE, and deployment engineers.
  • Conduct regular performance tuning, troubleshooting, and root cause analysis.
  • Stay updated with the latest industry trends, technologies, and best practices in DevOps and SRE.
  • Implement and manage observability tools for monitoring, logging, and alerting.

Qualifications:

  • 5+ years of experience in a DevOps, SRE, or similar role.
  • Bachelor's degree in Computer Science, Engineering, or a related field (or equivalent experience).
  • Proven experience leading and mentoring a team of engineers.
  • Proficient in Linux system internals, scripting, and configuration management tools (Bash/Python/Ansible).
  • Strong expertise in cloud platforms such as AWS, GCP, or Azure.
  • Proficiency in scripting languages (e.g., Python, Bash, Go).
  • Experience with containerization and orchestration tools such as Docker and Kubernetes.
  • Familiarity with cloud-based infrastructure services such as EC2, RDS, S3, and VPC, and with related tools such as CloudFormation and Terraform.
  • In-depth knowledge of CI/CD tools like Jenkins, GitLab CI, CircleCI, or similar.
  • Experience with monitoring applications such as Prometheus, Grafana, or ELK stack.
  • Solid understanding of networking concepts, security best practices, and system architecture.
  • Excellent communication skills, both verbal and written, to effectively collaborate with team members and clients.
  • Passionate about troubleshooting and investigating in unfamiliar environments.
  • Excellent problem-solving skills and the ability to work under pressure.

Preferred Qualifications:

  • Experience with microservices architecture and serverless computing.
  • Knowledge of configuration management tools (e.g., Chef, Puppet).
  • Understanding of database management and optimization.
  • Certifications in relevant technologies or platforms (e.g., AWS Certified DevOps Engineer).

What We Offer:

  • Competitive salary and benefits package.
  • Flexible working hours and remote work options.
  • Opportunities for professional growth and development.
  • A collaborative and innovative work environment.
  • The chance to work with cutting-edge technologies and projects..