Site Reliability Engineer - CareerBuilder Premium Subscription

Boca Raton, FL

Job description:

We are a cutting edge biomedical startup that is preparing for our first product release. This is a unique opportunity to be on the ground floor of a rapidly growing biomedical company. We are a tight-knit, agile group with many capable engineering, medical, and business personnel on the team and board alike. We are looking to further expand our team by adding a strong software development arm to the company.

Current Project:

Client's Humero Tech C1 changes the way shoulder injuries are rehabilitated with our innovative strength-building and sensor based technology. Our rotator cuff machine tracks patients' efforts as they work through strength-based exercises. At the end of sessions, the user gets a set of in-depth metrics to help inform the next steps for recovery.
Client is at the very beginning of device rollout into the field, and thus Titin is searching for a talented Site Reliably Engineer to ensure customers have a smooth experience while working with our software and their data.
Additionally, a strong and positive personality is critical because this person will inevitably be communicating directly with our customers.

System Monitoring and Incident Management

Set up and maintain monitoring tools to track system performance, availability, and reliability.
Respond to incidents, troubleshoot issues, and ensure fast recovery to minimize downtime.
Implement alerting mechanisms to proactively identify potential issues before they impact end users.

Automation and Efficiency

Automate manual operations and repetitive tasks to improve system reliability and speed.
Write scripts and create tools to streamline deployment, monitoring, and scaling processes.
Work with Continuous Integration/Continuous Deployment (CI/CD) management tools.

Infrastructure Management

Performance Optimization

Analyze system performance and work on tuning to meet predefined service level objectives (SLOs).
Optimize resource usage, including compute, memory, and storage, to ensure cost-efficiency without sacrificing performance.

Disaster Recovery and High Availability

Develop, test, and implement disaster recovery plans.
Ensure high availability by using redundancy, failover mechanisms, and geographical distribution of systems.

Security and Compliance

Collaboration and Communication

Work closely with development teams to integrate reliability into the software development lifecycle.
Participate in post-incident reviews to identify root causes and prevent future occurrences.
Provide technical support to teams and help to build a culture of reliability across the organization.

Documentation

Document incident response processes, infrastructure architecture, and SRE best practices.
Maintain clear, accessible records for troubleshooting, deployments, and maintenance tasks.
Generate work instructions to document tasks and enable smooth team expansion.

Continuous Improvement

Identify opportunities for process improvements and performance enhancements.
Keep up to date with the latest technology trends and industry practices, and adopt relevant innovations.

Application Question(s):

Education:

Required Experience:

Preferred Experience:

Experience with Linux/Unix systems
Experience with version control systems like Git.
Understanding of AWS IAM
Expertise in system administration tasks, such as patching, user management, and system performance tuning.
Familiarity with securing infrastructure, including access control, encryption, and vulnerability management.

Source : CareerBuilder Premium Subscription