Staff I Reliability Engineer - BlackLine
Los Angeles, CA 91367
About the Job
The Staff Site Reliability Engineer is responsible for assessing, testing, tracking, predicting, and reporting all related performance aspects of a suite of production applications from a performance, responsiveness, capacity, and availability perspective.
You'll Get To::Multi-Environment:
- You will manage multiple environments (development, staging, production) across global regions. It is all about ensuring the systems stay consistent and reliable no matter where they are running.
- You will also build failover strategies and backups to handle anything we throw at them.
- Create and maintain a continuous testing framework that observes and records and trends real-time availability data for all our clients.
- Improve the BlackLine SaaS service experience by discovering and highlighting optimization opportunities with existing code to address application availability, performance, observability, efficiency, and security challenges.
- Serve as technical lead for large projects, determining objectives and approaches to critical assignments and may oversee multiple projects concurrently.
- Regularly learn new systems and tools as the BlackLine platform and ecosystem evolves.
CI/CD:
- You will lead on improving our CI pipelines using Jenkins, ensuring that code moves smoothly from development to production with no downtime.
- You will also work with Ansible, Nomad, and ArgoCD to streamline the deployment process across all environments.
- Works cross-functionally to surface common pain points, architect solutions, establish conventions, and evangelize application development and operations best practices.
Automation:
- We want to automate everything. You will help build the tools that make deploying infrastructure and code as hands-off as possible. Efficiency is vital; you will automate workflows to ensure we do not do anything manually.
- Develop tools and systems to automate the identification, analysis, and remediation of application events, infrastructure issues, or requests.
Monitoring & Tuning:
- You will ensure our systems always perform at their best. Using tools under the NewRelic stack, you willmonitor everything and ensure we can quickly address any issues before they become real problems.
- Establish and maintain Key Performance Indicators for the overall health of the service and build tools to exercise and evaluate if these KPIs are being met.
- Support integration of performance data into customer experience analytics tools and reporting.
Incident Response:
- When things go wrong you will take charge of incident response. You will work through the chaos, solve the problem, and then figure out how we can avoid it happening again.
- Participate in our on-call rotation and conduct incident reviews (RCAs).
- Publish performance result findings, conclusions, recommendations.
Security & Compliance:
- You will ensure we are always secure and compliant with industry standards like PCI and GDPR. This includes monitoring security best practices and working with the team to ensure everything is locked down.
Lead by Example:
- We are looking for someone who is not great at tech but also enjoys mentoring and leading others.
- You will work closely with younger engineers, helping them grow while setting operational excellence standards. Think of it as building a team of rock stars alongside you.
- Contribute knowledge, skills, and personal qualities to a dedicated team of top engineers through mentorship and training, solving real-life problems in a bleeding-edge, high-performance, and high-traffic environment.
- 7+ years working in SRE, DevOps, or a similar role, and you have managed global, multi-tenant environments before.
- You know your way around Terraform, Jenkins, and cloud platforms like GCP.
- Demonstrated history of developing or operating production web applications and solid understanding of HTTP(S), HTML, JavaScript, CSS, and XML.
- Advanced level knowledge of IIS and Windows Server or Linux and Apache.
- Strong knowledge of TCP/IP and networking concepts, OSI networking layers.
- Proficiency with statistical concepts; confidence interval, hypothesis testing, sampling.
Leadership:
- You know how to lead a team and guide younger engineers, helping them grow while still being a hands-on contributor.
- Considerable experience in a lead role on a software development team.
- Baseline understanding of project management process/procedures with experience: Agile and Waterfall.
Tech Stack:
- You are comfortable with Ansible, Nomad, and ArgoCD, and you have built and maintained CI/CD pipelines in a production environment.
- Solid scripting skills in Python or Bash to automate anything you touch.
- Extensive knowledge of managing cloud platforms and cloud native tools.
- Strong knowledge of TCP/IP and networking concepts, OSI networking layers.
Problem-Solver:
- When incidents happen, you stay calm, figure out what is wrong, and lead the charge to fix it. You do not just fix it; you ensure it does not happen again.
- Must possess the ability to handle multiple goals concurrently and function in a fast-paced, demanding, ever-changing high-growth environment.
- Operating systems concepts such as CPU, memory, disk queues and graphing/analyzing these over time.
Scalability:
- You have dealt with scaling systems before and know how to ensure that the infrastructure can grow with us without falling apart.
- Advanced level knowledge deploying and managing observability tools; such as New Relic, Datadog, Prometheus, Grafana, Jaeger.
- Must possess the ability to handle multiple goals concurrently and function in a fast-paced, demanding, ever-changing high-growth environment.
- Ability to understand modern technologies quickly and adapt these into daily work and goals.
Security-First:
- You have got experience working in highly regulated industries and know how to keep systems secure, all while ensuring compliance with PCI, GDPR, and other standards.
We’re Even More Excited If You Have::
- Experience with other cloud platforms like AWS or Azure.
- Knowledge of service mesh architectures (like Istio or Linkerd).
- Familiarity with Kubernetes operators.
- Strong networking skills across DNS, VPN, and load balancers.
- Experience optimizing databases like PostgreSQL, Cassandra, or MongoDB.
- A technology-based company with a sense of adventure and a vision for the future. Every door at BlackLine is open. Just bring your brains, your problem-solving skills, and be part of a winning team at the world's most trusted name in Finance Automation!
- A culture that is kind, open, and accepting. It's a place where people can embrace what makes them unique, and the mix of cultural backgrounds and varying interests cultivates diverse thought and perspectives.
- A culture where BlackLiner's continued growth and learning is empowered. BlackLine offers a wide variety of professional development seminars and inclusive affinity groups to celebrate and support our diversity.
BlackLine is an equal opportunity employer. All qualified applicants will receive consideration for employment without regard to sex, gender identity or expression, race, ethnicity, age, religious creed, national origin, physical or mental disability, ancestry, color, marital status, sexual orientation, military or veteran status, status as a victim of domestic violence, sexual assault or stalking, medical condition, genetic information, or any other protected class or category recognized by applicable equal employment opportunity or other similar laws.
BlackLine recognizes that the ways we work and the workplace itself has shifted. We innovate in a workplace that optimizes a combination of virtual and in-person interactions to maximize collaboration and nurture our culture. Candidates who live within a reasonable commute to one of our offices will work in the office at least 2 days a week.
Salary Range::USD $147,000.00 - USD $196,000.00