Site Reliability Engineer (SRE)-Observability & Incident Response (Mid-Level & Sr. Level)
Location: Virtual-Remote, District Of Columbia US
Requisition Number: 1985
Location: Remote - Multiple roles available
For a large financial services client, we are seeking Site Reliability Engineers (SRE) with expertise on the AWS cloud. In this role, your responsibilities span traditional IT and software development for a portfolio of application (or applications) as you bridge the gap between developers and IT operations in a mature DevOps culture.
As an SRE your responsibilities for your portfolio of application(s) are ever expanding, from Observability and Incident Response to Automation to Software Development to improve resiliency and application functionality.
The ideal candidate will come from a Java software engineering background and will have solid experience in AWS, error budgeting, reliability models, toil elimination, observability, and incident management.
- Work with architecture and development teams to create resilient and reliable architecture and fault tolerant application design using performance engineering & chaos engineering principles.
- Develop and deploy tools and utilities to automate manual operational tasks in production.
- Analyze production utilization and incidents patterns, identify improvement areas and implement automation to improve productivity, avoid manual tasks and recurring incidents.
- Work with application stakeholders and define non-functional requirements covering performance, scalability, availability, resiliency and reliability including Service Level Objectives, Service Level Indicators and Error Budgets.
- Independently determine the needs of the customer while identifying and resolving conflicting or complementary needs across customer groups.
- Develop strategies to address the Non-functional requirements throughout Software or Product Development Life Cycle.
- Responsible for incidents related to NFRs, updating SOPs to capture right set of metrics/logs for RCA, Root cause analysis of the incidents, Solutions identification and Ensure permanent closure of the incidents.
- Apply advanced skill, knowledge and experience, design and develop software solutions to meet customer needs.
- Use a process-driven approach to leading design solutions.
- Implement new software technology and coordinate simultaneous implementation tasks across teams.
- May maintain or oversee the maintenance of existing software.
Required Experience and Qualifications:
- Bachelor’s degree and 4+ years of relevant professional experience.
- 3-4 years of experience in designing / implementing cloud applications using containerized, serverless microservices architecture in AWS services including Fargate
- 2 years of experience establishing Service Level Indicators and Objectives (SLIs & SLOs)
- 2 years of experience in building observability / monitoring dashboards and alerts in Splunk, Dynatrace, or CloudWatch
- 2 years of experience in disaster recovery planning and failover testing
- 2 years of experience in collaborating cross-functionally with other stakeholder groups
- Ability to work independently with minimal guidance
- Current or recent Site Reliability Engineer experience (2+ years)
- Experience with incident management and response (2+ years)
- Observability experience with Splunk Dynatrace, Datadog, including building dashboards (2+ years)
- Excellent verbal and written communication skills with experience presenting information and/or ideas to an audience in a way that is engaging and easy to understand
Additional Helpful Experience:
- AWS Certified Solutions Architect, AWS Certified SysOps Administrator, Splunk Certified Developer, Dynatrace, Sun Certified Java Programmer.
- Automation experience with Selenium, Blueprism, or Ansible.
- Expertise with Resiliency and building fault tolerant design patterns.
- Experience collaborating cross-functionally on availability / performance issues in order to identify root cause, determine areas for improvement, and drive those actions to closure through effective solutions.
- Skilled as a full stack developer with a focus on cross-platform optimization and responsiveness of applications.
- Working knowledge on one of Unix operating systems.
- Experience in implementing resiliency design patterns using Hystrix, Resilience4J, Service Mesh or similar frameworks and validation using chaos monkey type frameworks.
- Experience defining, measuring, and improving Reliability Metrics (SLO/SLI), Observability (Monitoring, Logging-Tracing solutions), Operations Processes (Incident, Problem Management), and Operations Toil Reduction through Automation.
- Experience designing, building and implementing necessary dashboards from application and infrastructure health perspectives using tools such as Splunk Dynatrace, Datadog, etc. to provide a single pane view of all critical business and operational information to relevant stakeholders.
- Certification in AWS Solutions Architect Associate or Developer Associate, Splunk Certification Developer, or Sun Certified Java Developer
- 3 years of experience in programming languages including Java or Python, and frameworks including Spring Boot / Spring Cloud
- 2 years of experience working with Agile methodologies, including Scrum and Kanban
- 2 years of experience working with code repositories such as Bitbucket / GitLab
- 2 years of experience implementing fault tolerant design patterns
- 2 years of experience in Failure Mode Effect Analysis (FMEA)
- 2 years of experience in Performance Testing
- 2 years of experience in Chaos Engineering / Testing
- 2 years of experience in production operations support, including incident response and root-cause analysis
- 2 years of experience in error budgeting
- 2 years of experience in value stream mapping and / or toil reduction
- 2 years of experience in operations process automation with tools such as Blue Prism / Selenium
- 2 years of experience validating non-functional requirements (NFRs)
- 2 years of experience working with DevOps / CI/CD pipeline tools such as Jenkins or UrbanCode Deploy (UCD)
State: District Of Columbia
Community / Marketing Title: Site Reliability Engineer (SRE)-Observability & Incident Response (Mid-Level & Sr. Level)
The Oakleaf Group is a premier advisory firm with expertise in risk management and financial modeling for the mortgage and banking industries. We serve publicly traded and privately held banks and non-bank mortgage firms, government agencies, law firms, insurance companies, institutional asset managers and hedge funds. Founded in 2007, our firm’s over 100 professionals are located in the Washington, DC and New York City metro areas, serving clients across North America and Europe.
We differentiate ourselves through our approach to client relationships. We begin with the belief that each client relationship will be permanent and ongoing, spanning across engagements. We invest in communication and research to ensure that we fully understand the drivers of every client’s short and long term success. We align our goals to those of our clients, and we continuously monitor and adjust to ensure that the relationship stays strong.
It’s on the foundation of strong client relationships and aligned objectives that we provide expertise-infused advisory services and technology-aware implementation assistance that drive client success.
Location_formattedLocationLong: Virtual-Remote, District Of Columbia US
CountryEEOText_Description: As a condition of employment with The Oakleaf Group, any successful job applicant will be required to successfully complete a background investigation, which may also include a pre-employment drug screen and/or a credit check for positions in some areas of our business. The Oakleaf Group is an equal opportunity employer. Applicants are considered for positions without regard to race, religion, gender, native origin, age, disability, or any other category protected by applicable federal, state, or local laws.