SRE Strategic Lead

Location: Virtual-Remote, District Of Columbia US


This position is no longer open.

Requisition Number: 1940

Position Title:

External Description:

Location: Remote/Virtual

For a large financial services client, we are seeking a hands-on thought leader to help build the Site Reliability Engineers (SRE) enterprise program. SRE’s draw on their extensive Java, AWS, Automation, and Incident Response expertise to help developers build better with a heavy focus on reliability and reducing toil. In this role you will help coach and develop the team as well as develop strategy and vision, and advocate for the initiative to leadership and key stakeholders. Prior experience launching and building an SRE function is critical to this role.


  • Partner with existing SRE leadership team to develop strategy, mission, and vision for the enterprise function. Advise and advocate for the program to senior leadership and key stakeholders.
  • Mentor SRE team to develop their expertise across the five pillars.
  • Lead an auditing initiative to evaluate opportunities for SRE support across the enterprise.
  • Lead training initiatives for SREs, developers, and stakeholders to educate on SRE principles and vision.
  • Independently determine the needs of the customer while identifying and resolving conflicting or complementary needs across customer groups.
  • Work with application stakeholders and define non-functional requirements covering performance, scalability, availability, resiliency, and reliability including Service Level Objectives, Service Level Indicators and Error Budgets.
  • Develop strategies to address the Non-functional requirements throughout Software or Product Development Life Cycle.
  • Work with architecture and development teams in creating performant, highly resilient and reliable architecture and design using performance engineering & chaos engineering principles.
  • Work with architecture and development teams in implementing resiliency constructs, building fault tolerance and develop optimal code.
  • Responsible for incidents related to NFRs, updating SOPs to capture right set of metrics/logs for RCA, Root cause analysis of the incidents, Solutions identification and Ensure permanent closure of the incidents.
  • Analyze production utilization and incidents patterns, identify improvement areas and implement automation to improve productivity, avoid manual tasks and recurring incidents.
  • Apply advanced skill, knowledge and experience, design and develop software solutions to meet customer needs.
  • Use a process-driven approach to leading design solutions.
  • Implement new software technology and coordinate simultaneous implementation tasks across teams.

Must Have Requirements:

  • 10+ years of relevant professional experience, including at least 4 years of experience as a full stack Java software engineer.
  • Experience building or leading an SRE function, building the SRE strategy, and advocating for the solution to leadership.
  • SRE thought leadership experience: white papers, conference speaking engagements, published articles, etc.
  • Extensive knowledge of principles, advanced techniques, and theories to suggest and implement solutions on a specific project, program, or product.
  • Influencing skills to include negotiation, persuasion of others, meeting facilitation, and conflict resolution.
  • Must have strong professional experience with automation (Selenium, Blueprism, Ansible) and incident response.
  • Prefer observability experience with Splunk Dynatrace, Datadog, including building dashboards.
  • Prefer strong AWS Cloud Experience

Additional Experience Requirements:

  • Excellent verbal and written communication skills with experience presenting information and/or ideas to an audience in a way that is engaging and easy to understand.
  • Expertise with Resiliency and building fault tolerant design patterns.
  • Experience collaborating cross-functionally on availability / performance issues to identify root cause, determine areas for improvement, and drive those actions to closure through effective solutions.
  • Adept at managing project plans, resources, and people to ensure successful project completion in an Agile / Scrum environment to facilitate the design / development of performance engineering and resiliency methodologies through collaboration with engineering and product teams to implement shift left techniques on test design & automation.
  • Experience mentoring teams in the writing of Performance and Chaos Engineering strategies and scripts with a strong emphasis on automated deployment, infrastructure automation solutions, and continuous integration & delivery processes.
  • Skilled as a full stack developer with a focus on cross-platform optimization and responsiveness of applications.
  • Strong understanding and knowledge of Java/J2EE technologies and frameworks – UI/JavaScript frameworks, Spring Boot/ Spring Cloud Frameworks, REST, Microservices, server-side frameworks.
  • Experience in working with one of cloud technologies (AWS, GCP or Azure).
  • Knowledge on Cloud technologies and containerization using Docker & Kubernetes.
  • Excellent understanding and demonstrated experience in the use of DevOps/CICD tools like Jenkins, Jules and Automated deployment tools.
  • Working knowledge on one of Unix operating systems.
  • Knowledge on performance tuning of enterprise level Java/J2EE applications (Web and Application Servers Configuration, JVM parameters tuning, GC and Heap Size, Message Broker).
  • Experience in implementing resiliency design patterns using Hystrix, Resilience4J, Service Mesh or similar frameworks and validation using chaos monkey type frameworks.
  • Experience in performance engineering tools – Monitoring tools, Performance testing tools and Analysis tools.
  • Experience in troubleshooting Performance / Scalability / Availability issues in production environment.
  • Skilled in cloud technologies and cloud computing to include Amazon Web Services (AWS) offerings, development, and networking platforms.
  • Experience defining, measuring, and improving Reliability Metrics (SLO/SLI), Observability (Monitoring, Logging-Tracing solutions), Operations Processes (Incident, Problem Management), and Operations Toil Reduction through Automation.
  • Experience designing, building and implementing necessary dashboards from application and infrastructure health perspectives using tools such as Splunk Dynatrace, Datadog, etc. to provide a single pane view of all critical business and operational information to relevant stakeholders.

Desired Experience:

  • Bachelor’s Degree or Equivalent
  • Relevant certifications such as AWS Certified Solutions Architect, AWS Certified SysOps Administrator, Splunk Certified Developer, Dynatrace, Sun Certified Java Programmer, etc.

City: Virtual-Remote

State: District Of Columbia

Community / Marketing Title: SRE Strategic Lead

Company Profile:

The Oakleaf Group is a premier advisory firm with expertise in risk management and financial modeling for the mortgage and banking industries. We serve publicly traded and privately held banks and non-bank mortgage firms, government agencies, law firms, insurance companies, institutional asset managers and hedge funds. Founded in 2007, our firm’s over 100 professionals are located in the Washington, DC and New York City metro areas, serving clients across North America and Europe. 

We differentiate ourselves through our approach to client relationships. We begin with the belief that each client relationship will be permanent and ongoing, spanning across engagements. We invest in communication and research to ensure that we fully understand the drivers of every client’s short and long term success. We align our goals to those of our clients, and we continuously monitor and adjust to ensure that the relationship stays strong. 

It’s on the foundation of strong client relationships and aligned objectives that we provide expertise-infused advisory services and technology-aware implementation assistance that drive client success.

Location_formattedLocationLong: Virtual-Remote, District Of Columbia US

CountryEEOText_Description: As a condition of employment with The Oakleaf Group, any successful job applicant will be required to successfully complete a background investigation, which may also include a pre-employment drug screen and/or a credit check for positions in some areas of our business. The Oakleaf Group is an equal opportunity employer. Applicants are considered for positions without regard to race, religion, gender, native origin, age, disability, or any other category protected by applicable federal, state, or local laws.

7315 Wisconsin Avenue, East Tower 10th Floor
Bethesda, MD 20814
P. (202) 684-2800 • F. (202) 684-2803

© 2018 The Oakleaf Group, All Rights Reserved.