Site Reliability Engineer (SRE)
Location: Virtual-Remote, District Of Columbia US
Requisition Number: 1899
UPDATE: We are now able to consider C2C contractors as well as full time salary candidates
For a large financial services client, we are seeking Site Reliability Engineers (SRE) with expertise in AWS cloud and Java. In this role, you will support the entire development lifecycle to incorporate service reliability best practices and reduce downtime. The ideal candidate will come from a Java software engineering background and will have solid experience in AWS, error budgeting, reliability models, toil elimination, observability, and incident management.
- Independently determine the needs of the customer while identifying and resolving conflicting or complementary needs across customer groups.
- Work with application stakeholders and define non-functional requirements covering performance, scalability, availability, resiliency and reliability including Service Level Objectives, Service Level Indicators and Error Budgets.
- Develop strategies to address the Non-functional requirements throughout Software or Product Development Life Cycle.
- Work with architecture and development teams in creating performant, highly resilient and reliable architecture and design using performance engineering & chaos engineering principles.
- Work with architecture and development teams in implementing resiliency constructs, building fault tolerance and develop optimal code.
- Develop tools and utilities to automate manual operational tasks in production.
- Responsible for incidents related to NFRs, updating SOPs to capture right set of metrics/logs for RCA, Root cause analysis of the incidents, Solutions identification and Ensure permanent closure of the incidents.
- Analyze production utilization and incidents patterns, identify improvement areas and implement automation to improve productivity, avoid manual tasks and recurring incidents.
- Apply advanced skill, knowledge and experience, design and develop software solutions to meet customer needs.
- Use a process-driven approach to leading design solutions.
- Implement new software technology and coordinate simultaneous implementation tasks across teams.
- May maintain or oversee the maintenance of existing software.
Must Have Requirements:
- 10+ years of relevant professional experience, including at least 4 years of experience as a full stack Java software engineer
- Current or recent Site Reliability Engineer experience
- Strong professional experience with incident management and response
- Automation experience with Selenium, Blueprism, or Ansible
- Observability experience with Splunk Dynatrace, Datadog, including building dashboards
- Strong AWS Cloud Experience
Additional Experience Requirements:
- Excellent verbal and written communication skills with experience presenting information and/or ideas to an audience in a way that is engaging and easy to understand.
- Experience collaborating cross-functionally on availability / performance issues in order to identify root cause, determine areas for improvement, and drive those actions to closure through effective solutions.
- Extensive knowledge of principles, advanced techniques, and theories to suggest and implement solutions on a specific project, program, or product.
- Influencing skills to include negotiation, persuasion of others, meeting facilitation, and conflict resolution.
- Adept at managing project plans, resources, and people to ensure successful project completion in an Agile / Scrum environment in order to facilitate the design / development of performance engineering and resiliency methodologies through collaboration with engineering and product teams to implement shift left techniques on test design & automation.
- Experience mentoring teams in the writing of Performance and Chaos Engineering strategies and scripts with a strong emphasis on automated deployment, infrastructure automation solutions, and continuous integration & delivery processes.
- Skilled as a full stack developer with a focus on cross-platform optimization and responsiveness of applications.
- Experience in working with one of cloud technologies (AWS, GCP or Azure).
- Knowledge on Cloud technologies and containerization using Docker & Kubernetes.
- Excellent understanding and demonstrated experience in the use of DevOps/CICD tools like Jenkins, Jules and Automated deployment tools.
- Working knowledge on one of Unix operating systems.
- Automation experience with Blueprism, Selenium, or Ansible play books and programming languages like Java, Perl, Python or PowerShell Scripting and Ansible play book.
- Knowledge on performance tuning of enterprise level Java/J2EE applications (Web and Application Servers Configuration, JVM parameters tuning, GC and Heap Size, Message Broker).
- Experience in implementing resiliency design patterns using Hystrix, Resilience4J, Service Mesh or similar frameworks and validation using chaos monkey type frameworks.
- Experience in performance engineering tools – Monitoring tools, Performance testing tools and Analysis tools.
- Experience in troubleshooting Performance / Scalability / Availability issues in production environment.
- Skilled in cloud technologies and cloud computing to include Amazon Web Services (AWS) offerings, development, and networking platforms.
- Experience defining, measuring, and improving Reliability Metrics (SLO/SLI), Observability (Monitoring, Logging-Tracing solutions), Operations Processes (Incident, Problem Management), and Operations Toil Reduction through Automation.
- Experience designing, building and implementing necessary dashboards from application and infrastructure health perspectives using tools such as Splunk Dynatrace, Datadog, etc. to provide a single pane view of all critical business and operational information to relevant stakeholders.
- Bachelor’s Degree or Equivalent
- Relevant certifications such as AWS Certified Solutions Architect, AWS Certified SysOps Administrator, Splunk Certified Developer, Dynatrace, Sun Certified Java Programmer, etc.
State: District Of Columbia
Community / Marketing Title: Site Reliability Engineer (SRE)
The Oakleaf Group is a premier advisory firm with expertise in risk management and financial modeling for the mortgage and banking industries. We serve publicly traded and privately held banks and non-bank mortgage firms, government agencies, law firms, insurance companies, institutional asset managers and hedge funds. Founded in 2007, our firm’s over 100 professionals are located in the Washington, DC and New York City metro areas, serving clients across North America and Europe.
We differentiate ourselves through our approach to client relationships. We begin with the belief that each client relationship will be permanent and ongoing, spanning across engagements. We invest in communication and research to ensure that we fully understand the drivers of every client’s short and long term success. We align our goals to those of our clients, and we continuously monitor and adjust to ensure that the relationship stays strong.
It’s on the foundation of strong client relationships and aligned objectives that we provide expertise-infused advisory services and technology-aware implementation assistance that drive client success.
Location_formattedLocationLong: Virtual-Remote, District Of Columbia US
CountryEEOText_Description: As a condition of employment with The Oakleaf Group, any successful job applicant will be required to successfully complete a background investigation, which may also include a pre-employment drug screen and/or a credit check for positions in some areas of our business. The Oakleaf Group is an equal opportunity employer. Applicants are considered for positions without regard to race, religion, gender, native origin, age, disability, or any other category protected by applicable federal, state, or local laws.