For the past two years, I have been writing technical articles. But today, I thought of sharing my experiences and the journey of my career towards a Senior Site Reliability Engineer. If you are dreaming of becoming an SRE, I hope this article will help you gain the skills that are required.
Embarking on the path towards an SRE, my journey traces back to an essential internship experience. Despite my academic focus on Software Engineering practices at university, a significant shift occurred as I delved into the world of DevOps. This transition happened during my internship which took place in 2019 within a DevOps-related team in a well-established company.
During those formative days, my efforts were carried to cloud-centric projects, with a primary emphasis on the AWS cloud infrastructure. Other than that I learned about Docker containers and pipeline orchestration. Guided by my teammates, I started learning about this area from scratch and sharpened my skills and knowledge.
Upon completing my internship, I returned to the university with a determined spirit to complete my degree in Software Engineering. The culmination of this academic journey marked a turning point in 2021 where I once again crossed paths with the same company in which I did my internship. This time I joined as a dedicated Site Reliability Engineer.
My aspiration of working in this company had always shimmered on the horizon as a dream. It’s still been two years on this team, but I have learned a lot within this time period. I would like to share them with you so that you will also be able to explore this area(SRE).
First I would like to share the tech stack that I mastered during this time.
Upon rejoining the company, I noticed the predominant use of Azure. While I had a bit of experience with AWS, Azure was uncharted territory for me. Determined to bridge this gap, I started, aligning AWS concepts with Azure counterparts. To solidify my understanding, I took the AZ-900 exam, relying on a comprehensive YouTube video for guidance. This exam not only deepened my grasp of Azure’s fundamentals but also proved to be a cost-free opportunity, thanks to their free workshop.
While on the SRE team, my journey led me to embrace another pivotal technology: Terraform. Tasked with managing cloud resources through Infrastructure as Code (IaC), Terraform became an essential technology. Its capabilities allow seamless provisioning of resources, a skill I quickly realized was vital for my career progression.
A technology that truly took center stage was Kubernetes. Despite having prior knowledge of Docker, Kubernetes was new to me. I started learning Kubernetes concepts from scratch. The more I worked with it, the more my fondness for Kubernetes grew. I understood how useful it was to handle microservice architectures. Scaling, auto-healing and load-balancing are some of the most useful features of Kubernetes.
In my pursuit of deepening my knowledge of Kubernetes, I embarked on a journey to obtain a professional certification. This path led me to the Certified Kubernetes Application Developer (CKAD) exam, which became a significant milestone in my quest for expertise in Kubernetes. The CKAD certification not only validated my skills but also opened doors to new opportunities, solidifying my commitment to mastering this powerful technology.
I got an opportunity to provision a full cloud deployment including Azure infrastructure using Terraform, CI/CD pipelines using Azure DevOps, and Kubernetes deployments using yamls. Then concluded the project by adding monitoring, and alerting. This project was very helpful for me as I had the opportunity to cover all the technologies that are required to become successful in my career.
Now, I’ll discuss some of the responsibilities undertaken as an SRE within the team. The reason to share the following points is because I was actively involved in them.
Maintaining the uptime is the key responsibility of the team. To maintain high availability we need to think of the deployment’s infrastructure aspect and how updates/migrations get rolled out to the environment. Maintaining the SLA is the most important thing that we need to do as a team. Being on-call is an added responsibility as an individual.
Keeping track of the cloud bill and optimizing the cloud cost is one of the key things that an SRE does. Deleting unused cloud resources/testing resources, and applying reservations for cloud instances are some of the tasks that we carry out. In the context of Kubernetes, ensuring the appropriate allocation of memory and CPU limits for each deployment is another activity that we do. This helps eliminate underutilized VMs and allows us to choose the most fitting VM sizes for our deployments.
SREs get a significant amount of toil work to do on a daily basis. Automation is one of the key things that an SRE should do to overcome the toil work.
Now I would like to share some of the soft skills I gained while I was following my career.
Leading myself till the end of a project was one of the accomplishments I had. Self-management is one of the most important skills that is required to have as an SRE. While I was doing the cloud deployment project I created an epic by myself and added the subtasks around it. This helped me a lot to track the progress of the project.
Communicating the progress to the team leads is also vital for the growth of the career. If there are any blockers for a long duration, keeping the leads informed about the blockers is a good practice.
Helping teammates is also another soft skill that needs to be improved. While helping the teammates it drastically improves our knowledge as well. During my career here, I helped several interns. While debugging the issues on their projects I also improved my debugging skills.
So I hope that I have shared all the valuable experiences that I had during my career. This article might help you if you are also interested on the SRE side.
Cheers to another article 🥂