With the evolution of distributed architectures, we see many clients who have taken the idea of autonomy to something that is closer to anarchy. Rightfully so, companies are trying to replicate the level of innovation that companies like Amazon and Google have achieved at scale, without (seemingly....) having to sacrifice overall organizational and team level efficiency.
The fundamental problem is usually that the while organizations drive towards maximizing team level autonomy as part of driving innovation, often the organization fails to balance autonomy with an optimal level of consistency in engineering process as well as technologies employed to solve common problems.
This lack of any standardization significantly adds to cost and complexity as each engineering team often is solving the same problem with completely different approaches and tools.
This can look like having 4-5 different databases in use for the same basic use case (e.g. relational databases for transaction processing) as part of the product architecture or using completely different tools and automation for continuous integration and continuous deployment (CI/CD).
The impacts to the business are multi-fold:
- Complexity to operate the application is higher because how the application functions ‘in concert’ is less well understood as there will be more disparate components that are only understood in silos. This results in more customer impacting incidents (outages and degradations) and increases time to restore service.
- Time to Market (TTM) is longer than optimal because more issues arise in integrating product components together due to the lack of common underlying approaches and architecture.
- Cost increases along two dimensions as more third-party technologies are purchased and engineers do the same work multiple times to create solutions that are used locally.
As companies drive towards the holy grail of scrum team level autonomy and fast TTM, balanced with efficiency, one approach is to create Centers of Excellence focused on areas where you want to deliberately drive a level of standardization across multiple teams.
What is a Center of Excellence (COE)?
Wikipedia describes the basic COE idea as follows:
“A center of excellence (COE or CoE), also called excellence center, is a team, a shared facility or an entity that provides leadership, best practices, research support or training for a focus area.”
In Engineering organizations these types of shared capabilities focused teams can be applied to many areas, some of the most common include:
- DevOps including Continuous Integration/Continuous Deployment (CI/CD) practices - these are the tools and automation used to move a code from an individual developer’s environment through quality and security gates and deploy to production. In most organizations it makes sense to make CI/CD as common as is practically possible across engineering teams by selecting tools and creating frameworks to make it easy for new services to ‘onboard’ onto the CI/CD platform. This central portion of this common capability may also manage pre-production testing environments.
- Common architecture – ideally most architecture work happens at the scrum team or pod level, however, it is critical in scaled organizations to create guidelines for architecture approaches and choices, examples include:
- “Choose from one of these 2 options for your relational database needs and these other two for Document DB’s” or
- Use a specific application logging format and xyz tool or
- “Service to Service calls must go through this API Gateway”
- Automated testing – Although quality engineering practices should be focused at the scrum team level, there is significant opportunity for cost, quality and efficiency improvements by:
- Specifying tools to use (a choice of 1-3 automation frameworks for different parts of the application most commonly)
- Creating frameworks that accelerate the actual use of automation at the team level
- Standardizing quality practices and metrics where appropriate across teams
- Security Architecture – although secure coding should – again – be primarily implemented at the scrum team level, there is often a need for Engineering security experts to consult with teams when they are making significant changes and enhancements to the application. New significant functionality often drives new security risks and challenges.
- Site Reliability Engineering (SRE) - SRE in a company is generally focused on the tools and processes used to manage a product in production, including monitoring tools, incident and problem management processes etc.
A very important footnote to the above candidates to be COE functions is that there should always be alignment directly to the Engineering scrum teams. More to come on this but in short, ideally there are experts in the above capabilities that are either embedded directly into the Engineering teams OR there are COE team members who are aligned/matrixed to 1-2 engineering teams on an ongoing basis.
Other characteristics for COE team members include:
- Being experts in their field. A scaled COE may bring in people newer to the field who are passionate about the area and becoming proficient/expert, but initially the COE must be seeded with members that know the subject matter well.
- Being passionate about continuously learning and researching developments in their area.
- Customer service orientation which means they are genuinely interested in problem solving at the team level and are not wedded to very specific approaches or tools. NO Ivory Towers!
Why are COE’s important?
COE’s become more important with the acceleration towards distributed (ala ‘microservices’...) architectures. As previously mentioned, a common misstep is for companies to confuse the idea of team autonomy with anarchy. Without the notion of a COE approach, each team WILL create their own solutions for the same problem, adding cost and complexity.
In addition, COE’s are important for new capabilities that will just not be well understood by most engineers for a period of time (usually in the order of years). In these cases, reinventing the wheel for each capability at the team level is even more risky as the ‘non-experts’ are likely to design and build a suboptimal implementation of the newer technology – and this will happen for every team!
Lastly COE's also support continuous improvement across the Engineering organization by developing outcome based metrics that guide the approach in a given area (e.g. quality practices, DevOps). Standardized metrics assist in gauging team level maturity and therefore optimize cross-organization focus for improvement.
How Do you Optimize the Setup of a COE?
In order to define the scope of a COE it’s important to put the COE into the context of the the overall organization. Creating a vernacular to describe ALL types of teams in your organization helps bring clarity to the role of the COE within the broader team. The popular book Team Topologies defines the types of teams within most organizations and is commonly used throughout the industry. Therefore it is a great place to start.
Following are the four types of groups that Team Topologies describes:
- Stream-aligned team: aligned to a flow of work from (usually) a segment of the business domain
- Enabling team: helps a Stream-aligned team to overcome obstacles. Also detects missing capabilities.
- Complicated Subsystem team: where significant mathematics/calculation/technical expertise is needed.
- Platform team: a grouping of other team types that provide a compelling internal product to accelerate delivery by Stream-aligned teams
How do those four types of teams align to the idea of a COE? In short, COEs are most closely associated with the Enabling team concept BUT, many COE’s also create frameworks and building blocks, i.e. Platforms, that are used by all Engineering scrum/pod teams as well. Within a typical Engineering organization, using the Team Topologies vernacular, most of these teams are Stream Aligned to domains of the project (e.g. Login, Checkout, Product Search), while some may also be Platform teams who work on foundational capabilities. For example, a Platform team that owns and manages an API gateway that is used by all Stream Aligned teams may still use capabilities from multiple COE’s such as Quality Engineering COE for test automation frameworks and a DevOps COE including the standard CI/CD platform.
This leads to our recommendation to name these COE teams according to their main area of focus and/or use “Enabling” (or synonym) in the team name. The reason is that Center of Excellence is a more antiquated industry term and often invokes a picture of an “Ivory Tower” team.
A well-intentioned COE team devolves into an Ivory Tower when:
- The COE team members are out of touch with what is actually happening/needed at the engineering scrum team level
- This happens when the COE team members are not working directly and hands-on (doing real work!) with the Engineering teams they are building for on a daily basis
- When this happens, the COE ends up building frameworks/tools that no one uses or wants to use
- Having too much separation of the COE team from the Engineering teams the COE is intended to “enable” also increases the level of Affective (bad) conflict between the COE and Scrum teams as it makes it less likely the COE understands the needs in detail.
Therefore, we strongly recommend that specialists for all COE areas are also embedded in the engineering scrum teams (in addition to an appropriately sized, usually small, central Enabling group). This means that there is a dedicated expert from the COE that is part of the scrum team 100% of the time. Their direct line reporting relationship may even be to the scrum team/Eng org with a dotted line back to the COE they represent. This approach is critical to success as it achieves two important outcomes:
- Reduces/removes risk of “Ivory Tower Syndrome” as the specialists are fully immersed and hands on in their engineering team.
- Improves speed of implementation and innovation as the learning cycle is faster with experts working side by side with each Engineering team
- Ensures truly shared ownership of approaches and tools between the central portion of the COE and the COE experts embedded in engineering teams.
In the case that individual engineering teams don’t have enough work to justify a dedicated specialist in a COE area, then an expert can be shared between 2 (or more) engineering teams, but what is most important is that the expert is dedicated to a small number of specific teams for a durable period of time (we recommend at least 2 years) so that they become experts in a domain of the product and are fully part of the scrum team. In many organizations, it also works well to rotate central COE team members to scrum teams and back to the central portion of the organization. This helps ensure all members of the COE understand the challenges at the scrum team level.
As a capability improves in a given area, normally the size of the central portion of the COE can shrink. For example, when setting up a Quality Engineering COE, once tools and frameworks are selected and created, there will likely be less demand from engineering teams as they can work with the tools provided in a steady-state of quality practices. At that point, the COE can continue to improve and expand their foundational frameworks, research and experiment with emerging tools and approaches, etc. but they are likely done with the initial heavy lifting.
In addition to the size of the COE, organizational maturity in a given COE discipline (e.g. DevOps, DB engineering etc.) is a consideration in the ideal modes of interaction and collaboration between the central portion of the COE and the scrum teams. And again, Team Topologies does an excellent job defining interaction models as follows with the following definitions:
- Collaboration: working together for a defined period of time to discover new things (APIs, practices, technologies, etc.). This is most common when the teams are exploring areas new to the organization or innovations in an established area.
- X-as-a-Service: one team provides and one team consumes something “as a Service”
- Facilitation: one team helps and mentors another team
Normally, the communication and collaboration modes involve a combination of the approaches above. X-as-a-Service is more common as an organizations’ capabilities improve, scrum teams are more capable of using self-service, well documented tools provided by a central COE.
The important points in understanding interaction models between the COE and the scrum teams it ‘serves’ are as follows:
- Be deliberate about what mode(s) of interaction are optimal based on your needs and maturity in a given area. It will most certainly shift over time.
- Plan your roadmap of what the COE provides/builds based on your maturity.
- Monitor to see if adjustments are needed. For example, if you think that X-as-a-Service is optimal for your needs, then in monitoring Slack you see a LOT of communication between the COE and the Scrum teams, then the COE likely has not provided true as-a-Service tools yet. The tools may not be complete for basic needs or the documentation may be lacking or some combination thereof.
What is most important is to think about what interaction models will work best, monitor over time and adjust as needed.
How do I know if the COE is effective?
Effectiveness of the COE should be gauged by quantitative and qualitative measures. For example, if you are creating a COE to help drive improvements in the DevOps area with the objective of decreasing both effort and time to release code to production, then metrics such as TTM and time to commit a change to the main line of code and run automated tests are all good metrics to consider.
As part of setting objective metrics aligned to your desired outcomes, it’s critical to make sure metrics are shared and aligned between the central COE and the scrum teams. For example, if the primary objective is improving TTM WITH a specified level of quality, then the metrics for BOTH the COE and Engineering scrum teams should reflect both TTM AND quality (e.g. number and impact of customer incidents).
As previously mentioned, you can also monitor more qualitative aspects of the COE by looking at the amount and type of communication between the central portion of the COE and the embedded COE experts within teams as well as the rest of the engineering scrum teams. It’s also useful to survey the organization periodically to understand what is working/not working well.
In summary, the COE concept is a powerful tool to gain leverage across your organization, but like all approaches to organizing your teams – it also has risks and common pitfalls. We work with Engineering teams regularly to tune and optimize their organizations. Give us a call and we can help!