November 1, 2018 | Posted By: Pete Ferguson
Eating your own dog food is a common phrase that is cynical from the start – unless you like eating dog food! A more positive, but often overused cliche, is “Be the Customer.” Regardless of how you want to phrase it, the goal is to create solutions that win your customers (which by the way hopefully include your engineers!) over.
We recently had an opportunity to walk in the customer’s shoes of one of our clients and it was painfully obvious within our first minute that user input to the ultimate design of the product was not considered. The methodology for feedback involved Post-it notes and paper forms as opposed to a simple feedback button on the application appliance. The end users were frustrated and reoccurring problems require creative manipulation by the person closest to the customer while software developers are insulated from valuable input.
The rise and fall of companies from top dog to B player (or worse) occurs at an ever quickening pace. Some companies get a second chance, but it takes a lot of effort to get even close to catching up. The best scenario of course is to never lose the number one spot, and the key to staying ahead lies in understanding your customers and innovating and providing appealing solutions for them to be successful.
Tenure, stock options, and other “Golden Handcuffs” are meant for retention, but can often backfire into lulling employees into a comfortable complacency.
So, how do you combat complacency or customer disservice? Many companies take creative routes to hold hackathons and other contests to improve user experience. The best companies allow their customers to vote on which products should be prioritized in the pipeline. But the most effective path to success is to ensure an open dialogue between engineers and those on the front lines using your products.
When I was at eBay there was an issue that had been dogging (no pun intended) customers for a long time and was frustrating customer service agents to no end. John Donahoe, then CEO, was visiting the customer service location in Utah and at a lunch Q&A the frustration came out. John had someone contact the California-based engineers responsible during the luncheon and arranged to have them all fly to the CS center the next morning if not that evening. It was communicated to John that the flights home for the engineers at the end of the week were sold out so he rearranged his travel and took them back on the corporate jet.
I was a driver to get them to the executive terminal at the airport. John raced out of the car and stood at the foot of the red carpet to salute and shake the engineer’s hands after a few grueling days of eating their dog food by sitting on customer calls and meeting with very frustrated agents.
The message was clear to all involved, “no more finger pointing” – engineers were tied at the hip with customer service reps on the front lines with the customers.
Avoid complacency by ensuring your developers and product managers walk in the customer’s shoes and hear from customers regularly. The better informed they are, the better the solutions they will develop.
Redefine the Definition of “Done”
An important aspect of incorporating customers into the development process is an OKRs (Objectives [at AKF we prefer “Outcomes”] and Key Results) focus. What is the desired customer behavior for new features, fixing existing features, etc.? It’s the big “so what” question that needs to be asked often.
If you are trying to increase customer engagement by 10% for a new product or service, then the project is not “done” because of a code release for a new product or features – it is “done” when customer engagement is increased by 10%! So that is when you have the party, not when code is released.
Word usage may seem trivial, but it is our experience that clarity must be uniform and consistent with actions. Team members pay attention to the little things and make the correlations between the words a CEO speaks at an All Hands and what behaviors are observed day in and day out. Having goals around work performed instead of changed customer behavior will likely result in a lot of code being released with very little “so what” for your customers which provides space for an upstart or competitor to edge their way in. If AOL, WebCrawler, Yahoo, AskJeeves, or Excite had provided great consistent search for their customers, Google wouldn’t have been able to take over and dominate. And if Google doesn’t continue to provide great search, the next “Google” will find a space to wiggle in and dominate in the future.
Customers have a balance of wanting their current needs filled with a look to the future. Be clear on what “done” means for your customers on each project and hold off celebrating success – not just effort exerted – until your customers can see that you are done.
Stay Focused on the Bigger Picture
The now infamous quote from Henry Ford is “If I had asked people what they wanted, they would have said ‘faster horses.’
Often we see teams getting microfocused on a fringe case and creating solutions for the minority of end users. While your product will demonstrate value for a few, you may miss the boat for the majority of your customers and they may move on.
I recall touring a facility in the US for a security software integrator. I walked by a set of cubicles with hundreds of security cameras set up and asked what they were doing. I was told the team of 30+ engineers were testing each camera to ensure it works with their software.
Meanwhile their software was not capable of two-factor authentication, which was fast becoming a major blindside for their organization. If making sure every brand of camera was really a profit center, they should have outsourced it offshore for the same result for a fraction of the cost and put their top engineers on something with customer value and profitability. I doubt supporting hundreds of cameras was a major differentiator - certainly not enough to tie up top-paid engineers. As the consumer if I knew they supported 30-60 cameras perfectly, I’d be fine with picking one from the list. Lack of two-factor authentication was causing major roadblocks for my team in getting infosec approval to continue using their software.
It is important to step back often and look at where teams are exerting the most effort and to ask the simple question “why are we doing this?” If your answer is “because we’ve always done it that way” you are likely not maximizing customer value. If your reasoning aligns with how to maximize customer value for longer term value, then you are on track.
Stay Two Steps Ahead
One of my first jobs was as an apprentice for a building contractor. Chris had a very small crew and specialized in high-end home remodeling and additions. My first day on the job he said “see that pile of trash?” Yes, I replied. “See those dumpsters out that window?” Yes, I said again. “Stop standing around and get to work!” Over the summer I learned that when we showed up on a new job site the saws and compressor and hoses and nail guns needed to be set up as soon as his truck slowed to a stop. My job was to anticipate what we would be doing next and make sure the right equipment was set up and ready to go before we needed it.
Steve Job’s famous modernization of Henry Ford’s faster horses quote: “It’s really hard to design products by focus groups. A lot of times, people don’t know what they want until you show it to them.”
Much to the complaint of many customers, Jobs drug everyone into USB with the original iMac (while eliminating the floppy disk) and Apple has continued to drag customers into faster adoption of Bluetooth, SSDs, and now USB-C. For the most part, the gambles have paid off and customers adopt and enjoy single cable interfaces, faster transfer speeds, etc. (and Apple generates a lot of revenue on dongles for themselves and others …).
Agile is all about quickly adapting and changing. Similarly, as you look at where your customers are today, don’t lose site of where you need to take them tomorrow. Provide the vision two steps ahead of where you are today.
Look, it is easy – and entertaining – to get lost in pet projects that provide a challenge and are good career development, but if they aren’t providing customer value, save these projects for for when there is overflow time. In any large corporation it can be easy to lose sight of customer and end user needs and get lost in endless meetings, pet projects, and other seemingly urgent, but not important, activities.
The main thing is to keep the main thing the main thing … and your customers/end users are THE Main Thing! Agile development provides the ongoing opportunity to comb the backlog and prioritize projects that have the most customer (shareholder, and hopefully employee) value. So make sure the efforts of your team and your organization are laser focused on what will immediately provide the maximum customer value. This will provide the needed profits to hire more staff and provide room to add in non-functional requirements and R&D projects as part of the larger, ongoing development process.
For a good laugh, visit demotivators.com.
October 12, 2018 | Posted By: Bill Armelin
Understanding Technical Debt
During the course of our client engagements, there are a few common topics or themes that are always discussed, and the clients themselves usually introduce them. One such area is technical debt. Every team has it, every team believes they have too much of it, and every team struggles to explain why it’s important to address it.
Let’s start by defining what technical debt means. It is the difference between doing something the “desired” or “best” way and doing something quickly (i.e. reduce time to market). The difference results in the company taking on “debt” within the solution. Technical debt requires acting with forethought. In other words, you only assume technical debt knowingly and with commission. Acts of omission (forgetting to plan or do something) do not count as debt. Our partners in business may think we are hiding the truth if we do not clearly delineate the difference between debt (known assumptions) and mistakes, failures or other issues related to maintenance.
The following list provides examples of things that are not tech debt:
- Software defects (unless we decide to NOT fix them for an extended period of time – but defects are still human failures – not debt.)
- Failures in design that are not previously tagged as debt.
- Failures to identify scalability bottle necks.
- Poor choices in technology components that fail to scale.
- Failure to properly identify infrastructure failures, or high failure rates of vendors in infrastructure.
A Financial Analogy for Tech Debt
When you hear the words “technical debt”, it invokes a negative connotation. However, the judicious use of tech debt is a valuable addition to your product development process. Tech debt is analogous to financial debt. Companies can raise capital to grow their business by either issuing equity or issuing debt. Issuing equity means giving up a percentage of ownership in the company and dilutes current shareholder value. Issuing debt requires the payment of interest but does not give up ownership or dilute shareholder value. Issuing debt is good, until you can’t service it. Once you have too much debt and cannot pay the interest, you are in trouble.
Tech debt operates in the same manner. Companies use tech debt to defer performing work on a product. As we develop our minimum viable product, we build a prototype, gather feedback from the market, and iterate. The parts of the product that didn’t meet the definition of minimum or the decisions/shortcuts made during development represent the tech debt that was taken on to get to the MVP. This is the debt that we must service in later iterations. In fact, our definition of done must include the servicing of the resulting tech debt. Taking on tech debt early can pay big dividends by getting your product to market faster. However, like financial debt, you must service the interest. If you don’t, you will begin to see scalability and availability issues. At that point, refactoring the debt becomes more difficult and time critical. It begins to affect your customers’ experience.
Many development teams have a hard time convincing leadership that technical debt is a worthy use of their time. Why spend time refactoring something that already “works” when you could use that time to build new features customers and markets are demanding now? The danger with this philosophy is that by the time technical debt manifests itself into a noticeable customer problem, it’s often too late to address it without a major undertaking. It’s akin to not having a disaster recovery plan when a major availability outage strikes. To get the business on-board, you must make the case using language business leaders understand – again this is often financial in nature. Be clear about the cost of such efforts and quantify the business value they will bring by calculating their ROI. Demonstrate the cost avoidance that is achieved by addressing critical debt sooner rather than later - calculate how much cost would be in the future if the debt is not addressed now. The best practice is to get leadership to agree and commit to a certain percentage of development time that can be allocated to addressing technical debt on an on-going basis. If they do, it’s important not to abuse this responsibility. Do not let engineers alone determine what technical debt should be paid down and at what rate – it must have true business value that is greater than or equal to spending that time on other activities.
Just as with debt that a company assumes, in and of itself, technical debt is not bad. It can be looked at as a leveraging tool to optimize the technology resources in the short term - delaying a hardware tech refresh or the release date for HTML 5. Delaying attention to address technical issues allows greater resources to be focused on higher priority endeavors. The absence of technical debt probably means missed business opportunities– use technical debt as a tool to best meet the needs of the business. However, excessive technical debt will cause availability and scalability issues, and can choke business innovation (too much engineering time dealing with debt rather than focusing on the product).
Develop a technology balance sheet and profit and loss (income) statement to discuss tech debt with the business in a manner they understand – finance. Let’s first look at the balance sheet, where Assets = Liabilities + Equity. Our assets are the engineering time spent creating the product. Liabilities are the principle of the tech debt (i.e. the difference between “desired” and “actual.” Equity is the remainder, or the engineering resources spent creating the product while not contributing to tech debt.
Here is an example of a technology balance sheet:
To further the financial analogy, we need to have a technology P&L statement. Here, the interest on tech debt is the difficulty or increased level of effort in modifying something in subsequent releases. This manifests as a reduction in developer productivity per value created. The more debt you take on or less principle you pay down, the higher your interest payment becomes, and the cost to the organization.
Dedicating resources on an ongoing basis to service technical debt can be a challenging discussion with the business. Resources are always limited and employing them in the manner which best benefits the business is a critical business priority decision. Similar to the notion of debt within business, you should never take on technical debt without a plan to pay the interest (increased future cost of development) and principal (fixing the difference between appropriate and as-is). Relating technical debt to financial debt can help those outside of your technology organization grasp the concept and understand the need to keep technical debt under control.
One way to make the concept of debt real is to estimate, for any debt item, the amount of “interest” one will need to pay in the future to modify the solution in question.
- For the benefit of time to market, you decide to “hard code” a number of “display strings” that you’d rather set aside in a resource file to modify and translate later.
- You save 2 weeks of development time, creating a 2-week liability on your balance sheet. You have a 2-week principal to fix.
- You estimate that for all future string modifications (or translations) it will take an additional day of development. Your interest is 1 day, payable for each modification.
Just as retiring all financial liabilities at once does not make good business sense, trying to wipe out technical debt in one fell swoop is a bad idea. Continuous service to the technical debt is required to prevent technical liabilities from wiping out technical equity. An informed decision to increase debt service to reduce the principal will result in more productive product development time (smaller debt requires less on-going service). A short-term decision to reduce tech debt service in favor of a critical product launch may be viable if not used often. Keep track of both your principal (balance sheet) and your interest payments (income statement). Use these to help your business partners with debt related decisions.
Do NOT mix the cost of defects, or other infrastructure and software mistakes with tech debt. Doing so creates two very big problems:
- It becomes harder for the technology team to learn from past mistakes. Mistakes are mistakes and we should use them as learning opportunities. Debt is taken thoughtfully. Track them separately and treat them differently.
- Using the debt term for non-debt related items, will lower the level of trust between you and the business. Businesses don’t for instance “mistakenly” take on debt. Mixing these terms can cause relationship problems.
Additionally, be clear about how you define technical debt, so time spent paying it down is not comingled with other activities. Bugs in your code are not technical debt. Refactoring your code base to make it more scalable, however, would be. A good test is to ask if the path you chose was a conscious or unconscious decision. Meaning, if you decided to go in one direction knowing that you would later need to refactor. You are making a specific decision to do or not to do something knowing that you will need to address it later. Bugs are found in sloppy code, and that is not tech debt, it is just bad code.
So how do you decide what tech debt should be addressed and how do you prioritize? If you have been tracking work with Agile storyboards and product backlogs, you should have an idea where to begin. Also, if you track your problems and incidents like we recommend, then this will show elements of tech debt that have begun to manifest themselves as scalability and availability concerns. Set a budget and begin paying down the debt. If you are working on less 12%, you are not spending enough effort. If you are spending over 25%, you are probably fixing issues that have already manifested themselves, and you are trying to catch up. Setting an appropriate budget and maintaining it over the course of your development efforts will pay down the interest and help prevent issues from arising.
Taking on technical debt to fund your product development efforts is an effective method to get your product to market quicker. But, just like financial debt, you need to take on an appropriate amount of tech debt that you can service by making the necessary interest and principle payments to reduce the outstanding balance. Failing to set an appropriate budget will result in a technical “bankruptcy” that will be much harder to dig yourself out of later.
Tech Debt Takeaways
Here is a list of our tech debt takeaways:
September 19, 2018 | Posted By: Greg Fennewald
Cloud hosting is growing rapidly, with many companies leveraging the cloud to deliver all or a portion of their products and services. This trend is unlikely to change any time soon as cloud hosting has commoditized digital infrastructure.
One of the concerns with cloud hosting we often hear from our clients is security – security of data stored in the cloud, access controls for the compute resources, and even physical access concerns. While these concerns are valid to a certain extent, they are all rooted in misconceptions about cloud hosting.
Stripped of all marketing glitz, buzzword bingo points, and misconceptions, cloud hosting is a passel of servers, switches, and storage devices living in a large data center. Who owns and maintains the hardware and facility is really the primary difference between cloud hosting and company owned data centers or traditional colocation services.
Let’s look at some of the common cloud security misconceptions;
Data Security and System Access - there is a fear that energy drink guzzling teenagers will steal your sensitive data if you store it in the cloud. Your sensitive data is encrypted at rest, right? If not, you’re right in thinking that cloud is not for you, Neither is technology. Polish up that resume. Encrypting data is an industry best practice that is rapidly becoming a base expectation, but does not alleviate you from notifying those potentially impacted by a breach.
The appropriate risk management approach are the policies and procedures controlling system access and thus access to data. In addition to your own policies, the major players in cloud hosting have proven policies and procedures that comply with multiple regulatory requirements and have been repeatedly audited. They are most likely better at it than you. The security certifications of major cloud hosting providers can be found here and here. How does that compare to your program? How much would it cost for your company to achieve and maintain the same level of certification? Are your security requirements drastically different from other companies already using cloud hosting? Chances are that the cloud provider capabilities and your own security program can meet your security needs.
Physical Security - concerns about physical security at cloud hosting locations are typically the result of a lack of topical knowledge. Cloud data centers have fewer people entering them each day as compared to a traditional colocation data center, where customers bring in their own hardware and work on it inside the shared data center. Cloud hosting customers do not have physical access to the cloud data centers. Those entering a cloud data center on a daily basis are either provider employees or service partners - people who have undergone mature access control procedures.
Major cloud hosting providers operate dozens of data centers. Physical security policies and safeguards have evolved over time and are thoroughly tested. Just as with system access controls, cloud providers are most likely better at physical security than you.
Economies of Scale
A key reason behind cloud providers being good at logical access control, regulatory compliance, and physical security is the scale at which the major players operate. They can afford the talent, technology, tools, and oversight.
The economies of scale that enable cloud providers to deliver the capacity and service quality the market demands are at work in the security arena as well. Combined with the broad regulatory compliance needs of their customers, these economies of scale enable cloud providers to be better than most across the board in security.
Regardless of where the infrastructure is hosted, a sound security program should include practices such as;
- Secure coding standards
- Role based access control
- Multi-factor authentication
- Logged access to systems and data
- Data encryption at rest
- Data classification procedure
- Network segmentation
- Data egress monitoring
- Security threat matrix
- Incident response plan
Combined with the security capabilities of cloud providers, a sound security program should enable nearly any company to make use of cloud hosting in a manner that benefits the business.
Interested in cloud options, but unsure how to proceed? AKF Partners has helped many clients with could strategy and SaaS transition. More about our services can be found here.
September 18, 2018 | Posted By: Pete Ferguson
As part of our Technical Due Diligence and Architectural reviews, we always want to see a company’s system architecture, understand their process, and review their org chart. Without ever stepping foot at a client we can begin to see the forensic evidence of potential problems.
Like that ugly couch you bought when you were in college and still have in your front room, often inefficiencies in architecture, process, and organization are nostalgic memories that have long since outlived their purpose – and while you have become used to the ugly couch, outsiders look in and recognize it as the eyesore it is immediately and often customers feel the inefficiencies through slow page loads and shopping cart issues. “That’s how it has always been” is never a good motto when designing systems, processes, and organizations for flexibility, availability, and scalability.
It is always interesting to hear companies talk with the pride of a parent about their unruly kid when they use words like “our architecture/organization is very complex” or “our systems/organization has a lot of interdependent components” – as if either of these things are something special or desirable! Great architectures are sketched out on a napkin in seconds, not hours.
Great architectures are sketched out on a napkin in seconds, not hours.
All systems fail. Complex systems fail miserably, and – like Dominos – take down neighboring systems as well resulting in latency, down time, and/or flat out failure.
ARCHITECTURE & SOFTWARE
Some common observations in hardware/software we repeatedly see:
Problem: Overloaded F5 or other similar firewalls are trying to encrypt all data because Personal Identifiable Information (PII) is stored in plain text, usually left over from a business decision made long ago that no one can quite recall and an auditor once said “encrypt everything” to protect it. Because no one person is responsible for a 30,000 foot view of the architecture, each team happily works in their silo and the decision to encrypt is held up like a trophy without seeing that the F5 is often running hot, causing latency, and is now a bottleneck (resulting in costly requests for more F5s) doing something it has no business doing in the first place.
Solution: Segregate all PII, tokenize it and only encrypt the data that needs to be encrypted, speeding up throughput and better isolating and protecting PII.
Integration (or Rather Lack Thereof) Of Mergers & Acquisitions
Problem: A recent (and often not so recent) flurry of acquisitions is resulting in cross data center calls in and out of firewalls. Purchased companies are still in their own data center or public cloud and the entire workflow of a customer request is crisscrossing the country multiple times not only causing latency, but if one thing goes wrong (remember, everything fails …) timeouts result in customer frustration and lost transactions.
Solution: Integrate services within one isolated stack or swim lane – either hosted or public cloud – to avoid cross data center calls. Replicate services so that each datacenter or cloud instance has everything it needs.
Problem: As the company grew and gained more market share, the search for bigger and better has resulted in a monolithic database that is slow, requires specialized hardware, specialized support, ongoing expensive software licenses, and maintenance fees. As a result, during peak times the database slows everyone and everything down. The temptation is to buy bigger and better hardware and pay higher monthly fees for more bandwidth.
Solution: Break down databases by customer, region, or other Z-Axis splits on the AKF Scale Cube. This has multiple wins – you can use commodity servers instead of large complex file storage, failure for one database will not affect the others, you can place customer data closest to the customer by region, and adding additional servers does not required a long lead time or request for substantial capital expenditure.
PROCESSES & ORGANIZATION
What sets AKF apart is that we don’t just look at systems, we always want to understand the people and organization supporting the system architecture as well and here there are additional multiplicative effects of failure. We have considerable expertise working for and with Fortune 100 companies, startups, and agencies in many different competencies. The common mistakes we see on the organization side of the equation:
Lack of Cross Functional Teams
Problem: Agile Scrum teams do not have all the resources needed within the team to be self sufficient and autonomous. As a result, teams are waiting on other internal resources for approvals or answers to questions in order to complete a Sprint - or keep these items on the backlog because effort estimation is too high. This results in decreased time to market, losing what could have been a competitive advantage, and lower revenue.
Solution: Create cross-functional teams so that each Sprint can be completed with necessary access to security, architecture, QA, and other resources. This doesn’t mean each team needs a dedicated resource from each discipline – one resource can support multiple teams. The information needed can be greatly augmented by creating guildes where the subject matter expert (SME) can “deputize” multiple people on what is required to meet policy. Guilds utilize published standards and provide a dedicated channel of communication to the SME greatly simplifying and speeding up the approval process.
Lack of Automation
Problem: It isn’t done enough! As a result, people are waiting on other people for needed approvals. Often the excuse is that there isn’t enough time or resources. In most cases when we do the math, the cost of not automating far outweighs the short-term investment with a continuous long-term payout that automation would bring. We often see that the individual with the deployment knowledge is insecure and doesn’t want automation as they feel their job is threatened. This is a very short-sighted approach that requires coaching for them to see how much more valuable they can be to the organization by getting out of the way of stifling progress!
Solution: Automate everything possible from testing, quality assurance, security compliance, code compliance (which means you need a good architectural review board and standards), etc! Automation is the gift that keeps on giving and is part of the “secret sauce” of top companies who are our clients.
Not Empowering Teams to Get Stuff Done!
Problem: Often teams work in a silo, only focused on their own tasks and are quick to blame others for their lack of success. They have been delegated tasks, but do not have the ability to get stuff done.
Solution: Similar to cross functional teams, each team must also be given the authority to make decisions (hence why you want the right people from a variety of dependencies on the team) and get stuff done. An empowered team will iterate much faster and likely with a lot more innovation.
While each organization will have many variables both enabling and hindering success, the items listed here are common denominators we see time and time again often needing an outside perspective to identify. Back to the ugly couch analogy, it is often easy to walk into someone else’s house and immediately spot their ugly couch!
Pay attention to those you have hired away from the competition in their early days and seek their opinions and input as your organization’s old bad habits likely look ridiculous to them. Of course only do this with an intent to listen and to learn – getting defensive or stubbornly trying to explain why things are the way they are will not only bring a dead end to you learning, but will also abruptly stop any budding trust with your new hire.
And of course, we are always more than happy to pop the hood and take a look at your organization just as we have been doing for the top banks, Fortune 100, healthcare, and many other organizations. Put our experience to work for you!
September 17, 2018 | Posted By: Bill Armelin
Everything fails! This is a mantra that we are always espousing at AKF. At some point, these failures will manifest themselves as an outage. In a SaaS world, restoring service as quickly as possible is critical. It requires having the right people available and being able to communicate with them effectively. A lack of good communications can cause an incident to drag on.
For startups and smaller companies, problems with communications during incidents is less of an issue. Systems tend to be smaller or monolithic. Teams supporting these systems also tend to be small. When something happens, everyone jumps on a call to figure out the problem. As companies grow, the number of people needed to resolve an incident grows. Coordinating communications between a large group of people becomes difficult. Adding to the chaos are executives joining the conference bridges demanding updates about service restoration.
In order to minimize the time to restore a system during an incident, companies need the right people on the call. For large, complex systems, identify the right resources to solve a problem can be difficult. We recommend swarming an issue with everyone that could be needed to resolve an incident, and then release those that are no longer needed. But, with such a large number of people, it can be difficult to coordinate communications, especially on a single conference call bridge.
Managing the communications of a large group of people working an incident is critical to minimizing the restoration time. We recommend a communication method that many of us at AKF learned in the military. It involves using multiple voice and chat channels to coordinate work and the flow of information. Before we get into the details of managing communications, we need to first look at the leadership required to effectively work the incident.
Technical Incident Manager and Incident Communications Manager
Managing a large incident is usually too much for a single individual. She cannot manage coordinating the work occurring to resolve the incident, as well as reporting status to and answering questions from executives eager to know what is going on. We recommend that companies manage incidents with two people. The first person is the individual that is responsible for directing all activities geared towards restoration of service. We call this person the Technical Incident Manager. This individual’s main job is to reduce the mean time to restoration. She needs an overall architectural knowledge of the product and systems to direct the work. She is responsible for leading the call and deescalating after diagnosis informs who needs to be involved. She identifies and diagnoses the service issues and engages the appropriate subject matter experts to assist in restoration.
The second individual is the Incident Communications Manager. He is responsible for supporting the Technical Incident Manager be listening to the technical resolution chatter and summarizing it for a non-technical audience. His focus is on communications speed, quality, and accuracy. He is the primary communications channel for both internal and external messaging. He owns the incident communications process.
Incident Communications Process
This process involves using multiple communication channels to control information and work performed. The first channel established is the Control Channel. This is in the form of a conference bridge and a chat channel. The Technical Incident Manager controls both of these channels. The second channel created is the Status Channel. This also has a voice bridge and a chat channel. The Incident Communication Manager is responsible for managing this channel.
The Control Channel is used for all communication related to the restoration of service. People only use the voice channel for immediate communication and to announce work that is occurring or address immediate questions that need to be answered. Detailed work conducted is placed in the chat channel. This reduces the chatter on the voice channel to command and control messages. It also serves as a record of actions taken that can be referenced in the post mortem/RCA process. If specific teams need to discuss the work they are performing, separate voice and chat breakout channels are created for them. They move off the main channel into their breakout channels to perform the work. The leader of these teams periodically communicates status back up to the control channel.
As the work is progressing, the Incident Communications Manager monitors the Control Channel to provide the basis for his messaging. He formulates updates that he delivers over the Status bridge and chat channel. He keeps executives and customers informed of progress and status, keeping the control channel free of requests for frequent updates and dedicated to restoring service.
This method of communications has worked well in the military for years and has been adopted by many large companies to manage their incident communications. While it is overkill for small companies, it becomes an effective process as companies grow and systems become more complex.
September 14, 2018 | Posted By: Larry Steinberg
It’s important to acknowledge that a core competency for hackers is hiding their tracks and maintaining dormancy for long periods of time after they’ve infiltrated an environment. They also could be utilizing exploits which you have not protected against - so given all of this potential how do you know that you are not currently compromised by the bad guys? Hackers are great hidden operators and have many ‘customers’ to prey on. They will focus on a customer or two at a time and then shut down activities to move on to another unsuspecting victim. It’s in their best interest to keep their profile low and you might not know that they are operating (or waiting) in your environment and have access to your key resources.
Most international hackers are well organized, well educated, and have development skills that most engineering managers would admire if not for the malevolent subject matter. Rarely are these hacks performed by bots, most occur by humans setting up a chain of software elements across unsuspecting entities enabling inbound and outbound access.
What can you do? Well to start, don’t get complacent with your security, even if you have never been compromised or have been and eradicated what you know, you’ll never know for sure if you are currently compromised. As a practice, it’s best to always assume that you are and be looking for this evidence as well as identifying ways to keep them out. Hacking is dynamic and threats are constantly evolving.
There are standard practices of good security habits to follow - the NIST Cybersecurity Framework and OWASP Top 10. Further, for your highest value environments here are some questions that you should consider: would you know if these systems had configuration changes? Would you be aware of unexpected processes running? If you have interesting information in your operating or IT environment and the bad guys get in, it’s of no value unless they get that information back out of the environment; where is your traffic going? Can you model expected outbound traffic and monitor this? The answer should be yes. Then you can look for abnormalities and even correlate this traffic with other activities in your environment.
Just as you and your business are constantly evolving to service your customers and to attract new ones, the bad guys are evolving their practices too. Some of their approaches are rudimentary because we allow it but when we buckle down they have to get more innovative. Ensure that you are constantly identifying all the entry points and close them. Then remain diligent to new approaches they might take.
Don’t forget the most common attack vector - humans. Continue evolving your training and keep the awareness high within your staff - technical and non-technical alike.
Your default mental model should be that you don’t know what you don’t know. Utilize best practices for security and continue to evolve. Utilize external or build internal expertise in the security space and ensure that those skills are dynamic and expanding. Utilize recurring testing practices to identify vulnerabilities in your environment and to prepare against emerging attack patterns.
Open Source Software as a malware on ramp
5 Focuses for a Better Security Culture
3 Practices Your Security Program Needs
Security Considerations for Technical Due Diligence
September 10, 2018 | Posted By: Robin McGlothin
The Scalability Cube – Your Guide to Evaluating Scalability
Perhaps the most common question we get at AKF Partners when performing technical due diligence on a company is, “Will this thing scale?” After all, investors want to see a return on their investment in a company, and a common way to achieve that is to grow the number of users on an application or platform. How do they ensure that the technology can support that growth? By evaluating scalability.
Let’s start by defining scalability from the technical perspective. The Wikipedia definition of “scalability” is the capability of a system, network, or process to handle a growing amount of work, or its potential to be enlarged to accommodate that growth. That definition is accurate when applied to common investment objectives. The question is, what are the key attributes of software that allow it to scale, along with the anti-patterns that prevent scaling? Or, in other words, what do we look for at AKF Partners when determining scalability?
While an exhaustive list is beyond the scope of this blog post, we can quickly use the Scalability Cube and apply the analytical methodology that helps us quickly determine where the application will experience issues.
AKF Partners introduced the scalability cube, a scale design model for building resilience application architectures using patterns and practices that apply broadly to any application. This is a best practices model that describes all scale dimensions from “The Art of Scalability” book (AKF Partners – Abbot, Keeven & Fisher Partners).
The “Scale Cube” is composed of an X-axis, Y-axis, and Z-axis:
1. Technical Architectural Layering (X-Axis ) – No single points of failure. Duplicate everything.
2. Functional Decomposition Segmentation – Componentization to Modules & Microservices (Y-Axis). Split Report, Message, Locate, Forms, Calendar into fault isolated swim lanes.
3. Horizontal Data Partitioning - Shards (Z-Axis). Beginning with pilot users, start with “podding” users for scalability and availability.
The Scale Cube helps teams keep critical dimensions of system scale in mind when solutions are designed. Scalability is all about the capability of a design to support ever growing client traffic without compromising performance. It is important to understand there are no “silver bullets” in designing scalable solutions.
An architecture is scalable if each layer in the multi-layered architecture is scalable. For example, a well-designed application should be able to scale seamlessly as demand increases and decreases and be resilient enough to withstand the loss of one or more computer resources.
Let’s start by looking at the typical monolithic application. A large system that must be deployed holistically is difficult to scale. In the case where your application was designed to be stateless, scale is possible by adding more machines, virtual or physical. However, adding instances requires powerful machines that are not cost-effective to scale. Additionally, you have the added risk of extensive regression testing because you cannot update small components on their own. Instead, we recommend a microservices-based architecture using containers (e.g. Docker) that allows for independent deployment of small pieces and the scale of individual services instead of one big application.
Monolithic applications have other negative effects, such as development complexity. What is “development complexity”? As more developers are added to the team, be aware of the effects suffering from Brooks’ Law. Brooks’ law states that adding more software developers to a late project makes the project even later. For example, one large solution loaded in the development environment can slow down a developer and gets worse as more developers add components. This causes slower and slower load times on development machines, and developers stomping on each other with changes (or creating complex merges) as they modify the same files.
Another example of development complexity issue is large outdated pieces of the architecture or database where one person is an expert. That person becomes a bottleneck to changes in a specific part of the system. As well, they are now a SPOF (single point of failure) if they are the only resource that understands the monolithic beast. The monolithic complexity and the rate of code change make it hard for any developer to know all the idiosyncrasies of the system, hence more defects are introduced. A decoupled system with small components helps prevents this problem.
When validating database design for appropriate scale, there are some key anti-patterns to check. For example:
• Do synchronous database accesses block other connections to the database when retrieving or writing data? This design can end up blocking queries and holding up the application.
• Are queries written efficiently? Large data footprints, with significant locking, can quickly slow database performance to a crawl.
• Is there a heavy report function in the application that relies on a single transactional database? Report generation can severely hamper the performance of critical user scenarios. Separating out read-only data from read-write data can positively improve scale.
• Can the data be partitioned across different load databases and/or database servers (sharding)? For example, Customers in different geographies may be partitioned to various servers more compatible with their locations. In turn, separating out the data allows for enhanced scale since requests can be split out.
• Is the right database technology being used for the problem? Storing BLOBs in a relational database has negative effects – instead, use the right technology for the job, such as a NoSQL document store. Forcing less structured data into a relational database can also lead to waste and performance issues, and here, a NoSQL solution may be more suitable.
We also look for mixed presentation and business logic. A software anti-pattern that can be prevalent in legacy code is not separating out the UI code from the underlying logic. This practice makes it impossible to scale individual layers of the application and takes away the capability to easily do A/B testing to validate different UI changes. Layer separation allows putting just enough hardware against each layer for more minimal resource usage and overall cost efficiency. The separation of the business logic from SPROCs (stored procedures) also improves the maintainability and scalability of the system.
Another key area we dig for is stateful application servers. Designing an application that stores state on an individual server is problematic for scalability. For example, if some business logic runs on one server and stores user session information (or other data) in a cache on only one server, all user requests must use that same server instead of a generic machine in a cluster. This prevents adding new machine instances that can field any request that a load balancer passes its way. Caching is a great practice for performance, but it cannot interfere with horizontal scale.
Finally, long-running jobs and/or synchronous dependencies are key areas for scalability issues. Actions on the system that trigger processing times of minutes or more can affect scalability (e.g. execution of a report that requires large amounts of data to generate). Continuing to add machines to the set doesn’t help the problem as the system can never keep up in the presence of many requests. Blocking operations exasperate the problem. Look for solutions that queue up long-running requests, execute them in the background, send events when they are complete (asynchronous communication) and do not tie up key application and database servers. Communication with dependent systems for long-running requests using synchronous methods also affects performance, scale, and reliability. Common solutions for intersystem communication and asynchronous messaging include RabbitMQ and Kafka.
Again, the list above is not exhaustive but outlines some key areas that AKF Partners look for when evaluating an architecture for scalability. If you’re looking for a checklist to help you perform your own diligence, feel free to use ours. If you’re wondering more about our diligence practice, you may be interested in our thoughts on best practices, or our beliefs around diligence and how to get it right. We’ve performed technical diligence for seed rounds, A-series and beyond, carve-outs, strategic investments and taking public companies private. From $5 million invested to over $1 billion. No matter the size of company or size of the investment, we can help.
September 6, 2018 | Posted By: James Fritz
“An incident is a terrible thing to waste” is a common mantra that AKF repeats during its Engagements. And rightfully so as many companies have an incident response plan in place but stop there. Why are incidents so important? What is the true value in doing a proper Post Mortem and actually learning from an incident?
Incidents identify issues in your product. But if that is all you take out of an incident then you are missing out on so much more information that an incident can provide. An incident is the first step to identifying a problem that exists in your product, infrastructure processes, and perhaps, people. “But aren’t incidents and problems the same thing?” Not necessarily. An incident is a one time event. It can occur multiple times if you never address the problem, but it is not isolated.
Conducting a Post Mortem
Gather as many data points as possible shortly after an incident concludes and schedule a Post Mortem review meeting.
Start with the incident timeline. Sufficiently logging events over time provides ready access to the needed data for forensic analysis. From this information you can then start to identify what went wrong, when it went wrong and how quickly you were able to respond to it. The below definitions are all factors that need to be identified:
- Time To Detect: How quickly did you identify that an incident had occurred
- Time To Escalate: How quickly did you get everyone necessary to fix the incident involved
- Time To Isolate: How quickly did you stop the incident from affecting other portions
- Time To Restore: How quickly did the system get brought back up
- Time To Repair: How quickly did you fix the incident
This all leads to the Incident Timeline Analysis.
If you can gather information from several incidents and look at them in your Post Mortem review, then you can figure out where your biggest issues are when it comes to incidents and getting the system back up and running. It is not uncommon for us to see that it often takes longer to detect the incident than to restore from it. This could be mitigated with more monitoring at more appropriate positions then you currently have.
Or maybe the time to escalate is an issue. Why does it take so long to get the proper engineers involved? Maybe a real-time alert system is required or a phone tree. And it is important to track and measure total time of an incident as beginning with when it occured (not when it was reported) all the way through to when customers were back up at 100% (not just when your systems were restored).
Problem vs. Incident
How do you know if your incident is also a problem? It’s actually fairly easy to determine. If you have an incident, you have a problem. The scale of the problem may vary by incident but every incident is caused because of something larger than itself.
During our Technical Due Diligences we always want to know how companies categorize incidents vs. problems. If the company properly categorizes problems related to incidents, they will be able to answer “Can you rank your problems to show which cause the most customer impact?” Many times, they can’t - but that ranking is critical to show which problems to attack first.
An incident, at its core, is caused by a problem. If your product crashes anytime someone attempts to access it via an unapproved protocol, the incident is the attempted access. The problem may be an improper review of your architecture. Or it may be lack of QA. Identifying the problem is much more difficult than identifying the incident. Imagine you find a termite on your deck. This small pest could be considered an incident. If you deal with the incident and get rid of the termite everything is good, right? If you don’t look any further than the incident you can’t identify the problem. And in this case the problem could be exposed, untreated wood allowing termites to slowly eat away at the inside of your house.
If you are keeping proper documentation each time you conduct a Post Mortem review, then you will have a history that will start to paint of a picture of ongoing and recurring problems that exist. Remedying the problem will stop the incident from occurring in the same exact way in the future. But small variations of the incident can still occur. If you fix the problem then you are stopping future iterations of that incident from happening again.
September 6, 2018 | Posted By: James Fritz
In our experience we have seen how Agile practices provide organizations within successful companies many benefits which is leading to more and more companies adopting frameworks of Agile outside of software development. Whether they are looking for reduced risk, higher product quality, or even the capability to “fail fast” and rectify mistakes, Agile provides many benefits, particularly in management.
While effort has been expended to identify how to create Agile product delivery teams (Organizing Product Teams for Innovation) and conversely why they fail (The Top Five Most Common Agile PDLC Failures) – a lot of the focus is on the successes and failures of the delivery teams themselves. But the delivery is only as good as the group that surrounds that team.
So how does Agile work beyond your delivery teams? An essay published in 1970 by Robert K. Greenleaf, The Servant as Leader, is credited with introducing the idea of a Servant-Leader, someone who puts their employees’ needs ahead of their own. This is counter-intuitive to a normal management style where management has a list of needs that require completion.
Looking at an Agile team, the concept of waiting for management to drive needs is not conducive to meeting the requirements of the market. A highly competent Agile team has all the necessary tools and authority to get the job done that is required of them. If normal management tactics sit over an Agile team, failure is going to occur.
This is where the philosophy of Servant-Leadership comes into play. If managers, all the way to the C-Suite, understand that they work for their employees, but their employees are accountable to them, then everyone is working towards one goal: the needs of the market. Management needs to be focused on securing the resources necessary for product delivery teams to meet the demands of the market, whether from a high level of the CEO and CFO for additional funding or further down with ensuring that technical debt and other tasks are assigned out appropriately to meet delivery goals. This empowerment for teams may seem risky, but the morale improvement and greater innovation that can be achieved far exceeds the level of risk that would be accepted.
Embracing Agile throughout a company is key to the company being able to survive beyond the first couple sprints. Small changes in management can play a huge role in that. Asking simple questions like, “what do you need to meet your goals”, or “what factors stand in your way of accomplishment” help to enable employees instead of limiting them. Asking yourself why you are successful as a company also helps to identify what segment is responsible for your success.
If the delivery of your services is what customers buy, then identifying ways to enable employees who create those services is vital. This isn’t to say that other roles in the company aren’t important. Without support from the entire company, no one particular segment can succeed. This is why it is so vital for Agile to permeate throughout your entire organization. If you need assistance in identifying gaps in Agile and figuring out how to employ it, feel free to reach out to AKF.
September 5, 2018 | Posted By: Pete Ferguson
Scalability doesn’t somehow magically appear when you trust a cloud provider to host your systems. While Amazon, Google, Microsoft, and others likely will be able to provide a lot more redundancy in power, network, cooling, and expertise in infrastructure than hosting yourself – how you are set up using their tools is still very much up to your budget and which tools you choose to utilize. Additionally, how well your code is written to take advantage of additional resources will affect scalability and availability.
We see more and more new startups in AWS or Azure – in addition to assisting well-established companies make the transition to the cloud. Regardless of the hosting platform, in our technical due diligence reviews we often see the same scalability gaps common to hosted solutions written about in our first edition of “Scalability Rules.” (Abbott, Martin L.. Scalability Rules: Principles for Scaling Web Sites. Pearson Education.)
This blog is a summary recap of the AKF Scale Cube (much of the content contains direct quotes from the original text), an explanation of each axis, and how you can be better prepared to scale within the cloud.
Scalability Rules – Chapter 2: Distribute Your Work
Using ServiceNow as an early example of designing, implementing, and deploying for scale early in its life, we outlined how building in fault tolerance helped scale in early development – and a decade + later the once little known company has been able to keep up with fast growth with over $2B in revenue and some forecasts expecting that number to climb to $15B in the coming years.
So how did they do it? ServiceNow contracted with AKF Partners over a number of engagements to help them think through their future architectural needs and ultimately hired one of the founding partners to augment their already-talented engineering staff.
“The AKF Scale Cube was helpful in offsetting both the increasing size of our customers and the increased demands of rapid functionality extensions and value creation.”
~ Tom Keevan (Founding Partner, AKF Partners)
The original scale cube has stood the test of time and we have used the same three-dimensional model with security, people development, and many other crucial organizational areas needing to rapidly expand with high availability.
At the heart of the AKF Scale Cube are three simple axes, each with an associated rule for scalability. The cube is a great way to represent the path from minimal scale (lower left front of the cube) to near-infinite scalability (upper right back corner of the cube). Sometimes, it’s easier to see these three axes without the confined space of the cube.
X Axis – Horizontal Duplication
The X Axis allows transaction volumes to increase easily and quickly. If data is starting to become unwieldy on databases, distributed architecture allows for reducing the degree of multi-tenancy (Z Axis) or split discrete services off (Y Axis) onto similarly sized hardware.
A simple example of X Axis splits is cloning web servers and application servers and placing them behind a load balancer. This cloning allows the distribution of transactions across systems evenly for horizontal scale. Cloning of application or web services tends to be relatively easy to perform and allows us to scale the number of transactions processed. Unfortunately, it doesn’t really help us when trying to scale the data we must manipulate to perform these transactions as memory caching of data unique to several customers or unique to disparate functions might create a bottleneck that keeps us from scaling these services without significant impact on customer response time. To solve these memory constraints we’ll look to the Y and Z Axes of our scale cube.
Y Axis – Split by Function, Service, or Resource
Looking at a relatively simple e-commerce site, Y Axis splits resources by the verbs of signup, login, search, browse, view, add to cart, and purchase/buy. The data necessary to perform any one of these transactions can vary significantly from the data necessary for the other transactions.
In terms of security, using the Y Axis to segregate and encrypt Personally Identifiable Information (PII) to a separate database provides the required security without requiring all other services to go through a firewall and encryption. This decreases cost, puts less load on your firewall, and ensures greater availability and uptime.
Y Axis splits also apply to a noun approach. Within a simple e-commerce site data can be split by product catalog, product inventory, user account information, marketing information, and so on.
While Y axis splits are most useful in scaling data sets, they are also useful in scaling code bases. Because services or resources are now split, the actions performed and the code necessary to perform them are split up as well. This works very well for small Agile development teams as each team can become experts in subsets of larger systems and don’t need to worry about or become experts on every other part of the system.
Z Axis – Separate Similar Things
Z Axis splits are effective at helping you to scale customer bases but can also be applied to other very large data sets that can’t be pulled apart using the Y Axis methodology. Z Axis separation is useful for containerizing customers or a geographical replication of data. If Y Axis splits are the layers in a cake with each verb or noun having their own separate layer, a Z Axis split is having a separate cake (sharding) for each customer, geograpy, or other subset of data.
This means that each larger customer or geography could have its own dedicated Web, application, and database servers. Given that we also want to leverage the cost efficiencies enabled by multitenancy, we also want to have multiple small customers exist within a single shard which can later be isolated when one of the customers grows to a predetermined size that makes financial or contractual sense.
For hyper-growth companies the speed with which any request can be answered to is at least partially determined by the cache hit ratio of near and distant caches. This speed in turn indicates how many transactions any given system can process, which in turn determines how many systems are needed to process a number of requests.
Splitting up data by geography or customer allows each segment higher availability, scalability, and reliability as problems within one subset will not affect other subsets. In continuous deployment environments, it also allows fragmented code rollout and testing of new features a little at a time instead of an all-or-nothing approach.
This is a quick and dirty breakdown of Scalability Rules that have been applied at thousands of successful companies and provided near infinite scalability when properly implemented. We love helping companies of all shapes and sizes (we have experience with development teams of 2-3 engineers to thousands). Contact us to explore how we can help guide your company to scale your organization, processes, and technology for hyper growth!
1 2 3 > Last ›