The pillars of a “good”, that lead us to eventual great software and products.
I have a pretty long list of books to read this year, along with a never-ending stack of books on my nightstand to get through.
One of the books I started reading is by Dr. Martin Kleppmann and it is called Designing Data-Intensive Applications. Amazon Web services focus on the various pillars of a well-architected framework. With some overlap, I want to cover pillars such as reliability, scalability, maintainability, security, and why it’s critical to get them right in an organization’s software and products for example. To be clear, there are so many important pillars, this is by no means an exhaustive list, just what I am choosing to focus on in this specific writing. Let’s go through each pillar, starting with reliability and so now.
Reliability
We as businesses and our customers need our systems to be available, responsive, working correctly all the time, and working even when there is some unplanned or unexpected situation with infrastructure, software, process, people, security, or region-specific with a public cloud provider. The bottom line here is that it’s critical to get ahead of these inevitable things and to be thoughtful in the design of systems that need to be fault-tolerant or otherwise resilient.
Team reviews or other more random methods such as using Chaos Monkey may help teams find areas where there may be opportunities to help ensure a system is as resilient as it can be provided budget and other potential requirements and constraints allow.
Infrastructure/Hardware
One of the flexibilities and benefits realized with virtualization, cloud, and containers among the many benefits was the ability for a system to remain online, even in a possibly degraded state. As the abstracted hardware or physical layers are remedied behind the scenes, apps and services can be restored to a normal operational state or origin and ultimately stay resilient and online. Expanding and extending this idea further, the concept of redundancy takes the stage. With redundancy, we are talking about building and deploying an app, service, or capabilities on multiple target machines or nodes. The idea again is if one node or machine has an issue an app or service doesn’t go offline or dark. For me, when I think of some examples of critical services this could be orders or authentication for example.
Software/Coding
Having clear visibility into logs/events is critical, allowing engineers, other stakeholders insights on commits, or identifying, correlating, and hopefully resolving errors or other conditions which may be the cause of errors or poor performance for example.
As systems become more distributed and are designed with scale in mind, it may become much more difficult to find and correlate issues with and to other services and technology in a stack which may be leading to errors or other app and service failures.
If an app or service does go down or offline, the specific request(s) need to be stored/saved elsewhere off a server so that it can be handled eventually when the system comes back online and to prevent the request from being lost or having someone manually reenter at a later point. One possible solution is to use a messaging queue to address this type of thing.
The human element
There have been plenty of cases where a person has updated or pushed a change and later finds it contributed to an app or service going offline or running in an otherwise degraded state. People make mistakes, we are human after all, and not machines. Even after an entire team has reviewed an update or change we still manage at times to skip over something.
One opportunity, method, or practice is for a Dev, Ops, or (InsertTeamName or separate platform TeamName) to build and deploy systems using Infrastructure as Code rather than building and deploying system manually. One other quick note here regarding whether a team follows Agile, Scrum, or Kanban, if the chosen method is too restrictive, doesn’t work well, or there is simply a lack of training for one individual or group, team members may resort to manual updates or changes to the infrastructure, which then means the running, Infrastructure Code will be out of sync with code that has been checked into the code repository.
Scalability
When we think about scalability, we may think about how a system will respond both positively or negatively when we increase the load of the number of requests, or the number of users accessing an app or service for example. During load testing, a team may set the number of total users to simulate, set the spawn rate, etc. The teams can take the data from a load-test, and analyze it to identify potential failures and make updates as needed to allow for and accommodate scaling up.
Some systems have batch and queue processing for jobs that may need to complete quickly or interactively and be tightly aligned with a business process, while other jobs may run longer or even be scheduled to run overnight. With either job type, we are interested in the amount of time it takes to process the job(s), how many jobs can be completed per min or per hour, does the number of jobs decreases over time as the number of concurrent jobs increases? Maybe you have business requirements where specific financial jobs must be finished within 24 hours for month-end processing for example. No matter what other ad-hoc jobs run, those scheduled or overnight jobs must complete within 24hours, perhaps they need their own dedicated queue or system resources to be allocated at specific times.
Locust.io is an open-source Python tool that allows a website, API, APP to be tested for performance. Locust provides statistics in terms of the type of request, name, # fails, as well as the median, average, Min and Max represented in (ms).
Maintainability
The maintainability pillar should be about following best practices which includes documentation that is stored in a central location, current SMEs are identified assigned to various portions of a system. This is true of critical and often with legacy systems, many if not all of the original team may no longer be there or they have been reassigned to other projects and work. The Dev team or DevOps teams still need to keep the system up and running and keep the performance at an optimal and desired level. I mentioned prior when I covered logs, that monitoring the health of the system is really important, sometimes you can predict, then be alerted to a situation that is developing and could cause some unplanned outage or downtime. We also want to be proactive which includes keeping a system up to date with regular security patches and updates.
Security
The Security pillar is a very important one, in fact, it’s part of the Amazon Web Services Well-Architected Framework. This pillar focuses heavily on protecting information, systems, and assets, but while still delivering business value and various mitigation strategies.
Taking a deeper dive into the AWS Security Pillar, there are five main areas covering Identity & Access Management, Data Protection, Detective Controls, Infrastructure Protection, and Incident Response. I’m focusing heavily on Amazon Web Services Services here in this section, but some of the fundamentals apply outside of AWS.
Starting with Identity & Access Management- This covers AWS services such as AWS IAM, AWS Directory Service, and AWS Organizations.
Data Protection- Covers AWS KMS and AWS HSM.
Detective Controls- This covers AWS CloudTrail, AWS Config, AWS Security Hub, and Amazon GuardDuty.
Infrastructure Protection- Covering Amazon VPC, AWS WAF, and AWS Systems Manager.
Incident Response- Includes AWS CloudTrail, Amazon SNS, and Amazon CloudWatch.
Expanding further on the five areas of the AWS Security pillar, each has some important, but simple best practices.
Identity & Access Management
- When an AWS account is established, there is a root account created. The AWS root account should not be used, this reduces the overall attack surface.
- Enable MFA on the root account and on all IAM user accounts
- Using IAM permission boundaries regardless of permissions assigned to roles, the IAM permission boundaries restrict the effective permissions.
Detective Controls
- Helps identify security misconfigurations
- Identify threats, threat actors, or other unexpected behavior
- Alerting, Metrics and event notifications
Enabling and using the AWS Security Hub to collect security data from across AWS accounts, services, supported third-party partner products, and help analyze the findings to find trends. The AWS services include the following
- Amazon GuardDuty
- Amazon Macie
- AWS Firewall Manager
- AWS Config
- AWS Inspector
- IAM Access Analyzer