From 588a46cc75106623d93970d4eef3dde5ad6ad85a Mon Sep 17 00:00:00 2001 From: Larry Peterson Date: Tue, 3 Jun 2025 14:02:04 -0700 Subject: [PATCH] all 2nd print changes --- arch.rst | 34 +++++++++---------- authors.rst | 78 ++++++++++++++++++++++--------------------- control.rst | 42 +++++++++++------------ intro.rst | 66 +++++++++++++++++++----------------- lifecycle.rst | 92 +++++++++++++++++++++++++++++++++------------------ monitor.rst | 38 ++++++++++----------- preface.rst | 55 ++++++++++++++++-------------- provision.rst | 27 +++++++-------- 8 files changed, 235 insertions(+), 197 deletions(-) diff --git a/arch.rst b/arch.rst index a6fd6b0..55ad965 100644 --- a/arch.rst +++ b/arch.rst @@ -187,10 +187,10 @@ cluster built out of bare-metal components, each of the SD-Core CP subsystems shown in :numref:`Figure %s ` is actually deployed in a logical Kubernetes cluster on a commodity cloud. The same is true for AMP. Aether’s centralized components are able to run -in Google Cloud Platform, Microsoft Azure, and Amazon’s AWS. They also +in Google Cloud Platform, Microsoft Azure, and Amazon’s AWS. They can also run as an emulated cluster implemented by a system like KIND—Kubernetes in Docker—making it possible for developers to run -these components on their laptop. +these components on their laptops. To be clear, Kubernetes adopts generic terminology, such as “cluster” and “service”, and gives it a very specific meaning. In @@ -239,8 +239,8 @@ There is a potential third stakeholder of note—third-party service providers—which points to the larger issue of how we deploy and manage additional edge applications. To keep the discussion tangible—but remaining in the open source arena—we use OpenVINO as an illustrative -example. OpenVINO is a framework for deploying AI inference models, -which is interesting in the context of Aether because one of its use +example. OpenVINO is a framework for deploying AI inference models. +It is interesting in the context of Aether because one of its use cases is processing video streams, for example to detect and count people who enter the field of view of a collection of 5G-connected cameras. @@ -274,11 +274,11 @@ but for completeness, we take note of two other possibilities. One is that we extend our hybrid architecture to support independent third-party service providers. Each new edge service acquires its own isolated Kubernetes cluster from the edge cloud, and then the -3rd-party provider subsumes all responsibility for managing the +3rd-party provider takes over all responsibility for managing the service running in that cluster. From the perspective of the cloud operator, though, the task just became significantly more difficult because the architecture would need to support Kubernetes as a managed -service, which is sometimes called *Container-as-a-Service (CaaS)*.\ [#]_ +service, which is sometimes called *Containers-as-a-Service (CaaS)*.\ [#]_ Creating isolated Kubernetes clusters on-demand is a step further than we take things in this book, in part because there is a second possible answer that seems more likely to happen. @@ -355,7 +355,7 @@ Internally, each of these subsystems is implemented as a highly available cloud service, running as a collection of microservices. The design is cloud-agnostic, so AMP can be deployed in a public cloud (e.g., Google Cloud, AWS, Azure), an operator-owned Telco cloud, (e.g, -AT&T’s AIC), or an enterprise-owned private cloud. For the current pilot +AT&T’s AIC), or an enterprise-owned private cloud. For the pilot deployment of Aether, AMP runs in the Google Cloud. The rest of this section introduces these four subsystems, with the @@ -485,9 +485,9 @@ Given this mediation role, Runtime Control provides mechanisms to model (represent) the abstract services to be offered to users; store any configuration and control state associated with those models; apply that state to the underlying components, ensuring they remain in -sync with the operator’s intentions; and authorize the set API calls -users try to invoke on each service. These details are spelled out in -Chapter 5. +sync with the operator’s intentions; and authorize the set of API +calls that users try to invoke on each service. These details are +spelled out in Chapter 5. 2.4.4 Monitoring and Telemetry @@ -526,13 +526,13 @@ diagnostics and analytics. This overview of the management architecture could lead one to conclude that these four subsystems were architected, in a rigorous, -top-down fashion, to be completely independent. But that is not -the case. It is more accurate to say that the system evolved bottom -up, solving the next immediate problem one at a time, all the while +top-down fashion, to be completely independent. But that is not the +case. It is more accurate to say that the system evolved bottom up, +solving the next immediate problem one at a time, all the while creating a large ecosystem of open source components that can be used -in different combinations. What we are presenting in this book is a -retrospective description of an end result, organized into four -subsystems to help make sense of it all. +in different combinations. What this book presents is a retrospective +description of the end result, organized into four subsystems to help +make sense of it all. There are, in practice, many opportunities for interactions among the four components, and in some cases, there are overlapping concerns @@ -686,7 +686,7 @@ own. The Control and Management Platform now has its own DevOps team(s), who in addition to continually improving the platform, also field operational events, and when necessary, interact with other teams (e.g., the SD-RAN team in Aether) to resolve issues that come -up. They are sometimes called System Reliability Engineers (SREs), and +up. They are sometimes called Site Reliability Engineers (SREs), and in addition to being responsible for the Control and Management Platform, they enforce operational discipline—the third aspect of DevOps discussed next—on everyone else. diff --git a/authors.rst b/authors.rst index 55859dc..ec8d718 100644 --- a/authors.rst +++ b/authors.rst @@ -6,47 +6,49 @@ Science, Emeritus at Princeton University, where he served as Chair from 2003-2009. His research focuses on the design, implementation, and operation of Internet-scale distributed systems, including the widely used PlanetLab and MeasurementLab platforms. He is currently -contributing to the Aether access-edge cloud project at the Open -Networking Foundation (ONF), where he serves as Chief Scientist. -Peterson is a member of the National Academy of Engineering, a Fellow -of the ACM and the IEEE, the 2010 recipient of the IEEE Kobayashi -Computer and Communication Award, and the 2013 recipient of the ACM -SIGCOMM Award. He received his Ph.D. degree from Purdue University. +contributing to the Aether access-edge cloud project at the Linux +Foundation. Peterson is a member of the National Academy of +Engineering, a Fellow of the ACM and the IEEE, the 2010 recipient of +the IEEE Kobayashi Computer and Communication Award, and the 2013 +recipient of the ACM SIGCOMM Award. He received his Ph.D. degree from +Purdue University. -**Scott Baker** is a Cloud Software Architect at Intel, which he -joined as part of Intel's acquisition of the Open Networking -Foundation (ONF) engineering team. While at ONF, he led the Aether -DevOps team. Prior to ONF, he worked on cloud-related research -projects at Princeton and the University of Arizona, including -PlanetLab, GENI, and VICCI. Baker received his Ph.D. in Computer -Science from the University of Arizona in 2005. +**Scott Baker** is a Cloud Software Architect at Intel, where he works +on the Open Edge Platform. Prior to joining Intel, he was on the Open +Networking Foundation (ONF) engineering team that built Aether, +leading the runtime control effort. Baker has also worked on +cloud-related research projects at Princeton and the University of +Arizona, including PlanetLab, GENI, and VICCI. He received his +Ph.D. in Computer Science from the University of Arizona in 2005. -**Andy Bavier** is a Cloud Software Engineer at Intel, which he joined -as part of Intel's acquisition of the Open Networking Foundation (ONF) -engineering team. While at ONF, he worked on the Aether project. Prior -to joining ONF, he was a Research Scientist at Princeton University, -where he worked on the PlanetLab project. Bavier received a BA in -Philosophy from William & Mary in 1990, and MS in Computer Science -from the University of Arizona in 1995, and a PhD in Computer Science -from Princeton University in 2004. +**Andy Bavier** is a Cloud Software Engineer at Intel, where he works +on the Open Edge Platform. Prior to joining Intel, he was on the Open +Networking Foundation (ONF) engineering team that built Aether, +leading the observability effort. Bavier has also been a Research +Scientist at Princeton University, where he worked on the PlanetLab +project. He received a BA in Philosophy from William & Mary in 1990, +and MS in Computer Science from the University of Arizona in 1995, and +a PhD in Computer Science from Princeton University in 2004. -**Zack Williams** is a Cloud Software Engineer at Intel, which he -joined as part of Intel's acquisition of the Open Networking -Foundation (ONF) engineering team. While at ONF, he worked on the -Aether project, and led the Infrastructure team. Prior to joining ONF, -he was a systems programmer at the University of Arizona. Williams -received his BS in Computer Science from the University of Arizona -in 2001. +**Zack Williams** is a Cloud Software Engineer at Intel, where he +works on the Open Edge Platform. Prior to joining Intel, he was on the +Open Networking Foundation (ONF) engineering team that built +Aether, leading the infrastructure provisioning effort. Williams has also +been a systems programmer at the University of Arizona. He received +his BS in Computer Science from the University of Arizona in 2001. **Bruce Davie** is a computer scientist noted for his contributions to -the field of networking. He is a former VP and CTO for the Asia -Pacific region at VMware. He joined VMware during the acquisition of -Software Defined Networking (SDN) startup Nicira. Prior to that, he -was a Fellow at Cisco Systems, leading a team of architects -responsible for Multiprotocol Label Switching (MPLS). Davie has over -30 years of networking industry experience and has co-authored 17 -RFCs. He was recognized as an ACM Fellow in 2009 and chaired ACM -SIGCOMM from 2009 to 2013. He was also a visiting lecturer at the -Massachusetts Institute of Technology for five years. Davie is the -author of multiple books and the holder of more than 40 U.S. Patents. +the field of networking. He began his networking career at Bellcore +where he worked on the Aurora Gigabit testbed and collaborated with +Larry Peterson on high-speed host-network interfaces. He then went to +Cisco where he led a team of architects responsible for Multiprotocol +Label Switching (MPLS). He worked extensively at the IETF on +standardizing MPLS and various quality of service technologies. He +also spent five years as a visiting lecturer at the Massachusetts +Institute of Technology. In 2012 he joined Software Defined Networking +(SDN) startup Nicira and was then a principal engineer at VMware +following the acquisition of Nicira. In 2017 he took on the role of VP +and CTO for the Asia Pacific region at VMware. He is a Fellow of the +ACM and chaired ACM SIGCOMM from 2009 to 2013. Davie is the author of +multiple books and the holder of more than 40 U.S. patents. diff --git a/control.rst b/control.rst index 75821b6..bd63be8 100644 --- a/control.rst +++ b/control.rst @@ -81,7 +81,7 @@ deployments of 5G, and to that end, defines a *user* to be a principal that accesses the API or GUI portal with some prescribed level of privilege. There is not necessarily a one-to-one relationship between users and Core-defined subscribers, and more importantly, not all -devices have subscribers, as would be the case with IoT devices that +devices have subscribers; a concrete example would be IoT devices that are not typically associated with a particular person. 5.1 Design Overview @@ -115,7 +115,7 @@ Central to this role is the requirement that Runtime Control be able to represent a set of abstract objects, which is to say, it implements a *data model*. While there are several viable options for the specification language used to represent the data model, for Runtime -Control we use YANG. This is for three reasons. First, YANG is a rich +Control Aether uses YANG. This is for three reasons. First, YANG is a rich language for data modeling, with support for strong validation of the data stored in the models and the ability to define relations between objects. Second, it is agnostic as to how the data is stored (i.e., @@ -155,7 +155,7 @@ that we can build upon. from (1) a GUI, which is itself typically built using another framework, such as AngularJS; (2) a CLI; or (3) a closed-loop control program. There are other differences—for example, - Adapters (a kind of Controller) use gNMI as a standard + Adaptors (a kind of Controller) use gNMI as a standard interface for controlling backend components, and persistent state is stored in a key-value store instead of a SQL DB—but the biggest difference is the use of a declarative rather than an @@ -168,11 +168,11 @@ x-config, in turn, uses Atomix (a key-value store microservice), to make configuration state persistent. Because x-config was originally designed to manage configuration state for devices, it uses gNMI as its southbound interface to communicate configuration changes to -devices (or in our case, software services). An Adapter has to be +devices (or in our case, software services). An Adaptor has to be written for any service/device that does not support gNMI -natively. These adapters are shown as part of Runtime Control in +natively. These adaptors are shown as part of Runtime Control in :numref:`Figure %s `, but it is equally correct to view each -adapter as part of the backend component, responsible for making that +adaptor as part of the backend component, responsible for making that component management-ready. Finally, Runtime Control includes a Workflow Engine that is responsible for executing multi-step operations on the data model. This happens, for example, when a change @@ -428,8 +428,8 @@ models are changing due to volatility in the backend systems they control, then it is often the case that the models can be distinguished as "low-level" or "high-level", with only the latter directly visible to clients via the API. In semantic versioning terms, -a change to a low-level model would then effectively be a backwards -compatible PATCH. +a change to a low-level model would then effectively be a +backward-compatible PATCH. 5.2.3 Identity Management @@ -467,15 +467,15 @@ the case of Aether, Open Policy Agent (OPA) serves this role. `__. -5.2.4 Adapters +5.2.4 Adaptors ~~~~~~~~~~~~~~ Not every service or subsystem beneath Runtime Control supports gNMI, -and in the case where it is not supported, an adapter is written to +and in the case where it is not supported, an adaptor is written to translate between gNMI and the service’s native API. In Aether, for -example, a gNMI :math:`\rightarrow` REST adapter translates between +example, a gNMI :math:`\rightarrow` REST adaptor translates between the Runtime Control’s southbound gNMI calls and the SD-Core -subsystem’s RESTful northbound interface. The adapter is not +subsystem’s RESTful northbound interface. The adaptor is not necessarily just a syntactic translator, but may also include its own semantic layer. This supports a logical decoupling of the models stored in x-config and the interface used by the southbound @@ -484,15 +484,15 @@ Control to evolve independently. It also allows for southbound devices/services to be replaced without affecting the northbound interface. -An adapter does not necessarily support only a single service. An -adapter is one means of taking an abstraction that spans multiple +An adaptor does not necessarily support only a single service. An +adaptor is one means of taking an abstraction that spans multiple services and applying it to each of those services. An example in Aether is the *User Plane Function* (the main packet-forwarding module in the SD-Core User Plane) and *SD-Core*, which are jointly -responsible for enforcing *Quality of Service*, where the adapter +responsible for enforcing *Quality of Service*, where the adaptor applies a single set of models to both services. Some care is needed to deal with partial failure, in case one service accepts the change, -but the other does not. In this case, the adapter keeps trying the +but the other does not. In this case, the adaptor keeps trying the failed backend service until it succeeds. 5.2.5 Workflow Engine @@ -519,7 +519,7 @@ ongoing development. gNMI naturally lends itself to mutual TLS for authentication, and that is the recommended way to secure communications between components that speak gNMI. For example, communication between x-config and -its adapters uses gNMI, and therefore, uses mutual TLS. Distributing +its adaptors uses gNMI, and therefore, uses mutual TLS. Distributing certificates between components is a problem outside the scope of Runtime Control. It is assumed that another tool will be responsible for distributing, revoking, and renewing certificates. @@ -738,7 +738,7 @@ that it supports the option of spinning up an entirely new copy of the SD-Core rather than sharing an existing UPF with another Slice. This is done to ensure isolation, and illustrates one possible touch-point between Runtime Control and the Lifecycle Management subsystem: -Runtime Control, via an Adapter, engages Lifecycle Management to +Runtime Control, via an Adaptor, engages Lifecycle Management to launch the necessary set of Kubernetes containers that implement an isolated slice. @@ -802,7 +802,7 @@ Giving enterprises the ability to set isolation and QoS parameters is an illustrative example in Aether. Auto-generating that API from a set of models is an attractive approach to realizing such a control interface, if for no other reason than it forces a decoupling of the -interface definition from the underlying implementation (with Adapters +interface definition from the underlying implementation (with Adaptors bridging the gap). .. sidebar:: UX Considerations @@ -839,7 +839,7 @@ configuration change requires a container restart, then there may be little choice. But ideally, microservices are implemented with their own well-defined management interfaces, which can be invoked from either a configuration-time Operator (to initialize the component at -boot time) or a control-time Adapter (to change the component at +boot time) or a control-time Adaptor (to change the component at runtime). For resource-related operations, such as spinning up additional @@ -847,7 +847,7 @@ containers in response to a user request to create a *Slice* or activate an edge service, a similar implementation strategy is feasible. The Kubernetes API can be called from either Helm (to initialize a microservice at boot time) or from a Runtime Control -Adapter (to add resources at runtime). The remaining challenge is +Adaptor (to add resources at runtime). The remaining challenge is deciding which subsystem maintains the authoritative copy of that state, and ensuring that decision is enforced as a system invariant.\ [#]_ Such decisions are often situation-dependent, but our experience is diff --git a/intro.rst b/intro.rst index 0e04589..16b27c1 100644 --- a/intro.rst +++ b/intro.rst @@ -71,8 +71,8 @@ like. Our approach is to focus on the fundamental problems that must be addressed—design issues that are common to all clouds—but then couple this conceptual discussion with specific engineering choices made while operationalizing a specific enterprise cloud. Our example -is Aether, an ONF project to support 5G-enabled edge clouds as a -managed service. Aether has the following properties that make it an +is Aether, an open source edge cloud that supports 5G connectivity as +a managed service. Aether has the following properties that make it an interesting use case to study: * Aether starts with bare-metal hardware (servers and switches) @@ -111,7 +111,7 @@ because each of these three domains brings its own conventions and terminology to the table. But understanding how these three stakeholders approach operationalization gives us a broader perspective on the problem. We return to the confluence of enterprise, -cloud, access technologies later in this chapter, but we start by +cloud, and access technologies later in this chapter, but we start by addressing the terminology challenge. .. _reading_aether: @@ -232,8 +232,9 @@ terminology. process and Operational requirements silos, balancing feature velocity against system reliability. As a practice, it leverages CI/CD methods and is typically associated with container-based - (also known as *cloud native*) systems, as typified by *Site - Reliability Engineering (SRE)* practiced by cloud providers like + (also known as *cloud native*) systems. There is some overlap + between DevOps and *Site + Reliability Engineering (SRE)* as practiced by cloud providers such as Google. * **In-Service Software Upgrade (ISSU):** A requirement that a @@ -374,10 +375,10 @@ manageable: * Zero-Touch Provisioning is more tractable because the hardware is commodity, and hence, (nearly) identical. This also means the vast - majority of configuration involves initiating software parameters, + majority of configuration involves initializing software parameters, which is more readily automated. -* Cloud native implies a set of best-practices for addressing many of +* Cloud native implies a set of best practices for addressing many of the FCAPS requirements, especially as they relate to availability and performance, both of which are achieved through horizontal scaling. Secure communication is also typically built into cloud RPC @@ -386,7 +387,7 @@ manageable: Another way to say this is that by rearchitecting bundled appliances and devices as horizontally scalable microservices running on commodity hardware, what used to be a set of one-off O&M problems are -now solved by widely applied best-practices from distributed systems, +now solved by widely applied best practices from distributed systems, which have in turn been codified in state-of-the-art cloud management frameworks (like Kubernetes). This leaves us with the problem of (a) provisioning commodity hardware, (b) orchestrating the container @@ -482,10 +483,10 @@ software components, which we describe next. Collectively, all the hardware and software components shown in the figure form the *platform*. Where we draw the line between what's *in the platform* and what runs *on top of the platform*, and why it is important, will -become clear in later chapters, but the summary is that different -mechanisms will be responsible for (a) bringing up the platform and -prepping it to host workloads, and (b) managing the various workloads -that need to be deployed on that platform. +become clear in later chapters. The summary is that one mechanism is +responsible for bringing up the platform and preparing it to host +workloads, and a different mechanism is responsible for managing the +various workloads that are deployed on that platform. 1.3.2 Software Building Blocks @@ -504,7 +505,7 @@ commodity processors in the cluster: interconnected to build applications. These are all well known and ubiquitous, and so we only summarize them -here. Links to related information for anyone that is not familiar +here. Links to related information for anyone who is not familiar with them (including excellent hands-on tutorials for the three container-related building blocks) are given below. @@ -578,7 +579,7 @@ these open building blocks can be assembled into a comprehensive cloud management platform. We describe each tool in enough detail to appreciate how all the parts fit together—providing end-to-end coverage by connecting all the dots—plus links to full documentation -for those that want to dig deeper into the details. +for those who want to dig deeper into the details. .. List: NexBox, Ansible, Netplan, Terraform, Rancher, Fleet, @@ -710,22 +711,27 @@ describe how to introduce VMs as an optional way to provision the underlying infrastructure for that PaaS. Finally, the Aether edge cloud we use as an example is similar to many -other edge cloud platforms now being promoted as an enabling -technology for Internet-of-Things. That Kubernetes-based on-prem/edge -clouds are becoming so popular is one reason they make for such a good -case study. For example, *Smart Edge Open* (formerly known as -OpenNESS) is another open source edge platform, unique in that it -includes several Intel-specific acceleration technologies (e.g., DPDK, -SR-IOV, OVS/OVN). For our purposes, however, the exact set of -components that make up the platform is less important than how the -platform, along with all the cloud services that run on top of it, are -managed as a whole. The Aether example allows us to be specific, but -hopefully not at the expense of general applicability. +other cloud platforms being built to support on-prem deployments. +The dominant use case shifts over time—with Artificial Intelligence +(AI) recently overtaking Internet-of-Things (IoT) as the most +compelling justification for edge clouds—but the operational +challenge remains the same. For example, *Open Edge Platform* recently +open sourced by Intel includes example AI applications and a +collection of AI libraries, but also an *Edge Management Framework* +that mirrors the one describe this book. It starts with a Kubernetes +foundation, and includes tools for provisioning edge servers, +orchestrating edge clusters using those servers, lifecycle managing +edge applications, and enabling observability. Many of the engineering +choices are the same as in Aether (some are different), but the +important takeaway is that Kubernetes-based edge clouds are quickly +becoming commonplace. That's the reason they are such a good case +study. .. admonition:: Further Reading - `Smart Edge Open - `__. + `Open Edge Platform `__. + + `Edge Management Framework `__. 1.4 Future of the Sysadmin -------------------------- @@ -743,7 +749,7 @@ Cloud providers, because of the scale of the systems they build, cannot survive with operational silos, and so they introduced increasingly sophisticated cloud orchestration technologies. Kubernetes and Helm are two high-impact examples. These -cloud best-practices are now available to enterprises as well, but +cloud best practices are now available to enterprises as well, but they are often bundled as a managed service, with the cloud provider playing an ever-greater role in operating the enterprise’s services. Outsourcing portions of the IT responsibility to a cloud provider is an @@ -756,9 +762,9 @@ within the enterprise, deployed as yet another cloud service. The approach this book takes is to explore a best-of-both-worlds opportunity. It does this by walking you through the collection of subsystems, and associated management processes, required to -operationalize an on-prem cloud, and then provide on-going support for +operationalize an on-premises cloud, and then provide on-going support for that cloud and the services it hosts (including 5G connectivity). Our hope is that understanding what’s under the covers of cloud-managed services will help enterprises better share responsibility for -managing their IT infrastructure with cloud providers, and potentially +managing their IT infrastructure with cloud providers, and potentially with MNOs. diff --git a/lifecycle.rst b/lifecycle.rst index 4638f93..9a5d6da 100644 --- a/lifecycle.rst +++ b/lifecycle.rst @@ -10,15 +10,14 @@ assume the base platform includes Linux running on each server and switch, plus Docker, Kubernetes, and Helm, with SD-Fabric controlling the network. -While we could take a narrow view of Lifecycle Management, and assume -the software we want to roll out has already gone through an off-line -integration-and-testing process (this is the traditional model of -vendors releasing a new version of their product), we take a more -expansive approach that starts with the development process—the creation -of new features and capabilities. Including the “innovation” step -closes the virtuous cycle depicted in :numref:`Figure %s`, -which the cloud industry has taught us leads to greater *feature -velocity*. +Traditionally, software would go through an offline integration and +testing process before any effort to roll it out in production could +begin. However, the approach taken in most modern cloud environments, +including ours, is more expansive: it starts with the development +process—the creation of new features and capabilities. Including the +“innovation” step closes the virtuous cycle depicted in +:numref:`Figure %s`, which the cloud industry has taught us +leads to greater *feature velocity*. .. _fig-cycle: .. figure:: figures/Slide9.png @@ -185,7 +184,7 @@ effective use of automation. This section introduces an approach to test automation, but we start by talking about the overall testing strategy. -The best-practice for testing in the Cloud/DevOps environment is to +The best practice for testing in the Cloud/DevOps environment is to adopt a *Shift Left* strategy, which introduces tests early in the development cycle, that is, on the left side of the pipeline shown in :numref:`Figure %s `. To apply this principle, you first @@ -312,16 +311,14 @@ switches). Example Testing Frameworks used in Aether. -Some of the frameworks shown in :numref:`Figure %s -` were co-developed with the corresponding software -component. This is true of TestVectors and TestON, which put -customized workloads on Stratum (SwitchOS) and ONOS (NetworkOS), -respectively. Both are open source, and hence available to pursue for -insights into the challenges of building a testing framework. In -contrast, NG40 is a proprietary framework for emulating 3GPP-compliant -cellular network traffic, which due to the complexity and value in -demonstrating adherence to the 3GPP standard, is a closed, commercial -product. +Some of the frameworks shown in :numref:`Figure %s ` were +co-developed with the corresponding software component. This is true +of TestVectors and TestON, which put customized workloads on Stratum +(SwitchOS) and ONOS (NetworkOS), respectively. Both are open source, +and hence available to be perused for insights into the challenges of +building a testing framework. In contrast, NG40 is a +close source, proprietary framework for emulating 3GPP-compliant +cellular network traffic. Selenium and Robot are the most general of the five examples. Each is an open source project with an active developer community. Selenium is a @@ -476,7 +473,7 @@ publish a new Docker image, triggered by a change to a ``VERSION`` file stored in the code repo. (We'll see why in Section 4.5.) As an illustrative example, the following is from a Groovy script that -defines the pipeline for testing the Aether API, which as we'll see in +defines the pipeline for testing the Aether API, which, as we'll see in the next chapter, is auto-generated by the Runtime Control subsystem. We're interested in the general form of the pipeline, so omit most of the details, but it should be clear from the example what @@ -514,6 +511,34 @@ patch set. .. literalinclude:: code/trigger-event.yaml + +.. sidebar:: Balancing DIY Tools with Cloud Services + + *Aether uses Jenkins as our CI tool, but another popular option is + GitHub Actions. This is a relatively new feature of GitHub (the + cloud service, not to be confused with the software tool Git). GitHub Actions augment + the code repo with a set of workflows that can be executed every + time a patch is submitted. In this setting, a workflow is roughly + analogous to a Groovy pipeline.* + + *GitHub actions are especially convenient for open source projects + because they include spinning up a container in which the workflow + runs (for free, but with limits). A mixed strategy would be to run + simple GitHub Actions for unit and smoke tests when code is + checked in, but then use Jenkins to manage complex integration + tests that require additional testing resources (e.g., a full QA + cluster).* + + *GitHub Actions are not unique. Many of the open source options + described in this book are paired with a cloud service + counterpart. The key consideration is how much you want to depend + on a service someone else provides versus depending entirely on + services you install and manage yourself. The former can be + easier, but comes with the risk that the provider changes (or + discontinues) the service. The same can be said of open source + projects, but having access to source code gives you more + control over your fate.* + The important takeaway from this discussion is that there is no single or global CI job. There are many per-component jobs that independently publish deployable artifacts when conditions dictate. @@ -533,10 +558,10 @@ Config Repo, which includes both the set of Terraform Templates that specify the underlying infrastructure (we've been calling this the cloud platform) and the set of Helm Charts that specify the collection of microservices (sometimes called applications) that are to be -deployed on that infrastructure. We already know about Terraform from +deployed on that infrastructure. We discussed Terraform in Chapter 3: it's the agent that actually "acts on" the infrastructure-related forms. For its counterpart on the application -side we use an open source project called Fleet. +side Aether uses an open source project called Fleet. :numref:`Figure %s ` shows the big picture we are working towards. Notice that both Fleet and Terraform depend on the @@ -630,10 +655,11 @@ when. overloaded the repo. A "polling-frequency" parameter change improved the situation, but led people to wonder why Jenkins' trigger mechanism hadn't caused the same problem. The answer is - that Jenkins is better integrated with the repo (specifically, - Gerrit running on top of Git), with the repo pushing event - notifications to Jenkins when a file check-in actually occurs. - There is no polling.* + that Jenkins is better integrated with the repo, with a GitHub + webhook pushing event notifications to Jenkins when a file + check-in actually occurs. There is no polling. (Polling can also + be disabled in Fleet, in favor of webhooks, but polling is the + default.)* This focus on Fleet as the agent triggering the execution of Helm Charts should not distract from the central role of the charts @@ -677,8 +703,8 @@ Our starting point is to adopt the widely-accepted practice of version number *MAJOR.MINOR.PATCH* (e.g., ``3.2.4``), where the *MAJOR* version increments whenever you make an incompatible API change, the *MINOR* version increments when you add functionality in a -backward compatible way, and the *PATCH* corresponds to a backwards -compatible bug fix. +backward-compatible way, and the *PATCH* corresponds to a +backward-compatible bug fix. .. _reading_semver: .. admonition:: Further Reading @@ -704,7 +730,7 @@ the software lifecycle: * The commit that does correspond to a finalized patch is also tagged (in the repo) with the corresponding semantic version number. In - git, this tag is bound to a hash that unambiguously identifies the + Git, this tag is bound to a hash that unambiguously identifies the commit, making it the authoritative way of binding a version number to a particular instance of the source code. @@ -770,7 +796,7 @@ chapter. 4.6 Managing Secrets -------------------- -The discussion up this point has glossed over one important detail, +The discussion up to this point has glossed over one important detail, which is how secrets are managed. These include, for example, the credentials Terraform needs to access remote services like GCP, as well as the keys used to secure communication among microservices @@ -825,7 +851,7 @@ Controller to use its sealing key to help them unlock those secrets. While this approach is less general than the first (i.e., it is specific to protecting secrets within a Kubernetes cluster), it has -the advantage of taking humans completely out-of-the-loop, with the +the advantage of taking humans completely out of the loop, with the sealing key being programmatically generated at runtime. One complication, however, is that it is generally preferable for that secret to be written to persistent storage, to protect against having @@ -871,7 +897,7 @@ with a particular set of use cases in mind, but it is later integrated with other software to build entirely new cloud apps that have their own set of abstractions and features, and correspondingly, their own collection of configuration state. This is true for Aether, where the -SD-Core subsystem was originally implemented for use in global +SD-Core subsystem, for example, was originally implemented for use in global cellular networks, but is being repurposed to support private 4G/5G in enterprises. diff --git a/monitor.rst b/monitor.rst index 4eb27ac..034558e 100644 --- a/monitor.rst +++ b/monitor.rst @@ -77,7 +77,7 @@ closed-loop control where the automated tool not only detects problems but is also able to issue corrective control directives. For the purpose of this chapter, we give examples of the first two (alerts and dashboards), and declare the latter two (analytics and close-loop -control) as out-of-scope (but likely running as applications that +control) as out of scope (but likely running as applications that consume the telemetry data outlined in the sections that follow). Third, when viewed from the perspective of lifecycle management, @@ -96,9 +96,9 @@ Finally, because the metrics, logs, and traces collected by the various subsystems are timestamped, it is possible to establish correlations among them, which is helpful when debugging a problem or deciding whether or not an alert is warranted. We give examples of how -such telemetry-wide functions are implemented in practice today, as -well as discuss the future future of generating and using telemetry -data, in the final two sections of this chapter. +such telemetry-wide functions are implemented in practice today, and +discuss the future of generating and using telemetry data, in the +final two sections of this chapter. 6.1 Metrics and Alerts ------------------------------- @@ -170,7 +170,7 @@ to the central location (e.g., to be displayed by Grafana as described in the next subsection). This is appropriate for metrics that are both high-volume and seldom viewed. One exception is the end-to-end tests described in the previous paragraph. These results are immediately -pushed to the central site (bypassing the local Prometheus), because +pushed to the central site (bypassing the local Prometheus instance), because they are low-volume and may require immediate attention. 6.1.2 Creating Dashboards @@ -179,7 +179,7 @@ they are low-volume and may require immediate attention. The metrics collected by Prometheus are visualized using Grafana dashboards. In Aether, this means the Grafana instance running as part of AMP in the central cloud sends queries to some combination of -the central Prometheus and a subset of the Prometheus instances +the central Prometheus instance and a subset of the Prometheus instances running on edge clusters. For example, :numref:`Figure %s ` shows the summary dashboard for a collection of Aether edge sites. @@ -447,12 +447,12 @@ foreseeable future. `__. With respect to mechanisms, Jaeger is a widely used open source -tracing tool originally developed by Uber. (Jaeger is not currently -included in Aether, but was utilized in a predecessor ONF edge cloud.) -Jaeger includes instrumentation of the runtime system for the -language(s) used to implement an application, a collector, storage, -and a query language that can be used to diagnose performance problems -and do root cause analysis. +tracing tool originally developed by Uber. (Jaeger is not included in +Aether, but was utilized in a predecessor edge cloud.) Jaeger +includes instrumentation of the runtime system for the language(s) +used to implement an application, a collector, storage, and a query +language that can be used to diagnose performance problems and do root +cause analysis. 6.4 Integrated Dashboards ------------------------- @@ -497,9 +497,9 @@ SD-Core, which augments the UPF performance data shown in in a Grafana dashboard. Second, the runtime control interface described in Chapter 5 provides -a means to change various parameters of a running system, but having -access to the data needed to know what changes (if any) need to be -made is a prerequisite for making informed decisions. To this end, it +a means to change various parameters of a running system, but to make +informed decisions about what changes (if any) need to be +made, it is necessary to have access to the right data. To this end, it is ideal to have access to both the "knobs" and the "dials" on an integrated dashboard. This can be accomplished by incorporating Grafana frames in the Runtime Control GUI, which, in its simplest form, @@ -515,7 +515,7 @@ certainly possible.) Example control dashboard showing the set of Device Groups defined for a fictional set of Aether sites. -For example, :numref:`Figure %s ` shows the current set +For example, :numref:`Figure %s ` shows the set of device groups for a fictional set of Aether sites, where clicking on the "Edit" button pops up a web form that lets the enterprise admin modify the corresponding fields of the `Device-Group` model (not @@ -584,9 +584,9 @@ Chapter 1. A Service Mesh framework such as Istio provides a means to enforce fine-grained security policies and collect telemetry data in cloud native applications by injecting "observation/enforcement points" between microservices. These injection points, called -*sidecars*, are typically implemented by a container that "runs along -side" the containers that implement each microservice, with all RPC -calls from Service A to Service B passing through their associated +*sidecars*, are typically implemented by a container that "runs +alongside" the containers that implement each microservice, with all +RPC calls from Service A to Service B passing through their associated sidecars. As shown in :numref:`Figure %s `, these sidecars then implement whatever policies the operator wants to impose on the application, sending telemetry data to a global collector and diff --git a/preface.rst b/preface.rst index a3d4f56..fee4128 100644 --- a/preface.rst +++ b/preface.rst @@ -9,13 +9,20 @@ Microsoft, Amazon and the other cloud providers do for us, and they do a perfectly good job of it. The answer, we believe, is that the cloud is becoming ubiquitous in -another way, as distributed applications increasing run not just in +another way, as distributed applications increasingly run not just in large, central datacenters but at the edge. As applications are -disaggregated, the cloud is expanding from hundreds of datacenters to tens of -thousands of enterprises. And while it is clear that the commodity -cloud providers are eager to manage those edge clouds as a logical -extension of their datacenters, they do not have a monopoly on the -know-how for making that happen. +disaggregated, the cloud is expanding from hundreds of datacenters to +tens of thousands of enterprises. And while it is clear that the +commodity cloud providers are eager to manage those edge clouds as a +logical extension of their datacenters, they do not have a monopoly on +the know-how for making that happen. + +At the same time edge applications are moving to the forefront, +increasing importance is also being placed on *digital sovereignty*, +the ability of nations and organizations to control their destiny and +their data. Cloud technology is important for running today's +workloads, but access to that technology does not necessarily have to +be bundled with outsourcing operational control. This book lays out a roadmap that a small team of engineers followed over the course of a year to stand up and operationalize an edge cloud @@ -78,13 +85,12 @@ The good news is that there is a wealth of open source components that can be assembled to help manage cloud platforms and scalable applications built on those platforms. That's also the bad news. With several dozen cloud-related projects available at open source -consortia like the Linux Foundation, Cloud Native Computing -Foundation, Apache Foundation, and Open Networking Foundation, -navigating the project space is one of the biggest challenges we faced -in putting together a cloud management platform. This is in large part -because these projects are competing for mindshare, with both -significant overlap in the functionality they offer and extraneous -dependencies on each other. +consortia such as the Linux Foundation, Cloud Native Computing +Foundation, and Apache Foundation, navigating the project space is one +of the biggest challenges we faced in putting together a cloud +management platform. This is in large part because these projects are +competing for mindshare, with both significant overlap in the +functionality they offer and dependencies on each other. One way to read this book is as a guided tour of the open source landscape for cloud control and management. And in that spirit, we do @@ -94,7 +100,7 @@ provide, but instead include links to project-specific documentation include snippets of code from those projects, but these examples are chosen to help solidify the main points we're trying to make about the management platform as a whole; they should not be interpreted as an -attempt to document the inner-working of the individual projects. Our +attempt to document the inner working of the individual projects. Our goal is to explain how the various puzzle pieces fit together to build an end-to-end management system, and in doing so, identify both various tools that help and the hard problems that no amount of @@ -112,21 +118,22 @@ foundational. Acknowledgements ------------------ -The software described in this book is due to the hard work of the ONF -engineering team and the open source community that works with +*Aether*, the example edge cloud this book uses to illustrate how to +operationalize a cloud, was built by the Open Networking Foundation +(ONF) engineering team and the open source community that worked with them. We acknowledge their contributions, with a special thank-you to Hyunsun Moon, Sean Condon, and HungWei Chiu for their significant contributions to Aether's control and management platform, and to Oguz -Sunay for his influence on its overall design. Suchitra Vemuri's +Sunay for his influence on Aether's overall design. Suchitra Vemuri's insights into testing and quality assurance were also invaluable. -This book is still very much a work-in-progress, and we will happily -acknowledge everyone that provides feedback. Please send us your -comments using the `Issues Link -`__. Also see the -`Wiki `__ for the TODO -list we're currently working on. +The ONF is no longer active, but Aether continues as an open source +project of the Linux Foundation. Visit https://aetherproject.org to +learn about the ongoing project. We will also happily accept feedback +to this book. Please send us your comments using the `Issues Link +`__, or submit a Pull +Request with suggested changes. | Larry Peterson, Scott Baker, Andy Bavier, Zack Williams, and Bruce Davie -| June 2022 +| April 2025 diff --git a/provision.rst b/provision.rst index 3993c7d..8158778 100644 --- a/provision.rst +++ b/provision.rst @@ -28,7 +28,7 @@ infrastructure, which has inspired an approach known as *Configuration-as-Code* concept introduced in Chapter 2. The general idea is to document, in a declarative format that can be "executed", exactly what our infrastructure is to look like; how it is to be -configured. We use Terraform as our open source approach to +configured. Aether uses Terraform as its approach to Infrastructure-as-Code. When a cloud is built from a combination of virtual and physical @@ -37,7 +37,7 @@ seamless way to accommodate both. To this end, our approach is to first overlay a *logical structure* on top of hardware resources, making them roughly equivalent to the virtual resources we get from a commercial cloud provider. This results in a hybrid scenario similar -to the one shown in :numref:`Figure %s `. We use NetBox as +to the one shown in :numref:`Figure %s `. NetBox is our open source solution for layering this logical structure on top of physical hardware. NetBox also helps us address the requirement of tracking physical inventory. @@ -316,14 +316,14 @@ goal is to minimize manual configuration required to onboard physical infrastructure like that shown in :numref:`Figure %s `, but *zero-touch* is a high bar. To illustrate, the bootstrapping steps needed to complete provisioning for our example -deployment currently include: +deployment include: * Configure the Management Switch to know the set of VLANs being used. * Configure the Management Server so it boots from a provided USB key. -* Run Ansible roles and playbooks needed to complete configuration +* Run Ansible playbooks needed to complete configuration onto the Management Server. * Configure the Compute Servers so they boot from the Management @@ -364,14 +364,11 @@ parameters that NetBox maintains. The general idea is as follows. For every network service (e.g., DNS, DHCP, iPXE, Nginx) and every per-device subsystem (e.g., network -interfaces, Docker) that needs to be configured, there is a corresponding -Ansible role and playbook.\ [#]_ These configurations are applied to the -Management Server during the manual configuration stage summarized above, once -the management network is online. - -.. [#] We gloss over the distinction between *roles* and *playbooks* - in Ansible, and focus on the general idea of there being a - *script* that runs with a set of input parameters. +interfaces, Docker) that needs to be configured, there is a +corresponding Ansible role (set of related playbooks). These +configurations are applied to the Management Server during the manual +configuration stage summarized above, once the management network is +online. The Ansible playbooks install and configure the network services on the Management Server. The role of DNS and DHCP are obvious. As for iPXE and Nginx, @@ -435,14 +432,14 @@ Kubernetes cluster. For starters, the API needs to provide a means to install and configure Kubernetes on each physical cluster. This includes specifying which version of Kubernetes to run, selecting the right combination of Container Network Interface (CNI) plugins -(virtual network adapters), and connecting Kubernetes to the local +(virtual network adaptors), and connecting Kubernetes to the local network (and any VPNs it might need). This layer also needs to provide a means to set up accounts (and associated credentials) for accessing and using each Kubernetes cluster, and a way to manage independent projects that are to be deployed on a given cluster (i.e., manage namespaces for multiple applications). -As an example, Aether currently uses Rancher to manage Kubernetes on +As an example, Aether uses Rancher to manage Kubernetes on the bare-metal clusters, with one centralized instance of Rancher being responsible for managing all the edge sites. This results in the configuration shown in :numref:`Figure %s `, which to @@ -546,7 +543,7 @@ some running at the edges on bare-metal and some instantiated in GCP) are to be instantiated, and how each is to be configured—and then automate the task of making calls against the programmatic API to make it so. This is the essence of Infrastructure-as-Code, and as we've -already said, we use Terraform as our open source example. +already said, Terraform is our open source example. Since Terraform specifications are declarative, the best way to understand them is to walk through a specific example. In doing so,