Building a multi-account, multi-runtime service-oriented architecture

Cole Morrison: Welcome everyone to building a multi account, multi run time and surprise multi region service oriented architecture. Yes, that is a very big topic. There's a lot in it. But interestingly if you are here, it's probably because you understand why this is important, this problem probably resonates with you. But before we get into that, let's introduce and connect the faces with some names here.

My name's Cole Morrison. I'm a Developer Advocate at HashiCorp. Prior to that spent 10 years in both software and dev ops engineering across internet of things, gaming, i start ups and the like and my wonderful co speaker here, Rosemary. Please introduce yourself.

Rosemary Wang: Hi, I'm Rosemary. I'm also a Developer Advocate at HashiCorp, but I wrote a book called Infrastructures Code Patterns and Practices, which are basically a reflection of all the things that i learned. Building lots of infrastructure systems, scaling them across the organizations and development teams. And this is a topic that is near and deep to me because it turns out it's really challenging to do multi account, multi run time and well multi region because why not challenge ourselves.

But the reality is that this is actually something that you see a lot. And because scaling across large organizations is so challenging, it's something that becomes very necessary. After a period of time, you can work between one or two people, building infrastructures code, maybe going into a console, clicking and creating things. But eventually you'll need to scale to thousands of developers. And that's where the becomes a little bit more daunting because people don't always want to learn things, right?

So people may not want to learn ECS, people may not want to learn EKS, they may not want to learn Terraform or Infrastructure Code, they may not even care and some development teams just want to develop and be done. And so you have a huge variety of people in your organization with different skill sets and different reasons for the things they want to do which then challenges the next segment of this, which is process, right?

Process is challenged because you don't have consistency of automation. Yes. So not only do we not all know the same things, we may not have the time to learn. All of this is the same thing. They need to roll up the one end user facing effort or in start developer facing efforts such as a product, regardless of what it is that you're working on, that's going to challenge a process when you're in just one run time, the pipeline that you write the code, the automation you create, it can be pretty straightforward until you start adding all these other ones.

And then you've got to start thinking about things like automating the automation. And sometimes you go through another level deep and you're automating automation and automation, you have to sort of count, start accounting for all of these different skill sets. How are you going to manage this? What process are you going to put in place so that this is all possible.

And of course, as hackers and engineers, developers ops whatever it is you call yourself, you start wondering what tools and technologies can we use to get this done and tell us about that summer.

Cole Morrison: Yeah. So rather than go through automation section where now you've automated your automation, maybe consider some technology, right? And some technology can help you provide a common interface or an architectural foundation for extending and scaling the system, right? It doesn't mean that you should always be looking at the technologies out there. But at some point, you will scale to a degree that in which you cannot maintain your automation anymore. And it makes sense to adopt technology or look at some technologies that will help you.

So we have a lot of technologies we're featuring today. I do, we get a lot, we've got so many, we're not even going to be able to go into all of them. So we're going to try to use umbrellas if you will to group them and you've already seen two of them in the title. So let's talk about the technologies from a multi account standpoint. It kind of gives it away. It's going to be an AWS account. Whether you consider that a technology or not, it becomes a core building block in this process because it's the perfect unit. If you're on AWS to sandbox, different teams, not only to sandbox them from a security standpoint, but also give them the autonomy to work in their own run time.

So we're going to take a look at how we can use AWS Organizations to create a parent account that then watches in a variety of chat accounts that bridge in two run times. We're going to see four different run times here all working together. We're going to see straight up EC2. So just the point of virtual machines, Amazon ECS, Cooper, Nace, the EKS. And then we're going to see a single page web app deployed front end using CloudFront all working together. And if that wasn't a lot, right, we are going to do it in multi region and it's going to feel like we're going through multiple universes and seeing a lot of the different networking problems that we have. They're both layer four ones and layer seven ones, how we connect all of those different things.

And we're going to use some wonderful HashiCorp tooling to get that done because obviously we're up here. Ring that Rosemary. Tell us about that.

Rosemary Wang: Yeah. So you need some technologies that pull all of these cloud automation together and it doesn't matter what you end up. But what we're here to show you are some patterns that you can look and apply to your systems, right? So a lot of the products from HashiCorp are around cloud infrastructure, automation and security, which makes lent it very well to this kind of scale we're going to focus on for today Terraform for infrastructures code Boundary for modern privilege access management, Vault for secrets management and Consul for service mesh.

The idea is that all of these provide a common interface to organize processes and people who are provisioning and configuring services and infrastructure, right? So we're adding a layer of services on top of all our infrastructure in the hope that it will help bring some organization and for us to scale and extend some existing systems while giving people the ability to build more and build what they want.

Speaking of building, right, our scenario is that we are going to be introducing two runtime on Amazon and over the course of this talk, we're going to be building more runtime on top of it, all of which will be across three regions and each runtime has its own account.

So if you're curious about this demo today, we have a massive mono repo for you.

Cole Morrison: Yeah. It does run a bunch of code across us e one us west to an eu west one. So i check it out. It's a reference for you to examine some patterns and potentially engineer in your organization. It does have connections across the council boundary te for him and the services you see today and some were not going to mention explicitly.

All right. So in the sequence of this cole and i needed some guiding principles when we built this repository for this discussion, right? The idea is that if we don't agree on some principles, we cannot extend it ourselves. If two people can extend this, you probably won't be able to extend it across your organization. So it's not just about copying the code, it's about how we applied some of the principles and patterns that we established when writing the code and the patterns that help us build it more.

And we're going to go with creating a production, rough time, run time, observing, measuring and managing production. We're going to add a feature and fix some bugs, secure, multi user access and customize and add new runtime.

Now you're looking at this thing. This is ambitious to be honest. The first section which is creating production infrastructure is going to reflect what you're going to struggle with, which is if you're creating a minimum viable production infrastructure, that's going to be the bulk of your time, but it will be much shorter in the remaining sections. So keep an eye out for all of those new features and concepts that we're introducing there.

All right. So cole we talked about principles for building this. When we went into this, we wanted to make sure we did this in a way that we agreed on how we were going to extend this. So what are the principles and what does service oriented architecture have to do with it?

Cole Morrison: So that was actually the first step of what is service oriented architecture. And we decided there is no set definition for service oriented architecture. So instead let's just come up with some guidelines, some guideposts that we can follow while we build this stuff instead of just trying to make up say just platitudes that sound nice. They do sound nice though, don't they, we wanted to look at things that happen almost like emergent properties in a large system both throughout our careers and at HashiCorp with the different users, we have both at the highest and mid and lowest levels. These things begin to appear whether you intend for them to or not. After you've done something 1020 100 times, you got to look at how you're going to standardize it, right?

When you have a number of teams, you want them to be able to work independently so that they can get things done, right? And in terms of loose coupling, we want a module, for example, that we can use across our different code bases and assets. So let's just go ahead and step into the first one and that's going to be autonomy. So self contained, independent changeable without disruption. This is keeping things such they can work on their own, not just the people having their freedom, but the code and the modules that you create and using an analogy here to building blocks. Those are totally just building blocks. I have no idea if you think those are something else but they're building blocks.

This is going to be instead of building everything out of raw plastic, instead create these blocks and start building with those. But that's going to bridge us into our next one. Rosemary.

Rosemary Wang: And that standardization because automation works best with consistency, right? Edge cases cause automation to break. So standardization is key, we want to make sure that we are using common formats, protocols and conventions for usability. So that when Cole looks at something that i built, he recognizes immediately, ok. This is a standard that i must follow as well.

So there are reasons why you would want to establish some kind of common format at the very least your automation get some consistency and you don't run into edge cases, right?

And if in standardization, those were like having the different types of bricks that you're going to build with. Well, the loose coupling would be the studs and the pips on the bricks that make connecting them, but also taking them apart and reusing them with the other things. This loose coupling principle so we need well defined interfaces, right?

So if this is, if we zoom out, that's going to like a very well defined api for which other things can consume either your code module, your service and the like it's going to take us into the last one though. And that is what discoverability because you cannot keep track of everything in your system. Systems of scale have many resources, components and assets. It's really hard from a security standpoint and an automation standpoint to keep track of them and identify what's happening.

So make sure you have clear purpose identified as part of each of these components and they are self registering because you cannot register them or manually go through and reconcile all the information about these assets, right? So document them, make sure they're self registering. That's where you can scale. Discoverability is really key when you have a bunch of services, but also a bunch of infrastructure components related to them.

And that brings us to before here we said there's no set definition. But for our purposes, we're going to say that service oriented architecture is building an aggregate system from independent components. So instead of building everything from a raw material, from raw plastic, we're going to first build building blocks and set of building blocks and construct what it is that we're trying to build from those. These are what we thought as we built out, what we're going to show you today. And we're going to call back to those and how are we going to get started here?

All right. So let's create some production infrastructure.

"We're going to divide this section into three sections that are according to this title, multi account, multi run time and multi region.

We're going to show you how we evolved our code base from multi account first, then multi run time and then multi region. What we're looking for is a base production infrastructure to run our systems well enough and then we can add more and build on it later.

So we're going to dive into the demo pretty immediately. Yeah. So if you thought it was going to just be a bunch of slides, there's not many more, it's going to be us going through a variety of codes.

So let's just talk about from a multi account standpoint. What are you going to need to do? Well, unsurprisingly, you're going to need to make a root account. So we went with AWS Organizations, as I said here and the beginning here was a root account that we can go into organizations and also creating a variety of child accounts that are going to represent your different runtime.

And we're going to see how to add these when you want to add new teams in the future. But the way to think about each of these runtime, even though we've named them after what it is the platform that is we're using these could be named after business, the business logic that your team handles.

So this is where we began, we created the different accounts and then we wanted to start to think, we started thinking about how we wanted to work with these accounts. Well, it would be nice since we're going to do everything in infrastructure as code for us to have a directory structure in workflow that mirrors it.

So this is our github repo and it looks very similar to that organization account structure. And this is by design, we wanted it such that if you made a modification to runtime ec s, it would map to the ecs account and only do things there and the same for the rest of these.

So you type that code out, you go ahead and you get it working. But how do we make this all come to reality? What tool do we use to make it so that we could not just do terra form apply on one lonely machine sitting in your basement. Yeah.

So the two of us are in different parts of the country. And so we needed a way to collaborate on it and make sure that we're not just stepping on each other's toes, so to speak.

So we ended up using Terraform Cloud as ac i framework. You can use any c i framework you want for integrating all your changes in infrastructure as code, but you'll notice that any time cole committed something or i committed something that the terraform cloud workspace would run and it would apply those changes.

So you can see the difference. So for example, i'm splitting eks here and i get plans, i can identify what's happening. So that's the benefit to using the workspaces here.

Notice that each workspace though corresponds to an aws account, you can subdivide this further. So if you have many business units, you can even give allocate different workspaces to teams in the form of projects. There's also role based access control here. You can subdivide people's access with terraform cloud.

But the point is that someone can go in and they can do what they need to in terraform without necessarily running it on their local machine without setting up additional c i frameworks for themselves if you ought to do this.

So that's one way that we controlled our changes into production and it became very necessary, especially because we were working across so many runtime with different dependencies.

The other important thing is that because we have different accounts, we need to make sure that we have credentials. Terraform has to access amazon in order for it to configure it, right? What we did instead was set up dynamic credentials directly to aws.

If you're not familiar with dynamic credentials and t from cloud, that's all right. But the idea is that we're able to link up a, an identity to our account and what this will allow us to do is use a role i am roll in order to retrieve a session token, it provides just in time access for the workspace to aws.

So what this means is that i can link up the workspace name. i could even do some wild card if you need certain things for certain teams right to access the workspace. So i can subdivide by project and workspace. And that means that i don't have to pass adibi access keys or secret access keys or anything to t phone cloud per workspace.

Instead, i get this dynamically and i don't have to worry about that. So it isolates it per account as well. So that's the important thing when you're trying to run infrastructures code, you want to isolate with as much lease privilege as possible.

All right. So that's multi account. What about multi run time, right? And so with this reality in place, we now have the workflow that was really important for us to get up and running, which is i can now make changes.

We go back to our code here. I can now make changes in any one of these folders and it will map to that particular account and deploy infrastructure just there and we can of course keep track of it.

So let's zoom in a bit more on what it looks like when we start doing an individual run time, we're going to see some of these principles start taking place. So within ecs, for example, we'll just start here, the pattern that we took when creating this code is we wanted this top level to represent all of the global infrastructure that is you're going to create.

Now, some of these things you just are global by default, which are going to be im for example, in route 53. But we also, since we wanted to do it regionally, we didn't want to write this, like we didn't want to repeat ourselves for all the code for every single different region.

So we created a module, a region module such that when a developer comes in here and starts coding their infrastructure, they quit thinking about multiple regions. They just think how they want it in one region. This becomes a module and we'll go back up to the global area here in maine and we'll see this module being used.

We'll see all regional infrastructure being deployed to us east one with perhaps a different switches, a couple of switches in here to customize it to that particular region. The same for us west two and the same for eu one power of this is when you want to expand to another region.

The workflow here is very simple and whatever it is that you're going whatever workflow you're falling for your developers doesn't change for region. If they want to add a new feature after they've done it, it will of course come through and put it into every single one of these modules as well.

Now, the other thing here that you'll see is beyond just our regional module is we wanted to go ahead and start giving developers the ability to build those smaller building blocks and we'll touch on networking a bit more. But for example, when they go to make their network, we've created an internal networking module that lives at the root because obviously modules and assets here, if you haven't guessed or not mapped aws accounts, these are global modules that can be reused across any runtime.

So the part about that is we get some of that standardization and control as the platform team, the person running the whole show or the team running the whole show to spill out over into the rest of the accounts.

And speaking of modules, i didn't want to do any of the hdp stuff, right? And you did most of that, you made a module and that's a pretty good bridge into what happened from runtime perspective on that.

So hcp stands for hasha corp cloud platform. It's a bunch of managed services for vault console and boundaries. So rather than run our own server instances, i decided i would use hb and it kind of helps us because what it means is that we isolate our console and vault clusters away from the rest of the run time.

So it's shared services. So that's why it is under our shared services directory, shared services is a separate account that we use for shared services. So this could be your c i frameworks. This could be managed services that you're reaching out to, that you don't directly manage per run time.

So shared services become sort of a hub for everything else. That's why we put hcp in it. And so in this, at least in the section, we'll talk about vault and console.

Vault is a secrets manager. We needed to get secrets into other systems. So for example, i use vault to issue certificates to my console cluster, my service mesh. And this means that i don't have to load in certificates myself. I don't have to necessarily use the the one that comes out of the box instead that means that i can just use whatever vault issuing.

So if i check out my console name space in vault, what that means is that console gets its own isolated section in vault for secrets management. And it's able to say i only get these certificates from here from this name space and you can subdivide your vault name spaces by run time by team. And more.

So this is maybe just for console console can retrieve its certificates from this pki secrets engine. But other name spaces are separated for, let's say ecs static values in ecs or even for, let's say my own eks cluster.

So all of these things you can subdivide further and you can separate out and authenticate to them differently. So secret engines, for example, for eks use a database secrets engine which are dynamic, which means you get a new user name and password every time and it expires after every hour.

So hopefully, you know, see that happening today. But the idea is that every time the credentials rotate, every time it gets retrieved, or someone retrieves a database credential, they're able to access a separate name space that isolates them.

You can share secrets between different name spaces by creating a shared name space, for example. But the idea is that you separate your secrets earlier. And that's why we set the foundation for vault first.

So the next is console because we have to connect between runtime, right? You want to go from ecs to eks of services, you could go over public internet or you can just self contain it. And that's why we decided to do a service mesh on top of it.

And the reason why is we needed more fine grain control between runtime. So we separated runtime and console using something called partitions, a partition sort of maps to some kind of run time cluster that you have.

So for example, we have ecs and eks here, each of them have their own services registered into them, they have their own, you know isolation of what's going on and how services communicate to each other. But once the service self registers into that partition, because it's on a specific run time.

That means that you can actually control who accesses what across partitions. So that's where we use something called sameness groups. So, you know, some sameness groups that we add new as we add new services. It says if these services are common across run times, they are the same, same group.

Uh so the, the helpful part about same groups is that if you have multiple services and they're all supposed to be running on, uh there's the same service but service instances across run times, you can group them together across partitions. And that's why console is particularly useful.

Um something that i will say that is worth noting is that if you're doing multiple runtime, like ss, you must have ip per service. So that's why we're using cn i plugins and we're allocating an ip address per pod and per container across ecs and cs, you may not be able to support that in your systems.

Sort of depends, but it makes it a lot easier from a routing standpoint. We're not routing publicly instead, we're routing internal to that region. And that's why it is helpful to have a unique ip address per pod or per container"

All right. So we talked about multi runtime. Now, the one that we dreaded multi region because we did it by accident. Yeah, funny, funny part about that is we submitted the abstract and when we were discussing what we were going to build, I told her it was multi region and we were both like, oh, it's definitely multi region. And then we went back and looked after we had already started the process and it turns out that wasn't in there. So it was a surprise to us just as much as it is to you. And it came with an entirely different set of challenges here. And interestingly, the infrastructure to get it all set up and working as is the case with a lot of things in AWS, the clicks, right? The codes you write is actually pretty minimal, but it's understanding what the right things to put in there are.

So if we go into our code here here and we just go up to the top, you'll see, we have that shared services, the shared services account here. This is an account that is not, it is its own separate runtime. But the way we wanted to approach this was like a hub, a hub that everything plugs into both from a networking perspective, but also from a services perspective that perhaps all of your runtime need. And so the first thing that we wind up needing to do across all of these were create a network that everyone could communicate across while still having control of their own network.

And we saw that some, these different runtimes, they used their own VPC module per region. But there was something, there's actually something special about this particular module in the sense that what it does when used is it sets up resource access manager sharing such that it can share resources specifically at transit gateway between the shared services account and its own runtime network. So we take a look in the code here. You can just see what some of that looks like. Now, I'm not going to make you all sit here at this time of day in Vegas and peel it through every single line so you can review this but do know that this is here. And then of course, the rest of the hooking up happens down here in shared services on the main part.

Now, instead of diving through each and every single one of these routes, let's just talk about what the topology looks like. You've got all of your accounts. Every one of these accounts has got a VPC and every one of these accounts also spans different regions. Meanwhile, over here, you've got the shared services account. It has the same model as each shared service. The shared service account has a VPC in each region and in each one of these regions, we're going to put a transit gateway. Now, whether you know what that is or not is the alternative we chose to go with over VPC peering just from a workflow perspective. And now this truly becomes a hub. We're going to plug every one of these accounts in this transit gateway through a transit gateway attachment. And this is going to make it so that the networks, the private networks can begin communicating across and between each other. Now, just a caveat here, this isn't hard to code, there's not a lot of code for this. But when you take this approach, the thing that Rosemary just talked about in terms of the private IP is something you're going to have to keep in mind. Because as soon as you do transit gateways, you do have to think about splitting up private IP address ranges through all of them.

But when we made this plug, we, we were able to start plugging and other shared services like Hashicorp Cloud Platform ones and Rosemary. So HashiCorp does offer a what is called a HashiCorp Virtual Network. So that is a VPC that the cluster, any cluster Vault or Console will sit on. So we have three HashiCorp Virtual Networks in each region, one per region. And that is partly the reason why we have so many of these clusters, we have six clusters, so three Vault and three Console because they're all duplicated across each region. And I'll explain a little bit that more later. But each virtual network is peered, is sorry, is attached to the transit gateway and peered with another region. And this is to allow inter-region communication as well as peering into our VPCs. So you're starting to see that as you include a managed service, you're going to have to start peering or adding some routing if you decide to do some private networking for it.

More importantly, the question that you get, you might be asking at this point is what did you do to replicate information across all these systems and clusters that's responsible for automation in the system? So in the case of Vault, Vault has something called performance replication. So you have a primary and it will replicate secrets across the secondaries. Anything you update in a secondary will also replicate into a primary. The idea is that if you decide to create secrets in a single region, then you can have them across regions or you could opt to localize them, which means that they do not replicate, you can also exclude certain namespaces in Vault from replication as well. So the point is you have the ability across multiple regions to segment out the types of secrets that are being used and how they're being accessed, right? And so each of these is going to replicate and change depending on which region you're accessing. It also means that you don't have to route out of a region is still with intra region in traffic. When an application accesses the secret, you also have Console Console does not necessarily do connection to each other. They also don't replicate. We're using something called cluster peering. And what cluster peering does is that it literally peers Console clusters in this case, we've established Console a cluster peering across each region. And that means that you can export services, cross region into another region for you. So for example, I have a store service that I have exported and it's able to be accessed in us east one across the us west one and the eu west one and us west. To the idea is that if you want to allow a service to fail over cross region, you can use the same this group and do that cross region as well. So it's a powerful way to add another service layer of networking because Cole just went through sort of the base layer of networking. But you can have a relative flatter network, right? And then have a layer of service networking on top of that. So you can control traffic between regions as well as between services using a service abstraction.

So all of that is in place for you for someone to use. If they need to on board a new service, they don't have to necessarily handle all the registration of the service metadata IP addressing, etc. Instead Console will handle that for you. And the result is that we actually do have a service that runs cross cross, we'll show across region later, but it runs across region, it runs across multiple runtime. So we're going from a front end to an API and a couple, a couple services in ECS, we're running into services on EKS in us west two more services in us os two, we also have a database, it's a global RDS cluster. So it is replicating data across regions. And so that's what we're accessing here. The service is accessing here and we're able to go through this like call tree of cross runtime, sort of cross region. We'll show you later, but um as well as cross accounts.

So all of this is in place. Now, we finally got our first application running on production infrastructure, which overall, you know, it took a while, but we got there. And the idea is that if you have multiple accounts, it does offer autonomy. So if you have different, you know, folks in your organization who want to do different things, you can enable multiple accounts with different privileges and different service offerings that they can be using. And loose coupling we achieve through a couple of interfaces including Terraform and Console infrastructures code is really helpful in this, especially if you need to reproduce across regions, reproduce across runtimes. And there are things that we transfer across runtimes like networking modules that help us sure that we're loosely coupling these dependencies and then Console is kind of helpful here mostly. So we add a service layer on top. So anytime someone registers a service, we automatically get that abstraction and we can control traffic between services without necessarily doing additional infrastructure configuration.

So once we standardized with infrastructures code. It really made it easy for me to create a runtime network. Cole is sort of like just use this module and I could create a runtime network in EKS and just deploy the cluster on top of it. And then if I wanted a database, I actually created that myself. So if Cole wants a database, you can use my module for a database too. And this means that it improves reproducibility. But you know what about discoverability? Because we talked about how important it was to understand what's happening in a system. Uh but we don't really know what's actually going on, right. So how do we know what is going on?

Yes. So, and before we dive into that, this, that was the longest section because we really wanted you all to feel how much more time we put into building that everything else. But this and the sections forward are going to be a lot shorter in the sense that now we're going to start realizing that return on investment and getting all this stuff set up. And one of the first things you're going to do as Rosemary just foreshadowed with discoverability is observing, measuring and managing that. Let's just go straight to of the top level things you're going to want to do, which is why the AWS account and specifically AWS organizations wound up being so useful and as seeing who did it, who did what when, where maybe not why. In fact, there may be some odd whys in here, but what we had set up at the root here was CloudTrail. And with CloudTrail set up for every single one of these accounts, we can go into any one of our given accounts track the event history of exactly what's going on. And because not only are we using, not, we're using AWS Organizations. Every one of the chat accounts get a special role that the or the top level organization owner can use to dive into an individual account. So even though these are just the events for the root account, if we say go into our, uh let's go into our EKS account. See. Oh, never mind. We will not go into this one because it's not logged in. We will go into the ECS account and go into CloudTrail. We'll be able to see what's happening here from the events perspective. And we'll see there's been some targets registered some CloudWatch streams, created some network interfaces being deleted. And if you are the administrator of this, the ECS account, you only have read privileges here. You can't do anything other than see what's going on.

Now, from the top level as the root owner, you need more fine grains, more detail on what is it you're doing. We did set up CloudWatch logging for everything here. So you can go in here. For example, look at one of the log streams. So we'll just go in and pick one, see what's going on.

And this is going to give us a queue and not only what people are doing into accounts. So if we open this up, we're going to see some very fine grained JSON, right, we're going to see everything happening. So we'll see something, assuming role or not, assuming we'll see EKS making some calls across the account. We'll see a role being assumed by EKS by the no group to be doing things when and where and this is across every single one of the accounts.

So me, you're going to be able to take this and use something like CloudWatch or some other observable to outside of AWS to see everything that's going on. Now, this is from the who did it perspective, but that's probably going to come after you've seen something that happened. So what tools do we have here that we can take to look into that?

Yes. So the benefit of at least putting a service mesh in place is that you do get metrics from each proxy. And so if you're not familiar with the services, uh the idea is that you have a proxy that runs per service, this proxy collects metrics and surfaces them up. Now, it sort of depends on if you decide you want to implement a service mesh. But when you do, you do get the metrics out of the box, you don't have to add application libraries, it to surface more metrics about HTTP servers that you're running or any calls that you have.

Right. Um and so I'm going to show a demo of this over video because it does take some time to generate enough load for this to show up. But you'll notice that if I wanted to get a binary healthy, unhealthy services, a service catalog is here for you to examine if it's healthy or unhealthy. It's a good way to just get started if we're talking about minimum, whether or not something is good or bad, that's when you probably want to think about a service catalog.

Um and if you're working cross runtime in a catalog of everything across region, everything across runtime is really useful for this. So that's why you'll see there's multi region and multi runtime involved, right? So you can tell default is EKS and there's maybe some ECS in here as well.

So, um let's check out the store service that I had the store service. I issued a couple of requests to it. Um and you'll notice there was some healthy, some, you know, not so healthy failed attempts. Um and also eventually I took it down these metrics. I didn't do anything for, I have a proxy. So all I did was just funnel it to my metrics server. In this case, I didn't do anything other than collect the telemetry and forward it to APM. But the idea is that if you do want sort of something out of the box for a lot of services, then think about a proxy, the proxy kind of helps aggregate all of this telemetry gives you a standard set of things to look for, like request server connections, etc. And it also will help you provide a map of how your services are connecting to each other, which is why it's particularly useful if you have a set of microservices.

So rather than deploy a bunch of metrics libraries across applications, instead, what you can do from a service level is add some kind of proxy in front of it and surface all that information with a standardized set of metrics, which means that having a standardized view of services and systems is available to you sort of out of the box. Now this is, there's a lot of, it's just enough observability for now, I'm not saying this is going to be perfect. But if you wanted to debug your systems further, you do get the option and you can customize metrics and we're separating everything through different concerns. So we're loosely coupling sort of the logging from other telemetry and everything else.

All right. So this is the one that is a little bit more challenging. So adding features and fixing bugs. Mhm and uh why don't first we just create some bugs live because why not you push the button there. I actually didn't know which button it was, ah, but it is up and live and we no longer need that.

So, first up, let's, while I'm talking here, I'm going to just go into the ECS account and I'm going to just take a risk here. We tried this once. So we'll see if we can do it. I'm going to just kill surprised. He didn't show up in the audit logs that he showed it before that he went and he deleted a bunch of yeah.

So let's just first go ahead and get up here. This is definitely in us west two, right? By the way, if you navigate to that page, it is up. So yeah, it is live. You can, so you can, I'm going to just go into the main API here and I'm gonna just, I'm gonna just do what I shouldn't do and I'm gonna just start killing stuff. So let's, there goes one, there goes two. I hope this does what it's supposed to Rosemary is doing this in us but keep in mind our application has dependencies across the runtimes as well. So if there is a service that let's say has delays or errors in EK which is in the upstream, this will also fail too. So it's going to be a little bit of a challenge, right? Because we're up here doing a live demo and there is some aggressive caching on the load balancer at the front end and route 50 3D N.

While we're waiting for that to happen, we've shown a lot of high level stuff. Again, the value is going to be in this repo and seeing how this is all glued together, all those edge cases and things that you're going to spend a lot of the time doing. Now when you want to fix a bug or add a feature, actually, the workflow is already here. Let's say this is a bug in ECS. Perhaps, maybe we went to, I don't know we released a docker image or a tag on a docker image that wasn't ready to go out yet and just needed to come back in here and change this to say 25. Well, all you would do, you would change it in the code here. Terraform cloud would pick this up in the particular workspace. So ECS in this, in this case, it would go in, you'd see these changes happen up here. You'd look at the plan, see every single thing that was about to happen or not happen, approve it, push it through and lo and behold, you would add your feature or fix your bug.

Now let's see what's happening in the meantime here. Oh look, fail over actually happens. So despite the fact that we bk things in us west two, we see fail over happening over into us east one. And so we have a variety of different fail overs Rosemary is going to show what happens when it's in a bit more. But the fail over you're seeing here at the top level is happening from a route 53. From a DNS perspective. We just have some latency based rules set up in route 53 that we give some different regional load balancers and will detect if one is unhealthy to another. But if you want to add a new feature and say, test it in portions Rosemary, how can we get that done?

Yeah. So imagine that you have an inventory service and you need a second version of that, right? For the most part, you know, you could deploy the new version and hope it all works, I think. But yeah, I've done that before unfortunately. But you can, if you hope it all works, sometimes it doesn't really work out. You probably want a finer green way to test whether or not that service is actually working.

So let's say I deploy a version two of the service tones. This means that I have now have a version two running in my K cluster. And then what I'll do is I'll say, ok, well, I will split traffic between them. Consul has the ability to use proxy filters which allow splitting of traffic or routing of traffic.

So if you want HTTP headers or splitters and more you can do that. But the idea is that if you split the traffic between them it has a nice way for you to do a canary very basic canary deployment of an application in one region. So for example, if I wanted to deploy v2 in us west one, I have now split that traffic. And if I keep refreshing this, eventually, you'll see that it divides the traffic between inventory version one and us east one and inventory version two.

And this traffic split means that I can do additional testing and even more fine grain control over some of the routing and testing that I want to do if I had to add a new feature to production. And the other nice part about this as well is that it gives you a lot of options if you want to do canary deployments based on metrics. If you want to do progressive delivery, this sets the foundation for it. You don't have to start off right off the bat with splitting and things like that. But if you have so many upstream services and you want a little bit more control over it, you can do that.

We also have the ability to specify these across region as well if we wanted to. So we could do fail over on the upstream across the region. So there's a lot going on here. But if you wanted to introduce something, a little bit more control, introduce new features to production, it's the same way whether or not you're doing infrastructure or service deployments we're just applying them through infrastructure as code.

Hopefully we'll see that become healthy again and go back to refresh it later. And so you get the autonomy of deployment to services. So anybody who wants to deploy the services can control traffic between them, whether it be through route 53 or some other service mesh capability. But the additional discoverability with auto service registration helps you quite a bit across region across runtime. And then you want to standardize your deployment workflow to use Consul traffic management, right?

Um the thought is that if you do some kind of traffic management at this fine grain level, you have the ability to standardize deployment workflows and delivery workflows for services as well.

All right. So we have two more sections. We didn't really think about securing user access, did we? No, because we wanted machines to do all the work for us, right? And that would have been great. But eventually it's going to come a time when logs are not up and you're going to have to get into the box, you're going to have to check things out and as things currently stand, I'm not sure about you. It's very easy and convenient to just set up a jump box or fashion, throw that SSH key down, throw it all over the place, right? And, and hop in.

However, that's going to create a lot of problems. You're gonna have to figure out who's using the SSH key. Where did it go? You have to make new ones. Do you have controls to figure out who's doing what? Well, that's going to need to be in place when you have multiple runtime, they're gonna need to access their infrastructure, you need to know what's going on. And so we have a tool here that Rosemary is going to walk through. That shows exactly how you can control all that better.

So Boundary is a modern privileged access management tool. Basically what this means is that if you need to access any TCP endpoint within your infrastructure, you have the ability to onboard that endpoint into Boundary, whether through discovery or through other mechanisms manually as well.

And you can target, you can or access into that target. So it could be a web endpoint, etc you can choose today, I'll show SSH you have the ability to use credentials injection, which I use for my database is that my database, I don't actually know my database, user name and password because i just use boundary to configure it.

So, but the thought is you can also load your SSH keys into a secrets manager and inject them through boundaries. So someone doesn't have to use it. And we did not go that route because we were debugging. But um when you SSH into an instance, right, you have the ability to on board different kinds of instances.

So here I have different run times i separated them by project scope. So boundary has this idea of scopes. So you can on board all of these different target endpoints. So I have eks end points of ecs, end points, i have a database endpoint in here, but we'll just go into s and what i want to do is i want to ssh into the cs container instances in us west two.

So what i'll do is i'll, i'll just do that. So i'll copy a target id, which i've already set. I'm also going to use the ssh key for us two. And once i connect through ssh, which hopefully will let me do that. I get into that machine.

The benefit of this is that you're standardizing the workflow. No one has to figure out how to go into eks. No one has to figure out how to go into ec s. No one has to figure out how to go into a database, right? It's one place someone has a developer team has access to the targets they need and just the targets they need and they can use the same interface to access it.

So it standardizes all of that. And the neat part is that it's just in time. um from a security standpoint, you can audit the session as well. There is session recording available. So if you want to audit who is making some commands on a box or something, then you have session recording. But if i decide i don't want the session anymore because from a security standpoint, i'm a little concerned about who's accessing my system. I can cancel it. And what that will do is close the connection on the client side.

So a developer who shouldn't have access has the ability to stop access after a period of time. And so uh overall, it's a short introduction into what this looks like, but the thought process is that you at least offer autonomy through just in time, least privilege, right? Rather than sit there and try to configure a bunch of least privileged policies across the board. Try to figure out who's going into eks and ecs. If you standardize on one interface, it helps you uh ensure that a bunch of development teams get the access that they need without necessarily blocking them and then you can audit, enable, audit logging for discoverability.

We have been forwarding to cloud watch. We also have standardized definitions of identity and we loosely couple identity from access. So if there's a contractor who's coming into our organization and they need access temporarily to the system, we can give them temporary access through boundary and then revoke the access after by removing them from the project scope, so they don't have access further.

So there's a lot of automation that you can do around this. that helps finally, what about the thing we promised, which is adding a new run time, right? Adding a new run time. So this is when you're really gonna start realizing the gain, this is you've added a new team, maybe you've added a whole brand new company and they have their own specialized, specialized set of skills and you need to, you want them to both work with you, but also have all those different principles in place.

And so let's say let's head over to our code here and let's just, let's just pretend that ec two and front end don't exist. What are the steps are you going to take to on board this new team on board, this new account and hook them up to everything else that we have here. Let's go from the practical steps. If you're using this as sort of a template and a reference.

Well, first and foremost, you're going to make the account that's going to be pretty, you're going to head in a aw organizations, make the account designated administrator, they're going to have to sign, get that sign on, get the credentials and be good. And the second thing is make a folder. So you're going to make a new directory here in this and you're going to follow those patterns that we have.

So let's just go into ec two, for example, and they're going to follow this same standardized workflow hat that we've talked about earlier. They're going to write their global quota here at the top. They're going to write the regional module, which is going to then deploy to every single one of the regions, but they're going to get on boarded a lot quicker because we have the reusable modules in that are already tested and vetted across all of the other runtime.

So they'll be able to set up their own network, they'll be able to set up boundary like we just saw because we have this module here that's written that just accepts and values they'll be able to configure then not only will they be able to access their infrastructure, but you'll be able to do it as well. And of course, we'll hook them up with terraform cloud.

So we'll see the workspace up here, set for ec two using the same types of things like the dynamic credentials. And you'll also, you'll also instantly get access to all the other different metrics. We talked about the observable ones and console as well as the stuff from cloud trail from an organizational perspective. So that will be it.

So new account, new directory, terraform cloud set up, you're going to lay down your networking, you're going to use some of the other modules for the hcp services and that will get them all set up to start building this next run time.

Now, when we got to this point, we forked, we actually decided, well, you know, we got like two days, why don't we add two new runtime? So we added a front end one and an ec two in for the front end one. This will give you sort of an idea of what it looks like to go into your software developer standpoint, the infrastructure for the front and run time is relatively simple. It's just for a single page web application, but let's go take a look at what it looks like for those.

They're probably going to have their own git repo that may either in this case crunch down into some html, cs s and javascript or if it's an application that needs a full run time down into say a docker image, if you're on kubernetes or ecs, and that'll just be swapping out the image being used. But for the front end perspective here, this is just going to be an s3 bucket that gets that static web application deploys it to cloud front. So so that's then made global and then reaches up to the public s api.

Now from an ec two perspective though, that was what rosemary. I mean, it was a bit of a non event, to be honest, it was not because we had all this in place. It wasn't too difficult to grab all of the modules that we had in place and then implement them for two. The main difference though is that two needed a console client per service so that the service self registers to console, that's where you're going to see there's services here, payments and reports and for the most part because we've set everything in console up already. And the same groups are collecting any common, commonly named applications across all these regions.

For the most part, i didn't have additional settings, which is kind of surprising. i even surprised myself in this. But you know, even after this, i wanted to make sure that if i wanted to route to my payments application in two, i had the ability to do so. And that's where partition, cross partition access and cross partition service communication was really rather important.

I create something called a service resolver. And that tells my partition ieks partition go to two if you want to look for payments. And what that looks like then is that well, we're cross run time. So we're back in us west two throw all of our tasks. And thank you cole. Now we have two instance that is serving our payments application and this entire culture is able to be controlled through one place.

And if we wanted to add yet another run time or another service, right? We build out another repository here and we're able to add more information and allow someone to customize it, right? You know. So even we have a database in here as well, it's in the eks directory. But if we had other databases that we wanted to add, you can also add those separate run times as well um and scale them accordingly.

So we're coming to the end of this. So what are some takeaways? Well, what we were hoping was that between the two of us, we at least have the autonomy to set up a new run time without needing one of the other asking each other questions or figuring out what's going on.

Yes, granted, you know, there were some things that we were trying to figure out from a dependency standpoint, but for the most part, we were able to set up new run times without asking each other, we could use each other's modules without too many questions. And if we needed to extend them for our own purposes, are able to do that.

And standardization was particularly important for us in this exercise, the boundary module and the run time network module, right are standardized. We have similar approaches to it. We matched each other's approaches, we formatted them to make sure that they were readable and we can continue growing on them, extending on them and building on them.

Loose coupling was pretty helpful in this environment, right? Most of the runtime don't really know each other. And for the most part, we don't say you go cross run and we don't say go to to go to ek instead, we use a lot of these abstractions to surface up information per service and group it on a service level. So payments, for example, it doesn't matter if it's an easy two or si don't really care. It's the payment service and that's what i care about the most.

And from a discoverability standpoint, a lot of the self registration here helps quite a bit. All of this helps to extend all of the infrastructure and services we have shown here. Now, if you're curious to learn more because there's a lot going on in this project, um do reach out to us.

Uh we have a repository available and definitely ask us questions. There's again, far more in there than we were able to describe in this talk including uh database, database, credentials, rotation in credentials, injection through boundary. And if you want more educational material, go up, check out our developer portal on hash group developers and check out hash group cloud platform if you're not really interested in running console vault boundaries.

Some of the things we show today on your own, you're welcome to check that out. We'll run the server instances and you can connect up to them and experiment and get the sort of the interest and benefit from it.

So with that, if you have questions, we're going to be right outside. Um we're not gonna take them uh on the floor here. So if you do um we'll be standing right outside. Uh we appreciate you joining us today and good luck with your own multi-account multi run time and hopefully multiregional deployments. Thank you. Thank you.

你可能感兴趣的:(aws,亚马逊云科技,科技,人工智能,re:Invent,2023,生成式AI,云服务)