Getting started building serverless SaaS architectures

We are here today to learn about the fundamentals of building software as a service or SAS applications on AWS using serverless.

Now, I actually work with software vendors. I'm coming over from London. I'm a senior solution architect and on a daily basis over the last few years from all the conversations that I've had with my customers, there's been a common story, a common trend, right? They wanna save costs, they wanna modernize legacy code and they wanna increase their operational efficiency.

I'm gonna show you how you can do that today with serverless SAS.

So before we get started, then, a quick poll from the audience - who here is already running serverless SAS in production today? Okay. Quite a few of you. How many people are looking to modernize something to serverless SAS? Okay. How many people are looking to build something new with serverless SAS? Okay, great. Well, the good news is you're all in the right place.

Now SAS is not just a technical conversation, it's also a business mindset shift, but today we're gonna focus on more of the technical side. This is a, a level 200 technical talk. So we're gonna start by looking at some of the drivers for serverless, why serverless and why SAS? And we're gonna move on to talk about serverless in a multi-tenanted model - so how you can do things like a pooled model where all tenants share same resources.

And we're also gonna look at serverless in some common integration patterns to help us solve some architecture challenges. And hopefully by the end of today, you've got some knowledge to take back to your own business use cases, apply this shift away from managing servers and get straight to writing the business value logic that you write to delight your customers with exciting new features.

So we're gonna do everything today through the lens of a fictional scenario. So who here has been to a restaurant before? Some people - which is good. This is the Any Company restaurant management software. There are no prizes for creativity in the name. But this is essentially business to business software that gives restaurant businesses everything you need to run and manage a restaurant.

So think about things like taking orders on site, sending those through to the kitchen to be processed, managing, creating and publishing menus and also managing the supplies of food and drink - so placing orders for new stock. Now this is an example to use today just to help frame the context. But think about how this translates to your own businesses.

Here's Any Company's journey from where their software has come from. And this may be familiar to some of you in the audience. It was originally launched as a software download back in the early 2000s. So customers would go and download it or receive a CD-ROM in the post. They'd pay a perpetual license fee and then they would install, configure and manage the software themselves. And they typically did this badly, right? They put this on a single machine in a cupboard somewhere on some failing hardware, maybe in the kitchen, things got very hot and greasy. And this led to a bad experience.

So fast forward, Any Company decided that they would change to offering a managed hosted solution. So they would run the software on behalf of their own customers and they chose to do this on AWS. Now, the software is much the same as it was back in the early 2000s. It has some of the legacy code in there. And so there's a future desire to modernize this to serverless and also move to a full software as a service model. And we're gonna explore how this can be done today.

Now, let's think about the customers of Any Company, right? These are restaurants and they come in all shapes and sizes, anywhere from a very small organization that is just a single independent site with a handful of employees all the way through to a very large enterprise which has multiple restaurant chains and sites around the world.

Now, from now on, we're gonna refer to these customers as tenants and a tenant is a fundamental concept when building a SAS application. Any user of the system is a tenant and you should think about who the tenant is in your business if we were to look at one of our large enterprise customers, and this is a customer, a tenant who is much larger than Any Company, right? These could be very complex organizations, it could be that it's kind of an umbrella company that's acquired multiple sub organizations, there's multiple restaurant chains, maybe a pizza restaurant, a sushi restaurant and multiple sites around the world.

Now, why am I telling you this? Because you should consider who the tenant is in your organization. Now, at which point do we define the tenant in our large enterprise use case? Right? Is it the outer company or is it as small as a single site? Now, how you define the tenant will help you define how you manage authentication, permissions, data and more so it's really important to think about this early on in the design process.

Now, this is the managed hosting offering as it stands today. So remember I said Any Company is running the software on behalf of their current their customers or their tenants and each tenant gets their own copy of the software and this software runs on a bunch of EC2 instances inside a VPC.

Now, this comes with some challenges. Hence, the desire to move to serverless SAS. One of them is idle resources for smaller customers, smaller tenants. Right? Some of these businesses are just a single site and some days they barely use the system. There's just a few requests within a day, remember that these are restaurants as well. So they're generally operating in meal times, lunch and dinner. But outside of that, the, the activity is fairly low.

Now with EC2, even with small instance sizes and good auto scaling, we've still got to maintain a constant 24/7 running of multiple instances across multiple availability zones to ensure the solution is always available and it's resilient across multiple availability zones, also slow to deliver new features. This code is much the same as it was back in the early 2000s. And every time new features have been added, the complexity has increased. It's becoming more and more spaghetti code and the chances of us deploying now without any bugs is fairly low.

New tenant onboarding is hard. Every time a new tenant comes into the system, we have to provision a new set of infrastructures and new VPCs and new EC2 instances and even with infrastructure as code and automation, right, this takes time to do because we have to go through a right sizing exercise to make sure it's scaled for that particular tenant's needs.

The last one is around third party integration. So there's a desire to integrate with some of the partners, for example, delivery companies to get to a wider reach. Also with some of the suppliers of food and drink to make sure that we can get our orders fast and have real time updates on the prices from the wholesalers.

So the desire is to move to serverless and modernize to SaaS.

So firstly, why serverless? Because managing servers is hard. I used to manage a lot of servers. I did it for 10 plus years. I spent a lot of time looking at operating systems, picking the best operating system to use for the company. Looking at hardening that with industry standards, automating, patching, building image baking pipelines, monitoring it, responding to out of hours calls when that went wrong all to find that another operating system is nearing end of life and starting the whole process again.

And when I think back, all the business really wanted was a running application that was highly available, secure and cost effective. And this is what serverless gives us. So we get straight to writing our application code without having to think about the servers underneath.

We get the benefits of paying for requests or events that flow through the system rather than an hourly charge. And as more events come into the system, the infrastructure automatically scales up behind the scenes. Similarly, it scales back down again when we're not using the system as much. And in most cases, we don't pay for idle. So when there are no requests or events, we're not paying anything for the infrastructure.

So why SaaS? Because it's what our customers want, right? For the same reasons that we want to use serverless. Our customers want to use a SaaS offering. They want to get straight to using the software for the features that it gives them. They don't want to have to think about installing, configuring, scaling, operating it.

For us as a software vendor, Any Company, SaaS gives us the benefits of increased operational efficiency, which means we can react to changes in the market faster and we can also increase our operating margins.

So when we talk about software as a service at AWS, if you've been to any other talks on SaaS, or you've seen some of the conversations from the SaaS Factory team that we have at AWS, you may have heard them talk about the importance of a control plane and the control plane is essentially a set of shared services which sits around the outside of the core application, providing those administration type services.

So thinking about things like onboarding tenants, managing metrics, doing the identity and all of the monitoring etc. Now, whilst we'll touch on some of these services today, we're actually gonna focus on the application itself and we're gonna build the application using serverless.

So we'll start with an introduction to serverless, what we have available to us, some common architecture patterns and then we'll move through and build up the architecture, look at data integration updates and then finish with migration.

So what's available to us in the serverless tool toolbox? What services are serverless on AWS? This is not an exhaustive list.

Amazon CloudFront - so Amazon CloudFront provides us with static content distribution. This uses the 550+ edge locations around the world to cache static content, but also to get us into the AWS backbone network faster than going back to the region.

For building dynamic APIs - we have a couple of services. One is API Gateway which provides us with RESTful APIs or websockets. AppSync provides us with GraphQL APIs.

For authentication - there's Amazon Cognito which gives us the ability to create user pools, manage identities, authentications and tokens.

AWS Lambda is one of the first services which fell under the serverless umbrella. It pioneered serverless and this provides us with the core compute capability, right? This is where we write our custom code, our business logic and we can use one of the managed runtime environments for Java, Node.js, Python or others. We can bring our own runtime as well.

For storage - we've got DynamoDB to provide us with key-value store. This is a NoSQL database and Amazon S3 is where we can store objects. These are our files and this provides us with 11 9s of data durability.

In the middle here, we've got a whole bunch of integration services. So these are our glue to stick our building blocks together.

Services here provide us with message queues to do asynchronous processing. We can do publish/subscribe, we can do event routing and we can do orchestration. We're going to look at how these services can help us a little bit later.

Now, with all of the services that you see on the screen, they are highly available across multiple Availability Zones out of the box. This is not something you have to build yourselves. They also scale automatically based on requests that come through the system. And in most cases, you don't pay for idle when you're not using them, you're not paying for them.

So if we put these building blocks together, we can build virtually any application that we like. But a common starting architecture is a serverless web application and this looks like this, right?

So users come into the entry point, Amazon CloudFront and Amazon CloudFront, remember is providing us with a cache. So it's going to cache files from an S3 bucket. In this case, HTML, CSS, JavaScript and media.

Now, if we use a modern web framework like React, Vue.js or Angular right, the static files will get downloaded to the user's browser and then it's the JavaScript which calls back to retrieve the data, the dynamic content and these dynamic content requests will go to in this example API Gateway.

Behind API Gateway, we have AWS Lambda providing us with our business logic, our custom code and then this will read and write data from DynamoDB.

Now this is a common starting point of a serverless web application. It's simple and it's easy to build.

If we were to take this architecture and stamp it out in the same way that we have our managed hosting offering today on EC2, it would look something like this, right? We'd create a copy of the resources for every tenant and this is something we would call the silo model.

Now, this is perfectly valid as a, as a model. And because we're now using serverless, we don't necessarily pay more the more times we stamp this out. The challenge comes on the management side, right? As we scale this to hundreds or thousands of tenants, things start to look a bit like this.

Now, there is a decentralized monitoring and management. We have to make sure that updates are applied consistency. There's a temptation to customize something for one of our tenants and this is kind of an anti pattern for SaaS. We also may have duplicate data if we've got common shared services and we have to copy the infrastructure so many times.

So for today, we're gonna focus on the pooled model where all tenants share the same set of resources. This gives us greater agility, a simplified management approach as well. But it does come with some trade offs, right? Some challenges, everything is a trade off.

For example, how do we stop one tenant impacting another tenant's experience, a noisy neighbor. How do we isolate data to make sure one tenant doesn't access another tenant's data unintentionally. And how do we minimize the scope of impact? Now, if we do updates, there's a risk that we're impacting all tenants and we will explore how we'll solve some of these challenges today.

So let's move on to talk about the entry point for our restaurant management software. Uh users will come into the system normally in one or two ways they will come in via direct API call. This is good for programmatic access. If they want to build their own integrations, they want to do automation, but this is restaurant management software. And we're gonna assume that the experience in in most cases is through a web UI.

So how will the users access the web UI? They'll browse through a URL so we could have our URL as something generic that's used by all tenants. Now, this is fairly simple to implement, but it's hard to get the tenant context. How do we understand who the tenant is? If we want to customize the front door experience, we could bake the tenant ID into the path and this might be something like AAA good or a UUID. We could also give each tenant their own subdomain or we could allow the tenants to bring their own custom domain, right? And this is the most level of customizability if we want to make the solution look like it's actually produced by the end customer.

So how would you implement each of these? Well, the first two is fairly straightforward, we could create a DNS entry inside of Route 53 an alias record that points to a CloudFront distribution. We then use a service called AWS Certificate Manager or ACM to issue a free public certificate and associate that to the CloudFront distribution. If we were to look at the third option, things are a little bit more complicated. Uh now each tenant needs their own DNS record. And so inside our Route 53 hosted DNS zone, we will create a record per tenant inside of ACM. We're probably gonna have to issue a wildcard certificate to allow scalability something that's not ideal. And our last option is the most complex to implement because now we've either got to host a DNS zone on behalf of the tenant or we get each tenant to point an alias record to our CloudFront distribution.

On the certificate side, we are gonna have to add multiple subject, alternate names or SANs to the certificate because there's a 1 to 1 mapping between ACM and the CloudFront distribution. Now there's a limit on the number of subject alternate names that you can add to the certificate. So this is ultimately gonna lead to us adding more CloudFront distributions pointing to the same back end.

So for today, we're gonna focus on things being a pool, pooled model sharing resources and we're gonna go for the generic URL. But we said we wanted to understand who the tenant was. So let's look at the login flow to see how this works.

And when users browse the URL and they log into the system, they'll first be prompted with a username. Now note that there's no password here, right? And you may see this on other SA solutions because we are identifying who the user belongs to which tenant first of all. And the reason for this is to allow our customers, our tenants to bring their own identity provider. If they want to, for example, our large enterprise tenants may have their own identity provider for single sign on that they use internally and they wanna integrate this into our system.

So if we find that this is a user which belongs to a tenant who's decided to use our built in identity provider, we can then go ahead and prompt them for the password and we'll use an industry standard like OpenID Connect to authenticate them against our identity provider. And this might be something like Amazon Cognito, for example, if uh what will be returned from the identity provider using OpenID Connect is something called a JSON Web Token or JWT or a jot. And what this is is essentially a signed JSON object. Uh it's a set of key value pairs at which are these things called claims that tells you information about the identity, it's signed, so it can't be tampered with and you can verify that it came from the identity provider. And it also includes things like an expiry date and uh uh expiry time stamp, sorry and tenant context like the tenant ID.

Now on that first look up, if we found that this is actually one of our enterprise tenants or someone who's bought their own identity provider, we can redirect them to their uh OpenID, Connect um login page. They will authenticate using whatever mechanism they've defined their password policy, multi factor authentication, et cetera and will still be returned at the end of it at JWT.

Now, this JWT is essentially our token to access our data, our dynamic calls to API Gateway. So now let's look at how we use this with our service API.

Now, once the user is logged in and they've got the token, the token will be passed to API Gateway as an authorization header. There's something called a bearer token. So the header is an HTTP header and you would pass this to API Gateway, but we actually need to add in another component to our architecture to help decide whether this user is allowed access or not. And this is something called the Lambda authorizer or a custom authorizer. It's essentially a Lambda function like any others. It's a small piece of custom code that you write where you decide the logic to say whether this request is allowed or denied.

Now inside that Lambda code, you'll do things like check that the token hasn't expired, you will validate that it hasn't been tampered with that it is signed correctly. And then you will look at some of the claims, some of the attributes to see whether this user is a tenant which has access or they have the correct role. What you can also do with a custom authorizer is return this thing called a context object. And the context object is a set of key value pairs anything you like essentially. But in our example, we wanna make sure that the tenant context, the tenant ID is passed throughout the system. So we're gonna include the tenant ID, we're gonna extract it either directly from the JWT or do a look up to translate it. And we'll also include any kind of sub context information that we're interested in. For example, if this is a larger tenant that has multiple sites, we might want to include the site ID for example, for our restaurant API Gateway, we can then use to pass these context variables as HTTP headers query strings, part of the path or if we're using Lambda, they will just automatically get passed through as the uh as part of the event payload.

So now we have our tenant context inside our system. Now let's think about storing and accessing data and this is gonna be important to make sure that one tenant doesn't access another tenant's data in this multi tenanted pooled model.

Now, our core services that we are using include DynamoDB and S3, we need to ensure that the tenant context is present when we're storing our data. So in the example of DynamoDB, right, here's a look at our table. For example, for managing our inventory of uh stock we have in the kitchen, we've included the tenant ID as the partition key or the primary key. Now, in reality, you would add some randomness to the end of this to ensure that you don't create hot partitions for large tenants. But this just shows you the example for S3, we could create a bucket for each tenant, but this means adding new infrastructure each time we on board a new tenant and it's not something we really wanna manage. We also have to think about things like service quotas for the number of buckets we're allowed in our account.

So to make things easier, we're gonna share one bucket between all tenants and I'm gonna separate the data using a prefix per tenant. So now we know how the data's stored. Let's think about how we access it. Well, this is a service architecture. So we're going to use Lambda to access the data. And when we access services in Lambda, we have this thing called an execution role which has an IAM policy assigned to it.

Now, in this example, the policy allows us to get items from the DynamoDB table, the stock table look something like this. Now, although inside of our code, we would essentially create uh a filter or a where clause to say where tenant ID equals X. If we as a bug or you know, some, some accidental change to the code, we could unintentionally access the wrong tenant's data or all of the tenant's data.

So how do we solve this? How do we stop uh one tenant accessing another tenant's data and causing, you know, essentially a catastrophic impact to the reputation of any company? Well, we can do this by using the tenant context that was passed to us from API Gateway. We can assume a role or some temporary credentials at run time inside the Lambda function. So we're not using the Lambda execution role to give permissions to DynamoDB. What we're doing is extracting that tenant ID, adding it to a dynamically generated IAM policy. And then assuming a role using an STS assume role call to IAM which will return us some temporary credentials. Now we can access DynamoDB. And even if we try to access tenant two's data, we can't.

Now this is using a specific condition in IAM for DynamoDB called leading keys, which essentially says you can only access items in the table that start with this partition or primary key. We can do the same for Amazon S3. So this time, we're assuming temporary credentials by dynamically generating the IAM policy and scoping access down to a particular prefix. Now, if we tried to access tenancies menus, we would get access denied even if we tried to inside the code or there was some unintentional bug. We're ensuring that we're applying security at layers and think about how you can do this in your own organizations.

So now let's look beyond the core API. So far, we've talked about Lambda API Gateway, but we mentioned that there was those integration services available to us in, in the middle on our previous uh diagram.

So so far, we've talked about synchronous integration patterns, right? The sender makes a request and they wait for a response. This is something that's simple. It fails fast and it's low latency. But what if the receiver takes a long time to process the request? The sender has to wait for that to happen. That could take minutes or hours. What if the receiver fails? The sender has to try and, and retry? And what if they don't do that properly. And how do they store that data to retry at a much later time? What if the receiver can't cope with downstream requests because there's a very low request per second throttle.

Well, this is where asynchronous patterns can help us. So rather than sending directly to the receiver, we have this intermediate uh party the, the queue to send the, send the queue, they get an instant acknowledgement back and then the receiver can pick messages from the queue and process them in its own time. And this is great because if the receiver fails, the message is still persisted and can be picked up by another receiver. But also if the processing takes a long time to happen, it can happen in the background.

Now the final pattern is published, subscribe and this one now the sender is sending to multiple uh receivers. Rather it being a pull, it's a push, it's pushing out, it's fanning out the same message to multiple interested parties.

So let's see how we can use these patterns in our restaurant example, starting with asynchronous processing. Now i mentioned at the start that the restaurant management software has the ability to create publish and update menus. So if a user goes into the system to change something like the price, uh add a new um a delicious meal, changes the calories, whatever it is when they click, update, the request will go to API Gateway through to a Lambda function and update in DynamoDB. But at the same time, we can also send a message to SQS.

Now. Why would we do this? Because although having all the menu data inside DynamoDB is great for the software experience for humans coming into the restaurants. They want to see something nice and visual with nice graphics, um pictures, nice fonts and this might take some time to process. So we send it to SQS and we have a Lambda function which pulls the queue for messages. This is a built in integration and it will generate some PDFs in the background. This might take one or two minutes to do once it's generated the menus, it will store them on an S3 bucket.

Now, this is just one example. This is just one way to do it. There are others. Let's look at another asynchronous processing example.

Now we're thinking about our third party integrations. So our food and drink suppliers, they might send us a nightly CSV file containing a list of items available to purchase and their prices, right? So it's not necessarily a modern or fancy integration mechanism but one that I've seen from working with software vendors. Once the file is uploaded to an S3 bucket, we can use a feature of S3 called uh event notifications to essentially say when an object is stored inside the bucket, we will send a message to the SQS again, we have a Lambda function which is processing messages from the queue and it will take the csv file, extract it and store the data updated in our Dynamo table which will give us the latest stock and the prices.

Now, we could also invoke the Lambda function directly from S3. But using the queue gives us the ability to allow for retries and also use things like dead letter queues for unprocessed messages.

Now, in both of those examples, if we think about our multi-tenant pooled model, all of the tenants messages are going into the same queue, right? And this is this is ok, right. And this is what we expect remember from from everything we talked about so far. It's important that we have the tenant context, the tenant ID throughout the system.

So one thing we can do is use a feature of SQS called message attributes and we can set the tenant ID as a message attribute which is outside of the message body. This will help us ensure that there's context as those messages are consumed. For example, to assume temporary credentials.

Now, because this is a service architecture...

We've talked about Lambda processing messages from the SQS queue. When we set up the integration between SQS and Lambda, we have this thing called a batch size. This is how many messages the Lambda function receives from the queue in one invocation, right? So by default, this is set to 10. So it means that up to 10 messages could be received in one invocation. And inside your Lambda code, you would loop through and process each one of those messages.

In our pooled model example, this means we could possibly receive 10 different tenants' messages in one invocation. So now if we think back to our example of assuming temporary credentials, we'd have to assume 10 different credentials. That's fine to do. But again, if there's a mistake in the code where you get your wires crossed, you may end up accessing different tenants' data.

So one thing you can do is if the throughput to the queue, the rate at which the messages arrive is fairly low, we could set this batch size to one and therefore we would only ever process one tenant's message at a time. Another option is to have a queue per tenant. This is great to ensure a strong level of isolation and also helps us to manage things like noisy neighbor scenarios where one tenant's messages may be blocking another tenants.

But again, thinking about onboarding and that process and the management side of things for us as a software company, every time we are onboard a new tenant, we have to create a new queue and we have to manage that and we have to apply configuration updates. So think about the implementation - generally for most workloads, the shared queue is normally sufficient.

We're still talking about integration now, but focusing more on events. And when we talk about SaaS, we often talk about event driven architectures. This is because with serverless, we're paying for the number of events or requests that come through the system. Building event driven architectures help us build something that's scalable and cost effective.

So let's look at another example. Now this is for actually taking orders on site inside the restaurant. So the waiting staff will go up to the table with their mobile device, take the table's order. What type of fries would you like? Would you like salad? Once I click OK, it will submit the order through to API Gateway through to Lambda and store inside the DynamoDB table.

And what we can also do is emit an event, an event has happened, an order has been created and we can send this event to EventBridge. EventBridge has an inside it, the ability to create custom event buses. This is where you can send in your events and then you create a series of rules and these rules are essentially a pattern match on the event to say I'm interested in these types of events. And I want to perform some action.

In our example, we're going to trigger a Step Function workflow to actually process the order and a Lambda function to go and update the bill. Now, the great thing about this architecture is that if we were to add a new feature in the future, all we need to do is add a new rule, right? We don't have to modify any of the code on the outside services inside the submit food order Lambda or any of the other workflows or Lambda functions.

Now, this is what the events schema might look like. EventBridge has some required fields such as the source. This is a service that produced the event and it's often the reverse domain Java type naming. It doesn't have to be. The detail type is the name of the event. And this follows the verb noun format. And then inside this detail object, we can have any kind of key value attributes that we like up to the size of the message payload maximum. But here is where we should think about our metadata, making sure that we have that tenant ID, that tenant context to pass through to downstream services in case they want to use it, for example, logging metrics, assuming temporary credentials, etc.

So one of the rules we looked at in the previous example was to process the order using Step Functions. And this is an example of orchestration with serverless. So when our order created event happens, we have a rule to trigger the Step Functions workflow and this workflow might look something like this, right? With Step Functions, we can chain multiple actions, service calls together to create a workflow. And there's various integration types that we can use such as running things in parallel.

The great thing about Step Functions is that we with the standard workflows, we pay for the number of transitions rather than the time that it runs for. So in our restaurant example, all the time that the order is being prepared in the in the kitchen, this may take 20 or 30 minutes, for example, we're not paying for anything, we'll only pay when that process is resumed and the transition happens.

So here our workflow in the first case, we get the order details from DynamoDB. This is using a direct integration, no Lambda function required. We then run a Lambda function to go and update the displays in the kitchen to say to the the cooks to go and cook the order. When they're finished preparing the meal, they can press a button on the touchscreen, for example. And this will go and notify the waiting staff, for example, using SNS to go and collect the order and deliver it to the right table.

The final step, we emit another event back to EventBridge again using a direct integration and Step Functions. And although we might not be using this event right now, this order complete event, it's good practice to have it there for any future integrations developments or new features.

Now, this is a reminder of what our rent payload looks like. This is what we get past as the input to the Step Function workflow. And again, it's important that that tenant context is there because in any one of these actions, we may want to use that for logging purposes or to assume any temporary credentials.

So next up is managing software updates. So when we think about rolling out changes to our architecture and doing updates, adding new features, putting in bug fixes, we want things to be as smooth as possible. We're operating a software as a service platform. We want the experience to be good for the end users. So how do we get from A to B? Right?

We may have updated some Lambda code, we may have added an SQS queue before we had a stack per tenant. So we could roll things out more gradually. Now all tenants share the same set of resources. So how do we stop one tenant? So how do we stop all tenants experience being impacted by a failed change?

Well, we want to ensure best practice by using things like infrastructure as code version, controlling that using automated continuous integration, continuous delivery pipelines doing frequent small changes. And with serverless, the application code and the infrastructure code comes closer together. So we use a framework such as SAM or CDK or others and this infrastructure as code becomes our artifact and we roll this artifact through a path to production.

So we take it through multiple environments. For example, for performance testing, for integration testing, user acceptance testing. All the way through to production. Now, for best practice, we'd wanna have these environments in different AWS accounts. But when we actually get to production, even with as much testing as, as you know, we can afford automated testing, there's still really nothing like rolling out to production.

So how do we reduce the possibility of impact? Well, we can use a concept of waves, right? And region by region is a good example of using waves because serverless services are regional constructs, we'll have to create the stack in multiple regions if we want this to be a solution available worldwide. So we could start with the region with the smallest number of tenants through to the largest. And we might do this at a period where usage is low and we could do a follow the sun model, for example.

Another example of waves is using tier. For example, we could separate out three tier tenants into their own stack. And this might be good because those tenants may be more open to having new features earlier. And we can also control the throttling a bit more. We might want to throttle those particular users to ensure they don't consume too much infrastructure resource.

We could also just shard our tenants, right? If we have one region or we have one particular region which is has a large number of tenants, we could group them into multiple shards, for example, group A and group B, this is not anywhere near as complex to manage as the siloed model where we had a stack per tenant. We may just have two or three groups, but now we can roll out to one at a time. Make sure that we have the correct monitoring and metrics in place to ensure that we can say yes, this was successful before moving on to the next portion of tenants and, and if something goes wrong, we'll only impact the percentage.

So let's finish up today by talking about migration. Remember we had those customers and a restaurant management software currently on the EC2 managed hosted offering. How do we get them to our new shiny SaaS platform? The last thing that you want is to have to operate both of these solutions in parallel for a long time to come.

Now, any company decided that they would build a new serverless solution in parallel to the legacy platform, which means least disruption. But now that things are up and running, we want to get customers off the old world and into the new world so we can decommission it. So how, how do we do that?

Well, one option is to achieve complete feature parity. Now this is quite difficult to do to ensure complete seamless transition. But if you can get it right, then we can essentially lift and shift customers across without them really being aware, we may have some changes to the billing model, for example.

We could also provide an attractive new offering so we could have new features, new services, better performance in your new world, which makes it attractive for customers to move on their own accord because we have greater operational efficiency and we're using serverless which has greater automated scaling our margins might increase. So we may be able to offer discounted pricing to encourage them to move across.

Another option is to do managed deprecation. So we can just set an end of life date for the old solution and almost force customers to move across. We could also look to target smaller customers first because they may be easier to move because they've only just started using the system, the smaller customers may be less complex, for example, they don't have the identity provider integration we talked about before.

As I said earlier, just keep in mind that you want to have this plan from the start because the last thing you want is to have both of these solutions running in parallel for a long time to come. This will only increase your operational burden and your costs.

So let's look to summarize what we've talked about. The challenges that we talked around at the start included the idle resources for small customers. Remember even with good auto scaling, they were still having to, we were still having to run continuous EC2 instances all the time. With a SaaS solution, we now have automatic scaling with pay for usage billing. We pay for events or requests.

New customer onboarding was hard with the serverless solution. We now have a pooled model where we can just add a new tenant to the system without provisioning any infrastructure without doing any right sizing.

Third party integrations were complex before. Now we have a number of low code or no code options using service integration services. So using EventBridge, for example, to just add new rules or Step Functions workflows to send requests and build very quick microservices.

Slow to deliver new features. Well, now we have a faster time to value with serverless because we don't have to think about managing servers, managing operating systems, updating operating systems. We get straight to writing our application code which means straight to writing the features.

So what are the key takeaways for today? Well, serverless and SaaS go together quite well, right? Auto scaling matches tenant usage, tenant consumption, we never pay for idle. So it helps to increase our margins in some cases. And this is one of the benefits of moving to SaaS.

We also removed the worry of managing servers which allows us to get to market faster. Tenant context is fundamental to multi-tenanted pooled models, right? Remember that it's important to think about who your tenant is in your business case, because this is important to decide how we partition the data, how we manage the experience, how we manage the identity, how we manage the authentication.

And finally, serverless is more than just RESTful APIs. There's a whole bunch of services, some which weren't covered today, but think about event driven architectures, asynchronous patterns, building workflows with Step Functions. This will help you to create loosely coupled and scalable solutions.

Hopefully you can apply what you've learned today to your own business and try and delight your end customers. If you want to learn a bit more, there is a SaaS reference solution available on GitHub. There's also a hands on workshop which takes you through deploying this, you can run it inside your own accounts and it covers some things that we have covered today such as assuming the temporary credentials inside the Lambda function at run time. But it also includes some things we haven't covered today such as the monitoring, the metrics aspect and the shared services.

There's also a bunch more sessions throughout re:Invent this week, this is only day one. I think there's over 50 sessions on SaaS and 100 sessions plus on serverless. So please have a look in the session catalogs. There's more breakout sessions, talks and workshops to help you go a bit deeper into building a serverless SaaS solution.

But otherwise, thank you so much for listening to me today. I appreciate you taking the time to come and hear me speak.

你可能感兴趣的:(aws,亚马逊云科技,科技,人工智能,re:Invent,2023,生成式AI,云服务)