Azure Web Apps

Azure web apps is an Azure product that allows an application to run as a multi-instance deployment without having to manage deployment.

Running a 99.999% Uptime Application

While the promise of Azure web apps is to manage deployments for the application, there is a lot that has to be done to make this a reality.

Swap with Preview

Assigning a deployment directly to the production slot on an application will cause the application to be restarted when the new code is being deployed and started up. This can lead to 8 to 15 minutes of downtime for a deployment. You can tell if you run into this because your application will return 503 status codes.

To get around this problem, Azure has introduced “Swap with Preview,” which takes advantage of deployment slots. Deployment slots are tied to the production application and have a slot name, which is appended to the main production app name.

Example: staging -> your-website-staging.

The Swap process is exactly as it sounds: it swaps two slots and puts the main slots application settings on the other slot.

A deployment slot can be created based on another slot, so create one called “staging” and base it off the production slot. This creates a slot with a copy of the production application settings. This deployment slot gets its own Git deployment URL and is in essence a completely separate application. It has its own network storage, Kudu, and application settings. It runs under the same resource group and on the same exact instances as just another app pool and sub application on IIS.

To utilize the staging slot, update the Git deployment location in Jenkins (your CI) to point to the staging slot’s Git URL. Also, have the repository approved in the Jenkins credentials file for Git deployment. The new application code is now on the staging slot but not in production at this point. To get the code to production, navigate to the staging slot, click “Swap”, and select “with Preview”. Choose the production slot to swap with. This will lock down the application settings in the portal. Then proceed to apply the production slots application settings to the staging slot application, which causes a restart of staging. This is also the time in which any application settings that are marked as slot-specific will not move over.

At this point you must hit all the instances to make sure that each one has restarted successfully. Once the restart has been completed you may complete the swap.

KEY NOTE: Make sure you complete the swap from the staging slot page, not the production page. Also, if you decide that you do not want to swap and need to cancel, MAKE SURE that you cancel from the staging slot or else application settings could be lost.

Do not complete the swap until all instances are successfully running and warmed up. Part of this is adding a health page that you can load to report if the application is warm. This health page is useful for warmup scripts to determine if the app is warm and healthy:

• local cache ready
• can connect to database
• static caches loaded and warm
• hash of instanceId and port

As mentioned, this URL should be loaded to warm up the application. Notice that instance ID and port are hashed — this information is used by a warmup script to make sure it has received a successful warm up from every instance and process (if you run more then one process per instance). One thing to note about the health URL is that it must be public and in front of your logic for AD.

NOTES

  • Don’t do this manually; set up Jenkins jobs to do this.
    • There are powershell commands.
    • Doing it manually does not scale.
  • Swaps can hang. Do not try to press the button again.
    • Pressing swap twice does one swap and then another swap, putting the original application code back out.
    • This causes a restart on the production slot and will take down your production application for 1-5 minutes.

Local Cache

Azure restarts the application sometimes for a few reasons: errors, file changes, app setting updates, etc. As an application that wants to have high availability you never want the application to restart, but it will happen.

“Local Cache” is an Azure process that copies your application from the shared network drive onto the actual instances hard disk. Why would you want this?

• Reading from a network drive takes additional time.
• Running script-based applications that reads many files on startup over network drive is slow.
• Speeds up restarts from 5-10 minutes to 5-8 seconds… BIG WINNER!

Recommended read: Azure documentation.

Local Cache is a great thing, but it has caveats, especially when also doing swap with preview.

To enable Local Cache you must add an application setting: WEBSITE_LOCAL_CACHE_OPTION set to exactly “Always.” When using this with “Swap with Preview” you must set this setting as a slot setting and set the staging value for WEBSITE_LOCAL_CACHE_OPTION as “Never.” If you do not do this then when you deploy to the staging slot it will not actually load the new code because the code is not local. With the setting as a slot setting it allows you to truly know when your application is ready to have its “Swap with Preview” completed.

Local cache sets an environment variable when the code has been copied locally. Once the code has been copied locally the application must be restarted so that it is read from the local copy, not the network copy. This environment variable is: WEBSITE_LOCALCACHE_READY. This environment variable will be set to TRUE when your application is running from the local copy.

You can look in Kudu to see if this environment variable is set, but that is manual and bound to be forgotten. You should add a check in your health page to see if it is running from Local Cache, and if it does not have a health page you MUST add one. Switch the health check warm up script for swap calls to wait for the local cache to be ready on the instance before it is marked as warm. You can also make your health endpoint know if it is running in production mode and, if not, Local Cache is ready to return a non-200 status code.

Since the code from the network drive is copied locally, there is a process in the background that runs on some interval to see if local code is the same as on the network drive. This is important because if something is updated locally and is different from network drive, the process will modify the local code and restart your application! Not ideal for 100% uptime. What does this mean? Don’t log to your code folder location, make sure New Relic logging is turned off, and don’t have any code that updates files in the local code folder.

NOTES

  • Swap with Preview will not complete unless all instances are Local Cache ready.
    • This means if you try and complete before cache is ready, your swap will take a long time and could keep you from rolling back.
    • This note is especially for services that are running more then one application deploying at the same time. One could deploy before the other.
  • You must check Local Cache on all instances, not just one via Kudu.
  • Local Cache can take anywhere from 1 minute to 15 minutes to be copied over.

Application Initialization Module

Azure will restart your applications when it does maintenance to the underlying machines. Azure will take instances out of rotation and spin up new instances to replace them. These two scenarios are what happens when working in a cloud provider, and your application should not have downtime because of it.

Azures response is application initialization module, a web.config definition for Azure to know how to check if your application is warmed up before it brings it into rotation. If not configured, Azure will only wait for the underlying instance to be ready, not your application code.

If you have a static cache that is loaded on start up or if you are using Local Cache, then Azure will bring your application into rotation before it is actually ready to take traffic. If you turn this module on then make sure your health URL does not return 200 when Local Cache is not ready. Have logic as: is instance PRODUCTION and WEBSITE_LOCAL_CACHE_OPTION then return not ready until Local Cache is ready.

Application initialization module is a module you need to add to your web.config.

<applicationInitialization>
    <add initializationPage="/health" />
</applicationInitialization>

The initialization page is the URL for Azure to hit and wait for it to return 200 before it puts traffic against it.

Comments from Microsoft:

There are a few scenarios to consider:

Process coming up on a new instance due to a scale-up operation (auto / manual)

Here, the controller makes a call to start the worker process and monitors if the app has been initialized. If Local Cache is also enabled, the controller will wait for this to be populated as well. Since this is a scale operation, it is safe to assume that there are additional instances that are already serving the application, and users requests will be responded to by those instances while this new instance is getting ready. Once Local Cache is ready and the app is initialized, this new instance will be added to rotation. As a result, users will hit an already warmed up application with its Local Cache already setup.

Swap with Preview

When a Swap with Preview with AlwaysOn is initiated, the new process that comes up will run its AppInit module and initialize the application. Since you are performing Swap with Preview via powershell, you may skip using AlwaysOn and instead make a request via powershell, which will start the worker process and in turn kick AppInit module as well. You can then decide to wait for the period it takes for the site to warm up / keep making dummy requests to the site (i.e. poll the site) to ensure it is warmed up before you actually commit a swap. This way, external users will only hit the app once it has been completely initialized.

Process coming up for any other reason e.g. website restart / worker process crash etc.

AppInit does not initiate a start of the worker process — we have AlwaysOn to do this. In the absence of AlwaysOn, it will be an actual users request that will trigger a process start and hence also see a delay. As soon as the worker process starts, AppInit does kick in and makes a request to initialization URL’s. AppInit will however not block any other incoming requests. If it does, then the users would essentially see a delay and this would defeat the purpose of AppInit. The controller also has no role to play here, because it is not involved when a process crashes and restarts. As there is no change in the instance itself, Controller has no role to play. Instead, for such a scenario, AppInit can be configured (via the remapManagedRequestsTo attribute) to redirect managed requests to a static HTML page, which can then show some message to the users. This way, users do not see a delay but get a static message indicating that the app is initializing, and once the app is initialized subsequent requests will hit the app.

The difference in behavior between Classic and ASE is that Classic has a much wider pool of VMs that it can use to provision new instances of the application before swapping it with the box to be patched

For scenarios where a worker instance is removed due to OS patching OR new WebApps code/patch being deployed (i.e. planned infrastructure maintenance activity): In this case, for a multi-tenant scenario that is always overprovisioned, a call is made to check if there are any free instances that can take the place of the existing instance that is slated for removal. Once such an instance is found, your site is hosted on it and added to the pool. For a brief period, you actually run with a few more instances than what you have configured for. As soon as the worker instance is online responding to requests, the planned instance is taken out of rotation for maintenance.

In case of ASE, if you have standby instances, you will see the same behavior and will be unaffected by planned infrastructure maintenance activity. In the absence of such standby instances, the planned instance will be taken down for maintenance and the site will see an impact.

Similar behavior can be achieved on ASE if corresponding Worker pools have additional spare VMs provisioned that would have no apps running on them normally.

NOTES:

  • This URL is only used on initialization.
    • Not like load balancers that continue to hit the health URL to know if the instance is healthy.
    • Azure will continue to send traffic to the instance even if the health URL is returning unhealthy!
  • This initialization is not always used and is not used when doing deployments.
    • Initialization is only used when Azure is bringing in a new instance or doing maintenance.
  • Health path cannot have AD or redirect response because Azure will treat it as always unhealthy.

Other Things That Cause Restarts

Below is a collection of settings that help to avoid other restart scenarios.

Auto Heal

When using application initialization, you do not want auto heal on. Auto heal does not play with application init, but instead it just restarts. You can make sure that it is not enabled by setting WEBSITE_PROACTIVE_AUTOHEAL_ENABLED to false.

File Change Notifications

Blog post
Disable via the web.config setting.

Conclusion

These are all the things that the Frontend team at Jet.com has learned over the past two years. We have worked closely with Azure to fix issues around how these three availability modules work together. We will try and keep this post updated as situations change or more information come available.

Does This Sound Fun to Work on?

Consider working with us! We’re looking for Android Developers, iOS Engineers, and Front-End Engineers.

Leave a Reply

Your email address will not be published. Required fields are marked *