Extend documentation for nodes

phillebaba · phillebaba · commit a814306a6664 · 2022-06-01T10:02:36.000+02:00
diff --git a/docs/xks/operator-guide/kubernetes/aks.md b/docs/xks/operator-guide/kubernetes/aks.md
@@ -5,17 +5,19 @@ title: AKS
 
 import useBaseUrl from '@docusaurus/useBaseUrl';
 
-## System Node Pool
+## Nodes
+
+### System Pool
 
 AKS requires the configuration of a system node pool when creating a cluster. This system node pool is not like the other additional node pools. It is tightly coupled to the AKS cluster. It is not
-possible without manual intervention to change the instance type or taints on this node pool without recreating the cluster. Additionally the system node pool cannot scale down to zero, for AKS to
-work there has to be at least one instance present. This is because critical system pods like Tunnelfront and CoreDNS will by default run on the system node pool. For more information about AKS
+possible without manual intervention to change the instance type or taints on this node pool without recreating the cluster. Additionally the system node pool cannot scale down to zero. For AKS to
+work there has to be at least one instance present. This is because critical system pods like Tunnelfront or Konnectivity and CoreDNS will by default run on the system node pool. For more information about AKS
 system node pool refer to the [official documentation](https://docs.microsoft.com/en-us/azure/aks/use-system-pools#system-and-user-node-pools).
 
 XKS follows the Azure recommendation and runs only system critical applications on the system node pool. Doing this protects services like CoreDNS from starvation or memory issues caused by user
 applications running on the same nodes. This is achieved by adding the taint `CriticalAddonsOnly` to all of the system nodes.
 
-### Sizing Nodes
+#### Sizing
 
 Smaller AKS clusters can survive with a single node as the load on the system applications will be moderately low. In larger clusters and production clusters it is recommended to run at least three
 system nodes that may be larger in size. This section aims to describe how to properly size the system nodes.
@@ -25,7 +27,7 @@ types which have a balance of CPU and memory resources. A good starting point is
 
 More work has to be done in this area regarding sizing and scaling of the system node pools to achieve a standardized solution.
 
-### Modifying Nodes
+#### Modifying
 
 There may come times when Terraform wants to recreate the AKS cluster when the system node pool has been updated. This happens when updating certain properties in the system node pool. It is still
 possible to do these updates without recreating the cluster, but it requires some manual intervention. AKS requires at least one system node pool but does not have an upper limit. This makes it
@@ -41,32 +43,107 @@ az aks nodepool add --cluster-name aks-dev-we-aks1 --resource-group rg-dev-we-ak
 > It may not be possible to create a new node pool with the current Kubernetes version if the cluster has not been updated in a while. Azure will remove minor versions as new versions are released. In
 > that case you will need to upgrade the cluster to the latest minor version before making changes to the system pool, as AKS will not allow a node with a newer version than the control plane.
 
-Delete the system node pool created by Terraform:
+Delete the system node pool created by Terraform.
 
 ```shell
 az aks nodepool delete --cluster-name aks-dev-we-aks1 --resource-group rg-dev-we-aks --name default
 ```
 
-Create a new node pool with the new configuration. In this case it is setting a new instance type and adding a taint:
+Create a new node pool with the new configuration. In this case it is setting a new instance type and adding a taint.
 
 ```shell
 az aks nodepool add --cluster-name aks-dev-we-aks1 --resource-group rg-dev-we-aks --name default --mode "System" --zones 1 2 3 --node-vm-size "Standard_D2as_v4" --node-taints "CriticalAddonsOnly=true:NoSchedule"
 --node-count 1
 ```
 
-Delete the temporary pool:
+Delete the temporary pool.
 
 ```shell
 az aks nodepool delete --cluster-name aks-dev-we-aks1 --resource-group rg-dev-we-aks --name temp
 ```
 
 For additional information about updating the system nodes refer to [this blog post](https://pumpingco.de/blog/modify-aks-default-node-pool-in-terraform-without-redeploying-the-cluster/).
 
-## Update AKS cluster
+### Worker Pool
+
+Worker node pools are all other node pools in the cluster. The main purpose of the worker node pools are to run application workloads. They do not run any system critical Pods. However they will run
+system Pods if they are deployed from a Daemonset, this includes applications like Kube Proxy and CSI drivers.
+
+All node pools created within XKF will have autoscaling enabled and set to scale across all availability zones in the region. These settings cannot be changed, it is however possible to set a static
+amount of instances by specifying the min and max count to be the same. XKF exposes few settings to configure the node instances. The main one being the instance type, min and max count, and
+Kubernetes version. Other non default node pool settings will not be exposed as a setting as XKF is a opinionated solution. This means at times that default settings can be changed in the future.
+
+## Disk Type
+
+XKF makes an opinionated choice with regards to the disk type. AKS has the option of either using managed disks och ephemeral storage. Managed disks offer the simplest solution, they can be sized
+according to requirements and are persisted across the whole nodes life cycle. The downside of managed disks is that the performance is limited as the disks are not located on the hardware. Disk
+performance is instead based on the size of the disk. The standard size used by AKS for the managed OS disk is 128 GB which makes it a
+[P10](https://azure.microsoft.com/en-us/pricing/details/managed-disks/) disk that will max out at 500 IOPS. It is important to remember that the OS disk is used by all processes. Pulled OCI binaries,
+container logs, and ephemeral Kubernetes volumes. All these processes will share the same disk performance. An application that for example writes large amount of requests, logs every HTTP request,
+can consume large amounts of IOPS as logs written to STDOUT will be written to disk. Another smaller downside with managed disks is that the disks are billed per GB on top of the VM cost, this
+represents a very small percentage of the total AKS cost.
+
+Ephemeral storage on the other hand offer higher IOPS out of the box at the cost of not persisting data and increased dependency on the VM type. This storage type uses the cache disk on the VM as
+storage for the OS and other kubelet related resources. The size of the cache will vary based on the VM type and size, meaning that different node pools may have different amounts of available storage
+for example ephemeral volumes. A general rule is however that the [cache disk has to be at least 30GB](https://docs.microsoft.com/en-us/azure/aks/cluster-configuration#use-ephemeral-os-on-existing-clusters)
+which removes some of the smallest VM sizes from the pool of possibilities. Remember that a cache disk of 30GB does not mean 30GB of free space as the OS will consume some of that space. It may be
+wise to lean towards fewer larger VMs instead of more smaller VMs to increase the amount of disk available.
+
+> VMs will on top of the cache disk also come with a temporary disk. This is an additional disk which is also local to the VM which shares the IOPS with the cache. A preview feature in AKS is to use
+the temporary disk as the storage volume for the kubelet. This feature can be enabled with
+[kubelet_disk_type](http://registry.terraform.io/providers/hashicorp/azurerm/latest/docs/resources/kubernetes_cluster_node_pool#kubelet_disk_type) and will most likely be used as fast as it it out of
+preview in AKS.
+
+Instance type availability is not properly documented currently, partly because the feature is relatively new. Regional differences has been observed where ephemeral VMs may be available in one region
+but not the other for the same VM type and size. There is no proper way currently to determine which regions are available, instead this has to be done through trial and error. The same can be said
+about the cache instance size. Some instance types have the cache size documented others do not, but will still work. Check the [VM sizes](https://docs.microsoft.com/en-us/azure/virtual-machines/sizes) 
+documentation for availability information first. The cache size is given as the value in the parentheses in the "Max cached and temp storage throughput" column.
 
-### Useful commands in Kubernetes
+The following VM sizes have been verified to work with ephemeral disks in the West Europe region. Observe that this may not be true in other regions.
+
+| VM | Cache Size |
+| --- | --- |
+| Standard_D4ds_v4 | 100GB |
+| Standard_E2ds_v4 | 50GB |
+| Standard_E4ds_v4 | 100GB |
+| Standard_F8s_v2 | 128GB |
+
+Being aware of the cache size is important because the OS disk size has to be specified for each. The default value of 128 GB may be larger than the available cache, in that case the VM creation will
+fail. The OS disk size should be the same as the cache size as there is no other use for the cache size other than the OS. An alternative method of figuring out the max cache size is to use 
+[this solution](https://www.danielstechblog.io/identify-the-max-capacity-of-ephemeral-os-disks-for-azure-vm-sizes/) which adds an API to query. Some testing of this API has however resulted in the
+finding that the data is not valid for all VM types, and some VM types that do support ephemeral disks do not show up.
+
+### Sizing
+
+Choosing a starting point for the worker node pool can be difficult. There are a lot of factors that will affect instance types choice which are not even limited to only memory or CPU consumption. An
+optimal may even include multiple node pools of different types to serve all needs. Unless there are prior analysis the best starting point will be a single node pool with a general instance type.
+
+```
+additional_node_pools = [
+  {
+    name                 = "standard1"
+    orchestrator_version = "<kubernetes-version>"
+    vm_size              = "Standard_D2ds_v4"
+    min_count            = 1
+    max_count            = 3
+    node_labels          = {}
+    node_taints          = []
+    os_disk_type         = "Ephemeral"
+    os_disk_size_gb      = 50
+    spot_enabled         = false
+    spot_max_price       = null
+  },
+]
+```
 
-When patching an AKS cluster or just upgrading nodes it can be useful to watch your resources in Kubernetes.
+### Modifying
+
+### Spot Instances
+
+
+## FAQ
+
+### When patching an AKS cluster or just upgrading nodes it can be useful to watch your resources in Kubernetes.
 
 ```shell
 # Show node version
@@ -78,22 +155,7 @@ watch kubectl get nodes
 # Check the status of all pods in the cluster
 kubectl get pods -A
 ```
-
-### Terraform update Kubernetes version
-
-TBD
-
-### CLI update Kubernetes version
-
-```shell
-export RG=rg1
-export POOL_NAME=default
-export CLUSTER_NAME=cluster1
-export AZURE_LOCATION=westeurope
-export KUBE_VERSION=1.21.9
-```
-
-What AKS versions can I pick in this Azure location:
+### What AKS versions can I pick in this Azure location:
 
 ```shell
 az aks get-versions --location $AZURE_LOCATION -o table