This repository has been archived by the owner on Apr 4, 2018. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 5
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
First steps in the azure configuration using terraform defining the provider.
First definition of the default network and the bastion subnet.
The VMs need a storage provider to setup the storage. We add one for all the VMs, [using Locally Redundant Storage (LRS)](http://blogs.msdn.com/b/windowsazurestorage/archive/2013/12/11/introducing-read-access-geo-replicated-storage-ra-grs-for-windows-azure-storage.aspx) for the time being. There is a restriction in the name, which must be lowercase alphanumeric only.
We add the first configuration of a host in azure. To add a host there are two requirements: * A hosted service, which we decided to declare it explicitly per host in terraform. * A SSH fingerprint. The SSH certificate is associated to the hosted service created above, and needs to be uploaded to the console. So far we did not find any way to setup it in terraform. This is an issue, as currently you need to first run terraform once (which will create the service and fail due missing SSH key) upload the certificate manually, and run terraform again. We will investigate alternative approaches.
Add the Network Security Group (NSG) and associated rules to only allow access from the office IPs. As we can only specify one CIDR per NSG rule, we "programatically" split the list of CIDR by the `,` char, counting the elements for the `count` attribute. Note: There is a bug in terraform which prevents the changes to succeed. Changes on the rules of the same NSG must be done one by one, but terraform tries to execute in parallel causing a error due locked resources: `Code: ConflictError, Message: Another operation that requires exclusive access to Network Security Group xxxx is ongoing. Please try again later.`
In order to [automate the process to generate and upload the SSH certificate](https://media-glass.es/2015/07/22/adding-a-ssh-key-to-azure/) to azure, we add a `local-exec` command with all the required steps and that calls the azure client to perform the upload. This adds a explicit dependency on the azure client. The uploaded ssh certificate thumbprint is saved in a temporary file called `ssh_thumbprint`, which terraform reads (added empty and ignored in the repo)
To keep consistency with other new scripts, we prefix them with "azure"
We had multiple issues trying to setup NSG rules to restrict access to the SSH endpoint: failures applying multiple rules in parallel, not working as expected, etc. We will remove this implementation so we can try [Endpoint ACLs instead](https://azure.microsoft.com/en-gb/documentation/articles/virtual-networks-acl/)
We implement [Azure's Endpoint ACLs](https://azure.microsoft.com/en-gb/documentation/articles/virtual-networks-acl/) to restrict the access to the bastion SSH port to allow only GDS office IPs. From the documentation, the ACLs rules will block everything but the traffic that is explicitelly allowed: > **Permit** - When you add one or more "permit" ranges, you are denying all other ranges by default. Only packets from the permitted IP range will be able to communicate with the virtual machine endpoint. Endpoints ACLs is not currently supported by terraform (=< 0.6.3), so we will use a `local-exec` provisioner in the `azure-instance` resource which will call the azure-cli command `azure vm endpoint acl-rule create ...`. In order to be able to use the comma separated list of office IPs defined in `globals.tf`, we created a wrapper script which will split this list and add a sequence of rules, increasing their priority by one. This way it is easier to modify the list of allowed CIDR.
We need more resources to be able to do bosh-init timely: * More CPU and memory to start and build the bost-release * More BW (restricted by instance type) to upload the images We will revert this change in the future.
We will generate several temporary files to prepare the azure environment, so we will prefix them with `generated`
Terraform does not support creating azure storage accounts, so we need to use the `azure-cli` for the time being. We provide a external script `azure/azure-create-storage-service.sh` which can be called by `local-exec` provider. This script creates the storage account associated to a given service name (terraform can create those) and retrieves and stores the account key in a file which we can read in terraform later. If the account already exists, it just downloads the key.
The storage accounts, needed to be able to create storage resources in azure by the BOSH cpi, are associated to a `azure_hosted_service`. The `azure_hosted_service` is supported by terraform, but the account not, for which we have the script `azure/azure-create-storage-service.sh`. We add a dedicated service for storage with a `local_exec` provisioner to create the storage account. Eventually, if storage accounts are implemented in terraform, we can move the logic to terraform itself.
Azure command line is required and has some special requirements when login in.
BOSH Azure cpi requires a new client ID with a password created in the `active directory` of the Azure account with a contributor role to login in azure and create the different objects. In order to create this credentials, we must: 1. create an application with a url and a given password 2. create a service principal associated to that application 3. add it to the `Contributor` role Terraform does not support this, so we created a script to implement this logic which we will be able to call manually or from terraform. This script creates the client id used in the Bosh manifest. Note: The created user cannot be deleted from the commandline (at least I did not figure how). To delete it: Azure console > Active Directory > select the existing AD > Applications > Select "apps owned" > Click on the given app > delete (icon in bottom bar)
Azure client has [two modes of operation: asm and arm](https://github.com/Azure/azure-content/blob/master/articles/virtual-machines/xplat-cli-azure-manage-vm-asm-arm.md), with different commands and options. Switching between each is a stateful operation. We need to be in `asm mode` to be able to upload the ssh certificate, but other scripts might have changed the mode to arm.
The `azure storage account keys list` command seems to fail for a long time. Probably account creation is async and takes a really long time. This is ugly hack.
We will use the generated x509 temporary cerficiates in other steps, like for instance for the BOSH manifest when passing the `ssh_certificate` key. Because that, we will keep the temporary files as `generated.insecure-deployer.pem` and `generated.insecure-deployer.pfx`
Hardcoded environment name and skip if app is already created.
We need to get and create: * account id, subscription id... * Application, pass for BOSH service provider * grant the permissions to the service provider This script runs the first steps and creates a bash script with variables to be consumed by terraform.
We add a manifest file [based on the example from the azure_bosh_cpi_release](http://cloudfoundry.blob.core.windows.net/misc/bosh.yml). We will use several variables for this, which must be provided externally [as environment variables in terraform](https://www.terraform.io/docs/configuration/variables.html). Other variables are predefined by convention based on the name of other resources in terraform, like `azure_resource_group_name`, `azure_vnet_name`, `azure_subnet_name`, `azure_storage_account_name`. Other variables, come from files generated by external scripts (we used this scripts for missing features in terraform), like `azure_storage_access_key` and `azure_ssh_certificate`. `azure_ssh_certificate` must be one unique line joined by `\n`, and we do so programatically in terraform with `join` and `split` Note: This manifest is not 100% functional and still requires some tuning.
In BOSH we need a Azure resource group to use in all the created assets (vms, disks, etc). The terraform `azure_hosted_service` resource creates a Azure resource group. We will creat a global `azure_hosted_service` and used to create all the objects. We change the storage account to be created within this new global group.
In order to keep consistency and make it more generic and clear.
BOSH azure cpi requires the used network to be created within the resource group assigned to BOSH. But the [azure network terraform resource](https://terraform.io/docs/providers/azure/r/virtual_network.html) does not allow to specify the resource group. Because that, we avoid using the terraform resource and we call the azure command line directly, to create the network and subnet in the right resource group for BOSH. We will need to solve how we will connect this two networks.
`azure-upload-certificate.sh` was not handling properly the case that there is already a certificate uploaded.
Use terraform provisoners to upload the SSH keys, manifest and a provision.sh script, as we do in other platforms.
Only install packages and download software if necesary
So it does not point to the wrong network name and wrong SSH key.
Terraform does not support create public IPs, so we need to use the azure client for the time being. We will need a public IP for microbosh to be able to contact it meanwhile we create it in a different network than the bastion host [due the limitations in terraform](hashicorp/terraform#3089)
Due having the bastion and microbosh in different networks ([see reported bug in terraform](hashicorp/terraform#3089)) we need to be able to specify a public ip for microbosh. This commit expects the user to manually create the ip with `azure/azure-create-public-ip.sh` and pass it to terraform as argument: `-var bosh_public_ip=65.52.132.211`
We must change the azure command line mode before running the required commands. Mode is changed in other scripts.
Also sleep before querying the IP in case Azure did not finish creating it.
Add logic in terraform to create the public ip calling the external command `azure-create-public-ip.sh`.
So we can easily troubleshoot and create new manifests.
In order to be able to connect to the service of microbosh on Azure via the Public IP, we need to define the required ports as endpoints. We start opening only SSH, and probably bosh-init would be able to provision using the `ssh_tunnel` defined later. NOTE: This is opening the ports to the public, we need to fix it.
So we can optionally define a faster machine for testing, for instance: `-var bastion_instance_size=Standard_D3`
When reading the externally generated azure storage key, we must get rid of any new line character so the manifest is valid.
We need to explicitely put the public network in the job definition if we want the bosh machine to listen the public interface.
As the bastion host where we run bosh-init is currently running in a different network than the bosh machine, we need to map the port 6868, used by the bosh agent in the stem cell, so bosh-init can connect to it and setup the microbosh machine. The `cloud_provider.ssh_tunnel` does not work for this case :( An alternative solution could be point the variable `cloud_provider.mbus` to localhost: ``` cloud_provider: mbus: https://mbus-user:mbus-password@127.0.0.1:6868 ``` and then manually create a tunnel with SSH with `ssh vcap@23.97.216.207 -i .ssh/id_rsa -L 6868:localhost:6868` But given that the current situation (bastion and bosh in different networks) is in general temporary and suboptimal, we will just configure the port in the public ip for the time being.
We create several resources out of terraform with several scripts. This script allows us to delete several objects created for the given environment: vms, networks, ips, storage, etc...
Also delete all the storage containers on azure when running `azure-delete-environment.sh`
Parameter checks, sensible variable names, etc.
It is required for the Makefile tasks to provision
Add all the required steps to build the azure environment: * Setup the bosh credentials * create temporary files * delete azure objects when destroying
Sometimes the `azure service cert list` does not return the key thumbprint because it was not created on time.
keymon
changed the title
Feature/101030380 bosh on azure
[#101030380] MicroBosh on Azure
Aug 28, 2015
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
#101030380 MicroBosh on Azure
What
We currently have MicroBOSH running in AWS and GCE - this PR tries to do it into Azure so that we are able to deploy the CF platform to any of the three platforms.
Existing tooling
We will write this feature based on:
Scope of the story:
This story must be reviewed in this scope:
Restrictions of this implementation
The tools we used and azure itself are quite new and we lack of several features. Specially in the case of terraform, as we had to implement several parts using shell scripting. We will list below the missing features and bugs.
How to review this story
In
Makefile
there are several tasks which will setup the environment:prepare-azure
will create the required credentials for microbosh.apply-azure
will call terraform, which will:provision-azure
will usebosh-init
to bootstrap a new node.Currently we setup microbosh in a dedicated public IP, with SSH expose publicly. This is due a limitation in terraform that prevents us reuse the networks created by terraform in the bosh CPI, so we cannot create the bastion host in the name network than microbosh.
How to login to the provisioned microbosh using the public IP
To test the new provision microbosh:
azure/generated.bosh-public-ip
)ssh 23.97.216.207 -l vcap -i ssh/insecure-deployer -fN -L 4222:localhost:4222 -L 25250:localhost:25250 -L 25555:localhost:25555 -L 6868:localhost:6868 -L 25255:localhost:25255 -L 25777:localhost:25777
rvm use 2.2.2@bosh --create gem install bosh_cli bosh target localhost # admin:admin
Limitations and workarounds implemented
Limitations on terraform
It does not allow to define endpoint acls
Reported here hashicorp/terraform#3187
We need Instance endpoint acls to be able to restrict access to the bastion. Terraform does not support that, instead we had to add a script to create the rule with a
local-exec
provisioner as in this commitThere are no resources to create application credentials and service principals
hashicorp/terraform#3096
Azure BOSH CPI requires application credentials and service principals to add resources to azure.
But Terraform does not provide a way to:
We workaround the issue with a external script:
azure-create-service-principal.sh
, which we call before running terraform.The script
azure-generate-account-settings.sh
creates a set of environment variables which can be consume later by terraform.There is no explicit way to create a resource group.
hashicorp/terraform#3097
There is not a explicit way to create a resource group in terraform (in command line: `azure config mode arm && azure group create
But one can create a
azure_hosted_service
which creates the group as a collateral effect.I did so in
hosted_service.tf
, which we refer later in different places.There is no way to create a storage account
hashicorp/terraform#3098
Azure BOSH CPI requires a azure storage account and credentials. But there is no way to create this resource group in terraform.
To workaround it we created the script
azure-create-storage-service.sh
, which will:azure resource create ...
azure storage account keys
A resource in terraform should be able to create this account and provide a way to query the created keys.
Terraform does not allow specify the resource group for virtual networks
Reported in hashicorp/terraform#3089
It is not possible to specify the resource group of a network created in terraform. It gets created in a group
Default_Networking
. But Azure Bosh CPI expects a dedicated group for all the elements. This includes includes the network, so BOSH CPI cannot see the networks created by terraform.If we create the network externally, with azure cli, but try to associate it to a host created in terraform, then is terraform who does not see the network, as it expects it to be in
Default Networking
.Because that, I we cannot create a bastion host using terraform to bootstrap and secure a microbosh host in the same network, and they can only communicate we use a public ip for our BOSH instance, which is suboptimal.
Terraform does not provide a way to create/upload ssh keys
hashicorp/terraform#3099
When creating an azure instance in terraform we can pass either a password or a SSH thumbprint.
This SSH thumbprint comes from a SSH key uploaded to the same
azure_hosted_service
defined in the instance, and uploaded using a method like this.But terraform does not provide any way to upload the ssh key, so we are forced to upload it manually using a script like in this commit
Could be nice if terraform provides a resource to generate and upload the certificate and query the SSH thumbprint.
Terraform does not provide a way to reserve public IPs on Azure
hashicorp/terraform#3101
We need to allocate public IPs in Azure to be able to expose our microBosh service. But terraform does not provide a explicit way to allocate ips.
We workaround it with the script
azure-create-public-ip.sh
which we call fromnetwork.tf
Bugs hit/found
While developing this PR we hit several bugs, which might affect the review:
Unable to delete the environment in the first run. Reported in terraform #2416
Azure hangs starting VM
Reprovision (Destroy+Create) in terrafom fails. Reported in terraform #3109
Update multiple NSG rules in parallel fails. Terraform issue #3111
Azure client does not support valid CIDR like x.x.x.x/32
Azure client might not be initialised properly. Use
azure account clear && azure login
.