Skip to content
This repository has been archived by the owner on Apr 4, 2018. It is now read-only.

[#101030380] MicroBosh on Azure #4

Closed
wants to merge 60 commits into from

Conversation

keymon
Copy link
Contributor

@keymon keymon commented Aug 28, 2015

#101030380 MicroBosh on Azure

What

We currently have MicroBOSH running in AWS and GCE - this PR tries to do it into Azure so that we are able to deploy the CF platform to any of the three platforms.

Existing tooling

We will write this feature based on:

Scope of the story:

This story must be reviewed in this scope:

  • I want to be able to start a bastion machine on Azure listening on port 22 (SSH) and restricted to the office IPs, allowing login using the insecure key.
  • I want to be able to provision a microbosh instance using bosh-init on azure.
  • I want to be able to login to the new microbosh instance.

Restrictions of this implementation

The tools we used and azure itself are quite new and we lack of several features. Specially in the case of terraform, as we had to implement several parts using shell scripting. We will list below the missing features and bugs.

How to review this story

In Makefile there are several tasks which will setup the environment:

  • prepare-azure will create the required credentials for microbosh.
  • apply-azure will call terraform, which will:
    1. Setup several additional resources: storage accounts, networks, etc.
    2. Create the bastion host
  • provision-azure will use bosh-init to bootstrap a new node.

Currently we setup microbosh in a dedicated public IP, with SSH expose publicly. This is due a limitation in terraform that prevents us reuse the networks created by terraform in the bosh CPI, so we cannot create the bastion host in the name network than microbosh.

How to login to the provisioned microbosh using the public IP

To test the new provision microbosh:

  1. you will need to learn the ip (see azure/generated.bosh-public-ip)
  2. Create a tunnel via SSH:
    ssh 23.97.216.207 -l vcap -i ssh/insecure-deployer -fN -L 4222:localhost:4222 -L 25250:localhost:25250 -L 25555:localhost:25555 -L 6868:localhost:6868 -L 25255:localhost:25255 -L 25777:localhost:25777
  3. Connect using this tunnel:
    rvm use 2.2.2@bosh --create gem install bosh_cli bosh target localhost # admin:admin

Limitations and workarounds implemented

Limitations on terraform

It does not allow to define endpoint acls

Reported here hashicorp/terraform#3187

We need Instance endpoint acls to be able to restrict access to the bastion. Terraform does not support that, instead we had to add a script to create the rule with a local-exec provisioner as in this commit

There are no resources to create application credentials and service principals

hashicorp/terraform#3096

Azure BOSH CPI requires application credentials and service principals to add resources to azure.

But Terraform does not provide a way to:

  • Create Azure Applications
  • Create a Service Principals
  • Assign roles to applications

We workaround the issue with a external script: azure-create-service-principal.sh, which we call before running terraform.

The script azure-generate-account-settings.sh creates a set of environment variables which can be consume later by terraform.

There is no explicit way to create a resource group.

hashicorp/terraform#3097

There is not a explicit way to create a resource group in terraform (in command line: `azure config mode arm && azure group create

But one can create a azure_hosted_service which creates the group as a collateral effect.

I did so in hosted_service.tf, which we refer later in different places.

There is no way to create a storage account

hashicorp/terraform#3098

Azure BOSH CPI requires a azure storage account and credentials. But there is no way to create this resource group in terraform.

To workaround it we created the script azure-create-storage-service.sh, which will:

  • create the account: azure resource create ...
  • query the keys: azure storage account keys

A resource in terraform should be able to create this account and provide a way to query the created keys.

Terraform does not allow specify the resource group for virtual networks

Reported in hashicorp/terraform#3089

It is not possible to specify the resource group of a network created in terraform. It gets created in a group Default_Networking. But Azure Bosh CPI expects a dedicated group for all the elements. This includes includes the network, so BOSH CPI cannot see the networks created by terraform.

If we create the network externally, with azure cli, but try to associate it to a host created in terraform, then is terraform who does not see the network, as it expects it to be in Default Networking.

Because that, I we cannot create a bastion host using terraform to bootstrap and secure a microbosh host in the same network, and they can only communicate we use a public ip for our BOSH instance, which is suboptimal.

Terraform does not provide a way to create/upload ssh keys

hashicorp/terraform#3099

When creating an azure instance in terraform we can pass either a password or a SSH thumbprint.

This SSH thumbprint comes from a SSH key uploaded to the same azure_hosted_service defined in the instance, and uploaded using a method like this.

But terraform does not provide any way to upload the ssh key, so we are forced to upload it manually using a script like in this commit

Could be nice if terraform provides a resource to generate and upload the certificate and query the SSH thumbprint.

Terraform does not provide a way to reserve public IPs on Azure

hashicorp/terraform#3101

We need to allocate public IPs in Azure to be able to expose our microBosh service. But terraform does not provide a explicit way to allocate ips.

We workaround it with the script azure-create-public-ip.sh which we call from network.tf

Bugs hit/found

While developing this PR we hit several bugs, which might affect the review:

keymon and others added 30 commits August 24, 2015 10:18
First steps in the azure configuration using terraform 
defining the provider.
First definition of the default network and the bastion subnet.
The VMs need a storage provider to setup the storage.
We add one for all the VMs, [using Locally Redundant Storage (LRS)](http://blogs.msdn.com/b/windowsazurestorage/archive/2013/12/11/introducing-read-access-geo-replicated-storage-ra-grs-for-windows-azure-storage.aspx) for the time being.

There is a restriction in the name, which must be lowercase alphanumeric only.
We add the first configuration of a host in azure. To add
a host there are two requirements: 
 * A hosted service, which we decided to declare it explicitly per host in terraform.
 * A SSH fingerprint.

The SSH certificate is associated to the hosted service created above, and needs to be uploaded to the console. So far we did not find any way to setup it in terraform. This is an issue, as currently you need to first run terraform once (which will create the service and fail due missing SSH key) upload the certificate manually, and run terraform again.

We will investigate alternative approaches.
Add the Network Security Group (NSG) and associated rules to only allow
access from the office IPs. 

As we can only specify one CIDR per NSG rule, we "programatically" split the list of CIDR 
by the `,` char,  counting the elements for the `count` attribute.

Note: There is a bug in terraform which prevents the changes to succeed. Changes on the rules of the
same NSG must be done one by one, but terraform tries to execute in parallel causing a error due
locked resources: `Code: ConflictError, Message: Another operation that requires exclusive access to Network Security Group xxxx is ongoing. Please try again later.`
In order to [automate the process to generate and upload the 
SSH certificate](https://media-glass.es/2015/07/22/adding-a-ssh-key-to-azure/) 
to azure, we add a `local-exec` command with all
the required steps and that calls the azure client to perform the 
upload.

This adds a explicit dependency on the azure client.

The uploaded ssh certificate thumbprint is saved in a temporary file
called `ssh_thumbprint`, which terraform reads 
(added empty and ignored in the repo)
To keep consistency with other new scripts, we prefix them with "azure"
We had multiple issues trying to setup NSG rules to restrict access to
the SSH endpoint: failures applying multiple rules in parallel, not working
as expected, etc.

We will remove this implementation so we can try [Endpoint ACLs instead](https://azure.microsoft.com/en-gb/documentation/articles/virtual-networks-acl/)
We implement [Azure's Endpoint ACLs](https://azure.microsoft.com/en-gb/documentation/articles/virtual-networks-acl/) to
restrict the access to the bastion SSH port to allow only GDS office IPs.

From the documentation, the ACLs rules will block everything but the traffic that is explicitelly allowed:

> **Permit** - When you add one or more "permit" ranges, you are denying all other ranges by default. Only packets from the permitted IP range will be able to communicate with the virtual machine endpoint.

Endpoints ACLs is not currently supported by terraform (=< 0.6.3), so we will use a `local-exec` provisioner in the `azure-instance` resource which will call the azure-cli command `azure vm endpoint acl-rule create ...`.

In order to be able to use the comma separated list of office IPs defined in `globals.tf`, we created a wrapper script which will split this list and add a sequence of rules, increasing their priority by one. This way it is easier to modify the list of allowed CIDR.
We need more resources to be able to do bosh-init timely:
 * More CPU and memory to start and build the bost-release
 * More BW (restricted by instance type) to upload the images

We will revert this change in the future.
We will generate several temporary files to prepare the azure
environment, so we will prefix them with `generated`
Terraform does not support creating azure storage accounts, so
we need to use the `azure-cli` for the time being.

We provide a external script `azure/azure-create-storage-service.sh`
which can be called by `local-exec` provider.

This script creates the storage account associated to a given
service name (terraform can create those) and retrieves and
stores the account key in a file which we can read in terraform later.

If the account already exists, it just downloads the key.
The storage accounts, needed to be able to create storage resources in
azure by the BOSH cpi, are associated to a `azure_hosted_service`.

The `azure_hosted_service` is supported by terraform, but the account
not, for which we have the script `azure/azure-create-storage-service.sh`.

We add a dedicated service for storage with a `local_exec` provisioner
to create the storage account.

Eventually, if storage accounts are implemented in terraform, we can
move the logic to terraform itself.
Azure command line is required and has some special requirements when
login in.
BOSH Azure cpi requires a new client ID with a password created in the
`active directory` of the Azure account with a contributor role
to login in azure and create the different objects.

In order to create this credentials, we must:
 1. create an application with a url and a given password
 2. create a service principal associated to that application
 3. add it to the `Contributor` role

Terraform does not support this, so we created a script to implement this
logic which we will be able to call manually or from terraform.

This script creates the client id used in the Bosh manifest.

Note: The created user cannot be deleted from the commandline (at least
I did not figure how). To delete it: Azure console > Active Directory >
select the existing AD > Applications > Select "apps owned" > Click on
the given app > delete (icon in bottom bar)
Azure client has [two modes of operation: asm and
arm](https://github.com/Azure/azure-content/blob/master/articles/virtual-machines/xplat-cli-azure-manage-vm-asm-arm.md), with
different commands and options. Switching between each is a stateful
operation.

We need to be in `asm mode` to be able to upload the ssh certificate,
but other scripts might have changed the mode to arm.
The `azure storage account keys list` command seems
to fail for a long time. Probably account creation is
async and takes a really long time. This is ugly hack.
We will use the generated x509 temporary cerficiates in other steps,
like for instance for the BOSH manifest when passing the
`ssh_certificate` key. Because that, we will keep the temporary files
as `generated.insecure-deployer.pem` and
`generated.insecure-deployer.pfx`
Hardcoded environment name and skip if app is already created.
We need to get and create:
 * account id, subscription id...
 * Application, pass for BOSH service provider
 * grant the permissions to the service provider

This script runs the first steps and creates a bash script with
variables to be consumed by terraform.
We add a manifest file [based on the example from the azure_bosh_cpi_release](http://cloudfoundry.blob.core.windows.net/misc/bosh.yml). 

We will use several variables for this, which must be provided externally 
[as environment variables in terraform](https://www.terraform.io/docs/configuration/variables.html).

Other variables are predefined by convention based on the name of other resources in terraform, like `azure_resource_group_name`, `azure_vnet_name`, `azure_subnet_name`, `azure_storage_account_name`.

Other variables, come from files generated by external scripts (we used this scripts for missing features in terraform), like `azure_storage_access_key` and `azure_ssh_certificate`.

`azure_ssh_certificate` must be one unique line joined by `\n`, and we do so programatically in terraform with `join` and `split`

Note: This manifest is not 100% functional and still requires some tuning.
keymon and others added 28 commits August 27, 2015 15:08
In BOSH we need a Azure resource group to use in all the created assets
(vms, disks, etc). The terraform `azure_hosted_service` resource creates a
Azure resource group.

We will creat a global `azure_hosted_service` and used to create all the
objects.

We change the storage account to be created within this new global
group.
In order to keep consistency and make it more generic and clear.
BOSH azure cpi requires the used network to be created within the
resource group assigned to BOSH. But the [azure network terraform
resource](https://terraform.io/docs/providers/azure/r/virtual_network.html) does
not allow to specify the resource group.

Because that, we avoid using the terraform resource and we call the
azure command line directly, to create the network and subnet in the
right resource group for BOSH.

We will need to solve how we will connect this two networks.
`azure-upload-certificate.sh` was not handling properly the case
that there is already a certificate uploaded.
Use terraform provisoners to upload the SSH keys, manifest and
a provision.sh script, as we do in other platforms.
Only install packages and download software if necesary
So it does not point to the wrong network name and wrong SSH key.
Terraform does not support create public IPs, so we need to use the
azure client for the time being.

We will need a public IP for microbosh to be able to contact it
meanwhile we create it in a different network than the bastion host [due
the limitations in
terraform](hashicorp/terraform#3089)
Due having the bastion and microbosh in different networks ([see reported bug in terraform](hashicorp/terraform#3089)) we need to be able to specify a public ip for microbosh.

This commit expects the user to manually create the ip with `azure/azure-create-public-ip.sh` and pass it to terraform as argument: `-var bosh_public_ip=65.52.132.211`
We must change the azure command line mode before running the
required commands. Mode is changed in other scripts.
Also sleep before querying the IP in case Azure did not finish creating it.
Add logic in terraform to create the public ip calling
the external command `azure-create-public-ip.sh`.
So we can easily troubleshoot and create new manifests.
In order to be able to connect to the service of microbosh on Azure via
the Public IP, we need to define the required ports as endpoints.

We start opening only SSH, and probably bosh-init would be able to
provision using the `ssh_tunnel` defined later.

NOTE: This is opening the ports to the public, we need to fix it.
So we can optionally define a faster machine for testing, for instance: `-var bastion_instance_size=Standard_D3`
When reading the externally generated azure storage key, we must get rid of any new line character so the manifest is valid.
We need to explicitely put the public network in the job definition if
we want the bosh machine to listen the public interface.
As the bastion host where we run bosh-init is currently running in a different 
network than the bosh machine, we need to map the port 6868, used by the bosh agent
in the stem cell, so bosh-init can connect to it and setup the microbosh machine.

The `cloud_provider.ssh_tunnel` does not work for this case :(

An alternative solution could be point the variable `cloud_provider.mbus` to localhost:

```
cloud_provider:
  mbus: https://mbus-user:mbus-password@127.0.0.1:6868
```

and then manually create a tunnel with SSH with `ssh vcap@23.97.216.207  -i .ssh/id_rsa -L 6868:localhost:6868`

But given that the current situation (bastion and bosh in different networks) is in general temporary and suboptimal, we will just configure the port in the public ip for the time being.
We create several resources out of terraform with several scripts.

This script allows us to delete several objects created for the given
environment: vms, networks, ips, storage, etc...
Also delete all the storage containers on azure when running
`azure-delete-environment.sh`
Parameter checks, sensible variable names, etc.
It is required for the Makefile tasks to provision
Add all the required steps to build the azure environment:
 * Setup the bosh credentials
 * create temporary files
 * delete azure objects when destroying
Sometimes the `azure service cert list` does not return the
key thumbprint because it was not created on time.
@keymon keymon closed this Aug 28, 2015
@keymon keymon changed the title Feature/101030380 bosh on azure [#101030380] MicroBosh on Azure Aug 28, 2015
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants