Infrastructure simplifying engineer
Date: 2025-02-16
Managing a large number of virtual machines can be challenging. As our infrastructure grew, we needed a better way to handle the complexity. That’s when we turned to Terraform to help us manage our VMware environment more efficiently. I’m going to cover the use cases for IaC and share:
The goal is to provide practical insights and tips for anyone looking to improve their VM management with Terraform for VMware.
I’m going to cover some very basic scenarios of VMs lifecycle:
There are many technologies available for different levels of infrastructure management. As DevOps engineers, we had to make specific choices. In our case:
Let us take a look at specific parts and the whole solution.
Behind the Packer build command:
Terraform is a stateful application with some important concepts:
Ansible automates the configuration of VMs through the following steps:
Everything is stored in git. There are some linked processes:
The beginning of the story is described in the Ansible: CoreOS to CentOS, 18 months long journey. In short, it describes the migration from a custom configuration management solution to Ansible + Packer for managing the environment. As part of that journey, I created the vmware_content_deploy_ovf_template Ansible module as part of the VMware collection.
But we had the problem of having multiple sources of truth about VM configurations: VMware and the git repository. The goal was to have only one source of truth to make IaC consistent.
I’m going to share some chronological lessons learned from implementing Terraform.
Steps to reproduce:
Result: infrastructure is not in consistent state How to avoid: use remote state
Steps to reproduce:
Result: original VM is not reachable How to avoid: double check the changes
Steps to reproduce:
### module.InfraServices.vsphere_virtual_machine.example-com is tainted, so must be replaced
-/+ resource "vsphere_virtual_machine" "example-com" {
- boot_delay = 0 -> null
- boot_retry_enabled = false -> null
~ change_version = "2023-07-28T11:41:47.016148Z" -> (known after apply)
Result: data are lost, but nobody has used the VM How to avoid: don’t keep VMs in bad state. fix it or recreate
Steps to reproduce:
datastore_id
~ resource "vsphere_virtual_machine" "worker23" {
~ datastore_id = "datastore-2032" -> "datastore-88"
id = "423845e7-xxxx-xxxx-xxxx-77240ff97399"
name = "worker23"
Result: unneeded changes on Vmware
How to avoid: use datastore_cluster_id
Steps to reproduce:
datastore_id
datastore_id
to datastore_cluster_id
Result: multiple disks can be moved across storages How to avoid: do not do many changes at the same time
Steps to reproduce:
datastore_cluster_id
│ Error: Cannot use datastore_cluster_id with Content Library source
│
│ with module.TrainingsAndDemo.vsphere_virtual_machine.example-06,
│ on modules/TrainingsAndDemo/example-06.example.local.tf line 2, in resource "vsphere_virtual_machine" "example-06":
│ 2: resource "vsphere_virtual_machine" "example-06" {
Result: VM is not created How to avoid:
datastore_id
& run terraformdatastore_id
-> datastore_cluster_id
& run terraformSteps to reproduce:
datastore_id
& run terraformdatastore_id
-> datastore_cluster_id
& run terraform
Result: VM is created, but there are too many actions. it’s not bug but API featureHow to avoid:
storage_policy_id
& run terraformdata "vsphere_storage_policy" "storage_cluster_01" { name = "Storage Cluster 01" }
resource "vsphere_virtual_machine" "customnagpr" {
name = "customnagpr"
storage_policy_id = var.storage_policy_id
Steps to reproduce:
│ Error: error detaching tags to object ID "vm-108158": POST <https://vcenter01.example.local/rest/com/vmware/cis/tagging/tag-association/id:urn:vmomi:InventoryServiceTag:2c12366a-xxxx-xxxx-xxxx-59ab8e157645:GLOBAL?~action=detach:> 404 Not Found
│
│ with module.ExamplePersonal.vsphere_virtual_machine.ld-example,
│ on modules/ExamplePersonal/ld-example.example.local.tf line 2, in resource "vsphere_virtual_machine" "ld-example":
│ 2: resource "vsphere_virtual_machine" "ld-example" {
Result: terraform exits with non zero code How to avoid: there are no ideas, run it twice
Steps to reproduce:
~ resource "vsphere_virtual_machine" "example-com" {
~ custom_attributes = {
- "304" = "OK" -> null
- "305" = "31-07-2023 02:31:26" -> null
Result: There is unneeded change How to avoid: add into the VM config
lifecycle {
ignore_changes = [
custom_attributes["304"],
custom_attributes["305"]
Steps to reproduce:
Result: The VM is recreated How to avoid: add temporary code for migration
moved {
from = module.ExamplePersonal.vsphere_virtual_machine.ld-example
to = module.AnoverPersonal.vsphere_virtual_machine.ld-example
}
Links: https://developer.hashicorp.com/terraform/tutorials/configuration-language/move-config
Steps to reproduce:
Result: The VM is not reachable by DNS name & you must restart network to fix it How to avoid: there is no solution
Steps to reproduce:
Result: The VM is deleted How to avoid:
...
prevent_destroy = true
...
Steps to reproduce:
terraform apply
Planning failed. Terraform encountered an error while generating this plan.
│ Error: disk: duplicate SCSI unit_number 0
│
│ with module.InfraServices.vsphere_virtual_machine.commondata,
│ on modules/InfraServices/commondata.example.local.tf line 2, in resource "vsphere_virtual_machine" "commondata":
│ 2: resource "vsphere_virtual_machine" "commondata" {
How to avoid: set unit_number
disk {
label = "disk1"
size = 120
controller_type = "scsi"
thin_provisioned = true
unit_number = 1
terraform import module.ExampleGuild.vsphere_virtual_machine.ans-jenkins01 /My\\ Datacenter/vm/ExampleGuild/ans-jenkins01
- enable_logging = true -> null
~ imported = true -> false
+ storage_policy_id = "eaf0b1ab-xxxx-xxxx-xxxx-348473a5727b"
- sync_time_with_host = true -> null
- cdrom {
- client_device = false -> null
- device_address = "ide:0:0" -> null
- key = 3000 -> null
}
~ disk {
~ keep_on_remove = true -> false
TIP: locator format: <DC NAME>/vm/<Folder>/<VM Name>
Steps to reproduce:
Result: The VM is rebooted.
How to avoid: Review the Terraform output and align the VM parameters accordingly.
diff -C0 created-by-terraform.example.local.tf created-manually.example.local.tf
...
*** 9,10 ****
- sync_time_with_host = true
- enable_logging = true
8
-
*** 31,32 ****
! cdrom {
! client_device = false
27,35
-
! clone {
! template_uuid = data.vsphere_content_library_item.packer_rocky8.id
! customize {
! linux_options {
! host_name = "jenkins-dockerbuild-02"
! domain = var.domain
! }
! network_interface {}
! }
***************
*** 34,38 ****
-
- cdrom {
- client_device = false
- }
data "vsphere_folder" "folder" { path = "/My Datacenter/vm/example" }
resource "vsphere_vapp_container" "VAPP_demo" {
name = "VAPP-demo"
parent_folder_id = data.vsphere_folder.folder.id
terraform import module.Imagemaster.vsphere_vapp_container.VAPP_demo /My\ Datacenter/host/My\ Cluster\ 01/Resources/Example/VAPP-demo
resource "vsphere_virtual_machine" "fdemo-docker" {
name = "demo-docker"
resource_pool_id = vsphere_vapp_container.VAPP_demo.id
Steps to reproduce:
### module.ExamplePersonal.vsphere_virtual_machine.ld-example will be updated in-place
~ resource "vsphere_virtual_machine" "ld-example" {
- memory_reservation = 16384 -> null
~ num_cores_per_socket = 2 -> 1
~ num_cpus = 6 -> 2
Result: Terraform wants to revert it How to avoid: Do changes via terraform or sync them
num_cpus = 6
num_cores_per_socket = 2
memory_reservation = 16384
Steps to reproduce:
terraform apply
### module.ExamplePersonal.vsphere_virtual_machine.ld-example will be updated in-place
~ resource "vsphere_virtual_machine" "ld-example" {
- memory_reservation = 16384 -> null
~ num_cores_per_socket = 2 -> 1
~ num_cpus
Result: The only RAM changes are detected How to avoid: just added note…
Steps to reproduce:
terraform apply
│ Error: 400 Bad Request: {"type":"com.vmware.vapi.std.errors.invalid_argument","value":{"error_type":"INVALID_ARGUMENT","messages":[{"args":["network_mappings","com.vmware.vcenter.ovf.library_item.resource_pool_deployment_spec"],"default_message":"Could not convert field 'network_mappings' of structure 'com.vmware.vcenter.ovf.library_item.resource_pool_deployment_spec'","id":"vapi.bindings.typeconverter.fromvalue.struct.field.error"},{"args":[],"default_message":"Element already present in the map.","id":"vapi.bindings.typeconverter.map.duplicate.element"}]}}
Result: Failed to craete How to avoid: create with 1 NIC & after add the 2nd
MULTIPLE
resource "vsphere_tag_category" "terraform_managed_category" {
name = "terraform_managed_category"
cardinality = "MULTIPLE"
description = "Managed by Terraform"
associable_types = [
"VirtualMachine",
terraform state pull > state.json
terraform state push state.json
It is not ideal because there is a lot of copy-pasting, but it is quite easy to find the specific VM and tune it.
├── < FOLDER NAME >.tf
├── connections.tf
├── main.tf
├── modules
│ ├── < FOLDER NAME >
│ │ ├── < VM NAME >.example.local.tf
│ │ ├── _main.tf
│ │ └── _variables.tf
├── outputs.tf
├── README.md
├── remote_state.tf
├── tags.tf
├── variables.tf
└── versions.tf
< FOLDER NAME > - the same as Network/Folder/Pool names in VMware
Steps to reproduce:
Result: The VM must be moved manually to the specific OU. How to avoid: Create a PR with the required functionality to the upstream and wait.
Create tags.tf
resource "vsphere_tag_category" "terraform_managed_category" {
name = "terraform_managed_category"
cardinality = "MULTIPLE"
description = "Managed by Terraform"
associable_types = [
"VirtualMachine",
"Datastore",
"VirtualApp",
]
}
resource "vsphere_tag" "Sandbox" {
name = "Sandbox"
category_id = vsphere_tag_category.terraform_managed_category.id
description = "Ansible group default_db_postgres. Managed by Terraform"
}
create a VM tf file and assign the tags
resource "vsphere_virtual_machine" "demo-db-pg" {
name = "demo-db-pg"
storage_policy_id = var.storage_policy_id
resource_pool_id = vsphere_vapp_container.demo.id
wait_for_guest_net_timeout = 5
sync_time_with_host = true
enable_logging = true
firmware = "efi"
annotation = <<-EOT
!!! Do not edit properties via vcenter via web ui !!!
Managed by ansible & terraform
Some important notes
EOT
tags = [
vsphere_tag.Sandbox.id,
]
...
Install the community.vmware
collection and create the hosts.vmware.yml
plugin: vmware_vm_inventory
hostname: vcenter01.example.com
username: xxx
with_tags: True
keyed_groups:
- key: tag_category.terraform_managed_category
prefix: ""
separator: ""
hostnames:
- 'config.name+".example.com"'
properties:
- 'config.name'
- 'summary.runtime.powerState'
- 'guest.guestFamily'
with_nested_properties: true
filters:
- tag_category.terraform_managed_category is defined
- "'Sandbox' in tag_category.terraform_managed_category"
- guest.guestFamily is defined
- guest.guestFamily == 'linuxGuest'
- summary.runtime.powerState == 'poweredOn'
resources:
- datacenter:
- My Datacenter
resources:
- compute_resource:
- My Cluster 01
datastore_cluster_id
is not supported & we must use storage_policy_id
datastore_cluster_id
is not supported & we must use storage_policy_id
prevent_destroy
to avoid VMs deletion