Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add infra for GCP #112

Open
wants to merge 1 commit into
base: v2
Choose a base branch
from
Open

Add infra for GCP #112

wants to merge 1 commit into from

Conversation

GJS2162
Copy link
Collaborator

@GJS2162 GJS2162 commented Oct 30, 2024

Summary by CodeRabbit

Release Notes

  • New Features

    • Introduced a .gitignore file to manage ignored files in the infra-gcp directory.
    • Added a .terraform.lock.hcl file to ensure consistent provider versions across environments.
    • Updated README.md with a guide for updating infrastructure using GCP Kubernetes Engine.
    • Created a configuration template for dynamic Ansible inventory management for Kubernetes nodes.
    • Added Terraform configurations for managing GCP virtual machines and networking resources, including VPC and subnetworks.
    • Introduced output variables for retrieving critical information related to Kubernetes infrastructure.
  • Documentation

    • Enhanced README.md with step-by-step instructions for infrastructure updates.
  • Chores

    • Added a new variables.tf file to define configuration variables for GCP infrastructure.

Copy link

coderabbitai bot commented Oct 30, 2024

📝 Walkthrough

Walkthrough

This pull request introduces several new files and configurations for managing infrastructure on Google Cloud Platform (GCP) using Terraform. Key changes include the addition of a .gitignore file to exclude sensitive and unnecessary files, a .terraform.lock.hcl file for provider version management, and a variables.tf file for defining various configuration parameters. New Terraform configurations for networking, virtual machines, and outputs are also included, facilitating the deployment and management of Kubernetes clusters on GCP.

Changes

File Change Summary
infra-gcp/.gitignore Added a .gitignore file to exclude Terraform state files, local directories, sensitive .tfvars files, and more.
infra-gcp/.terraform.lock.hcl Created a .terraform.lock.hcl file to specify provider versions and integrity hashes for AWS, Google, Local, and TLS providers.
infra-gcp/README.md Updated README.md to include a guide for updating infrastructure using GCP Kubernetes Engine with Terraform commands.
infra-gcp/inventory.ini.tftpl Added a template for an Ansible inventory file to manage Kubernetes nodes dynamically based on node types.
infra-gcp/module/vm/vm.tf Introduced a Terraform configuration for GCP VMs, defining variables and resources for creating VM instances and public IPs.
infra-gcp/networking.tf Created a configuration for GCP networking, including a VPC, subnets, and firewall rules.
infra-gcp/output.tf Added output variables to expose IP addresses and private keys from various Kubernetes modules.
infra-gcp/state.tf Introduced a backend configuration for storing Terraform state in Google Cloud Storage and defined project metadata.
infra-gcp/variables.tf Created a variables.tf file to define various configuration parameters for the GCP infrastructure setup.
infra-gcp/vm.tf Added modules for different Kubernetes node types, including load balancer, master, worker, and admin nodes.

Possibly related PRs

  • add documentation to scale kubernetes infrastructure #19: The changes in the main PR introduce a .gitignore file for the infra-gcp directory, which is relevant to the documentation updates in the retrieved PR that discusses scaling Kubernetes infrastructure, as both involve managing configurations and files related to Kubernetes deployments.

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 13

🧹 Outside diff range and nitpick comments (7)
infra-gcp/state.tf (1)

1-6: Enhance state storage security and configuration management.

Consider the following security and maintainability improvements for the backend configuration:

  1. Enable encryption for sensitive state data
  2. Enable versioning for state files to prevent accidental loss
  3. Use variables for the bucket name to support different environments

Apply these improvements:

 terraform {
   backend "gcs" {
     bucket  = "bhasaideploymentstorage"
     prefix  = "terraform/bhasai/live/dev-k8s"
+    encryption_key = var.state_encryption_key
   }
 }

Also, consider adding these bucket configurations in a separate terraform configuration:

resource "google_storage_bucket" "terraform_state" {
  name          = var.state_bucket_name
  location      = var.region
  force_destroy = false
  
  versioning {
    enabled = true
  }

  uniform_bucket_level_access = true
}
infra-gcp/inventory.ini.tftpl (2)

3-4: Remove or update the example entries

The commented example entries contain hardcoded IP addresses that might be misleading. Consider either removing them or updating them with placeholder values that clearly indicate they are examples.

-#node1 ansible_host=35.202.160.50
-#node2 ansible_host=35.202.162.100 
+# Example format:
+# node1 ansible_host=<external_ip> ansible_ssh_private_key_file=<path_to_key>

1-36: Add documentation for template variables

The template uses several variables (master_nodes, worker_nodes, worker_gpu_nodes) but lacks documentation about their expected structure and required attributes.

Add a header comment section explaining the required variables:

+# Required variables:
+# - master_nodes: List of objects with attributes: name, external_ip
+# - worker_nodes: List of objects with attributes: name, external_ip
+# - worker_gpu_nodes: List of objects with attributes: name, external_ip
+
 ## Configure 'ip' variable to bind kubernetes services on a
infra-gcp/vm.tf (1)

1-11: Consider network segmentation for load balancer nodes.

The load balancer nodes are currently placed in the master subnet. For better security and network segmentation, consider placing them in a dedicated DMZ or frontend subnet.

infra-gcp/module/vm/vm.tf (3)

35-41: Consolidate duplicate 'network_interface' blocks

There are two network_interface blocks in the instance configuration. Unless multiple network interfaces are required, consider consolidating them into a single block to simplify the configuration.

Combine the configurations as needed to maintain only one network_interface block.

Also applies to: 51-57


45-47: Parameterize the VM image for flexibility

The VM image is hardcoded. To allow flexibility in selecting different images, consider making it a variable.

Add a variable for the image:

variable "node_image" {
  type    = string
  default = "projects/ubuntu-os-cloud/global/images/ubuntu-2204-jammy-v20240419"
}

Update the boot_disk configuration:

 initialize_params {
-  image = "projects/ubuntu-os-cloud/global/images/ubuntu-2204-jammy-v20240419"
+  image = var.node_image
   size  = var.node_disk_size
   type  = "pd-standard"
 }

70-77: Review and complete the commented-out 'vm_ips' output block

The vm_ips output block is commented out and appears incomplete. If you intend to output VM IP information, review and complete this block.

Uncomment and adjust the output as needed:

-# output "vm_ips" {
-#   value = { for vm in google_compute_instance.vm : vm.name => {
-#     id         = vm.id
-#     public_ip  = vm.
-#     private_ip = google_compute_instance.vm[count.index].network_interface[0].network_ip
-#     }
-#   }
-# }
+output "vm_ips" {
+  value = { for vm in google_compute_instance.vm : vm.name => {
+    id         = vm.id
+    public_ip  = vm.network_interface[0].access_config[0].nat_ip
+    private_ip = vm.network_interface[0].network_ip
+    }
+  }
+}

Ensure that the public_ip and private_ip fields correctly reference the appropriate attributes.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between d32c823 and 2e245eb.

📒 Files selected for processing (10)
  • infra-gcp/.gitignore (1 hunks)
  • infra-gcp/.terraform.lock.hcl (1 hunks)
  • infra-gcp/README.md (1 hunks)
  • infra-gcp/inventory.ini.tftpl (1 hunks)
  • infra-gcp/module/vm/vm.tf (1 hunks)
  • infra-gcp/networking.tf (1 hunks)
  • infra-gcp/output.tf (1 hunks)
  • infra-gcp/state.tf (1 hunks)
  • infra-gcp/variables.tf (1 hunks)
  • infra-gcp/vm.tf (1 hunks)
✅ Files skipped from review due to trivial changes (4)
  • infra-gcp/.gitignore
  • infra-gcp/.terraform.lock.hcl
  • infra-gcp/README.md
  • infra-gcp/variables.tf
🧰 Additional context used
🪛 checkov
infra-gcp/module/vm/vm.tf

[HIGH] 28-61: Ensure 'Block Project-wide SSH keys' is enabled for VM instances

(CKV_GCP_32)


[HIGH] 28-61: Ensure VM disks for critical VMs are encrypted with Customer Supplied Encryption Keys (CSEK)

(CKV_GCP_38)

infra-gcp/networking.tf

[HIGH] 39-51: Ensure Google compute firewall ingress does not allow unrestricted ssh access

(CKV_GCP_2)

infra-gcp/state.tf

[HIGH] 15-20: Ensure oslogin is enabled for a Project

(CKV_GCP_33)

🔇 Additional comments (5)
infra-gcp/state.tf (1)

15-20: ⚠️ Potential issue

Enable OS Login and follow GCP naming conventions.

Several security and consistency improvements are needed:

  1. OS Login should be enabled for enhanced security (CKV_GCP_33)
  2. Variable naming should follow GCP conventions (project_id instead of resource_group_name)
  3. The placeholder comment suggests incomplete configuration

Apply these security improvements:

 resource "google_compute_project_metadata" "default" {
-  project       = var.resource_group_name
+  project       = var.project_id
   metadata = {
     delete_os_disk_on_deletion = "true"
+    enable-oslogin = "TRUE"
   }
 }

Let's verify if OS Login is consistently configured across the project:

🧰 Tools
🪛 checkov

[HIGH] 15-20: Ensure oslogin is enabled for a Project

(CKV_GCP_33)

infra-gcp/inventory.ini.tftpl (1)

1-36: Verify integration with Terraform configurations

The template assumes specific node structures and SSH key paths. Let's verify these assumptions against the Terraform configurations.

✅ Verification successful

✓ Template aligns with Terraform configuration

The inventory template is correctly structured and matches the Terraform implementation:

  • The template expects master_nodes, worker_nodes, and worker_gpu_nodes, which align with the Terraform modules k8s_master, k8s_worker, and k8s_worker_gpu
  • SSH key paths in the template (/home/ubuntu/.ssh/master.pem, worker.pem, worker_gpu.pem) match the paths in the commented provisioner block
  • Each node type generates its own SSH key via the tls_private_key resource in the VM module
  • The node naming convention in the template will match the instance names generated by the VM module (${var.resource_group_name}-${var.node_type}-${count.index})
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify that the required node structures and SSH configurations exist in Terraform

# Check for node definitions
echo "Checking for node definitions..."
rg -A 5 'resource\s+"google_compute_instance"' --type tf

# Check for SSH key generation or management
echo "Checking for SSH key configurations..."
rg -l 'ssh|private_key' --type tf

# Check for output variables that would feed into this template
echo "Checking for relevant output variables..."
rg 'output\s+".*_nodes"' --type tf

Length of output: 967


Script:

#!/bin/bash
# Let's examine the specific node configurations and SSH key setup

# Check VM module implementation
echo "=== VM Module Implementation ==="
cat infra-gcp/module/vm/vm.tf

# Check main VM configuration
echo -e "\n=== Main VM Configuration ==="
cat infra-gcp/vm.tf

# Check output configurations
echo -e "\n=== Output Configurations ==="
cat infra-gcp/output.tf

# Look for variables that match our template's expectations
echo -e "\n=== Variables Related to Node Types ==="
rg "variable.*(\bmaster\b|\bworker\b|\bnode_type\b)" --type tf

Length of output: 8621

infra-gcp/output.tf (2)

1-19: Verify the intention behind commented IP outputs

All VM IP outputs are commented out, which might make it difficult to interact with the infrastructure. If these are intentionally disabled for security reasons, consider:

  1. Using sensitive = true instead of commenting them out
  2. Implementing proper access controls or VPN if direct IP access is needed
  3. Using DNS names instead of direct IPs

Could you clarify if these outputs should remain commented out or if they need to be enabled with appropriate security controls?


21-49: Security: Reconsider exposing private keys as outputs

While the private keys are marked as sensitive, exposing them through Terraform outputs poses several security risks:

  1. Keys are stored in the Terraform state file
  2. No apparent key rotation mechanism
  3. Increased attack surface through state file access

Consider these security improvements:

  1. Use Google Secret Manager or HashiCorp Vault for key management
  2. Implement automated key rotation
  3. If outputs are necessary, consider outputting key references instead of actual keys

Example using Secret Manager:

resource "google_secret_manager_secret" "vm_keys" {
  secret_id = "vm-private-keys"
  
  replication {
    automatic = true
  }
}

resource "google_secret_manager_secret_version" "vm_keys" {
  secret = google_secret_manager_secret.vm_keys.id
  secret_data = jsonencode({
    master = module.k8s_master.private_key,
    worker = module.k8s_worker.private_key,
    # ... other keys
  })
}

output "secret_name" {
  value = google_secret_manager_secret.vm_keys.name
}

Let's check if there are any existing secret management solutions in the codebase:

✅ Verification successful

Let me gather more context about how these private keys are being used and if there are any existing patterns for key management in the infrastructure code.


Let me check how these private keys are being used in the VM modules to understand the complete picture.


Security: Private key outputs are required for VM SSH access but need better management

Based on the verification, these private keys are legitimately used for SSH access to the VMs, where:

  1. Each VM module generates an RSA key pair using tls_private_key
  2. The public key is attached to the VM's metadata for SSH authentication
  3. The private key is exposed as an output for SSH client configuration

While the current implementation follows a common pattern for VM SSH access, here are specific, actionable improvements:

  • Add a lifecycle block to the tls_private_key resource with create_before_destroy = true to ensure smooth key rotation
  • Consider implementing automated key rotation using external tooling or scheduled workflows
  • Document the security implications and proper handling of these keys in the README
  • Store the state file in an encrypted backend with strict access controls
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Look for secret management configurations
echo "Checking for secret management patterns..."
rg -l "google_secret_manager|vault_secret" || echo "No secret management found"

# Check for potential security guidelines
echo -e "\nChecking for security documentation..."
fd -g "*security*.md" -g "*README*.md" -X grep -l "key|secret|sensitive" || echo "No security documentation found"

Length of output: 542


Script:

#!/bin/bash
# Check the k8s module implementations to understand key usage
echo "Checking k8s module implementations..."
fd -g "*.tf" -X rg -l "private_key"

# Look for any existing key management patterns
echo -e "\nChecking key management patterns..."
fd -g "*.tf" -X rg -A 3 "resource\s+\"tls_private_key\""

# Check if these keys are used in other places
echo -e "\nChecking references to private keys..."
rg "private_key\s*=" -A 2

Length of output: 1263


Script:

#!/bin/bash
# Check the VM module implementation details
echo "Checking VM module private key usage..."
rg -A 10 "tls_private_key" ./infra-gcp/module/vm/vm.tf

# Check if these keys are used for SSH access
echo -e "\nChecking for SSH-related configurations..."
rg -A 5 "ssh|connection" ./infra-gcp/module/vm/vm.tf

# Check if there are any key lifecycle management configurations
echo -e "\nChecking for key lifecycle configurations..."
rg "lifecycle|create_before_destroy" ./infra-gcp/module/vm/vm.tf

Length of output: 1147

infra-gcp/networking.tf (1)

63-63: Update 'source_ranges' in 'internal_firewall' with your internal IP range

The source_ranges is currently set to a placeholder value "10.0.0.0/8" with a comment indicating that it should be replaced. Ensure that you update it to reflect your actual internal IP range to enforce the correct firewall rules.

Please verify and replace the placeholder with your specific internal IP range.

Comment on lines +7 to +13
required_providers {
google = {
source = "hashicorp/google"
version = ">= 3.0.0"
}
}
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Tighten version constraints and add provider configuration.

The current provider setup has two issues:

  1. The version constraint >= 3.0.0 is too permissive and could lead to unexpected behavior with major version updates
  2. Missing provider configuration block with essential GCP settings

Add a more specific version constraint and provider configuration:

 terraform {
   required_providers {
     google = {
       source  = "hashicorp/google"
-      version = ">= 3.0.0"
+      version = "~> 6.9.0"
     }
   }
 }

+provider "google" {
+  project = var.project_id
+  region  = var.region
+  zone    = var.zone
+}

Committable suggestion was skipped due to low confidence.

Comment on lines +34 to +36
[k8s_cluster:children]
kube_node
kube_control_plane
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Consider adding Ansible variables section

The inventory could benefit from a variables section to define cluster-wide configurations.

Add a variables section at the end of the file:

[k8s_cluster:children]
kube_node
kube_control_plane

+[all:vars]
+ansible_python_interpreter=/usr/bin/python3
+ansible_ssh_common_args='-o StrictHostKeyChecking=no'
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
[k8s_cluster:children]
kube_node
kube_control_plane
[k8s_cluster:children]
kube_node
kube_control_plane
[all:vars]
ansible_python_interpreter=/usr/bin/python3
ansible_ssh_common_args='-o StrictHostKeyChecking=no'

Comment on lines +6 to +14
%{ for nodes in master_nodes ~}
${nodes.name} ansible_host=${nodes.external_ip} ansible_ssh_private_key_file=/home/ubuntu/.ssh/master.pem
%{ endfor ~}
%{ for nodes in worker_nodes ~}
${nodes.name} ansible_host=${nodes.external_ip} ansible_ssh_private_key_file=/home/ubuntu/.ssh/worker.pem
%{ endfor ~}
%{ for nodes in worker_gpu_nodes ~}
${nodes.name} ansible_host=${nodes.external_ip} ansible_ssh_private_key_file=/home/ubuntu/.ssh/worker_gpu.pem
%{ endfor ~}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Improve SSH key management security

The current implementation has several security considerations:

  1. Hardcoded paths to SSH keys in /home/ubuntu/.ssh/
  2. Different key files for different node types increases management complexity
  3. No SSH user specification which defaults to the current user

Consider these improvements:

  1. Use variables for SSH paths to make them configurable
  2. Consider using a single SSH key for all nodes
  3. Explicitly specify the SSH user
%{ for nodes in master_nodes ~}
-${nodes.name} ansible_host=${nodes.external_ip} ansible_ssh_private_key_file=/home/ubuntu/.ssh/master.pem
+${nodes.name} ansible_host=${nodes.external_ip} ansible_user=${ssh_user} ansible_ssh_private_key_file=${ssh_private_key_path}
%{ endfor ~}

Committable suggestion was skipped due to low confidence.

Comment on lines +16 to +32
[kube_control_plane]
%{ for nodes in master_nodes ~}
${nodes.name}
%{ endfor ~}

[etcd]
%{ for nodes in master_nodes ~}
${nodes.name}
%{ endfor ~}

[kube_node]
%{ for nodes in worker_nodes ~}
${nodes.name}
%{ endfor ~}
%{ for nodes in worker_gpu_nodes ~}
${nodes.name}
%{ endfor ~}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codebase verification

Control plane configuration requires attention

Based on the infrastructure code, the current setup has only 1 master node (k8s_master_node_count = "1"), which means there's no high availability for the control plane and etcd. For production environments, it's recommended to have at least 3 master nodes to ensure high availability and fault tolerance.

Key findings:

  • Single master node configuration (k8s_master_node_count = "1")
  • Same node serves both control plane and etcd roles
  • No redundancy for critical cluster components
🔗 Analysis chain

Verify Kubernetes control plane configuration

The current setup places all master nodes in both kube_control_plane and etcd groups. While this is a common pattern, ensure this aligns with your high-availability requirements.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check if there are related Terraform configurations that specify the number of control plane nodes
# and their high availability setup.

# Look for control plane node count variables or locals
rg -l 'master_nodes|control_plane' --type tf

# Look for etcd-related configurations
rg -l 'etcd' --type tf

Length of output: 97


Script:

#!/bin/bash
# Let's examine the content of the VM configuration files to understand the control plane setup

# Check infra-gcp/vm.tf for master node configuration
echo "=== Content of infra-gcp/vm.tf ==="
cat infra-gcp/vm.tf

# Check infra/vm.tf as well
echo -e "\n=== Content of infra/vm.tf ==="
cat infra/vm.tf

# Also check for any variables files that might define the number of nodes
echo -e "\n=== Checking variables files ==="
fd "variables.tf" --type f

Length of output: 9995


Script:

#!/bin/bash
# Let's check the variables files to see the control plane node count configuration

echo "=== Content of infra-gcp/variables.tf ==="
cat infra-gcp/variables.tf

echo -e "\n=== Content of infra/variables.tf ==="
cat infra/variables.tf

Length of output: 2137


Script:

#!/bin/bash
# Let's check for any terraform.tfvars files that might define the actual values

echo "=== Looking for tfvars files ==="
fd ".*\.tfvars$" --type f

echo -e "\n=== Content of any found tfvars files ==="
fd ".*\.tfvars$" --type f --exec cat {}

Length of output: 1310

node_disk_size = var.k8s_worker_disk_size
resource_group_name = var.resource_group_name
resource_group_location = var.resource_group_location
target_tags = ["k8s-master"]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Fix incorrect target tags for worker nodes.

All worker nodes are incorrectly using the "k8s-master" target tag. This could lead to security issues as firewall rules and network policies intended for master nodes would affect workers. Each node type should have its appropriate tag:

Apply these changes:

-  target_tags             = ["k8s-master"] 
+  target_tags             = ["k8s-worker"]
-  target_tags             = ["k8s-master"]  
+  target_tags             = ["k8s-worker-gpu"]
-  target_tags             = ["k8s-master"] 
+  target_tags             = ["k8s-stateful"]

Also applies to: 46-46, 58-58

Comment on lines +61 to +72
module "k8s_admin" {
source = "./module/vm"
node_type = "admin"
number_of_nodes = 1
subnet_id = google_compute_subnetwork.master.id
node_vm_size = var.k8s_admin_node_size
node_disk_size = var.k8s_admin_disk_size
resource_group_name = var.resource_group_name
resource_group_location = var.resource_group_location
node_public_ip = true
target_tags = ["k8s-admin"] # Add tag for admin
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codebase verification

Security risk: Admin node exposed to unrestricted SSH access

The verification confirms critical security concerns:

  • The admin firewall rule (admin_firewall) allows SSH access (port 22) from any IP (0.0.0.0/0)
  • The admin node has a public IP and is placed in the master subnet
  • These configurations create an unnecessarily broad attack surface

Required changes:

  • Replace source_ranges = ["0.0.0.0/0"] with a variable for allowed CIDR ranges
  • Consider moving the admin node to a dedicated management subnet for better network segmentation
🔗 Analysis chain

Review security implications of admin node configuration.

The admin node:

  1. Has a public IP in the master subnet, which could increase the attack surface
  2. Should have additional security measures like restricted IP ranges for SSH access

Consider:

  • Moving to a dedicated management subnet
  • Adding var.allowed_ssh_ranges to restrict access

Let's check if any firewall rules are defined for the admin node:


Let me gather more information about the firewall rules to understand the security configuration better.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for firewall rules related to admin nodes
rg -A 5 'k8s-admin' --type tf

Length of output: 674


Script:

#!/bin/bash
# Get the full firewall rule definitions
rg -B 5 -A 15 'google_compute_firewall.*admin' --type tf

Length of output: 1182

Comment on lines +79 to +82
output "private_key" {
sensitive = true
value = tls_private_key.rsa_key.private_key_openssh
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Avoid outputting the private SSH key

Outputting the private key via Terraform outputs, even if marked as sensitive, poses a security risk.

Consider removing the private_key output and securely managing SSH keys outside of Terraform outputs.

Comment on lines +9 to +10
variable "resource_group_name" {}
variable "resource_group_location" {}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Use GCP-specific terminology for variables and resources

The variables resource_group_name and resource_group_location are more commonly associated with Azure. In GCP, consider using project_id and zone to align with standard terminology.

Apply this diff to rename variables and references:

-variable "resource_group_name" {}
-variable "resource_group_location" {}
+variable "project_id" {}
+variable "zone" {}

---

-  name         = "${var.resource_group_name}-${var.node_type}-${count.index}-ip"
-  region       = var.resource_group_location
-  project      = var.resource_group_name
+  name         = "${var.project_id}-${var.node_type}-${count.index}-ip"
+  region       = var.region
+  project      = var.project_id

---

-  name                = "${var.resource_group_name}-${var.node_type}-${count.index}"
-  zone                = var.resource_group_location
-  project             = var.resource_group_name
+  name                = "${var.project_id}-${var.node_type}-${count.index}"
+  zone                = var.zone
+  project             = var.project_id

Ensure to update all references to these variables throughout the configuration.

Also applies to: 31-32, 34-34

Comment on lines +28 to +61
resource "google_compute_instance" "vm" {
count = var.number_of_nodes
name = "${var.resource_group_name}-${var.node_type}-${count.index}"
zone = var.resource_group_location
machine_type = var.node_vm_size
can_ip_forward = true
project = var.resource_group_name
network_interface {
subnetwork = var.subnet_id
access_config {
# Assign a public IP if required
nat_ip = var.node_public_ip ? google_compute_address.public_ip[count.index].address : null
}
}

boot_disk {
initialize_params {
image = "projects/ubuntu-os-cloud/global/images/ubuntu-2204-jammy-v20240419"
size = var.node_disk_size
type = "pd-standard"
}
}

network_interface {
network = "default" # Replace with your network name or variable if you have a custom network
access_config {
nat_ip = var.node_public_ip ? google_compute_address.public_ip[count.index].address : null
}
}

metadata = {
ssh-keys = "ubuntu:${tls_private_key.rsa_key.public_key_openssh}"
}
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Consider encrypting VM disks with Customer Supplied Encryption Keys

For enhanced data security, consider encrypting VM disks using Customer Supplied Encryption Keys (CSEK).

Introduce a variable for the KMS key:

variable "kms_key_self_link" {
  type    = string
  default = null
}

Update the boot_disk configuration:

   boot_disk {
     initialize_params {
       image = var.node_image
       size  = var.node_disk_size
       type  = "pd-standard"
     }
+    disk_encryption_key {
+      kms_key_self_link = var.kms_key_self_link
+    }
   }

Ensure to provide the appropriate KMS key when deploying the infrastructure.

🧰 Tools
🪛 checkov

[HIGH] 28-61: Ensure 'Block Project-wide SSH keys' is enabled for VM instances

(CKV_GCP_32)


[HIGH] 28-61: Ensure VM disks for critical VMs are encrypted with Customer Supplied Encryption Keys (CSEK)

(CKV_GCP_38)


⚠️ Potential issue

Add 'block-project-ssh-keys' to enhance instance security

To prevent the use of project-wide SSH keys and enhance security, add block-project-ssh-keys = "true" to the instance metadata.

Apply this diff to update the metadata:

 metadata = {
   ssh-keys = "ubuntu:${tls_private_key.rsa_key.public_key_openssh}"
+  block-project-ssh-keys = "true"
 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
resource "google_compute_instance" "vm" {
count = var.number_of_nodes
name = "${var.resource_group_name}-${var.node_type}-${count.index}"
zone = var.resource_group_location
machine_type = var.node_vm_size
can_ip_forward = true
project = var.resource_group_name
network_interface {
subnetwork = var.subnet_id
access_config {
# Assign a public IP if required
nat_ip = var.node_public_ip ? google_compute_address.public_ip[count.index].address : null
}
}
boot_disk {
initialize_params {
image = "projects/ubuntu-os-cloud/global/images/ubuntu-2204-jammy-v20240419"
size = var.node_disk_size
type = "pd-standard"
}
}
network_interface {
network = "default" # Replace with your network name or variable if you have a custom network
access_config {
nat_ip = var.node_public_ip ? google_compute_address.public_ip[count.index].address : null
}
}
metadata = {
ssh-keys = "ubuntu:${tls_private_key.rsa_key.public_key_openssh}"
}
}
resource "google_compute_instance" "vm" {
count = var.number_of_nodes
name = "${var.resource_group_name}-${var.node_type}-${count.index}"
zone = var.resource_group_location
machine_type = var.node_vm_size
can_ip_forward = true
project = var.resource_group_name
network_interface {
subnetwork = var.subnet_id
access_config {
# Assign a public IP if required
nat_ip = var.node_public_ip ? google_compute_address.public_ip[count.index].address : null
}
}
boot_disk {
initialize_params {
image = "projects/ubuntu-os-cloud/global/images/ubuntu-2204-jammy-v20240419"
size = var.node_disk_size
type = "pd-standard"
}
}
network_interface {
network = "default" # Replace with your network name or variable if you have a custom network
access_config {
nat_ip = var.node_public_ip ? google_compute_address.public_ip[count.index].address : null
}
}
metadata = {
ssh-keys = "ubuntu:${tls_private_key.rsa_key.public_key_openssh}"
block-project-ssh-keys = "true"
}
}
🧰 Tools
🪛 checkov

[HIGH] 28-61: Ensure 'Block Project-wide SSH keys' is enabled for VM instances

(CKV_GCP_32)


[HIGH] 28-61: Ensure VM disks for critical VMs are encrypted with Customer Supplied Encryption Keys (CSEK)

(CKV_GCP_38)

Comment on lines +39 to +51
resource "google_compute_firewall" "admin_firewall" {
name = "${var.resource_group_name}-admin-fw"
network = google_compute_network.vpc.name
project = var.resource_group_name

allow {
protocol = "tcp"
ports = ["22"] # Allow SSH
}

source_ranges = ["0.0.0.0/0"] # Allow from anywhere
target_tags = ["k8s-admin"] # Tag for admin nodes
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Restrict SSH access in 'admin_firewall' to trusted IP ranges

Allowing SSH access from any IP address (0.0.0.0/0) poses a significant security risk. It's recommended to restrict the source_ranges to specific trusted IP addresses or networks to enhance security.

Consider updating the source_ranges to include only the IP ranges that require SSH access. For example:

 resource "google_compute_firewall" "admin_firewall" {
   name    = "${var.resource_group_name}-admin-fw"
   network = google_compute_network.vpc.name
   project = var.resource_group_name

   allow {
     protocol = "tcp"
     ports    = ["22"]  # Allow SSH
   }

-  source_ranges = ["0.0.0.0/0"]  # Allow from anywhere
+  source_ranges = ["<trusted_ip_range>"]  # Restrict to trusted IPs
   target_tags   = ["k8s-admin"]  # Tag for admin nodes
 }

Committable suggestion was skipped due to low confidence.

🧰 Tools
🪛 checkov

[HIGH] 39-51: Ensure Google compute firewall ingress does not allow unrestricted ssh access

(CKV_GCP_2)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant