Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DO NOT MERGE] - PoC for orchestrating remote tests from inside the package using terraform #4832

Closed
wants to merge 10 commits into from

Conversation

pchila
Copy link
Member

@pchila pchila commented May 31, 2024

What does this PR do?

This PR contains a simple PoC to show what a TestMain for remote tests looks like when provisioning is offloaded to a third-party tool like Terraform.

Note: the code is really rough and in no way, shape or form close to production quality. This has been tested only on Linux even if it should work with minimal changes on MacOS. Windows has not been harmed during the writing of this code but it's also not been taken into account for compatibility. This PR is there just to explore what is possible without getting distracted by something like support for all OSes

Goals:

  • Test if it's possible to orchestrate remote tests (both provisioning and running) without using mage or other tools
  • Support and pass along any go test flags the user may specify when launching the tests to be run on a remote machine
  • Limit the usage of environment variables as much as possible
  • Test is terraform can simplify the test code and abstract away provisioning complexity
  • Check compatibility with define.* constructs already used for elastic-agent integration tests
  • Provision an environment where a go test can be readily executed without setup getting too much in the way of the tests to be run

Non-Goals:

  • Implement a drop-in replacement for the current integration framework
  • Support multiple OSes both on local or remote side

What has been implemented:

  • Provisioning of a single linux ubuntu 2204 VM using elastic ubuntu image
  • Spin up a small ESS deployment
  • Sync local code with remote host "automagically" using terraform and rsync
  • Collect the information about the provisioning and expose it to a TestMain in an agnostic way
  • Support running local and remote tests with a simple go test command that supports custom command line arguments
  • Add small tests to be run remotely to verify that we can run them and connect to an ESS deployment

What is still missing (and would probably be the next steps)

  • A bunch of command line options and terraform variables
  • Properly abstract Provisioner, Runner via interfaces so that we can swap different implementations (the ones needed by the current code are at least TerraformProvisioner and SSHRunner)
  • Implement a pluggable system for various provisioners each with their own flags (handled within the specific provisioner type) so that we can implement Docker , elastic-package (and others...) provisioning
  • Spin up multiple VMs in terraform based on arguments passed as variables (by the user or by the TestMain itself)
  • Implement some smart batching and parallel test execution for the remote tests on different hosts.
  • Debug support by launching tests with dlv on the remote while port forwarding the debug port on the local machine
  • I am sure I am forgetting a lot of other things but the ones above are what springs to mind...

I would also like to extend special thanks to @alexsapran and @pierrehilbert which helped me get through some of the terraform stuff

Why is it important?

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in ./changelog/fragments using the changelog tool
  • I have added an integration test or an E2E test

Disruptive User Impact

How to test this PR locally

To test this PR we launch tests of the newexp package like:

go test -tags integration -test.count 1 -test.v github.com/elastic/elastic-agent/testing/newexp

The output should look like this:

2024/05/31 18:34:17 go test args: [/tmp/go-build1065280305/b001/newexp.test -test.paniconexit0 -test.timeout=10m0s -test.count=1 -test.v=true]
2024/05/31 18:34:21 Installed terraform, exec path: /tmp/terraform_4172190103/terraform
2024/05/31 18:34:21 working dir: /home/paolo/dev/elastic-agent/testing/newexp/terraform
2024/05/31 18:34:21 [INFO] running Terraform command: /tmp/terraform_4172190103/terraform version -json

...
A LOT of Terraform logging
...

2024/05/31 18:37:23 [INFO] running Terraform command: /tmp/terraform_4172190103/terraform show -json -no-color
2024/05/31 18:37:24 Connected via SSH to the machine as buildkite-agent
2024/05/31 18:37:24 full command to run on remote host: "cd /src/elastic-agent && TEST_DEFINE_PREFIX=aaaaaa  ELASTICSEARCH_HOST=https://1ede842a84bc4dc68ce6783de05d9b81.us-west2.gcp.elastic-cloud.com:443 KIBANA_HOST=https://b304c45e6cdc4590ab2582b553c933f3.us-west2.gcp.elastic-cloud.com:9243 ELASTICSEARCH_USERNAME=elastic ELASTICSEARCH_PASSWORD=<redacted> KIBANA_USERNAME=elastic KIBANA_PASSWORD=AnWCVDCiT1mZ7QtKuNQb8fx2 go test  github.com/elastic/elastic-agent/testing/newexp -tags integration -test.paniconexit0 -test.timeout=10m0s -test.count=1 -test.v=true -args -integration.skip-provisioning"
2024/05/31 18:41:05 Test run output:
go: downloading github.com/hashicorp/go-version v1.6.0
...
more go modules download goodness here
...
go: downloading github.com/armon/go-radix v1.0.0
2024/05/31 16:41:04 go test args: [/tmp/go-build3603923429/b001/newexp.test -test.paniconexit0 -test.paniconexit0 -test.timeout=10m0s -test.count=1 -test.v=true -integration.skip-provisioning]
=== RUN   TestDrill
--- PASS: TestDrill (0.00s)
=== RUN   TestDeployment
    foo_test.go:36: ES info:
        {
          "name" : "instance-0000000000",
          "cluster_name" : "1ede842a84bc4dc68ce6783de05d9b81",
          "cluster_uuid" : "JD-igHfMSBqo95qGtj4iJg",
          "version" : {
            "number" : "8.15.0-SNAPSHOT",
            "build_flavor" : "default",
            "build_type" : "docker",
            "build_hash" : "f416688e1eb2e021263421dd01cdbae3fe4e8535",
            "build_date" : "2024-05-31T13:42:28.275922874Z",
            "build_snapshot" : true,
            "lucene_version" : "9.10.0",
            "minimum_wire_compatibility_version" : "7.17.0",
            "minimum_index_compatibility_version" : "7.0.0"
          },
          "tagline" : "You Know, for Search"
        }

    foo_test.go:43: Fleet Agents:
        {
        ... lots of JSON here
        }
--- PASS: TestDeployment (1.29s)
PASS
ok      github.com/elastic/elastic-agent/testing/newexp 1.309s

2024/05/31 18:41:05 [INFO] running Terraform command: /tmp/terraform_4172190103/terraform destroy -no-color -auto-approve -input=false -lock-timeout=0s -var-file=/home/paolo/dev/elastic-agent/testing/newexp/terraform/main.tfvars -lock=true -parallelism=10 -refresh=true
data.external.golist_dump: Reading...
tls_private_key.ssh_key: Refreshing state... [id=b775a606fda55e480ff9d67499485c713fee8d3b]
ec_deployment.integration-testing: Refreshing state... [id=576dd949856fc58571e03682dd10b9f7]
local_sensitive_file.private_ssh_key: Refreshing state... [id=06c9b649bfff4204ad00919125ce752ddbf77316]
local_file.public_ssh_key: Refreshing state... [id=a170215459a7b184a3ba4d183aa5080d163cd56c]
data.external.golist_dump: Read complete after 1s [id=-]
google_compute_instance.ubuntu_2204_instance: Refreshing state... [id=projects/elastic-platform-ingest/zones/europe-west12-a/instances/tf-test-instance]
terraform_data.sync_repo: Refreshing state... [id=ff7d60e2-d81d-48cc-5405-df1a2db21ecd]

Terraform used the selected providers to generate the following execution
plan. Resource actions are indicated with the following symbols:
  - destroy

...
ANOTHER BUNCH of Terraform logs here
...
ec_deployment.integration-testing: Destruction complete after 1m28s

Destroy complete! Resources: 6 destroyed.
ok      github.com/elastic/elastic-agent/testing/newexp 500.869s

Congratulations: you have just provisioned everything, ran the tests and cleaned up everything! 🎉

If you want to iterate over the code you can leave the stuff provisioned by adding some extra test parameters (note the use of -args before the custom flags)

 go test -tags integration -test.timeout 30m  -test.count 1 -test.v github.com/elastic/elastic-agent/testing/newexp -args -integration.skip-destroy

To check for command line help, we are interested in the -integration.* flags (there is not much yet but weirdly there are some default agent CLI flags in there 🤔 )

➜  elastic-agent git:(terraform_experiment) go test -tags integration -test.timeout 30m  -test.count 1 -test.v github.com/elastic/elastic-agent/testing/newexp -args --help
Usage of /tmp/go-build888068571/b001/newexp.test:
  -c string
        Configuration file, relative to path.config (default "elastic-agent.yml")
  -config string
        Configuration file, relative to path.config (default "elastic-agent.yml")
  -d value
        Enable certain debug selectors
  -e    Log to stderr and disable syslog/file output
  -environment value
        set environment being ran in
  -integration.skip-destroy
        Set this flag to skip destroying resources
  -integration.skip-provisioning
        Set this flag to run directly the tests by skipping the provisioning
  -integration.terraform-dir string
        Directory containing terraform files (default "/home/paolo/dev/elastic-agent/testing/newexp/terraform")
  -path.config string
        Config path is the directory Agent looks for its config file (default "/tmp/go-build888068571/b001")
  -path.downloads string
        Downloads path contains binaries Agent downloads (default "/tmp/go-build888068571/b001/data/elastic-agent-unknow/downloads")
  -path.home string
        Agent root path (default "/tmp/go-build888068571/b001")
  -path.home.unversioned
        Agent root path is not versioned based on build
  -path.install string
        DEPRECATED, setting this flag has no effect since v8.6.0
  -path.logs string
        Logs path contains Agent log output (default "/tmp/go-build888068571/b001")
  -path.socket string
        Control protocol socket path for the Agent (default "unix:///tmp/go-build888068571/b001/elastic-agent.sock")
  -test.bench regexp
        run only benchmarks matching regexp
 ...
 lots more standard go test flags
 ...
  -test.v
        verbose: print additional output
  -v    Log at INFO level
ok      github.com/elastic/elastic-agent/testing/newexp 0.362s
➜

Related issues

Questions to ask yourself

  • How are we going to support this in production?
  • How are we going to measure its adoption?
  • How are we going to debug this?
  • What are the metrics I should take care of?
  • ...

@pchila pchila added Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team Team:Elastic-Agent Label for the Agent team labels May 31, 2024
@pchila pchila self-assigned this May 31, 2024
@pchila pchila requested a review from a team as a code owner May 31, 2024 16:32
@elasticmachine
Copy link
Contributor

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

@elasticmachine
Copy link
Contributor

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

Copy link
Contributor

mergify bot commented May 31, 2024

This pull request does not have a backport label. Could you fix it @pchila? 🙏
To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-v./d./d./d is the label to automatically backport to the 8./d branch. /d is the digit

NOTE: backport-skip has been added to this pull request.

@blakerouse
Copy link
Contributor

Just want to be clear that the integration testing framework already supports different providers. Like the ability to switch to multipass on macOS. Implementing a terraform provisioner, and having it output the SSH credentials would be enough to switch to using terraform.

@pchila
Copy link
Member Author

pchila commented Jun 11, 2024

Closing this as it was just meant to show what has been done as a PoC

@pchila pchila closed this Jun 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-skip skip-changelog Team:Elastic-Agent Label for the Agent team Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants