|
| 1 | +# Benchmarking |
| 2 | + |
| 3 | +## Client Side Metrics |
| 4 | + |
| 5 | +* `response time` (percentiles): the time between client's initial request and the last byte of a validator response |
| 6 | +* `requests per second (RPS)`: number of requests per second |
| 7 | + * `transactions per second (TPS)`: number of write requests (txns) per second |
| 8 | +* `number of clients`: number of concurrent clients that ledger serves |
| 9 | +* (optional) `throughtput` (in/out): number of KB per second. Marked as optional since we don't consider much in/out data due to relatively small txns payloads. |
| 10 | + |
| 11 | +## Server Side Metrics |
| 12 | + |
| 13 | +Starting from `v0.40.0` Cosmos SDK provides [telemetry](https://docs.cosmos.network/master/core/telemetry.html) package as a server side support for application performance and behavior explorations. |
| 14 | + |
| 15 | +The following [metrics](https://docs.cosmos.network/master/core/telemetry.html#supported-metrics) make sense to consider: |
| 16 | + |
| 17 | +* `tx_count`: Total number of txs processed via DeliverTx (tx) |
| 18 | +* `tx_successful`: Total number of successful txs processed via DeliverTx (tx) |
| 19 | +* `tx_failed`: Total number of failed txs processed via DeliverTx |
| 20 | +* `abci_deliver_tx`: Duration of ABCI DeliverTx (ms) |
| 21 | +* `abci_commit`: Duration of ABCI Commit (ms) |
| 22 | +* `abci_query`: Duration of ABCI Query (ms) |
| 23 | +* `abci_begin_block`: Duration of ABCI BeginBlock (ms) |
| 24 | +* `abci_end_block`: Duration of ABCI EndBlock (ms) |
| 25 | +* `begin_blocker`: Duration of BeginBlock for a given module (ms) |
| 26 | +* `end_blocker`: Duration of EndBlock for a given module (ms) |
| 27 | +* `store_iavl_get`: Duration of an IAVL Store#Get call (ms) |
| 28 | +* `store_iavl_set`: Duration of an IAVL Store#Set call (ms) |
| 29 | +* `store_iavl_has`: Duration of an IAVL Store#Has call (ms) |
| 30 | +* `store_iavl_delete`: Duration of an IAVL Store#Delete call (ms) |
| 31 | +* `store_iavl_commit`: Duration of an IAVL Store#Commit call (ms) |
| 32 | +* `store_iavl_query`: Duration of an IAVL Store#Query (ms) |
| 33 | + |
| 34 | +## Environment |
| 35 | + |
| 36 | +**Note**. For the moment it's not clear enough what production setup will look like, in particular: |
| 37 | + |
| 38 | +* number of vendor companies (number of validators) |
| 39 | +* type of external endpoints, options are [Cosmos SDK / Tendermint endpoints](https://docs.cosmos.network/master/core/grpc_rest.html) |
| 40 | +* type and number of proxies for validator-validator and client-validator connections |
| 41 | + |
| 42 | +Current assumptions: |
| 43 | + |
| 44 | +* multiple companies (vendors) will manage one/multiple validators |
| 45 | +* while some common requirements and recommendations would be provided each vendor will deploy the infrastructure independently with some freedom regarding internal architecture |
| 46 | +* there would be a set of external (for clients) and internal (for validators to support txn flows) endpoints |
| 47 | + * most likely observer nodes along with REST http servers with clients authentication would be in front of the client endpoints |
| 48 | + |
| 49 | +## Workloads |
| 50 | + |
| 51 | +### Transactions Types |
| 52 | + |
| 53 | +* write txns: |
| 54 | + * `tx/modelinfo/add-model` |
| 55 | + * with `vid` constant for a particular client |
| 56 | + * variable (incremented) `pid` |
| 57 | + * **ToDo** consider other request types (e.g. `update-model`) |
| 58 | +* read txns: |
| 59 | + * `query/modelinfo/model` |
| 60 | + * **ToDo** consider other request types (e.g. `all-models`) |
| 61 | + |
| 62 | +### Clients |
| 63 | + |
| 64 | +**ToDo** define which client endpoints are considered in production |
| 65 | + |
| 66 | +As long as CosmosSDK (Tendermint) provides multiple client [endpoints](https://docs.cosmos.network/master/core/grpc_rest.html) makes sense to benchmark all of them (separately and in a combination), in particular: |
| 67 | + |
| 68 | +* http RPC |
| 69 | +* websocket RPC |
| 70 | +* http REST |
| 71 | + |
| 72 | +### Load Types |
| 73 | + |
| 74 | +* per txns types: |
| 75 | + * only write txns: to measure server-side (consensus related) bottlenecks and limitations |
| 76 | + * only read txns: to measure client-side (setup related) limitations |
| 77 | + * combined loads with read/write ratio as a parameter |
| 78 | + * **ToDo** define anticipated real loads |
| 79 | +* per scenario: |
| 80 | + * stepping load: to identify the point where performance degrades significantly |
| 81 | + |
| 82 | + * waves: to emulate peaks and troughs in clients behavior |
| 83 | + |
| 84 | +## Load Generation Framework |
| 85 | + |
| 86 | +As long as DCledger based on Cosmos SDK and Tendermint which provide standard HTTP/websocket RPC and REST [endpoints](https://docs.cosmos.network/master/core/grpc_rest.html) to perform both read & write txns generic production ready tools like [jMeter](https://jmeter.apache.org/), [Locust](https://locust.io/), [K6](https://k6.io/) may be considered. |
| 87 | + |
| 88 | +[Locust](https://locust.io/) looks like the most easy-to-go option: |
| 89 | + |
| 90 | +* tests can be configured using simple python scripts (version control, CI/CD), in comparison: |
| 91 | + * JS based configuration for [K6](https://k6.io/) will likely require more efforts |
| 92 | + * [jMeter](https://jmeter.apache.org/) configuration is mostly about UI but not coding |
| 93 | +* [distributed testing](https://docs.locust.io/en/stable/running-locust-distributed.html) with results aggregation is supported (if we decide to use it) |
| 94 | +* there are some [concerns](https://k6.io/blog/comparing-best-open-source-load-testing-tools/) regarding its performance and accuracy but the current vision is that it should be acceptable for our case |
| 95 | + |
| 96 | +## Testing Environment Provisioning Automation |
| 97 | + |
| 98 | +General considerations: |
| 99 | + |
| 100 | +* as long as target production deploy architecture is not yet defined automation for testing environment provisioning would simplify comparison of the options |
| 101 | +* single cloud provider as a starting point |
| 102 | +* multiple cloud provides as a very possible production case |
| 103 | +* tools: [terraform](https://www.terraform.io/) and [pulumi](https://www.pulumi.com/) as the preferred options |
| 104 | + |
| 105 | +## ToDo |
| 106 | + |
| 107 | +* define acceptance criteria (target metrics values) |
| 108 | +* define target environment |
0 commit comments