-
Notifications
You must be signed in to change notification settings - Fork 46
DCL MainNet Deployment
This is WIP
DCL is in a better position regarding DDoS protection comparing to public Cosmos networks.
DCL is a permissioned network consisting of quite a limited set of trusted or semi-trusted nodes (validators and observers), and we don't require to make all the nodes public to anyone in the world (a company may make its nodes accessible to that company applications only). But Cosmos public networks are permisionless meaning that it may have any number of nodes, and that nodes need to be public for anyone in the world.
Moreover, DCL nodes do not compete for proposing the blocks, they don't play to "game of stake". There is no tokenomics in the permissioned DCL network (at least so far). So, if one node dies or unavailable for some time, this is not a catastrophe. A Node Admin can fix and repair it. But a crashed/non-available node can be a problem in a staking-based network (Cosmos), as the node can not propose new blocks, and may lost the "clients" (delegators) and their tokens.
In other words, DCL is more collaborative, while staking-based networks (like Cosmos) usually consist of competitive entities.
All options assume that the validator node is not public and accepts incoming connections from trusted validators and observers only (see Options for network protection)
- Option 1: Cloud, no HSM
- Option 1A: no Sentry, private keys and secrets at the Validator machine
- Option 1B: with Sentry, private keys and secrets at the Validator machine
- Option 1C: no Sentry, private keys and secrets are not at Validator machine (tmkms, HashiCorp Vault)
- Option 1D: with Sentry, private keys and secrets are not at Validator machine (tmkms, HashiCorp Vault)
- Option 2: Physical machine, HSM, with Sentries
Option 2 (Physical machine, HSM, with Sentries) or Option 1B (Cloud, no HSM, with Sentry, private keys and secrets at the Validator machine).
Why use Sentries:
- Harder to DDoS the real Validator node (in case malicious Validators present)
- Hides real Validator node's IP, so harder to attack a real validator
- Can support HSM and Validators at physical machines w/o access to Internet (if not from beginning, then HSM support can be added in future)
- Public Sentries are essentially Observers, so no need for more Observers
- Can potentially auto-scale Sentry nodes (create new Sentries when attack is detected)
- though it's not that simple, see https://kb.certus.one/peers.html#sentry-auto-scaling
Why use separate KMS for Validator Keys:
- Security best practice: do not keep secrets at Validator machine, so that if Validator is compromised, secrets are not accessed
- In particular, helps to prevent double-signing by Validators (see https://kb.certus.one/hsm.html#double-signing)
- Please note though, that double signing is not that critical for DCL comparing to permissionless proof-of-stake networks (Cosmos). In DCL nodes don't have any tokens and don't manage public reputation and clients. So, if a node tries to double sign, it will be just slashed (removed from the network). Later on Node Admins and Trustees can investigate what was the reason.
Why use HSM for Validator Keys:
- The most secure key management
- Not that critical for DCL comparing to permissionless proof-of-stake networks (Cosmos), see the previous Item.
https://kb.certus.one/peers.html#private-nodes
- Option 1: no IPSec/VPN, just whitelist/blacklist via firewall rules
- pros:
- Seems enough and quite easy to do since
- We can expect/assume that all IPs are static
- We don't need encryption at IP level, as auth encryption will be done on Tendermint P2P level in any case
- Done in for example link, Sections 6.6 and 6.7
- no additional cost (cloud providers)
- no additional resources (gateways)
- Seems enough and quite easy to do since
- cons:
- there might be some concerns in Tendermint P2P auth encryption (one may don't trust Tendermint's implementation)
- pros:
- Option 2: IPSec site-to-site VPN (Cloud providers)
- pros:
- managed VPN gateway resources (highly available)
- IPSec - old, trusted technology
- IPSec encryption best practices (certificates, their rotation, key exchange IKEv2)
- mature service (access control, configuration automation)
- cons:
- p2p cross-cloud connections will require VPN-to-VPN configuration which is not a straightforward thing even if a documentation is good e.g. Google to AWS
- A single connection routine with additional resources (gateways), even if they are managed for HA
- Plus multiple VPN connections per a cloud (gateway)
- Plus additional firewall rules
- Certificates management
- Additional costs:
- E.g. AWS pricing:
- Site-to-Site VPN connection fee (per hour)
- Data transfer fee (per Gb)
- (optionally) accelerated connection and data transfer fees
- E.g. AWS pricing:
- p2p cross-cloud connections will require VPN-to-VPN configuration which is not a straightforward thing even if a documentation is good e.g. Google to AWS
- pros:
- Option 3: WireGuard
- pros:
- Encryption
- easy to install (part of Linux, FreeBSD, Android kernels, Windows ...) and configure, in short:
- each peer
- installs the package
- creates net interface (
ip link add...
) assign an address and with a mask (ip address add...
) - creates pub/priv pair of keys (
wg genkey
andwg pubkey
)
- peers share the pubkeys along with endpoint and private (local) IPs assigned to the created interfaces
- each peer:
- for each (other) peer create a record (pubkey, private IP, endpoint IP) in the configuration file and applies using command
wg setconf
- makes the net interface up (
ip link set up
) - configures nodes to send to the private IPs (wireguard will route encrypted packets that to corresponded public endpoints)
- for each (other) peer create a record (pubkey, private IP, endpoint IP) in the configuration file and applies using command
- each peer
- fast (e.g. vs OpenVPN, some IPSec comparison)
- Real p2p (full mesh)
- additional protection:
- keys are coupled with IPs and this is acting as a firewall
- some other DDoS mitigation logic
- No additional cost
- not a big codebase (~4000 lines) so security audit more real than for other protocols
- e.g. A Cryptographic Analysis of the WireGuard protocol by researchers Benjamin Downling and Kenneth G. Paterson
- cons:
- Young technology
- but Linus Torvalds accepted it (with a preference to OpenVPN and IPSec) and other known good developers work around the technology (e.g. tailscale project)
- known (self-realized) trade-offs (but likely no stoppers)
- Young technology
- comparison to IPSec and OpenVPN
- pros:
- Option 4: P2P VPN
- Mentioned as an option in https://docs.tendermint.com/master/spec/p2p/node.html#validator-node for validators that trust each other (actually our DCL case)
- pros:
- May handle IP changes better (?)
- no additional cost and resources
- cons:
- one more VPN protocol: had some vulnerabilities in the past, not often releases, likely no deep security audit in the past
- requires additional SW that we need to trust (e.g. tinc)
- May be more tricky for configuration, especially in heterogeneous environment (different cloud providers etc.)
- Additional layer of encryption can be beneficial if there are concerns in Tendermint P2P auth encryption
- Persistent peers between all Validators (or private Sentries if Validator is behind a Sentry Node)
- This is how our current TestNet is deployed
- May need to maintain and update the list of peers
- One or multiple Seed nodes that all nodes use for discovery. The node can be managed by CSA for example.
- All nodes have to trust and rely on that seed node
- Every Validator starts up its own Seed Node
- See https://docs.cosmos.network/master/run-node/keyring.html#available-backends-for-the-keyring
- Ledger Nano is supported (though not tested):
-
https://hub.cosmos.network/main/resources/ledger.html#gaia-cli-ledger-nano: replace
gaia
bydcl
there.
-
https://hub.cosmos.network/main/resources/ledger.html#gaia-cli-ledger-nano: replace
- DDoS Protection
- Private Key and secrets security
- Trusted relationship (can trust query results, no MITM)
- Health and monitoring
- Stability and performance
- High Availability and scalability
- there are numerous types of attacks: e.g. on different OSI levels, long-lived, highly distributed (some references: wiki, cloudflare.com ) ...
- cloud providers accomplish well with layer 3 and 4 attacks mitigation and provides a service for other more sophisticated attacks
- e.g. AWS:
- AWS Shield: is active by default (at no additional charge) for all users and mitigates layer 3 and 4 attacks
- for some fee additional protection (AWS Shield Advanced) is provided to help with more sophisticated attacks including layer 7 attacks
- a list of additional techniques will help to make that even better
- Google Cloud:
- Google Cloud Armor a dedicated service
- e.g. AWS:
- firewall rules is a good way but only a part and not sufficient to mitigate all possible (known) types of attacks
- the same for embedded CosmosSDK/Tendermint protection logic: it allows to prevent mostly application level attacks only
- Only valid txns are broadcasted to other nodes
- Read requests are not broadcasted to other nodes
- Tendermint/Cosmos TPS is quite high
- Need to attack a lot of ONs
- Possible to not allow random ONs to be connected to your ON
- Cosmos SDK targets of attacks:
- p2p connections
- client connections (client service)
- Conclusions (mitigation directions):
- (in case of cloud) don't ignore cloud provider anti-DDoS services and consider non-free ones as well
- separate client and p2p channels (so p2p would continue to work even if all client services are down) and protect with different set of techniques
- hide validators from the public as the most important part of the system, so even if all public relays are down (are being restarted to back online) validators are still healthy and doesn't require recovery
- [MUST] Cloud-specific DDoS protection for Sentry and Validation nodes
- [MUST] Sentry Nodes
- Optional if Validators deployed at the Cloud
- Must-have for Validators deployed in a Data Center
- Not must-have for permissioned network (as DCL) unlike permissionless networks (as Cosmos)
- Can give additional protection for Validator nodes.
- Can hide Validator node's IP address.
- Two types of Sentries
- Private (not publicly available as Validators without Sentries) connected to other Sentries/Validators only
- Public - Observers
- [MUST] Network protection (Trust Link between Validators and/or Private Sentries)
- No Public Validator nodes and Sentries (Validators/Sentry nodes allow incoming connections from other Validators/Sentries nodes only)
- https://kb.certus.one/peers.html#private-nodes
- VPN/IPSec/WireGuard between Sentries/Validators. See https://docs.tendermint.com/master/spec/p2p/node.html#sentry-node
- Firewall rules to blacklist/whitelist validators and observers
- [SHOULD] Stateful firewall and Network (TCP) Load Balancers - move DDoS closer to the providers edge
- [MUST] Proper keyring backend for user account keys
- [SHOULD] Do not hold Validator private keys at the Validator Machine
- [SHOULD] HSM for Validator Keys
- YubiHSM2 for example
- It's possible to use software one, but not recommended for production
- AWS CloudHSM is not an option as it doesn't support ed25519 (which is default for Cosmos apps)
- [SHOULD] HashiCorp Vault for secrets
- [MUST] gRPC/REST over HTTPS (not HTTP)
- [MUST] Tendermint RPC over HTTPS (not HTTP)
- [MUST] Clients connect to trusted Observer nodes only. If there is no trusted Observer to connect to, clients should use Tendermint RPC queries and verify proofs via light client
- There is support for Light Client Proxy Node, so that clients can run a Proxy node, send all RPC queries to that Proxy, and the Proxy will verify the proofs automatically.
- TLS 1.3 is supported by Tendermint RPC and CLI client (it uses the endpoint)
- Note. there is no way to config CLI to work with self-signed certificate (e.g. for testing purposes)
- TLS is not supported by gRPC/REST endpoints (see details)
Thus, we don't have full TLS support and need a HTTPS-HTTP proxy here. Options:
- Option 1:
- reverse proxy (e.g. nginx) in-front of the DCL edge node (either sentry/observer or validator)
- [SHOULD] Monitor performance: prometheus
- [SHOULD] Monitor logs: ELK stack
- what
- server metrics:
- CPU usage
- memory usage
- disk utilization
- network performance
- IO performance
- application metrics:
- metrics endpoints exposed by cosmos-sdk
- blocks:
- good ones / missed / missed rate (%)
- time perspective: all time / 1w / 1d / 1h
- current block height:
- sentries
- validators
- validator / KMS performance
- sync latency
- sign latency
- signatures per minute
- number of peers
- blocks:
- metrics endpoints exposed by cosmos-sdk
- ??? RPC endpoints (to provide supplemental data regarding the networks)
- server metrics:
- how
- tools:
- external monitor (e.g. cloud provided) (mostly for server metrics)
- Option 1:
- Prometheus, Grafana and Influxdb
- notes:
- HA setup should be considered
- monitor (Prometheus) should be monitored itself (e.g. by cloud service)
- alerts for:
- downtime (e.g. more than 1 min)
- ...
- tools:
-
TODO
- explore more metrics to consider
- explore and define metrics thresholds
- work on tools options
- what
- application logs
- system logs
- authentication logs
- how:
- tools:
- Option 1:
- Option 2:
-
fluentd (instead of Logstash) with ELK
- some comparison links: by logz.io, by openlogic.com
-
fluentd (instead of Logstash) with ELK
- notes:
- logs are pushed to a high-availability queuing service
- log processor (Fluentd or Logstash) pops, parses, tokenizes, indexes and stores in a search engine (e.g. Elasticsearch)
- alerts may be triggered on specific phrases (e.g.
CONSENSUS FAILURE
orDisk Full
) - debugging facilities
- tools:
-
TODO
- explore and prepare a map:
phrase-event
- list of events for alerts
- list of events for debugging
- work on tools options
- explore and prepare a map:
- [MUST] Recommended config
- disable PEX for private nodes
- adjust timeouts
- [SHOULD] State-Sync for new Nodes
- [SHOULD] Seed Nodes for peer discovery??
- [SHOULD] Multiple Observers (Sentries)
- [SHOULD] Load Balancers for Observers (Public Sentries)
- https://docs.tendermint.com/master/nodes/
- https://docs.google.com/document/d/e/2PACX-1vQXb1kd0zqYT8K4B4XYb-lrlfRIuPDXsgiTjj94gDOjw3ezEUAtjvxR8yfbKJypmioKeGRrhkLCtZog/pub
- https://kb.certus.one/
- https://medium.com/@kidinamoto/tech-choices-for-cosmos-validators-27c7242061ea
- https://medium.com/@kidinamoto/key-management-choices-for-cosmos-validators-29b910af23c0
- https://medium.com/@kidinamoto/setup-cosmos-validator-relay-network-6b6e63661100