Allocating resources for distributed systems is complicated, it is hard, and it isn't secure. Distributed systems rely on a centralized service to allocate requests and keep track of resource utilization. Typically this involved setting up pools hosts with specific allocations of storage and compute. Multiple resource pools are required to isolate services, to guarantee services levels, and to ensure one service does't take up all the resources. Setting up security, is another challenge. Security requires configuration of both the clients and the services. Add in the need to configure routing of requests, logging, and fault domains, and configuration for central services to manage it all grows out of control.

Taking a Page from Blockchains

There should be a better way to do all these things. Why don't we take a page from blockchains to provide more autonomous and secure computing platforms. If you want to use a blockchains you drop a work request into a pool. The work request is picked up by a miner and that results in a block added to the blockchain. Encryption at the edge, means the client is responsible for managing security. Since encryption is at the edge, clients have their own unique set of encryption keys, that creates a diversity of encrypted material, and there is no single centralized account to hack into.

This is a better setup, as there isn't a centralized service that runs everything. The system was designed from the ground up to work through consensus, with the edge nodes taking over responsibility to correctly integrate and join up with the majority. In addition, blockchains provide incentives to align people and resources with the goals of the system. Example of incentives are transaction fees, gas charges, or awards for mining a block.

The idea is to move functionality to the edges of the network, and require more autonomous nodes to collaborate and communicate amongst themselves. In addition, encryption keys and encryption endpoint should be pushed to the edges where possible. Allowing end-to-end security from the client would allow multiple clients to share resources without fear of being compromised. This is analogous to blockchains where multiple wallets can be on the same block, each wallet encrypted with its own private key.

Autonomous Compute Important Stuff

Based on what we have described lets sketch out an autonomous computer network. Below worker means compute resource performing the requested actions, and issuer is the client requesting the work. First lets identify the very high level important stuff.

  • We ask for workers through a request pool
  • We need a chain of trust to verify both the worker and the issuer. Workers need to see the inner details of our requested actions in order to do the work. So we need to send work only to the workers we trust. Random issuers can send malicious instructions, so workers only accept trusted issuers.
  • Move to negotiation between issuers and workers, get rid of pre-exchanged setup (at least as much as we can).
  • Our system will need a way to effectively allocation resources among a diverse set of workloads. This would require an aggregate view of resources available, resources used, and resources requested, and leveraging this information to grow and shrink resources allocations based on needs of the organization. Simply said do the high priority stuff and resource starve the low priority stuff.

As a picture that leads to the six steps below. This is a happy path.

Whiteboard high level autonomous compute

Some notes on this design. The Request Pool holds requests for resources. For example request may say it needs 5 secs of CPU time, 5 Kb of memory, and a 1GB network connection. Perhaps the type of work or execution environment requirements (eg Java8 with PostgreSQL) , needs to be provided as well. In the interest of simplicity I'm skipping over that for now.

The Issuer puts requests into the request pool. The Worker finds appropriate requests in the pool. Matching an issuer and worker requires they exchange their public keys to facility end to end encryption.

The Registry holds verified identities of the issuer and the worker. It does this by having the public key of the issuer or worker signed by a trusted authority. The Registry is needed to better manage the explosion of public/private key pairs, and provide a facility for distributed key management (DKM). As an example DKM would include things like switching to a new key pair, recovering a lost key, and requiring multiple encryption keys (n-of-m keys).

You'll notice there is no centralized data store to provide an aggregate view of resources uses or resources available. Step 6 shows that workers will need to talk among themselves to understand the fuller picture of resource use and resource availability. Broadcast or gossip communication isn't very efficient so aggregate nodes that hold an aggregate view of a set of workers may be used. This is a performance optimization that can be left till later, the use of aggregate nodes is not intended to change the logical functional.

Challenges

What are the challenges with autonomous computing, and if it is better why isn't anyone doing it now? First I would say teams are too busy trying to bring services to market, and they don't have time to innovate away from the existing cloud computing solutions. Important to note, there are some open source projects like Apache Brooklyn which are working on Autonomic Computing.

Here are the challenges

Symmetric Certs. HTTPS security uses TLS which uses symmetric certs. This means all the nodes need to be identified ahead of time, and pre-populated with certs. As an alternative public/private key pairs can exchange public keys or use privately-signed & public-verified messages for validation in a lazy (just-in-time) manner. Therefore if you want to add/join/remove nodes to do work you need to skip HTTPS/TLS or extend HTTPS/TLS and roll your own public/private encrypted communications.

Slow Response Time. Putting requests into a queue, and ramping up to meet demand is slow. If you want very fast response times in the millisecond range you need to build a specific customized solution. Imagine starting a site to report election results for a presidential election. As traffic ramps up new work request to process results or render results would be put into a queue, and as the volume of requests ramped up more and more compute nodes would be recruited. This would take time, and resources would always be behind the volume of requests. This means resources would likely be insufficient causing slow election updates. It is possible to build smarter systems that ramp up in response to predictable scaling patterns, but that takes work to build, and predictive capacity estimation cannot be applied in all cases (hard to make this work as a generalized solution).

Standardizing Control Plan Work. Adding new nodes would require lots of work including access to storage, loading software, updating load balancers, possible CDN/DNS updates, container setup and more. This is possible if there are standardization of all these services. Applications could leverage conventions and standards to build out a whole application stack

Encryption Key Management. Clients need helpful services and applications to aid them in managing their public/private key pairs. This requires services like key generation, key rotation, key recovery. In addition to integration with existing key custody and key storage solutions.

Incentives. Balancing a system means incentivizing the right behavior. This means having feedback loops to ensure clients issue legitimate work requests, and do their best to minimize wasted work requests. Incentives often take the form of compensation for resources used or for complexity. The more resources and the greater the complexity the higher the compensation. An incentive system will have unintended results. It will take time to put the right incentives in-place that keep the system in balance.