Skip to content

Service Discovery

Jasper Nalbach edited this page Aug 26, 2016 · 5 revisions

Service Discovery

This article goes into the implementation details of the service discovery mechanism in las2peer. For understanding this section, you should know about these facts:

  • A service is specified by a name and a version.
  • A service is executed by multiple service agents. A service agent runs on one node. A service agent only executes one service (remember: a service is specified by name and version!).

The following challenges have to be addressed:

  • In a p2p network, services start, stop frequently, crash or are under heavy load at any time.
  • It is not practicable to have full knowledge of all services in the network. A node should only keep track of services that it is interested in.
  • Service dependencies: APIs change from version to version. RMI calls should have guarantees about invoked versions.
  • Performance vs Stability/Availability: An RMI call should be executed or rejected in predictable time. On the other hand, if there's a service available, but not detected (due to slow network connection), it should not be ignored.
  • Service agents should be distinctable from outside to allow distrusting a specific service agent or react on reliability losses.

The old mechanism

las2peer had two approaches implemented (in parallel):

  • There was one service agent per service in the network, not allowing any flexibility when looking at mutliple service versions and load balancing. It was possible to start multiple instances of this instance if running nodes had the passphrase for this agent. RMI calls were realized by sending the request via a broadcast to all instances (thus each instance executed the call) and taking the first answer received.

  • The WebConnector needed full knowledge of all running services to build a mapping tree of all methods available. For this, a ServiceInfoAgent was implemented managing a large Envelope containing a list of all services. This did not scale very well and was unreliable (whenever the envelope got corrupted, especially after network restarts). Also, services started at the same time lead to lost updates, some services were not available over the network.

These mechanism were too simple to meet the requirements listed above.

Topics and Agent Messaging

This feature was implemented in order to allow efficient broadcasts to services of the same name. Read more about it here.

ServiceDiscoveryContent message

  • All services listen to the topic of the service name.
  • To discover available service instances in the network, a service discovery message is sent to this topic, containg the requirements the service should fit, currently its the version and a boolean if an exact version match is needed.
  • If a service fits these requirements, it sends a response to the sending node.
  • The sending node collects these answers (using the default mechanism) and chooses a service instance. Currently, these criteria are the service version and the order in which the answers has been received. This may be extended in future, see below for more thoughts on this.

NodeServiceCache

The cache is responsible for keeping track of known services. It has a simple interface getServiceInstance returning a remote or local service instance meeting the requirements given by the parameters (version (exact match or fit) and local only). The cache looks up for known services and sends the service discovery message if needed, collects the results and manages them. Also, the cache has a second list of locally running services. Each service instance has a timeout after which it gets invalidated. Results are cached the improve the performance on multiple service invocations. The timeout should not be too high in order to not loose dynamic properties: We still want to have an estimate of the node load for load balancing etc. Also, if a service was shut down in meantime, we don't want to risk a failed invocation since this may take more time than performing another broadcast. Please note that not the newest fitting service instance will be choosen. The method supports a weaker semantic: Any service instance fitting the requirements can be returned. Although it is tried to return the latest version, this cannot be guaranteed. This leaves room for future extensions of service selection criteria.

Invocations

When invoking a service, the node gets an instance from the cache and performs the invocation locally or remotely. Failed remote invocations are handled as follows:

  • the service instance is removed from the cache
  • the cache is asked for another service
  • after the 3rd failed invocation, the operation fails

WebConnector integration

The challenge is to integrate with the REST principles.

A service alias can be reigstered for a service. This is an Envelope singed by the service agent, having the service alias as key and the service name as key. This way, the service who claims a service alias first gets it, other services cannot use this alias anymore.

The service alias is the first part of the url and can be resolved by the WebConnector to a service name. Then, the WebConnector gets the mapping tree from the service, parses the url and invokes the service. The version can be specified in the URL, too.

URL scheme: http(s)://ip:port/serviceAlias/v1.0/methodName, the version may be left out: http(s)://ip:port/serviceAlias/methodName

A matching version will be chosen. The developer has to make sure that all services matching the specified version have the same REST mapping - so make sure to use semantic versioning!

Drawbacks

There are some minor drawbacks of this approach.

  • We have a weaker semantic for our version selection (which is also an advantage).
  • Also, the caching of known services may reduce the ability to react on events in the network, a tradeof between dynamic and service call performance has to be figured out.

Future Work / Optimization

As mentioned above, there is some space for service selection. Load balancing could be refined and extended to some more parameters. Also web of trust can easily be integrated into this mechanism.