carlosgaldino · home

A Critique of the Remote Procedure Call Paradigm - 30 years later*

* Almost 30 years since the paper was published on 1988.

I recently read a paper written by Andrew S. Tanenbaum and Robbert van Renesse where they discuss the problems of Remote Procedure Calls (RPC). The paper is titled “A Critique of the Remote Procedure Call Paradigm” and it was published on 1988. After reading it I thought it would be interesting to see what have changed regarding RPC since the paper's publication. The result is this post.

Please be aware that this is not intended to defend RPC as our Savior. It clearly has its problems but as you will see below some projects are trying to overcome them. At the end of the day, use your own judgment to choose the best solution for your problem.

Table of contents

Introduction

RPC is a communication mechanism used to call subroutines in a different address space. It is typically used to communicate over a network but the programmer doesn't need to write the code that deals with the communication aspects1 of it. He/She invokes the remote subroutine the same way a local subroutine would be invoked and the RPC framework deals with the rest.

Looking like a local call might give the programmer the illusion that there aren't new problems being introduced but one thing you must know is that once you go over the wire nothing is fine. This is one of the main criticisms about RPC, the idea that remote and local calls are transparent. It is interesting that the authors of one of the first RPC implementations were aware of these problems and mentioned them in their “Implementing Remote Procedure Calls” paper:

The existing communication mechanisms appeared to be a major factor constraining further development of distributed computing. Our hope is that by providing communication with almost as much ease as local procedure calls, people will be encouraged to build and experiment with distributed applications. RPC will, we hope, remove unnecessary difficulties, leaving only the fundamental difficulties of building distributed systems: timing, independent failure of components, and the coexistence of independent execution environments.

History shows that people focused too much on this false promise of transparency and forgot that they had new problems to handle.

Anyway, let's get to the problems that Tanenbaum and van Renesse mentioned.

Problems with RPC

Who is the Server and Who is the Client?

The first problem mentioned in the paper is that RPC is not appropriate for all computations. The authors question the role of each party in a RPC environment and give the following example:

sort <infile | uniq | wc -l >outfile

The pipeline above sorts infile, a file with a single word per line, removes the duplicated lines and puts the word count on outfile.

If the pipeline were executed via RPC who would be the clients and who would be the servers?

uniq could have a client interface that requests data from sort and a server interface that provides data to the next part, wc in this case. But what about sort and wc? sort could have a client interface that requests data from the file server for infile and a server that responds with data for requests from uniq. And wc could have a client interface that requests data from uniq but what does it do with its output? How to pass it over to the file server so it gets written on outfile?

You can't simply turn the file server into a file client2 which then starts requesting data from wc. That is a very odd thing to do.

There is also the possibility of having two clients for sort, one requesting data from the file server and another sending data to uniq server. In this case uniq client sends data do wc server and wc client would send data to the file server. The problem the authors see with this approach is that now there is an asymmetric situation where the first part has two clients and the next parts have one client and one server. For them this is a clear indication that the RPC model does not fit.

Observation

I don't know how RPC was marketed at the time (or even today). Selling it as something that fits any kind of problem is irresponsible and believing in this promise is silly. I prefer to believe that people today are not doing it anymore3.

Unexpected Messages

The second problem pointed by the authors is when one party wants to send some information that is important but the other party is not expecting it.

RPC works in a client-server manner where once the server receives a request from the client it starts working on it and only returns when the response is ready. In this way if the client has some important message to send, let's say a cancellation message, the server won't be able to react to that message and cancel whatever it is doing. The authors question this shortcoming.

One of their ideas for an alternative model is a full-duplex virtual circuit where non-blocking SEND and RECEIVE primitives are used with the possibility of signaling interrupts. In this alternative model the problem above could be solved by sending a high priority message telling the server to cancel.

Another example they give is where you have a central file server that receives requests from several clients and when a client updates a file the file server must tell the clients that have cached versions of that file to invalidate their caches. In this scenario the clients are solely clients and are not expecting a message that would require them to act as a server.

Observation

gRPC is an example of an RPC framework that allows canceling a request4. Its Go library uses a package context that allows signaling a cancellation for example. It seems that the same thing is possible in Java.

What about other languages? Do they have the same features?

I tried to find the same thing for gRPC's Ruby library but couldn't find anything. This clearly constitutes another problem for RPC. Features might be available for one language but not for the other. As you will see later on this post the authors pointed out a similar problem regarding different machines.

Single Threaded Servers

The next problem pointed by the authors is that RPC "forces" the choice of a multi-threaded server instead of single threaded. This happens because the RPC model doesn't allow the server to return without serving a response to the client. If the data requested by the client is not immediately available the server has to wait and can't start serving new requests. The most obvious choice then is to make the server multi-threaded.

The authors explain that they are not against multi-threading, the problem they see is that the RPC model forces a big design decision when that kind of decision should be left for the programmers to make.

Observation

Nowadays is hard, although not impossible, to imagine single-threaded servers being built. And this is reflected by the RPC frameworks out there. Finagle is built around services which are asynchronous functions and gRPC offers both synchronous and asynchronous function calls.

This doesn't exempt the model from forcing the design decision. The critique is still valid.

The Two Army Problem

This is also known as “Two Generals' Problem” and it states that is impossible for two processes to agree on a decision over an unreliable network5.

The authors question that both sides can't know for sure that the RPC is over. In their virtual circuit model the same problem would happen but only when trying to gracefully shutdown the connection. In the RPC model this might happen for every call.

Observation

This might be the main reason for timeouts when using RPC. Since the system can't know that the other party received the message and it could wait indefinitely for a response it chooses to timeout and then act properly by canceling the request, as seen above, retrying after some time or some other approach suitable for the application in question.

Multicast

One process might want to send a message for all other known processes in the system, a file server could send such message for other parties holding a now invalid file to tell them to purge their cache. Hardware pieces can do that via broadcast or multicast. The problem the authors see is that since RPC is a two-party interaction you cannot take advantage of this hardware facility. Multiple messages sent directly from the server would be required then.

Observation

I don't know if there is a framework out there that takes advantage of the hardware for this problem6.

Parameter Marshalling

RPC frameworks have a stub for the functions involved. Before sending the message the stub has to marshall the parameters so they can be sent across the network. The authors point out that before marshalling, the stub must know how many parameters the function is expecting and what are their types. They then question if printf were the function to be called remote it would be difficult to determine the characteristics of the parameters.

Observation

Nowadays the frameworks use some Interface definition language (IDL) to define the messages that will be exchanged between the parties and they include the functions that can be called remotely, the types, etc. This solves the problem of knowing exactly what the other end is expecting but IDLs have problems too. The authors mentioned a problem that you will see later about “Parameter Representation” that is related to this, and although the IDL helps having a unique structure, it doesn't help when two very different languages are used. It might impose restrictions and/or odd patterns if it's difficult to map the IDL types to the languages that are being used.

Parameter Passing

Related to the problem above is how the parameters are passed to the other side. Values like integers, booleans are easy since they can be copied into the message and sent without problems. But what happens when pointers are involved? Should the pointer be sent or the value it is pointing to? What happens if the pointer is pointing to something in the middle of a complex structure with other pointers inside? Should the entire structure be sent as well? Or the parties should interact with each other asking for values from each pointer the remote function uses?

These are some of the questions made by the authors and you can easily see that it breaks the transparency characteristic that the RPC tries to provide. Remote calls will behave very different than local calls.

Observation

Again, IDLs help minimizing this problem but as mentioned before it will bring its problems as well. And the transparency of RPC will still be gone.

Global Variables

If a procedure that was designed to be executed locally uses a global variable what should happen when the same procedure is forced to be executed remotely? This is similar to the problem above with pointers and difficult in the same manner.

Observation

One might say: “don't use global variables”. This is one way to deal with this problem but the same way that happens with pointers, it breaks the transparency promise of RPC.

Timing Problems

The next problem mentioned is when the execution speed for some procedure changes because of a remote execution and as result the whole task at hand fails. They give an example regarding I/O devices that when issuing commands to the driver require the driver to write words back within an specified interval. If after writing the first word there is a small procedure that runs remotely, the delay could be long enough to cause a time out in the controller which results in the entire operation failing.

Observation

I don't know much about hardware specifics like this but I can imagine it being hard to deal from the RPC framework perspective7.

Exception Handling

When a procedure is executed locally it either completes or fails entirely. Remote procedures introduce new errors regarding the communication over the network and also when one party fails. Some systems might decide to hang forever and wait for the other party to come back. Such approach is not very attractive. Another approach already mentioned would be using timeouts to maybe retry after some time. This could also mark the computation as failed and return an exception if it retried n times with no luck. You can imagine that this violates the transparency because when a remote procedure is called, the programmer must check for certain errors that would not be necessary if the procedure were executed locally. And this is exactly what the authors question.

Observation

The problem here is the fact that RPC was sold with the transparency promise between local and remote calls. It should be clear that the programmers must deal with new errors. If from day 1 remote calls were sold as very different from local calls maybe things could have evolved differently. Especially the mindset around RPC usage. What intrigues me, as seen earlier in this post, is that Birrell and Nelson don't try to hide the fact that even using RPC the programmer will still need to deal with such difficulties when building distributed systems. It's probably a mystery why this mindset hasn't been carried over the years.

Repeated Execution Semantics

A local procedure is called and executed exactly one time, it does not gets executed again if a second call is not made. The same thing is impossible to happen in a remote environment because, as seen earlier, the system can't know for sure if the other end received the message. The sending party has two choices: send again or not send. Let's say process A is sending messages to process B. If B didn't receive and A sends again the same message, B would receive and execute the procedure one time. If B received the message and A sends it again we have two calls being made. If B didn't receive and A doesn't send again we have zero calls. For idempotent operations receiving duplicated messages and executing them more than once causes no harm but for non-idempotent operations it will certainly be a problem. Again, this is another peculiarity that should be taken into account with RPC. If the programmer fools himself/herself by the transparency promise he/she will have a bad time.

Observation

In this case I don't think there's much to be done by RPC frameworks besides not selling the promise of transparency between remote and local calls. The framework can't decide what to do in case one message is lost. Facilities to deal with such problems are always welcome but in any case, it should be decided by the programmer developing the system.

Loss of State

This problem deals with the fact that sometimes the parties might hold important information about the operations and if it crashes that information might be lost. The programmer is very likely to change its design to consider this problem and operate normally even when the process dies.

Observation

Just like the previous problem, I don't think there is much to do from the framework perspective. The programmers involved in the system should consider the possible failures and design the system to act accordingly in cases where a problem occurs. It might lead them to avoid using RPC and that is fine, we don't have a unique way for building distributed systems. Other options should always be considered.

Orphans

Most of the mentioned problems were regarding server crashes. The authors mention another problem that might occur when a client crashes. If a client crashes when the server is working on the response the result of that computation will become an orphan. What then should be done with this result? They mention that their proposed virtual circuit model can deal with such problem and detect a failure in the client which then allows the server to stop all computations started by such client.

Observation

The same way that the previously mentioned frameworks detect errors with the server they could detect for client crashes. I'm not sure if they have this mechanism available. Anyway, this is another problem that can't be hidden from the programmers because it's their job to decide about what to do in such cases.

Parameter Representation, Byte Ordering, Structure Alignment

In the paper the authors mention each of these problems in separate sections but I decided to put them together because they share the same solution.

The questions are: how the communication deals with differences in the parameter representation, especially with floats. How to deal when the client machine uses little-endian and the server uses big-endian order? And vice-versa. The last one questions about structure alignment when servers and clients have different word sizes.

Observation

Having a common ground to communicate with each party is the most obvious way to go. And IDLs can certainly help here, abstracting this distinction.

Lack of Parallelism

In a client-server model one of the parties is idle when the other is working. The client may be idle when expecting a response from the server and the server might be idle waiting for client requests. The authors mention the problem that the client waiting is a loss of performance since it could be doing some work instead. And they also mention the problem when the server is waiting for a disk operation and since it can't return to serve new requests all clients have to wait.

Observation

As mentioned in the observation about Single Threaded Servers the possibility offered by RPC frameworks to use asynchronous functions can help in this case.

Lack of Streaming

The last problem deals with the fact that when the server is building the response it must build it entirely and the client is then idle for that period. The authors mention that in the virtual circuit model the server could stream the results as soon as they are found.

Observation

It seems that progress has been made in this area. Both gRPC and Finagle offer a way for building streaming clients and servers.

Conclusion

I really liked reading this paper because it shared the perspective about RPC 30 years ago and I find interesting learning the history behind the ideas. Progress has been made over the years and many of the problems mentioned in the paper have a solution available by RPC frameworks today. I like to think that having such criticisms available will always help to improve the tools we have and also encourage others to create new tools or models.

RPC is not the only answer for communication between nodes, there are many message queuing systems out there, languages like Erlang, frameworks like Akka and Orleans, for example. We should be thankful for having many options at hand and be diligent when choosing.

As Waldo et al. said in their excellent “A Note on Distributed Computing8:

Distributed objects are different from local objects, and keeping that difference visible will keep the programmer from forgetting the difference and making mistakes.

I hope that we don't make the same mistake as happened with Birrell and Nelson's quote and forget this one too.

Let me know if there is any errors, comments or if you would like to point some new alternative that has been created to overcome the problems mentioned above.

References

  • Andrew S. Tanenbaum, and Robbert van Renesse. A Critique of the Remote Procedure Call Paradigm. Proc. European Teleinformatics Conf. (EUTECO 88), North-Holland, Amsterdam, 1988, pp. 775-783.
  • Andrew D. Birrell, and Bruce Jay Nelson. Implementing Remote Procedure Calls. ACM Transactions on Computer Systems, vol. 2, no. 1, Feb. 1984, pp. 39–59.
  • Jim Waldo, Geoff Wyant, Ann Wollrath, and Sam Kendall. A Note on Distributed Computing. Technical Report SMLI TR-94-29, Sun Microsystems Laboratories, Inc., Nov. 1994.

Notes


  1. Such aspects may include: marshalling the parameters, sending the message to the other end, unmarshalling the parameters at the other end, calling the correct procedure, etc. 

  2. Well, you certainly can but it would be very unusual, bringing its own set of problems. 

  3. And that might be silly of me. :) 

  4. Finagle is another RPC framework that allows canceling requests: http://twitter.github.io/finagle/guide/FAQ.html#what-are-cancelledrequestexception-and-cancelledconnectionexception 

  5. More information can be found at: https://en.wikipedia.org/wiki/Two_Generals%27_Problem 

  6. Let me know if you know. 

  7. Again, if you know something about it, I'd love to know. 

  8. Another excellent paper discussing the problems and differences that programmers must be aware when building distributed systems. 

Carlos Galdino
@carlosgaldino
github.com/carlosgaldino