Vertical Scalability Investigation of ROS (Robot Operating System)

My level 4 university project was an investigation into the scalability of ROS. This post looks at how I evaluated ROS' ability to vertically scale.

Introduction

As described on its homepage, ROS "is a set of software libraries and tools that help you build robot applications". It is one of the most popular open source robot middlewares with thousands of user-created packages and extensions available for many different robot domains (driving cars, robot arms, boats, pure simulation, etc). This popularity makes ROS a prime candidate for academic research, as results are likely to be useful into the future.

My final year university project comprised of an investigation into the communication scalability of ROS in multi-robot systems (meaning systems which contain multiple individual host machines). The results of my investigation, and detailed discussion is available in my final dissertation, and full code and data is available in this public repository.

Experiment Design

ROS splits distinct computational blocks in to separately running modules called nodes. Each node is self-sufficient (handles it's own resources), but generally communicates with several other nodes using messages in order to receive input data and hand-off output data. Nodes can run on the same host machine, or they can run on separate hosts. There must always be a Master node which allows nodes to find each other in the system. This concept is outlined in the following image: ROS_Explained.png (had trouble embedding the image for now).

Now, vertical scaling involves adding more load to each host, thus to vertically scale ROS we add more nodes to each host. My experiment used 1, 2, 4, 6, 8, 16, 32, 64, and 128 nodes per host, with 2 hosts in total. Each node on one host communicated with exactly one node on the other host. The communication rate of each node was also varied, I tested 1Hz, 10Hz, 20Hz, 100Hz, 200Hz, and 300Hz by the end of my testing.

Set-Up

To conduct the experiment, two Raspberry Pis were connected to a dedicated router (so no other traffic would interfere with their network communication). ROS Kinetic was installed on both Pis. By far the easiest way to install ROS Kinetic was to take an existing installation (from another Pi) and use `cp` or `scp` to copy /opt/ros/kinetic/ from one Pi to the other. If this is not an available option, then ROS can be installed from a package manager such as APT, however I found I had to tweak the package manager installation to work with my experiments.

After moving the ROS code to right location, some environment variables need to be set. This is covered in the ROS docs, but the basic things to set are ROS_ROOT, PATH, and ROS_MASTER_URI which can be set using ROS' setup script: /opt/ros/kinetic/setup.bash. If these are incorrect, then feel free to manually set them. ROS_MASTER_URI in particular needs to be set to the hostname of the machine running the master node, e.g. http://terrapi:11311.

Running The Experiment

There are A LOT of different configurations to run for this experiment. Quick math: 9 node counts * 6 frequencies * 3 runs = 162 distinct runs! This had to be automated to run in a reasonable time, so I wrote several scripts to help. The first, run_experiment.sh, ran on the master host, and set which frequencies and node counts should be used, and paused between runs. This script called roslaunch_script.py which was a somewhat complex script which hooked directly into the internal roslaunch API (public APIs were not good enough to accomplish the task). The roslaunch_script.py started the masternode, and launched all the required nodes in the system (both locally on the master host, and remotely on the other host).

The nodes running on the master node then dumped their timing results to CSV, indexed by the node count, message frequency, and run number. Another Python script (process_results.py) was then used to aggregate the 162 CSV files into 6 CSV files for plotting as graphs.

Results

The results of this experiment, and others is discussed in much further detail in my dissertation. However, the most important result was the proposal of a constant limit on the maximum amount of communication a particular host can support, which is equal to (Number of nodes on the host * frequency of messages * size of messages). If this constant is exceeded for a particular host, then communication performance is significantly reduced and the system clogs up with messages. In this case, I observed message latencies of up to several minutes.

I hope to write future blog posts around my other experiments, but feel free to comment if there's any aspect of this post you would like me to elaborate on.

If you enjoyed this post, please check out my other blog posts.