BASED ON PAPER: RL-IoT: Reinforcement Learning to Interact with IoT Devices
Automatically learn the semantics of a protocol of a generic IoT device in the shortest possible time, using Reinforcement Learning (RL) techniques.
This RL framework implements 4 RL algorithms:
- SARSA
- Q-learning
- SARSA(λ)
- Q(λ) (Watkin's version)
RL is used to automatize the interaction with the IoT devices present in the local network. For these algorithms we assume there exists a dataset with valid protocol messages of different IoT devices. However, we have no further knowledge on the semantics of such command messages, nor on whether particular devices would accept the commands. This dataset will be stored into a dictionary inside our framework.
Here there is a first component based on the Yeelight protocol.
- Introduction to the project
- Features
- How to use?
- Demo
- Tests
- Contribute
- Authors
- License
- Acknowledgments
In IT systems, the presence of IoT devices is exponentially growing and most of them are custom devices: they rely on proprietary protocols, often closed or poorly documented. Here we want to interact with such devices, by learning their protocols in an autonomous manner.
- state of an IoT device: represented by some properties specific to that device.
- state-machine of a protocol: multiple series of states linked by one or more sequences of commands. These commands can be exchanged through that protocol to complete a predefined task.
- task: identified as a path inside the state-machine. The sequence of commands could change the state of the IoT device following a certain path - i.e., completing a task - inside the state-machine.
- RL algorithms iterate over 2 nested loops: the outer loop iterating over episodes and the inner loop iterating over time steps t.
This work mimics the behaviour of an attacker, which tries to explore the state-machine of the IoT device it is trying to communicate with.
We start developing our framework:
- Targeting an actual IoT protocol: Yeelight protocol.
- Implementing 4 RL algorithms: SARSA, Q-learning, SARSA(λ) and Q(λ).
Note: the Yeelight protocol defines a maximum rate on the commands to be sent to Yeelight devices, hence our framework can take about 50 minutes to complete the 1 learning process of 200 episodes for a single RL algorithm.
Main features include:
- Support to 4 RL algorithms, that can be selected inside the
config.py
file. - Collect all necessary data to generate plots for comparing performance among different configurations.
- Block the learning process and restart it from the Q matrix computed before, giving as id the date of the previous execution.
- Possibility to configure all parameters inside a single file -
config.py
- about parameters for algorithms, for the framework and about debug options.
This project has been developed with Python 3.7. To use it, you need to first install all necessary Python packages with command:
pip install .
After installing all needed dependencies, the project can be executed directly running the __main__.py
script.
If some modules are missing, try to install them with pip install <module>
command.
Note: you need to have nmap installed in your computer to make the python-nmap package working.
General structure of directories:
learning
directory contains a learning module, with RL algorithms, and a run script to follow the best policy found by algorithms.discovery
contains scripts for finding IoT devices in LAN.dictionary
contains dictionaries for IoT protocols.request_builder
accesses dictionaries and builds requests to be sent to IoT devices.device_communication
contains api for directly communicating with a specific IoT device.state_machine
contains methods defining state machines for protocols and methods for the computation of the reward.plotter
contains scripts for plotting results.sample
contains some toy scripts to communicate to individual devices (Yeelight and Hue devices).images
contains images for readme purposes.
The project can be run from the __main__.py
, starting with a discovery phase for IoT devices in the local network.
Throughout the entire learning process, the Learning module collects data into external files, inside the output
directory.
All files for 1 execution of the learning process are identified by the current date and the id of the current thread, in the format %Y_%m_%d_%H_%M_%S_<thread_id>
The structure of the output
directory is the following:
output
|
|__ log
| |__ log_<date1>_<thread_id>.log
| |__ log_<date2>_<thread_id>.log
|
|__ output_csv
| |__ output_<algorithm1>_<date1>_<thread_id>.csv
| |__ output_<algorithm2>_<date2>_<thread_id>.csv
| |__ partial_output_<algorithm1>_<date1>_<thread_id>.csv
| |__ partial_output_<algorithm2>_<date2>_<thread_id>.csv
|
|__ output_Q_parameters
| |__ output_parameters_<date1>_<thread_id>.csv
| |__ output_parameters_<date2>_<thread_id>.csv
| |__ output_Q_<date1>_<thread_id>.csv
| |__ output_E_<date1>_<thread_id>.csv
|
|__ log_date.log
More in details, inside output
directory:
output_Q_parameters
: contains data collected before and after the learning process. Before the process starts, all values for the configurable parameters are saved into fileoutput_parameters_<date>_<thread_id>.csv
: information about the path to learn, the optimal policy, the chosen algorithm, the number of episodes, the values of α, γ, λ and ε. If one wants to reproduce an execution of the learning process, all the parameters saved inside this file allow for repeating the learning process using the exact same configuration. Then, at the end of each episode, the Q matrix is written and updated inside a fileoutput_Q_<date>_<thread_id>.csv
. The E matrix, if required by the chosen RL algorithm, is written intooutput_E_<date>_<thread_id>.csv
.output_csv
: containsoutput_<algorithm>_<date>_<thread_id>.csv
andpartial_output_<algorithm>_<date>_<thread_id>.csv
files. The first file contains, for each episode, the obtained reward, the number of time steps and the cumulative reward. The latter contains the same values obtained while stopping the learning process at a certain episode and following the best policy found until that episode.partial_output_<algorithm>_<date>_<thread_id>.csv
files are present only if a proper flag is activated inside thelearning_yeelight.py
script, specifying the number of episodes at which the learning process should be stopped.log
: contains log data for each execution. After the learning process has started, for each step t performed by the RL agent,log_<date>_<thread_id>.log
is updated with information about the current state st, the performed action at, the new state st+1 and the reward rt+1.log_dates.log
: file saving the id of each execution. It can be used to collect all ids for all executions and use them inside the Plotter module.
The complete workflow is modelled in the following way:
Here there is an in-depth description of the previous figure:
- The framework starts through the
__main__.py
script, that first activates the Discoverer. Before starting, theconfig.py
file provides all necessary information to configure the framework: some general information about the root directory in which saving output files, the state-machine, the goal that the RL agent should learn, all values of parameters for the chosen RL algorithm. Possible paths arbitrarily defined for the Yeelight protocol are shown inside theimages
directory. - The Discoverer analyzes the local network and returns to the main script all Discovery Reports describing found IoT devices. Here you can choose if you want to use the nmap version of the Discoverer, or only the the Yeelight-specific discoverer. The nmap version of the Discoverer supports 2 protocols: Yeelight and Shelly. Support for multiple protocols needs to be done.
- The main script receives these reports and generates multiple threads running the Learning module, passing to each of them the Discovery Report for 1 distinct Yeelight device found inside the LAN.
- The Learning module is the RL agent, iterating over episodes.
- It receives multiple parameters as input from the
config.py
file: the chosen RL algorithm, values of ε, α, γ and λ if needed, total number of episodes, etc. Also some flags are present to decide whether after the learning process the user wants to directly plot some results, or wants to run the RL agent following the best policy found, using respectively the Plotter module or the Run Policy Found script. - During each episode, the agent asks for commands to the Request Builder, which accesses data of the Yeelight Dictionary and returns a JSON string with the built command requested by the agent. This string can be sent to the Yeelight device.
- The JSON string is passed to the API Yeelight script inside the Device Communication module, that sends commands to the Yeelight bulb and handles its responses.
- Moreover, at each time step t the Learning module retrieves the reward rt and the current state st from the State Machine module. In order to retrieve information about the state of the Yeelight device, this module asks to the Dictionary module the command to retrieve all necessary information from the bulb and sends this command to the API script, which actually sends the command to the bulb and returns the response to the State Machine Yeelight module.
- At the end of the learning process, the Learning module generates some output files, described in Output section.
- It receives multiple parameters as input from the
- The main thread waits until the thread running the Learning module ends.
Generated output files can then be used by the Run Policy Found script, which retrieves data from these files and follows the best policy found, through the Q-matrix. While following the policy, the script retrieves complete commands from the Dictionary module and sends them to the Yeelight device passing through the API Yeelight script. Also, output files can be used by the Plotter module to present graphically the results obtained in the learning process.
Since a lot of different plots can be generated, here there is a quick explanation on what graphs can be generated by scripts of the Plotter module.
get_training_time_traffic.py
andplot_training_time_traffic.py
retrieve values of time of execution and traffic generated by each execution of the algorithm and generates these bar graphs:plot_moving_avg.py
andplot_moving_avg_for_params.py
show the following results respectively for different algorithms and for different values of parameters:plot_cdf_reward.py
plots the CDF (Cumulative Distribution Function) of the reward:plot_reward_per_request.py
shows the cumulative reward over the number of commands sent:plot_output_data.py
shows reward and time step results for 1 single execution (It can be used for check the correct working):plot_heatmap.py
generates the heatmap reflecting the Q matrix of one run of the algorithm.plot_animation.py
generates an animated plot in real time while the algorithm is working. Once the algorithm has started, the current date can be retrieved from thelog_date.log
file and copied into theplot_animation.py
script. Once this script has started, it will generate a real time plot as the one showed in Demo section.support_plotter.py
contains methods for supporting the operation of the other scripts inside the Plotter module.run_all_plots.py
generates all plots and saves them into a Plot directory created outside the Plotter module.
Note:
- all scripts use arrays of dates in format
%Y_%m_%d_%H_%M_%S_<thread_id>
to identify executions of RL algorithms. - most of the scripts save plots inside subdirectories of the Plot directory. The target directory can be manually chosen inside each script.
A short demo of the working of the Learning process, showed through the console, and an animated plot can be seen in demo.
Recall that this demo was done using the previously described plot_animation.py
script, in order to create an animated plot.
No tests present for now.
Pull Requests are always welcome.
Ensure the PR description clearly describes the problem and solution. It should include:
- Name of the module modified
- Reasons for modification
- Giulia Milan - Initial work - giuliapuntoit
See also the list of contributors who participated in this project.
This project is licensed under the MIT License - see the LICENSE file for details.
- Previous implementation of RL algorithms in TCP toycase scenario: RL-for-TCP
- SARSA implementation example: SARSA-example
- How to evaluate RL algorithms: RL-examples