Marco Scilipoti's blog

Introduction
Process Connection Diagram
Design and Challenges Found
Implementation Limitations
Conclusion
Appendix compilation and execution instructions

Introduction

This project is focused on creating a multi-process system for calculating the md5 hash of files in parallel, using the Unix operating system as a basis.

The structure of this report is based on presenting not only the implemented design but also the way we arrived at the design through the difficulties and ideas we had along the way. First, we present a general diagram to give an image of the connection between the processes. Then we discuss the design through the main parts that we found (in the order of the program's execution) and finally we conclude the project as well as present references and compilation and execution instructions.

Process Connection Diagram

The system involves three main processes:

APPLICATION: This is the main process that manages the entire operation. It takes as input a list of files that need their md5 hash calculated. It determines the optimal number of slave processes to be created, ensuring an efficient distribution of tasks.
SLAVE: Each slave process is tasked with performing the md5 hashing of a subset of files. They receive the list of files from the APPLICATION process via inter-process communication (pipes).
VIEW: This process receives information about the hash results from the APPLICATION process via shared memory. It's responsible for printing the results of the md5 hash in a specific format.

Design and Challenges Found

Task Loads and Slaves

The first difficulty encountered was knowing how many slaves we had to create. Our implementation took as a starting point three premises: (a) no matter how many slaves we have, the md5 will not be faster than the md5 of the largest file (md5 not distributed) (b) the fewer slaves needed the better (there is a whole overhead tied to each new slave). (c) md5 has a temporal complexity of O(n) where n is the number of bits in the file.

Thinking about the problem through these two premises led us to conclude that the optimal number of slaves is essentially: the minimum number of slaves such that each one is assigned a set of files whose total size is less than or equal to the heaviest file. In this way, we use the ability to parallelize to maximize the added value for each slave and minimize the cost of managing them.

About the algorithm itself, we first considered it as a sort of Knapsack problem, usually solved through dynamic programming, but with the modification that we would handle multiple backpacks. After trying to solve it and researching, we found that this generalization was a known problem, the Bin packing problem. Our implementation is based on an approximation of the Bin packing problem (because it is an NP-Hard problem), through a sorting of the files based on their size and the First-fit algorithm.

Our implementation itself (see loadBalancer.h), not only seeks to determine the optimal number of slaves, but also constructs the so-called Loads, where the number of files and the ids of the files that a slave should process are stored.

Pipes and Slaves

The main difficulty that has been found was the connection between the different processes. That is why special attention has been paid to the different mechanisms that exist for the connection between them.

In the communication between the application process and the different slave processes we use two pipes, one in which the father writes and the slave reads, and another pipe that goes in the opposite direction. Having a large number of pipes to read the application process we had to look for a solution so that it does not just wait in a single pipe to read and that is why we use select. This same function allows us to know which pipe is ready to read and allows the application process not to wait in a single pipe for the slave to write it. One of the main problems we had was how to use this function as we found the man section of the same very confusing.

Shared Memory and View

As the application process receives the results of the md5 hash sent to its slave processes, it should share them with the view process so that it can display them through stdout as it receives them. To do this, it has been requested to use a shared memory area between the view and application processes.

Implementation Limitations

Taking into account the proposed solution, some minor limitations arise. First, in the face of an incorruption signal that causes our processes to end (like SIGKILL for example) there will be resources that are not freed, like the 'shared memory' among others.

Conclusion

In conclusion, this report has presented not only the application of intercommunication between processes (both shared memory and through pipes), but also a method to balance the load of files between child processes.

For future versions, we consider that if the use warrants it, we cannot only modularize but also improve the functionality of rebalancing the loads.

Appendix compilation and execution instructions

To compile the project, a Makefile type file will be used. For this, the user must open a terminal at the root of the project, where they must do 2 commands.

make clean
make all

Additionally, you can perform an extra validation, running the PVS Studio, which is executed with the following command.

make check

Then, in the root directory of the project there will be 3 executables, which work together to perform the task.

APPLICATION: It receives as a parameter in command line the files to be processed. It will manage the slave processes.
SLAVE: It performs the md5 hashing of a series of files received by standard input from the parent process (APPLICATION).
VIEW: It receives by command line parameter or by standard input (Prioritizing the first of both options) the number of files that are going to be processed. It will be in charge of printing the results of the md5 hash with the following format "(pid) (hash code) (filePath)". To be able to run the project, it can be done in one of two ways. The first option is by a pipe via command line, where the application process is first called with the respective parameters, followed by the view process.

./APPLICATION <file1> <file2> ... <fileN> | ./VIEW

Alternatively, you can run using 2 different terminals, where the APPLICATION process is first run and then the VIEW.

-- Terminal 1 --
 ./APPLICATION <file1> <file2> ... <fileN>

-- Terminal 2 --
 ./VIEW <numberOfFiles>

Scilim

Load Balancer & IPC

Table of contents