Very Basic MapReduce in Java for word count, make with java, gradle and intellij
MapReduce in Java, with very basic driver and worker implementation.
This has the following structure:
- the src folder, is the main folder with all the code, it contains 2 folder:
- main folder: contains the package folders with:
- driver:
- MapReduceServer.java: the java class for the driver
- TaskServiceImpl.java: the gRPC service implementation for the driver
- CircularList.java.: a java class for implementing a circular list for the task assigning
- worker:
- MapReduceworker.java: the java class for the gRPC worker, implemented the Map and reduce methods
- driver:
- test folder: contains a simple test for the driver and the worker, Please make sure to update the folder tests as appropriate.
- main folder: contains the package folders with:
- build.gradle: the file for the gladle project build
- settings.gradle: the file for gradle config
- results: the folder with the generated result data when run with N=8, N=6.
The project uses Maven/gradle to manage build.
Please make sure to update tests as appropriate.
On the folder executables are the .jar files for the driver and the worker, it requires a folder /inputs with the data on the same directory.
java -jar MRDriver-1.0-SNAPSHOP.jar
it run with default values for directory or N task and M task, for help run:
java -jar MRDriver-1.0-SNAPSHOP.jar --help
by default it run on localhost on port 50051
java -jar MRWorker-1.0-SNAPSHOP.jar
it run with default values for directory files and master address and port , for help run:
java -jar MRWorker-1.0-SNAPSHOP.jar --help
by default it run on localhost on port 50051
Using gRPC protocol, the program uses the gRPC server as driver and the gRPC clients as workers, when start the server it looks for the files directory it create the folder intermediate and out if doesn't exist Read the files to be proccessed and make a task lists with the number of map task and number of reduce task and mark each task as "TOASSING". Wait for worker to connect. On Each worker coneection it assign all the map tasks, until all are marked as completed, then assigne the reduce tasks. When all is completed it send a signal to the worker for exit, and exit itself. Function:
- schedule task
- assign task
- Send exit
The Worker on other hand received the task to be made, a map or a reduce task, when it done it send a message to the driver asking for new task and to let know to the driver that former that is completed, when there are not task let it exit. Function:
- Receive task
- Read in data
- Process data(map or reduce or anything)
- Output result
The Driver and worker running:
- make worker failure recovery
- make some more test
- local test
- run master and all workers on localhost
- The result folder has the output files containing the frequency of each work in the input, with N= 6 and M = 4
some basic test are under folder test, please be aware of the data directory