COSC 6377 : Computer Networks

Project: Sharding and Replication for Online Storage

Milestone 1: 10/31/2016; Due: 12/1/2016

In this project, we will build a system that provides storage system that uses sharding and replication to improve robustness of the storage service. The system shares key ideas in its design with some of the online storage/file sharing systems except that it is simplified so we can build it before the end of the semester. Your code must run on program.cs.uh.edu. Otherwise we will not grade your project.

Shards

In an online storage system, the client may partition the data it needs to upload into a number of partitions (e.g., 3 in our project) and upload it to three different servers. We can call these shards. There could be many reasons for sharding. A widely used reason is to put different parts of the dataset on different servers to optimize download and upload speeds. The network may be a bottleneck to a single server.

Replication

In many distributed systems, such as storage systems in the cloud, the same data is copied to multiple servers in multiple geographical locations. Such copies of data are called replicas. Replicas are used to provide redundancy and reliability to storage system, i.e., if one copy of the data is lost, we still have another copy. In our project, we will replicate the data stored in the shards so that the system can survive the crash of one shard.

In this project, we will have each shard split the data it receives for storage into two pieces and copy them to the remaining two servers. That way, if a shard crashes, we have a copy of the data on the two remaining servers. The system will not be able to recover if more than one shard crashes.

System Components

Configuration file

The clients and the shards require several configuration parameters which are specified in a configuration file. The configuration file uses a JSON format and allows these keys:


      homedir: home directory for the shard and client. Each client
      and shard has a different home directory.

      listenport: port at which this shard should listen for
      incoming connections.

      metadatafile: the file that stores the metadata corresponding to
      the files and replicates stored by the shard.

      shard1ip: IP address of shard 1. Client and shard use this
      information.

      shard1port: Port of shard 1. Client and shard use this
      information.

      shard2ip: IP address of shard 2. Client and shard use this
      information.

      shard2port: Port of shard 2. Client and shard use this
      information.

      shard3ip: IP address of shard 3. Client uses this information.

      shard3port: Port of shard 3. Client uses this information.

Client

The client is a socket program that interacts with the shards to determine how much data to upload to each shard. Then it uploads the appropriate amount of data to the shards. The client tries to balance the total amount of storage utilized on the shards. For example, if shard1 happens to store more data than shard2 and shard3, the client will upload more data to shard1 than shard2 and shard3. Ideally, shards 1, 2, and 3 will have exactly the same amount of data after each upload but that will never be the case because of how the shards are replicated but we will try to get as close to balanced storage as possible.

Here is an example execution when uploading a file:

./client -config configfile.json -upload myfile.jpg

reading configuration information from configfile.json
key1 value1
key2 value2
key3 value3

connected to shard 1 at IP address: 1.2.3.4 port: 123
connected
asking currently used storage
reply was 100 bytes

connected to shard 2 at IP address: 5.6.7.8 port: 234
connected
asking currently used storage
reply was 100 bytes

connected to shard 3 at IP address: 5.6.7.8 port: 345
connected
asking currently used storage
reply was 50 bytes

size of upload file 100 bytes
upload sizes are
shard1 17 bytes
shard2 17 bytes
shard3 66 bytes

uploading 17 bytes of myfile.jpg to shard1
done
closing connection
closed connection

uploading 17 bytes of myfile.jpg to shard2
done
closing connection
closed connection

uploading 66 bytes of myfie.jpg to shard3
done
closing connection
closed connection

Here is an example execution when downloading a file:

./client -config configfile.json -download yourfile.jpg

reading configuration information from configfile.json
key1 value1
key2 value2
key3 value3

connected to shard 1 at IP address: 1.2.3.4 port: 123
connected
asking if shard1 has yourfile.jpg
reply was:
primary bytes [0,99] out of 300 bytes of yourfile.jpg
backup bytes [100,149] of yourfile.jpg
backup bytes [200,249] of yourfile.jpg

connected to shard 2 at IP address: 5.6.7.8 port: 234
connected
asking if shard2 has yourfile.jpg
reply was:
primary bytes [100,199] out of 300 bytes of yourfile.jpg
backup bytes [0,49] of yourfile.jpg
backup bytes [250,299] of yourfile.jpg


connected to shard 3 at IP address: 6.7.8.9 port: 234
connected
asking if shard3 has yourfile.jpg
reply was:
primary bytes [200,299] out of 300 bytes of yourfile.jpg
backup bytes [50,99] of yourfile.jpg
backup bytes [150,199] of yourfile.jpg

downloading [0,99] from shard1
closing connection to shard1

downloading [100,199] from shard2
closing connection to shard2

downloading [200,299] from shard3
closing connection to shard3

saving yourfile.jpg

The client terminates after uploading or downloading the file. If one of the shards is down, the client should download the appropriate portion of the file from the replicated shard.

Shard

Shard is responsible for storing the files. To simplify this project, we will have shards also be the backups and develop the system so that it can tolerate the failure of one shard. Further, the system will support exactly three shards but the IP address and ports should not be hard coded in your code. Each shard connects to the other two shards, shard1 and shard2.

The shards, when they receive an upload file, they split the file into two equal parts and upload the "backup" to the other two shards. Thus each upload chunk is tagged as primary or backup on the shard. When calculating the total storage used on a shard, it reports the total of primary and backup storage.

The shard listens to the incoming connections from the clients. It also connects to the other shards in the network. It keeps a metadata of all the uploaded files and replicated files as backup.

Here is an example execution of a shard

./shard -config configfile.json

listening on port: 9999

*** waiting for requests ***

received status query from client at IP address 1.2.3.4
reply is 100 bytes

*** waiting for requests ***

received upload request of 20 bytes for testfile.jpg
received primary bytes [40,59] for testfile.jpg
saving testfile.jpg

connecting to shard 1 at IP address: 5.6.7.8 port: 123
upload backup bytes [40,49] of testfile.jpg
done
closing connection
closed connection

connecting to shard 2 at IP address: 6.7.8.9 port: 234
upload backup bytes [50,59] of testfile.jpg
done
closing connection
closed connection

updating metadata

*** waiting for requests ***

received download request for testfile.jpg
we have primary bytes [40,59] for testfile.jpg
sending primary bytes [40,59] to the client
done

*** waiting for requests ***

received backup bytes request for hello.jpg
received backup bytes [20,45] for hello.jpg
saving backup bytes [20,45] for hello.jpg
updating metadata

*** waiting for requests ***

Shard metadata

The home directory for the shard contains a metadata file specified in the configuration file. The metadata has information about the primary and backup data stored by the shard. This file must be in JSON format but you are welcome to come up with your own list of fields as needed by your shards to operate correctly and efficiencly.

Protocols

The shards communication with other chards. The client communications with the shards. These protocols will use JSON format and will be standardized in the class.

Additional services

Design and implement at least one additional service provided by the shard. The functionality could be as simple as making the shard support a new type of query, or asking the shard to compress a file, to as complex as new type of upload or replication strategy. You are basically implementing a new API call on the shard service. Your shard must still work as expected for the standardized services described in this project.

JSON

JSON is a widely used format to exchange data over the Internet. This project uses JSON-formatted configuration file and protocols. You are required to use a JSON library in the language you decide to use to implement this project. In other words, you are not allowed to write your own JSON reader and writer. This will allow you to take advantage of many good JSON libraries that are already available and focus your effort on the distributed systems and networking aspects of the project.

Documentation

The documentation should be written in formats used by online services that provide various API calls. Write documentation of this functionality. The API call for each documentation should have a section called "Known Limitations".

Measurement

For this part of the project, please start your three shards with different amount of storage used. You should be able to configure these initial conditions for the shard using the metadata file for each shard. Then, draw the amount of storage used as a function of time as you upload multiple files. To obtain the current amount of storage used at each shard, you need to write a simple client called monitor. You should be able to reuse code from your main client to implement monitor program with very little new code. Your monitor client has the same configuration as the standard client. The monitor client sends a query to each shard every 10 seconds and logs the value returned by each shard in a log file in a CSV format. You should be able to plot this file directly in your favorite spreadsheet or visualization program. Please include this graph and an explanation of what you observed and learnt in a separate section of doc.html.

Milestones

Milestone 1 (10/31/16): Complete a setup where a client can upload and download files from a single shard. The shard should use the JSON configuration file. The client should interact with the shard using the JSON protocol. Milestone 1 will be graded at 50% of the project grade. Please upload report.pdf to Moodle with no more than one paragraph each describing what is working, what is not working, and your plans for rest of the project.

Final report: Please write a one page report describing what your system does as if you were preparing a marketing brochure for this product. You can look at how brochures from companies marketing similar products. It should say the product is currently in beta and have a list of features that are not yet production ready. That is how you can list the limitations of your project. Your submission should be a single report.pdf. This is in addition to API documentation and README you have prepared for the project. They should be in your github repository. Report.pdf should be submitted through Moodle.

Submission

Please submit your project code to github on a folder called p1. There should be a Makefile in the folder.

Your p1 folder should also have shardapi.html, which is the documentation of the API calls provided by your sharding service. Please format it nicely like what you see online. It is ok to copy formatting styles from the web.

Your p1 folder should also have doc.html, which contains the documentation of the protocol used by your client to interact with the shard and the shard to interact with other shards. It also includes a link to the documentation of the API provided by the sharing service.

To grade your submission, we will clone your repo on program.cs.uh.edu, go to your p1 folder and run "make". It should generate three executables "shard", "client", and "monitor".