Usually data is hot when it is fresh, and are accessed very often. SeaweedFS normal volumes tries hard to minimize the disk operations, but it comes with a cost of loading indexes in memory.
However, data can become warm or cold after a period of time. They are accessed much less often. The high cost of memory is not cost-efficient for warm storage. To store them more efficiently, you can "seal" the data and enable erasure coding (EC).
- Storage Efficiency: SeaweedFS implemented RS(10,4), which allows loss of 4 shards of data with 1.4x data size. Compared to replicating data 5 times to achieve the same robustness, it saves 3.6x disk space.
- Fast Read Speed: SeaweedFS uses continuous 1GB block layout with 1MB block sizes for edge cases, optimized for both small file reads and storage efficiency.
- Optimized for Small Files: there are no file size requirement for EC to be effective.
- High Availability: If up to 4 shards are down, the data is still accessible with reasonable speed.
- Memory Efficiency Minimum memory usage. The volume server does not load index data into memory.
- Fast Startup Startup time is much shorter by skip loading index data into memory.
- Rack-Aware data placement to minimize impact of volume server and rack failures.
The downside:
- If some EC shards are missing, fetching data on those shards would be slower.
- Re-construct missing EC shards would require transmitting whole volume data.
- Only deletion is supported. Update is not supported.
- Compaction would require converting back to normal volumes first.
Side Note:
- The 10+4 can be easily adjusted via and
ParityShardsCount
inhttps://github.com/chrislusf/seaweedfs/blob/master/weed/storage/erasure_coding/ec_encoder.go#L17 - If you are considering these enterprise-level customizations, please consider supporting SeaweedFS first.
Architecture
SeaweedFS implemented 10.4 Reed-Solomon Erasure Coding (EC). The large volumes are split into chunks of 1GB, and every 10 data chunks are also encoded into 4 parity chunks. So a 30 GB data volume will be encoded into 14 EC shards, each shard is of size 3 GB and has 3 EC blocks.
Since the data is split into 1GB chunks, usually one small file is contained in one shard, or possibly two shards in edge cases. So most reads still only cost O(1) disk read.
For smaller volumes less than 10GB, and for edge cases, the volume is split into smaller 1MB chunks.
The 14 EC shards should be spread into disks, volume servers and racks as evenly as possible, to protect against the hardware failure caused data loss.
Run weed scaffold -conf=master
to generate a master.toml
file, put it in current directory, ~/.seaweedfs/
, or .
The script in the master.toml
is executed on the master. If you have a large number of EC volumes, processing all of them on master may cost some CPU resources. It's better to run them with weed shell
via some cron job in a separate machine.
How the scripts works?
The scripts have 3 steps related to erasure coding.
command will find volumes that are almost full and has been stale for a period of time.
The default command is ec.encode -fullPercent=95 -quietFor=1h
. It will find volumes at least 95% of the maximum volume size, which is usually 30GB, and have no updates for 1 hour.
Note: One collection can contain both normal volumes and erasure coded volumes, with write requests going to normal volumes.
If disks fail or servers fail, some data shards are lost. With erasure coding, we can recover the lost data shards from the remaining data shards.
The default command is ec.rebuild -force
.
The data repair happens for the whole volume, instead of one small file at a time. It is much more efficient and fast to reconstruct the missing data shards than processing each file individually.
With servers added or removed, some data shards may not be laid out optimally. For example, one volume's 5 data shards could be on the same server. If that server goes down, the volume would be unrepairable or part of the data is lost permanently.
When all data shards are online, the read for one file key are assigned to one volume server (A) that has at least one data shard for the volume. Server A will read its copy of index file, and locate the volume server (B), and read from server B for the file key.
For example, one read request for 3,0144448888 will:
- Ask master server to locate the EC shards for volume 3, which is usually a list of volume servers.
- Server A will read its local index file, find the right volume server B that has the file content. Sometimes it may have to contact additional servers if the file is split between multiple blocks. But usually the data shard has 1GB block size. So this does not happen often.In normal operations, there are usually one extra network hop compared to normal volume reads.
In case of missing data shards or read failures from server B, server A will try to collect as many pieces of data as possible from the remaining servers, and reconstruct the requested data.
Read Performance
For this test, I started 4 volume servers. With each volume divided into 14 shards, each server will host 3 or 4 volumes. If one volume server is down, the data will still be readable and repairable.
Here are the benchmark numbers by weed benchmark -master localhost:9334 -n 102040 -collection=t -write=true
Then I force to erasure encode the volumes by
Here is the normal EC read performance by weed benchmark -master localhost:9334 -n 102040 -collection=t -write=false
.
You may need to run it twice because of some one-time read for the volume version. The EC read performance is about half of the normal volume read performance, because of the extra network hop.
Now let's stop one of the 4 servers. There will be 3 or 4 shards missing, out of the total 14 shards. It is still readable. But the read speed is slower because the volume server needs to contact all other servers to reconstruct the missing data.