A few days ago I built https://github.com/overflowy/parallel-rsync to scratch my own itch: I realized I could just launch multiple rsync instances in parallel to speed things up.
The claim of being 7x faster than rsync is very dubious. I would like to know the test conditions for such a result.
I use every day rsync over SSH, and even between 7 to 10 years old computers it reaches the maximum link speed over 2.5 Gb/s Ethernet.
So in order to need something faster than rsync and be able to test it, one must use at least 10 Gb/s Ethernet, where I do not know how good must be your CPU to reach link speed.
For 7x faster, one would need to use at least 25 Gb/s Ethernet, and this in the worst case for rsync, when it were not faster on higher speed Ethernet than what I see on cheap 2.5 Gb/s Ethernet.
If on a higher-speed Ethernet the link speed would not be reached due to an ancient CPU that has insufficient speed for AES-GCM or for AES-UMAC, then using multiple connections would not improve the speed. If the speed is not limited by encryption, then changing TCP parameters, like window sizes, would probably have the same effect as using multiple connections, even when using just rsync over ssh.
If the transfers are done over the Internet, then the speed is throttled by some ISP and it is not determined by your computers. There are some cases when a small number of connections, e.g. 2 or 3 may have a higher aggregate throughput than 1, but in most cases that I have seen the ISPs limit the aggregated throughput for the traffic that goes to 1 IP address, so if you open more connections you get the same throughput as with fewer connections.
> I use every day rsync over SSH, and even between 7 to 10 years old computers it reaches the maximum link speed over 2.5 Gb/s Ethernet.
What are you rsyncing? Is it Maildirs for 5000 users? Or a multi-TB music and movie archive? The former might benefit greatly if the filesystem and its flash backing store is bottlenecking on metadata lookup, not bandwidth. The latter, not so much.
I too would like to know the test conditions. This is probably one of those tools that is lovely for the right use case, useless for the wrong one.
Anecdote: I have rsync’d maildirs and I recall managing a ~7x perf improvement by combining rsync with GNU parallel (trivial to fan out on each maildir)
It used to be possible in openssh to use -c none and skip the overhead of encryption for the transport (while retaining the protection of rsa keys for authentication). Even the deprecated blowfish-cbc was often faster than aes-ni for bulk transfers. I remember cutting off hours of wait time in backup jobs using these options.
Sadly it appears those days are gone now. 3des is still supported, probably for some governmental environments, but it was always a slower algorithm. Unless there are undocumented hacks I think we're stuck with using proper crypto. Oh darn.
It is a bottleneck for multiple files, but will it speed up with a single file?This is how we sent files for decades. Archive, transfer, unarchive. So I'm wondering what the point is.
It depends on the size of the file, of course. For copying your 90 line .bashrc, probably not noticeable in the noise. For copying an 800GB database? Um, yeah. :-)
I see this project's main value in turning loose the power of multiple cores on a filesystem full of manifold directories, backed by flash based storage that only runs optimally at queue depth >1 (which is most of them). On spinning rust this will probably just thrash the heads.
Hmmm. I wonder how 2 or 3 threads perform with zfs and a reasonable sized ARC?
I use every day rsync over SSH, and even between 7 to 10 years old computers it reaches the maximum link speed over 2.5 Gb/s Ethernet.
So in order to need something faster than rsync and be able to test it, one must use at least 10 Gb/s Ethernet, where I do not know how good must be your CPU to reach link speed.
For 7x faster, one would need to use at least 25 Gb/s Ethernet, and this in the worst case for rsync, when it were not faster on higher speed Ethernet than what I see on cheap 2.5 Gb/s Ethernet.
If on a higher-speed Ethernet the link speed would not be reached due to an ancient CPU that has insufficient speed for AES-GCM or for AES-UMAC, then using multiple connections would not improve the speed. If the speed is not limited by encryption, then changing TCP parameters, like window sizes, would probably have the same effect as using multiple connections, even when using just rsync over ssh.
If the transfers are done over the Internet, then the speed is throttled by some ISP and it is not determined by your computers. There are some cases when a small number of connections, e.g. 2 or 3 may have a higher aggregate throughput than 1, but in most cases that I have seen the ISPs limit the aggregated throughput for the traffic that goes to 1 IP address, so if you open more connections you get the same throughput as with fewer connections.
What are you rsyncing? Is it Maildirs for 5000 users? Or a multi-TB music and movie archive? The former might benefit greatly if the filesystem and its flash backing store is bottlenecking on metadata lookup, not bandwidth. The latter, not so much.
I too would like to know the test conditions. This is probably one of those tools that is lovely for the right use case, useless for the wrong one.
When I think of those obscenely ugly scripting hacks I used to do back in the day....
"Well, trust me, this way's easier." -- Bill Weasley
Sadly it appears those days are gone now. 3des is still supported, probably for some governmental environments, but it was always a slower algorithm. Unless there are undocumented hacks I think we're stuck with using proper crypto. Oh darn.
I see this project's main value in turning loose the power of multiple cores on a filesystem full of manifold directories, backed by flash based storage that only runs optimally at queue depth >1 (which is most of them). On spinning rust this will probably just thrash the heads.
Hmmm. I wonder how 2 or 3 threads perform with zfs and a reasonable sized ARC?