Tuesday, January 25, 2011

Parallel File Copy

I have a list of files I need to copy on a Linux system - each file ranges from 10 to 100GB in size.

I only want to copy to the local filesystem. Is there a way to do this in parallel - with multiple processes each responsible for copying a file - in a simple manner?

I can easily write a multithreaded program to do this, but I'm interested in finding out if there's a low level Linux method for doing this.

  • There is no low-level mechanism for this for a very simple reason: doing this will destroy your system performance. With platter drives each write will contend for placement of the head, leading to massive I/O wait. With SSDs, this will end up saturating one or more of your system buses, causing other problems.

    Jon : Err that doesn't seem to be the case with a single cp at present, I'm sure there's a happy medium for multiple parallel "cp's" at which you're I/O channel doesn't become completely saturated...
  • As mentioned, this is a terrible idea. But I believe everyone should be able to implement their own horrible plans, sooo...

    for FILE in *;do cp $FILE <destination> &;done

    The asterisk can be replaced with a regular expression of your files, or $(cat <listfile>) if you've got them all in a text document. The ampersand kicks off a command in the background, so the loop will continue, spawning off more copies.

    As mentioned, this will completely annihilate your IO. So...I really wouldn't recommend doing it.

    --Christopher Karel

  • The only answer that will not trash your machine's responsiveneess isn't exactly a 'copy', but it is very fast. If you won't be editing the files in the new or old location, then a hard link is effectively like a copy, and (only) if you're on the same filesystem, they are created very very very fast.

    Check out cp -l and see if it will work for you.

  • If you system is not trashed by it (e.g. maybe the files are in cache) then GNU Parallel http://www.gnu.org/software/parallel/ may work for you:

    find . -print0 | parallel -0 cp {} destdir
    

    This will run 9 concurrent cp's.

    Pro: It is simple to read.

    Con: GNU Parallel is not standard on most systems - so you probably have to install it.

    Watch the intro video for more info: http://www.youtube.com/watch?v=OpaiGYxkSuQ

    From Ole Tange

0 comments:

Post a Comment