Recently I needed to move users from a Lustre parallel filesystem to a new Lustre filesystem. This task reminded me of a common problem.
Don't create many small files.
A common mistake made that can kill application performance is writing many small files. Most HPC systems use some sort of networked file system. NFS is probably the most common but more and more use Lustre. Let us quickly think about happens in the NFS case.
In NFS to create a file the client node must talk over the network to the NFS server ask for the file to be created, wait for a reply back and then write something to it. When the file is then closed, the file server will almost always wait for the data to hit disk before telling the client it has finished writing.
This process is slow, very slow, actually compared to using a local hard drive significantly slower. It is not uncommon to find users applications spending more time in open() close() than in write() or read().
On the other hand if a user leaves files open and keep appending to them, the open() close() overhead is eliminated. This does introduce the risk of data loss because the file server may not have committed data to disk without close() or sync()/flush(), but if the application crashes we need to restart anyway. Obviously checkpoint files need to be close()'d to be useful.
Now let us look at Lustre. Lustre meta-data (the existence of the file) lives on its' own server. There is only one for an entire file system. That huge 10's PB filesystem? Only 1 meta-data server. This can be a bottle neck. To open a file, first the client talks to this MDS (meta-data server) which tells the client which OSS (storage server) to write data to. Lustre will have many OSS's. If the client keeps creating new files or opening and closing the same file, it keeps making that trip back to that single MDS. If the client creates one file, doesn't close it, and keeps writing to it, the client never speaks to the MDS again! Just to the, many, OSS nodes.
Obviously the client can avoid making the extra network trip over to the MDS and back multiple times, but it also avoids this single server bottle neck.
In the future we will look at performance of manipulating many small files vs fewer larger files.
No comments:
Post a Comment