Monday, February 7, 2011

Asynchronous file IO in .Net

I'm building a toy database in C# to learn more about compiler, optimizer, and indexing technology.

I want to maintain maximum parallelism between (at least read) requests for bringing pages into the buffer pool, but I am confused about how best to accomplish this in .NET.

Here are some options and the problems I've come across with each:

  1. Use System.IO.FileStream and the BeginRead method

    But, the position in the file isn't an argument to BeginRead, it is a property of the FileStream (set via the Seek method), so I can only issue one request at a time and have to lock the stream for the duration. (Or do I? The documentation is unclear on what would happen if I held the lock only between the Seek and BeginRead calls but released it before calling EndRead. Does anyone know?) I know how to do this, I'm just not sure it is the best way.

  2. There seems to be another way, centered around the System.Threading.Overlapped structure and P\Invoke to the ReadFileEx function in kernel32.dll.

    Unfortunately, there is a dearth of samples, especially in managed languages. This route (if it can be made to work at all) apparently also involves the ThreadPool.BindHandle method and the IO completion threads in the thread pool. I get the impression that this is the sanctioned way of dealing with this scenario under windows, but I don't understand it and I can't find an entry point to the documentation that is helpful to the uninitiated.

  3. Something else?

  4. In a comment, jacob suggests creating a new FileStream for each read in flight.

  5. Read the whole file into memory.

    This would work if the database was small. The codebase is small, and there are plenty of other inefficiencies, but the database itself isn't. I also want to be sure I am doing all the bookkeeping needed to deal with a large database (which turns out to be a huge part of the complexity: paging, external sorting, ...) and I'm worried it might be too easy to accidentally cheat.

Edit

Clarification of why I'm suspicious with solution 1: holding a single lock all the way from BeginRead to EndRead means I need to block anyone who wants to initiate a read just because another read is in progress. That feels wrong, because the thread initiating the new read might be able (in general) to do some more work before the results become available. (Actually, just writing this has led me to think up a new solution, I put as a new answer.)

  • I'm not sure I see why option 1 wouldn't work for you. Keep in mind that you can't have two different threads trying to use the same FileStream at the same time - doing so will definitely cause you problems. BeginRead/EndRead is meant to let your code continue executing while the potentially expensive IO operation takes places, not to enable some sort of multi-threaded access to a file.

    So I would suggest that you seek and then do a beginread.

    Jacob : Agreed; you should use a new FileStream object for each asynchronous read in flight.
  • What if you loaded the resource (file data or whatever) into memory first and then shared it across threads? Since it is a small db. - you won't have as many issues to deal with.

    Doug McClean : This works in some cases, but I meant "small" in the sense of "few features" rather than "not much data."
  • Use approach #1, but

    1. When a request comes in, take lock A. Use it to protect a queue of pending read requests. Add it to the queue and return some new async result. If this results in the first addition to the queue, call step 2 before returning. Release lock A before returning.

    2. When a read completes (or called by step 1), take lock A. Use it to protect popping a read request from the queue. Take lock B. Use it to protect the Seek -> BeginRead -> EndRead sequence. Release lock B. Update the async result created by step 1 for this read operation. (Since a read operation completed, call this again.)

    This solves the problem of not blocking any thread that begins a read just because another read is in progress, but still sequences reads so that the file stream's current position doesn't get messed up.

  • What we did was to write a small layer around I/O completion ports, ReadFile, and GetQueuedCompletion status in C++/CLI, and then call back into C# when the operation completed. We chose this route over BeginRead and the c# async operation pattern to provide more control over the buffers used to read from the file (or socket). This was a pretty big performance gain over the purely managed approach which allocates new byte[] on the heap with each read.

    Plus, there are alot more complete C++ examples of using IO Completion ports out on the interwebs

    Doug McClean : This is a good idea. You can also avoid allocating new byte[]s (and thrashing the large object heap) by pre-allocating them in big chunks when you create (or grow) the buffer pool.
    Doug McClean : Also, I didn't now about GetQueuedCompletionStatus (or read past it somehow), which probably explains why my attempts at this failed. Time to go read some more.
    From Dave Moore

0 comments:

Post a Comment