I'm building a toy database in C# to learn more about compiler, optimizer, and indexing technology.
I want to maintain maximum parallelism between (at least read) requests for bringing pages into the buffer pool, but I am confused about how best to accomplish this in .NET.
Here are some options and the problems I've come across with each:
Use
System.IO.FileStream
and theBeginRead
methodBut, the position in the file isn't an argument to
BeginRead
, it is a property of theFileStream
(set via theSeek
method), so I can only issue one request at a time and have to lock the stream for the duration. (Or do I? The documentation is unclear on what would happen if I held the lock only between theSeek
andBeginRead
calls but released it before callingEndRead
. Does anyone know?) I know how to do this, I'm just not sure it is the best way.There seems to be another way, centered around the
System.Threading.Overlapped
structure and P\Invoke to theReadFileEx
function in kernel32.dll.Unfortunately, there is a dearth of samples, especially in managed languages. This route (if it can be made to work at all) apparently also involves the
ThreadPool.BindHandle
method and the IO completion threads in the thread pool. I get the impression that this is the sanctioned way of dealing with this scenario under windows, but I don't understand it and I can't find an entry point to the documentation that is helpful to the uninitiated.Something else?
In a comment, jacob suggests creating a new
FileStream
for each read in flight.Read the whole file into memory.
This would work if the database was small. The codebase is small, and there are plenty of other inefficiencies, but the database itself isn't. I also want to be sure I am doing all the bookkeeping needed to deal with a large database (which turns out to be a huge part of the complexity: paging, external sorting, ...) and I'm worried it might be too easy to accidentally cheat.
Edit
Clarification of why I'm suspicious with solution 1: holding a single lock all the way from BeginRead to EndRead means I need to block anyone who wants to initiate a read just because another read is in progress. That feels wrong, because the thread initiating the new read might be able (in general) to do some more work before the results become available. (Actually, just writing this has led me to think up a new solution, I put as a new answer.)
-
I'm not sure I see why option 1 wouldn't work for you. Keep in mind that you can't have two different threads trying to use the same FileStream at the same time - doing so will definitely cause you problems. BeginRead/EndRead is meant to let your code continue executing while the potentially expensive IO operation takes places, not to enable some sort of multi-threaded access to a file.
So I would suggest that you seek and then do a beginread.
Jacob : Agreed; you should use a new FileStream object for each asynchronous read in flight.From John Christensen -
What if you loaded the resource (file data or whatever) into memory first and then shared it across threads? Since it is a small db. - you won't have as many issues to deal with.
Doug McClean : This works in some cases, but I meant "small" in the sense of "few features" rather than "not much data."From typemismatch -
Use approach #1, but
When a request comes in, take lock A. Use it to protect a queue of pending read requests. Add it to the queue and return some new async result. If this results in the first addition to the queue, call step 2 before returning. Release lock A before returning.
When a read completes (or called by step 1), take lock A. Use it to protect popping a read request from the queue. Take lock B. Use it to protect the
Seek
->BeginRead
->EndRead
sequence. Release lock B. Update the async result created by step 1 for this read operation. (Since a read operation completed, call this again.)
This solves the problem of not blocking any thread that begins a read just because another read is in progress, but still sequences reads so that the file stream's current position doesn't get messed up.
From Doug McClean -
What we did was to write a small layer around I/O completion ports, ReadFile, and GetQueuedCompletion status in C++/CLI, and then call back into C# when the operation completed. We chose this route over BeginRead and the c# async operation pattern to provide more control over the buffers used to read from the file (or socket). This was a pretty big performance gain over the purely managed approach which allocates new byte[] on the heap with each read.
Plus, there are alot more complete C++ examples of using IO Completion ports out on the interwebs
Doug McClean : This is a good idea. You can also avoid allocating new byte[]s (and thrashing the large object heap) by pre-allocating them in big chunks when you create (or grow) the buffer pool.Doug McClean : Also, I didn't now about GetQueuedCompletionStatus (or read past it somehow), which probably explains why my attempts at this failed. Time to go read some more.From Dave Moore
0 comments:
Post a Comment