Code Question: Big Data: Which HD Parameters are Important?

I work with a lot of datasets that are in the tens of GBs, usually split into several files. Performing any type of dataside-wide operation (grep, sed, search, read/write to/from databases and Hadoop) on these files is of course very slow and time consuming. Until now, I have been using whatever HD I can get for a good deal - typically Seagates at 5400rpm or 7200rpm.

It is time for me to upgrade the HD. What parameters should I be looking at for the type of work I described? Spindle speed? Interface? Seek time and throughput? I have read various things that some of them do not matter so I am confused.

I can provide more info if this is not enough.

From serverfault Ryan Rosario

Spindle speed is certainly important, as is seek time. But the most important thing for dealing with massive database files is the ability for random read/writes (that is, taking a lot of data from all different areas of the disk, as opposed to Sequential Read/Write, where all the data is in order on the disk)

This is where SAS absolutely excels. With a normal IDE or SATA disk, if you have four pieces of data that are non-sequential, and the requests are received in a certain order, then the drive has to do an entire revolution to pick up each individual piece of data.

With SAS, the controller will order the request in the order that they can best be serviced, and will order then so that multiple pieces of data can be picked up in a single revolution if possible. So just because the requests come in as A B C D, the SAS drive might serve them in A D C B, because that's the order that they are on the physical disk. A normal SATA/IDE drive can only serve them A B C D even though this is not an optimal order.

James : Do you mean command queueing? Newer SATA drives do that as well, not just SAS...

Farseeker : No, I mean re-ordering the commands into their optimal order, so that multiple chunks of random data can be retrieved within a single revolution.

Farseeker : Unfortunately the only reference I can find to this on the intertron at the moment is a forum: http://www.tomshardware.com/forum/248547-32-raid-sata#t1760727 - hardly admissible in court, but I know I've read about this elsewhere.

From Farseeker
Seek time is not that important for entire-data scanning/analysis operations (assuming you use flat files or modern scalable database like Hypertable, instead of traditional B-Tree* based database which would require significant random seeks to scan large tables. If you rely on random I/O of hard drives to deal with large data set, you'd certainly be doing it wrong.

The most important factors for this type of work is raw sustained (uncached) sequential read/write throughput and the ability to handle multiple scans at the same time without degrading to random I/O patterns. There is a good benchmark from one these benchmark sites for 1+TB SATA drives. It showed that Seagate and Western digital drives are pretty good at handling multiple scans while Samsung drives degrades dramatically when there are more than one scan going on.

From obecalp
Solid state drives woul really help you here, if you can afford them.

From Massimo
Use more than one disk if you can - stripe them at the OS level or get Hadoop to distribute the data over multiple drives - having more than one spindle seeking will massively improve performance, and be cheaper than SSD.

From James

Code Question

Sunday, January 16, 2011

Big Data: Which HD Parameters are Important?

0 comments:

Post a Comment

Blog Archive