I did not realize the newly redefined Sort benchmark after the disappearance of Jim Gray at sea in January 2007. Jim defined, sponsored, and administered these benchmarks by himself from his role as a researcher at Microsoft. For those who do not know Jim Gray, he was the father of transaction processing during the 1970s and 1980s. The seminal book on that subject is co-authored by him.
The new benchmark is appropriately named GraySort in deference to Jim’s contribution to computer science. You can see a description of this benchmark here. Given the explosive growth of data volumes in sites such as Google, Facebook, and Twitter it is imperative that we understand the processing technology and improve on it.
Amongst several types of test, the GraySort is the amount of time it takes to sort a very large volume of data, currently set to a minimum of 100TB.
This year’s winner for that benchmark is Yahoo where they ran the GraySort in 173 minutes using Hadoop in a distributed system of 3452 nodes x (2 Quadcore Xeons, 8 GB memory, 4 SATA). That’s a lot of data to sort through. Yahoo also won the “MinuteSort” benchmark (meaning how much data can you sort in a minute) via Hadoop on 500GB on a 1406 nodes x (2 Quadcore Xeons, 8 GB memory, 4 SATA) system.
Such performance is unheard of in the traditional RDMS world, where a lot of processing power is required to navigate through complex code. These numbers are very significant for the new world of social networking, searching, and mini-blogging like Twitter. Hadoop is becoming the de-facto standard in large scale distributed computing framework with MapReduce as the core technology.