At the Large Hadron Collider, two protons are accelerated to a fantastic speed and then smashed into each other. This piece of sub-atomic vandalism is not just for kicks. Scientists there are looking for the Higgs Boson, a particle that, if found, could help explain why some things are heavier than others. It looks like they might have found one last July, too.
Now these collisions are watched by all sorts of detectors. And each collision creates about 1Mb of data (if you had the first minute of a classic song stored as an mp3, that would be about 1Mb, to give you an idea). Which doesn’t sound like much! Except, and here’s where things get tricky, there are millions of collisions a second. Even if you’re reading this blog post on a high-end computer, your hard-drive would fill up within 30 seconds. And the LHC operates throughout the year, and will for years to come. Too much data is generated for anyone to possibly store, never mind sift through.
So when they designed the LHC they had a huge problem to solve – just how do you store and process that much information?
The first thing that happens to LHC data is a cull. Around 199,999 out of every 200,000 collisions are discarded out of hand, automatically, seconds after the event that created the data in the first place. Those that remain are the ones that have something potentially interesting in the data. But even with this cull, the LHC still produces around 15 million gigabytes of data a year. If you burned all of that to CDs, you’d have a stack 14 kilometres tall. So how exactly do you store all of that info, and in such a way that scientists can actually get to it?
This is where the Grid comes in (and it warrants the capitalisation). The Grid is a network of scientific institutes that spans the globe. It’s divided into different ‘Tiers’:
Tier 0 is CERN and the LHC itself. They then push data, at a speed of 5Gb a second to the Tier 1 sites (the very fastest broadband speed in the UK is Willowfield, in Telford and Wrekin, with 70.9 Mb a second – about 70 times slower than the LHC speed). There are eleven Tier 1 sites across the world – the one in the UK is at Rutherford Appleton Labs, in Oxfordshire – and they all share copies of the raw data. The Tier 2 sites, about 140 different universities and institutes across the world, can pull data from any Tier 1 site, process it, and then hand the processed, organised, data back up the Tiers to be shared around.
This way, if you’re a scientist and you’d like to get access to a chunk of data from the LHC to analyse it for something you’re interested in, all you have to do is log into the Grid at your local institution and ask for it to be processed in whatever way you want. Your processing then usually gets shared by many of the Tier 2 sites, all of whom talk to the Tier 1 sites to get the raw data – turning what would have been a *massive* job for one computer into lots of small jobs done by lots of computers all at the same time, across the Tier 2 sites.
So what would have been an impossible task of searching through millions upon millions upon millions of collisions gets spread out and shared, and you, as a scientist, get your processed LHC data back within about a day.
But, you might wonder, what happens if the system goes wrong, and a possible Higgs Boson gets stolen? Well, that’s when you hire a private detective, of course…