Follow Join

Playing with data - le Tour de France 13 years ago

I was born with big legs. My family always said if I was born in the 1930s in China, I'd be a pedicab driver. I suspect 40-50% of my body weight is just legs which makes pants buying difficult, but that also makes riding a bike easier.

The teenaged me had a heavy 30lb monster mountain bike that I ride along the dyke of my home town, and I used to pretend I'm racing against the top riders, imagining myself on a breakaway. Much like this video actually. Though now I've graduated to a much lighter and faster road bike, and I've swapped sneakers for carbon soled riding shoes, the question always remained: how good am I vs today's best riders? Did elite riders in the past perform as well as our present day heroes? Did the toughest bicycle race on the planet change over time? How did it evolve?

Data often can tell compelling stories - That's the training I've received during my 12-year continual stint in academia. Now that I'm stepping into the world of commercial data analytics, I would like to preserve, to some extent, the freedom of inquiry that I've enjoyed for so long. Of course, that kind of freedom is extremely rare outside of academia and the closest thing I came up with is analyzing data that interests me personally in addition to the high-leverage data that pays the bills.

The Tour does have a long and distinguished history that smacks of madness. Who thought of riding steel bikes with one gear through France for 5745 km? It is no surprise that the race has evolved over time to be "saner" by elite standards, yet it is difficult to understand the Tour's evolution and how riders fared over the years without the help of data visualization.

I crawled wikipedia, the Tour's official site, and bike race info for data and built a data base that showcases various attributes of the Tour in light of the progression of time. I scored the data based on the three sources of information and framed the data with some questions in mind.

At first the data roughly looked like this:

Pairwise plots of all variables

At first glance there are some very obvious patterns (those plots with obvious linear trends), and some not so interesting ones. It was relatively simple to go through these and find the story telling trends.

What I ended up visualizing was this time series:

Tour de France history

An inescapable fact is that the Tour was interrupted by the two World Wars. Despite these interruptions, the tour has evolved from a race with fewer longer stages to one with many shorter stages. This changes the dynamics of the race both tactically and strategically favouring targeted attacks on key stages in the race (queen stages).

One interesting facet of the race's change is the increase in average speed of the race winner from around 25km/hr to nearly 40km/hr. Unlike races such as sprinting, the performance gain of elite riders does not seem to have hit a plateau. This is probably due to the continual technical improvements of bicycles, race tactics, sports nutrition and shorter distance among stages.

The French are extremely proud of the Tour and it is rather galling that France has not been able to produce a winner in recent years. I drew a heat map of the nationality of winners through the Tour's history, and here's what it looks like:

Tour winning nations

France has dominated the race along with nations with a long cycling heritage such as Italy, Belgium, Luxembourg, and Switzerland. France seems to have produced winners periodically (actually a rather predictable pattern) until the 1990s where other countries are joining the club, notably the USA, Ireland, Germany, and most recently Great Britain.

Of course there is a lot more facets of data that can be explored about the Tour de France, but I find these two visualizations the most compelling in illustrating the Tour's history and evolution.

Ok, onto the next project.