Follow Join

Shopping habits of hot(ter) singles 11 years ago

As a Data Scientist (which I found out could mean so many things), I often wondered what sort of data do I want to work with next?

Big data? So in vogue but too vague.

Corporate data? Locked down with NDAs. (though I did some interesting ones recently)

Economic trends? Well that'd be interesting, working on it.

But I thought a lot about data containing information about human interactions, fornications, and the like. Of course, this was in part inspired by the fact that I was dumped for quitting grad school (being Dr. Wong was apparently important) to pursue data science. Perhaps I should work on data about people getting dumped. Nah, too depressing, and also hard to come by. When was the last time the population of dumpsville volunteering data on their heartaches?

So I decided to work on data that showed how people connect with each other romantically. Perhaps a dating site? Well, I contacted the fine ladies @ Coffee Meets Bagel and they were super excited for some data analytics on their core users, and since they have a graphics department that turns my numbers into info-graphics, it was a er... match made in heaven.

So here it is on Coffee Meets Bagel's blog.

I crawled through Coffee Meets Bagel's user interaction data base to assess popularity and pickiness among users. I then parsed their Facebook mentions of shopping brands to figure out the most popular 10 brands for men and women. It turns out popularity on the network varied with favoured brands (ie/ folks who wear certain brands or like certain brands are more liked). Interestingly both men and women like Victoria's Secret.

Shopping habits

I'm looking forward to diving deeper into the Coffee Meets Bagel data set. It's absolutely loaded with possibilities. Stay tuned!

Playing with data - le Tour de France 11 years ago

I was born with big legs. My family always said if I was born in the 1930s in China, I'd be a pedicab driver. I suspect 40-50% of my body weight is just legs which makes pants buying difficult, but that also makes riding a bike easier.

The teenaged me had a heavy 30lb monster mountain bike that I ride along the dyke of my home town, and I used to pretend I'm racing against the top riders, imagining myself on a breakaway. Much like this video actually. Though now I've graduated to a much lighter and faster road bike, and I've swapped sneakers for carbon soled riding shoes, the question always remained: how good am I vs today's best riders? Did elite riders in the past perform as well as our present day heroes? Did the toughest bicycle race on the planet change over time? How did it evolve?

Data often can tell compelling stories - That's the training I've received during my 12-year continual stint in academia. Now that I'm stepping into the world of commercial data analytics, I would like to preserve, to some extent, the freedom of inquiry that I've enjoyed for so long. Of course, that kind of freedom is extremely rare outside of academia and the closest thing I came up with is analyzing data that interests me personally in addition to the high-leverage data that pays the bills.

The Tour does have a long and distinguished history that smacks of madness. Who thought of riding steel bikes with one gear through France for 5745 km? It is no surprise that the race has evolved over time to be "saner" by elite standards, yet it is difficult to understand the Tour's evolution and how riders fared over the years without the help of data visualization.

I crawled wikipedia, the Tour's official site, and bike race info for data and built a data base that showcases various attributes of the Tour in light of the progression of time. I scored the data based on the three sources of information and framed the data with some questions in mind.

At first the data roughly looked like this:

Pairwise plots of all variables

At first glance there are some very obvious patterns (those plots with obvious linear trends), and some not so interesting ones. It was relatively simple to go through these and find the story telling trends.

What I ended up visualizing was this time series:

Tour de France history

An inescapable fact is that the Tour was interrupted by the two World Wars. Despite these interruptions, the tour has evolved from a race with fewer longer stages to one with many shorter stages. This changes the dynamics of the race both tactically and strategically favouring targeted attacks on key stages in the race (queen stages).

One interesting facet of the race's change is the increase in average speed of the race winner from around 25km/hr to nearly 40km/hr. Unlike races such as sprinting, the performance gain of elite riders does not seem to have hit a plateau. This is probably due to the continual technical improvements of bicycles, race tactics, sports nutrition and shorter distance among stages.

The French are extremely proud of the Tour and it is rather galling that France has not been able to produce a winner in recent years. I drew a heat map of the nationality of winners through the Tour's history, and here's what it looks like:

Tour winning nations

France has dominated the race along with nations with a long cycling heritage such as Italy, Belgium, Luxembourg, and Switzerland. France seems to have produced winners periodically (actually a rather predictable pattern) until the 1990s where other countries are joining the club, notably the USA, Ireland, Germany, and most recently Great Britain.

Of course there is a lot more facets of data that can be explored about the Tour de France, but I find these two visualizations the most compelling in illustrating the Tour's history and evolution.

Ok, onto the next project.