Out on Numbers Rule Your World, Kaiser Fung has a nice analysis on Andrew Gelman’s analysis of the Facebook controversy (where Facebook apparently “played with people’s emotions” by manipulating their news feeds. The money quote from Fung’s piece is here:
Sadly, this type of thing happens in A/B testing a lot. On a website, it seems as if there is an inexhaustible supply of experimental units. If the test has not “reached” significance, most analysts just keep it running. This is silly in many ways but the key issue is that if you need that many samples to reach significance, it is guaranteed that the measured effect size is tiny, which also means that the business impact is tiny.
This refers to a common fallacy that I’ve often referred to on this blog, and in my writing elsewhere. Essentially, when you have really large sample sizes, even small changes in measured values can be statistically significant. The fact that they are statistically significant, however, does not mean that they have a business impact – sometimes the effect is so small that the only significance it has is statistical!
So before you blindly make business decisions based on statistical significance, you need to take into account whether the measured difference is actually significant enough to have an impact on your business! It may not strictly be “noise” – for the statistical significance test has shown that it’s not “noise”, but it is essentially an effect that, for all business purposes, can be ignored.
PS: Fung and Gelman are among my two favourite bloggers when it comes to statistics and quant. A lot of what I’ve learnt on this subject is down to these two gentlemen. If you’re interested in statistics and quant and visualisation I recommend you to subscribe to both of Fung’s feeds and to Gelman’s feed.
About a month or so back I had a long telephonic conversation with this guy who runs an offshored analytics/data science company in Bangalore. Like most other companies that are being built in the field of analytics, this follows the software services model – a large team in an offshored location, providing long-term standardised data science solutions to a client in a different “geography”.
As is usual with conversations like this one, we talked about our respective areas of work and kind of projects we take on, and soon we got to the usual bit in such conversations where we were trying to “find synergies”. Things were going swimmingly when this guy remarked that it was the first time he was coming across a freelancer in this profession. “I’ve heard of freelance designers and writers, but never freelance data scientists or analytics professionals”, he mentioned.
In a separate event I was talking to one old friend about another old friend who has set up a one-man company to do provide what is basically freelance consulting services. We reasoned that the reason this guy had set up a company rather than calling himself a freelancer given the reputation that “freelancers” (irrespective of the work they do) have – if you say you are a freelancer people think of someone smoking pot and working in a coffee shop on a Mac. If you say you are a partner or founder of a company, people imagine someone more corporate.
Now that the digression is out of the way let us get back to my conversation with the guy who runs the offshored shop. During the conversation I didn’t say much, just saying things like “what is wrong with being a freelancer in this profession”. But now that i think more about it, it is simply a function of the profession being a fundamentally creative profession.
For a large number of people, data science is simply about statistics, or “machine learning” or predictive modelling – it is about being given a problem expressed in statistical terms and finding the best possible model and model parameters for it. It is about being given a statistical problem and finding a statistical solution – I’m not saying, of course, that statistical modelling is not a creative profession – there is a fair bit of creativity involved in figuring out what kind of model to model, and picking the right model for the right data. But when you have a large team working on the problem, working effectively like an assembly line (with different people handling different parts of the solution), what you get is effectively an “assembly line solution”.
Coming back, let us look at this “a day in the life” post I wrote about a year back about a particular day in office for me. I’ve detailed in that the various kinds of problems I had to solve that day – hidden markov models and bayesian probability to writing code using dynamic programming and implementing the code in R, and then translating the solution back to the business context. Notice that when I started off working on the problem it was not known what domain the problem belonged in – it took some poking and prodding around in order to figure out the nature of the problem and the first step in solution.
And then on, it was one step leading to another, and there are two important facts to consider about each step – firstly, at each step, it wasn’t clear as to what the best class of technique was to get beyond the step – it was about exploration in order to figure out the best class of technique. Next, at no point in time was it known what the next step was going to be until the current step was solved. You can see that it is hard to do it in an assembly line fashion!
Now, you can talk about it being like a game of chess where you aren’t sure what the opponent will do, but then in chess the opponent is a rational human being, while here the “opponent” is basically the data and the patterns it shows, and there is no way to know until you try something as to how the data will react to that. So it is impossible to list out all steps beforehand and solve it – solution is an exploratory process.
And since solving a “data science problem” (as I define it, of course) is an exploratory, and thus creative, process, it is important to work in an atmosphere that fosters creativity and “thinking without thinking” (basically keep a problem in the back of your mind and then take your mind off it, and distract yourself to solve the problem). This is best done away from a traditional corporate environment – where you have to attend meetings and be liable to be disturbed by colleagues at all times, and this is why a freelance model is actually ideal! A small partnership also works – while you might find it hard to “assembly line” the problem, having someone to bounce thoughts and ideas with can have a positive impact to the creative process. Anything more like a corporate structure and you are removing the conditions necessary to foster creativity, and are in such situations more likely to come up with cookie-cutter solutions.
So unless your business model deals with doing repeatable and continuous analytical work for a client, you are better off organising yourselves in an environment that fosters creativity and not a traditional office kind of structure if you want to solve problems using data science. Then again, your mileage might vary!
Avinash Kaushik has put out an excellent, if long, blog post on building dashboards. A key point he makes is about the difference between dashboards and what he calls “datapukes” (while the name is quite self-explanatory and graphic, it basically refers to a report with a lot of data and little insight). He goes on in the blog post to explain how dashboards need to be tailored for recipients at different levels in the organisation, and the common mistakes people make about building a one-size fits all dashboard (most likely to be a dashboard).
Kaushik explains that the higher up you go in an organisation’s hierarchy, the lesser access to data the managers have and they also have lesser time to look into and digest data before they come to a decision – they want the first level of interpretation to have been done for them so that they can proceed to the action. In this context, Kaushik explains that dashboards for top management should be “action-oriented” in that they clearly show the way forward. Such dashboards need to be annotated, he says, with reasoning provided as to why the numbers are in a certain way, and what the company needs to do to take care of it.
Going by Kaushik’s blog post, a dashboard is something that definitely requires human input – it requires an intelligent human to look at and analyse the data, analyse the reasons behind why the data looks a particular way, and then intelligently try and figure out how the top management is likely to use this data, and thus prepare a dashboard.
Now, notice how this requirement of an intelligent human in preparing each dashboard conflicts with the dashboard solutions that a lot of so-called analytics or BI (for Business Intelligence) companies offer – which are basically automated reports with multiple tabs which the manager has to navigate in order to find useful information – in other words, they are datapukes!
Let us take a small digression – when you are at a business lunch, what kind of lunch do you prefer? Given three choices – a la carte, buffet and set menu, which one would you prefer? Assuming the kind of food across the three is broadly the same, there is reason to prefer a set menu over the other two options – at a business lunch you want to maximise the time you spend talking and doing business. Given that the lunch is incidental, it is best if you don’t waste any time or energy getting it (or ordering it)!
It is a similar case with dashboards for top management. While a datapuke might give a much broader insight, and give the manager opportunity to drill down, such luxuries are usually not necessary for a time-starved CXO – all he wants are the distilled insights with a view towards what needs to be done. It is very unlikely that such a person will have the time or inclination to drill down -which can anyway be made possible via an attached data puke.
It will be interesting what will happen to the BI and dashboarding industry once more companies figure out that what they want are insightful dashboards and not mere data pukes. With the requirement of an intelligent human to make these “real” dashboards (he is essentially a business analyst), will these BI companies respond by putting dedicated analysts for each of their clients? Or will we see a new layer of service providers (who might call themselves “management consultants”) who take in the datapukes and use their human intelligence to provide proper dashboards? Or will we find artificial intelligence building the dashboards?
It will be very interesting to watch this space!
Initial reports yesterday regarding Radamel Falcao’s move to Manchester United mentioned a valuation of GBP 6 million for the one year loan, i.e. Manchester United had paid Falcao’s parent club AS Monaco GBP 6 million so that they could borrow Falcao for a year. This evidently didn’t make sense since earlier reports suggested that Falcao had been priced at GBP 55 million for an outright transfer, and has four years remaining on his Monaco contract.
In this morning’s reports, however, the value of the loan deal has been corrected to GBP 16 million, which makes more sense in light of his remaining period of contract, age and outright valuation.
So how do you value a loan deal for a player? To answer that, first of all, how do you value a player? The “value” of a player is essentially the amount of money that the player’s parent club is willing to accept in exchange for foregoing his use for the rest of his contract. Hence, for example, in Falcao’s case, GBP 55M is the amount that Monaco was willing to accept for foregoing the remaining four years they have him on contract.
Based on this, you might guess that transfer fees are (among other things) a function of the number of years that a player has remaining on his contract with the club – ceteris paribus, the longer the period of contract, the greater is the transfer fee demanded (this is intuitive. You want more compensation for foregoing something for a longer time period than for a shorter time period).
From this point of view, let us now evaluate what it might take to take Falcao on loan for one year. Conceptually it is straightforward. Let us assume that the value Monaco expects to get from having Falcao on their books for a further four years is a small amount less than their asking price of GBP 55M – given they were willing to forego their full rights for that amount, their valuation can be any number below that; we’ll assume it was just below that. Now, all we need to do is to determine how much of this GBP 55M in value will be generated in the first year, how much in the second year and so on. Whatever is the value for the first year is the amount that Monaco will demand for a loan.
Now, loans can be of different kinds. Clubs sometimes lend out their young and promising players so that they can get first team football in a different club – something the parent club would not be able to provide. In such loans, clubs expect the players to come back as better players (Daniel Sturridge’s loan from Chelsea to Bolton is one such example) and thus with a higher valuation. Given this expectations, loan fees are usually zero (or even negative – where the parent club continues to bear part of the loanee’s wages).
Another kind of loan is for a player who is on the books but not particularly wanted for the season. It could happen that player’s wages are more than what the club hopes to get in terms of his contribution on the field (implying a negative valuation for the player). In such cases, it is possible for clubs to loan out the player while still covering part of the player’s salary. In that sense, the loan fee paid by the target club is actually negative (since they are in a sense being paid by the parent club to loan the player out). An example of this kind was Andy Carroll’s loan from Liverpool to West Ham United in the 2012-13 season.
Falcao is currently in the prime of his career (aged 29) and heavily injury prone. Given his age and injury record, he is likely to be a fast depreciating asset. By the time he runs out his contract at Monaco (when he will be 33), he is likely to be not worth anything at all. This means that a lion’s share of the value Monaco can derive out of him would be what they would derive in the next one year. This is the primary reason that Monaco have demanded 30% of the four year fee for one year of loan.
Loaning a player also involves some option valuation – based on his performance on loan his valuation at the end of the loan period can either increase or decrease. At the time of loaning out this is a random variable and we can only work on expectations. The thing with Falcao is that given the stage of his career the probability of him being much improved after a year is small. On the other hand, his brittleness means the probability of him being a lesser player is much larger. This ends up depressing the expected valuation at the end of the loan period and thus pushes up the loan fee. Thinking about it, this should have pushed up Falcao’s loan fee above GBP 16M but another factor – that he has just returned from injury and may not be at peak impact for a couple of months has depressed his wages.
Speaking of option valuation, it is possibly the primary reason why young loan signings to lesser clubs come cheap – the possibility of regular first team football increases significantly the expected valuation of the player at the end of the loan period, and this coupled with the fact that the player is not yet proven (which implies a low “base sale price”) drives the loan valuation close to zero.
Loaning is thus a fairly complex process, but players’ valuations can be done in rather economic terms – based on expected contribution in that time period and option valuation. Loaning can also get bizarre at times – Fernando Torres’s move to Milan, for example, has been classified by Chelsea as a “two year loan”, which is funny given that he has two years remaining on his Chelsea contract. It is likely that the deal has been classified as a loan for accounting purposes so that Chelsea do not write off the GBP 50M they paid for Torres’s rights in 2010 too soon.
There might have been a time in life when you would’ve had some Single Malt whisky and thought that it “doesn’t taste like any other”. In fact, you might have noticed that some single malt whiskies are more distinct than others. It is possible you might want to go on a quest to find the most unique single malts, but given that single malts are expensive and not easily available, some data analysis might help.
There is this dataset of 86 single malts that has been floating about the interwebs for a while now, and there is some simple yet interesting analysis related to that data – for example, check out this simple analysis with a K-means clustering of various single malts. They use the dataset (which scores each of the 86 malts on 12 different axis) in order to cluster the malts, and analyze which whiskies belong to similar groups.
When India exited the 2007 Cricket World Cup, broadcasters, advertisers and sponsors faced huge losses. They had made the calculations for the tournament based on the assumption that India would qualify for the second group stage, at least, and when India failed to do so, it possibly led to massive losses for these parties.
Back then I had written this blog post where I had explained that one way they could have hedged their exposure to the World Cup would have been by betting against India’s performance. Placing a bet that India would not get out of their World Cup group would have, I had argued, helped mitigate the potential losses coming out of India’s early exist. It is not known if any of them actually hedged their World Cup bets in the betting market.
Looking at the odds in the ongoing Football World Cup, though, it seems like bets are being hedged. The equivalent in the World Cup is Brazil, the home team. While the world football market is reasonably diversified with a large number of teams having a reasonable fan following, the overall financial success of the World Cup depends on Brazil’s performance. An early exit by Brazil (as almost happened on Saturday) can lead to significant financial losses for investors in the tournament, and thus they would like to hedge these bets.
The World Cup simulator is a very interesting website which simulates the remaining games of the World Cup based on a chosen set of parameters (you can choose a linear combination of Elo rating, FIFA ranking, ESPN Soccer Power Index, Home advantage, Players’ Age, Transfer values, etc.). This is achieved by means of a Monte Carlo simulation.
I was looking at this system’s predictions for the Brazil-Colombia quarter final, and comparing that with odds on Betfair (perhaps the most liquid betting site). Based purely on Elo rating, Brazil has a 77% chance of progress. Adding home advantage increases the probability to 80%. The ESPN SPI is not so charitable to Brazil, though – it gives Brazil a 65% chance of progress, which increases to 71% when home advantage is factored in.
Assuming that home advantage is something that cannot be ignored (though the extent of it is questionable for games played at non-traditional venues such as Fortaleza or Manaus), we will take the with home advantage numbers – that gives a 70-80% chance of Brazil getting past Colombia.
So what does Betfair say? As things stand now, a Brazil win is trading at 1.85, which translates to a 54% chance of a Brazil victory. A draw is trading at 3.8, which translates to a 26% chance. Assuming that teams are equally matched in case of a penalty shootout, this gives Brazil a 67% chance of qualification – which is below the range that is expected based on the SPI and Elo ratings. This discount, I hypothesize, is due to the commercial interest in Brazil’s World Cup performance.
Given that a large number of entities stand to gain from Brazil’s continued progress in the World Cup, they would want to protect their interest by hedging their bets – or by betting against Brazil. While there might be some commercial interest in betting against Colombia (by the Colombian World Cup broadcaster, perhaps?) this interest would be lower than that of the Brazil interest. As a result, the volume of “hedges” by entities with an exposure to Brazil is likely to pull down the “price” of a Brazil win – in other words, it will lead to undervaluation (in the betting market) of the probability that Brazil will win.
So how can you bet on it? There is no easy answer – since the force is acting only one way, there is no real arbitrage opportunity (all betting exchanges are likely to have same prices). The only “trade” here is to go long Brazil – since the “real probability” or progress is probably higher than what is implied by the betting markets. But then you need to know that this is a directional bet contingent upon Brazil’s victory, and need to be careful!