Those obsessed with Mammon will read Facebook’s IPO prospectus for what it says about making money. Others of us with a more geeky bent will pour over what it reveals about how the company handles data. It starts with arresting stats: 845 million active monthly users; 100 billion friendships, and every day 250 million photos uploaded and 2.7 billion likes or comments.
But that is just the eye-candy. The substance is buried deep in the prose, under the heading “Data Management and Personalization Technologies.” Get a load of this:
“loading a user’s home page typically requires accessing hundreds of servers, processing tens of thousands of individual pieces of data, and delivering the information selected in less than one second. In addition, the data relationships have grown exponentially and are constantly changing.”
And then there is this:
“We use a proprietary distributed system that is able to query thousands of pieces of content that may be of interest to an individual user to determine the most relevant and timely stories and deliver them to the user in milliseconds.”
“We store more than 100 petabytes (100 quadrillion bytes) of photos and videos.”
“We use an advanced click prediction system that weighs many real-time updated features using automated learning techniques. Our technology incorporates the estimated click-through rate with both the advertiser’s bid and a user relevancy signal to select the optimal ads to show.”
But my favorite is this:
“Our research and development expenses were $87 million, $144 million, and $388 million for 2009, 2010, and 2011, respectively.”
So R&D expenses grew almost five-fold in three years. Considering Facebook had $1 billion in profit on $3.7 billion of revenue last year, the company’s research budget came to 10% of sales. This is very healthy (albeit natural, perhaps, with a company boasting such hefty profit margins). According to the OECD, the top 100 R&D-inteisve companies in the IT and telecoms sectors spend an average of nearly 7% of revenue on R&D.
Most of the fruits of the R&D is probably kept internal and covered under trade secrets. But for that generous sum, the prospectus informs us:
“As of December 31, 2011, we had 56 issued patents and 503 filed patent applications in the United States and 33 corresponding patents and 149 filed patent applications in foreign countries relating to social networking, web technologies and infrastructure, and related technologies. Our issued patents expire between May 2016 and June 2031.”
But the most interesting thing is how much was not exposed in the prospectus. In a section were Facebook purported to explain its analytics, with an example of how it uses elements on a webpage to determine what ads to show (page 87), the example was so juvenile as to be meaningless.
It is actually funny the way Facebook keeps quiet on analytics, considering that the first time the word appears is on page 12, when Facebook cites it as one of the “risk factors” that could ruin the business:
“our inability to improve our analytics and measurement solutions that demonstrate the value of our ads and other commercial content”
Though it is loath to make too much of it, since it is its main source of value, Facebook is an analytics company before anything else. Google might have been the world’s first big-data IPO. Facebook may be the first analytics one. But you wouldn’t know it from its IPO prospectus.
What is the world of BigData? To define it is to limit it. But here’s an entry in: it refers to things that we cannot see with the naked eye, but is only revealed to us from a huge body of data. So, for example, we may not know what people think the best restaurant is in terms of food and service and atmosphere and a good time. But being able to see the percentage left as tips could be a way of learning this, in a way we couldn’t know in the past — when we couldn’t get the data or process it.
This is a fictitious example (and one imagined by Mike Driscoll of Metamarkets). But it helps to concentrate the mind on what is new about the BigData revolution taking place, and how information cleverly reused can create new sources of economic value. And this leads me to thinking about the Wall Street Journal’s excellent article “The Really Smart Phone” by Robert Lee Hotz, which is part of the paper’s impressive series “What They Know.”
There’s much to praise in the piece. Instead, I want to put forward some vital distinctions that industry needs to consider, when thinking about some of the trends happening.
First, we need to separate the process of BigData from its output. The article — like the industry — doesn’t really do this. For example, sometimes we talk about being able to track 100 million cellphone users (but don’t note the substance of what is being tracked: calls? location? bills?) And sometimes we talk about what we learn, such as a person’s susceptibility to obesity. But is it because location data shows they’ve been sedentary? Or because they bought lots of ice-cream from their iPhone?
These distinctions are crucial. In one instance, it is anonymized metadata, in another it is individual information. The ways that entities are allowed use these different types of data perhaps ought be different too.
Most people, and most articles in the press, approach the BigData issue from the negative: “if you only knew what they know about you!” But I believe that industry ought be far more transparent because if people did know, they’d probably be more impressed than alarmed. (It is a point that I made in my special report “The data deluge” in The Economist last year.) Specifically, what is so unsettling: what they collect? Or what they know? On the surface, we bristle at both. But when we look deeper, it gets fascinating to see what new things can be learned from a big body of data.
Hence, the transition from “Ick!” to “Wow!” But I think the failure of industry to be open about its practices will hold it back. Thus: the public will cry: “Wait!”
People are antsy. Regulators are uneasy. And business is barricading itself. Amazon never discusses it. Google does — and thus invites abuse, alas. Apple is characteristically silent. “Google Inc. defended the way it collects location data from Android phones, while Apple Inc. remained silent for a third day,” the WSJ wrote in a separate article in April 23.
I think with the right outreach, BigData firms can make the case for collecting and processing the information. It will change the debate to the more essential questions: who owns the information, who gets to benefit from it, how it is valued, how it is protected and what are the penalties if this trust is abused?
This further establishes the distinction I identified at the outset: separating the process from the substance. In fact, we are talking about so many records — millions of people, zillions of data-points of locations or calls — that the practical effect seems to be anonymization, even if is not done effectively in practice.
To overcome the “Wait!,” I’d urge that we make rules. As for a starting point to think about them, I’ll discuss another time. For now, a look at what the WSJ piece did a nice job of highlighting. The large numbers associated with the research was interesting, but actually unimportant — they’re just big numbers; the process. The actual output of what is to be learned is far most interesting. Specifically, cellphone BigData lets us:
– pinpoint “influencers,” the people most likely to make others change their minds
– forecast where people are likely to be at any given time in the future
– predict which people are most likely to defect to other cellphone carriers
– reveal subtle symptoms of mental illness
– foretell movements in the Dow Jones Industrial Average,
– chart the spread of political ideas as they move through a community
– expose a cultural split that is driving a historic political crisis (in Belgium)
– deduce that two people were talking about politics
– detect flu symptoms before the students themselves realized they were getting sick
A final note: all of these insights were gleaned by parsing two types of data: location and interconnections among users — metadata. There is a lot more mobile data to collect; we’ve barely scratched the surface. Also, nota bene that none of the data relates to specific content from the phone or user. And it is not clear that the data collected can be traced back to a specific user, other than in cases of academic research in which consent was granted.
In some ways, the data collected looks like the “pen register” information that is less spooky than an outright wiretap: eg, who calls whom and when, but not what they said. It has a lower standard for law enforcement to obtain.
My point is this: the BigData issues we’re confronting now are the easy ones. So this is the moment to start thinking about seriously debating them, and arriving at answers — as a precursor to the harder issues coming down the pike.
One possibility comes to mind: perhaps shopping patterns are so statistically consistent, routine and as personal as DNA, that information about a person’s previous purchases — or even non-shopping activities — enables an algorithm to know if the customer is truly the person he or she says.
That is interesting. But, alas, the report looks at more prosaic things, ie:
PayPal, Amazon, and Google have all developed sophisticated analytical tools and infrastructure to identify patterns of fraudulent activity. Paypal, for example, has a series of Fraud Management Filters that screen payments and sort out transactions that warrant review because of their amount, their origin, or other factors that can be set by a merchant. […] PayPal and Amazon have developed fraud detection tools that depend on massive datasets containing not only financial details for transactions, but IP addresses, browser information, and other technical data that will help these companies refine models to predict, identify, and prevent fraudulent activity. PayPal and Amazon have had years to amass databases of the transaction details for hundreds of millions of customers across thousands of merchants.
The sort of filtering and checking described above (bold emphasis mine) involves no conceptual shift in how to use data. All that is being described is doing the same intutive techniques that one would have long done in a world of “small data.” The only thing “big” about it is that there’s a lot more data to sift through. But the firms are not using the size and depth of data to do anything novel per se.
This is a pity. The revolution that is taking place in other dimensions of the Internet industry is that companies can do entirely new things with a big data set that they cannot do with a small one. A former top Google executive once told me that Google Checkout was created in part because the firm realized that learning about a customer’s shopping pattern could better detect fraud, which is the key e-commerce stumbling block.
Likewise, at the O’Reilly Strata conference in February, hallway chit chat was about how a financial services firm might be able to more accurately predict whether someone will repay a loan using Facebook’s social graph than a FICO score, since best predictor if person will repay is if their friends repay their loans. (Actually, the example was told to me as if it were already being done, though not with Facebook’s data). Yet I think I’m safer considering it apocryphal until I hear it first hand.
Does anyone know of incredible stories of how “big data” is being used in new ways to reduce financial fraud? If so, comment here or email me directly.