Kenneth Neil Cukier

Learn to code. And cook, perform open-heart surgery, write kanji, design efficient thermodynamics, blog…

May 16, 2012 cukier Leave a comment

For such an interesting and useful debate, what such silly arguments are being advanced. I think the coders need to lift their heads from their screens and spend the year to learn the humanities.

Clay Johnson (@cjoh) believes that coding is like literacy: if you don’t learn it, you’re shut out of the world. Matt Galligan (@mg) suggests it is like cooking: it doesn’t matter that you can’t compete against Jamie Oliver with a whisk, simply being knowledgeable is important. (I’ll link to their tweets shortly.)

Both are deeply smart. Yet I beg to differ on two grounds.

First, we live in a resource constrained world, and one cannot peruse everything they’d like. I’ve never read Plutarch despite knowing I’d be a better person for it.

Second, there is a value to concentrating one’s efforts where there is the biggest payoff; the idea of comparative advantage. An old economics textbook — was it Samuelson? — used the example of why President Roosevelt oughtn’t type his own letters even if he is a faster typist than his secretary.

I have no principled objection to learning to code, either rudimentarily or more seriously, if that is what one wishes. But insisting that it is somehow essential to learn is ridiculous.

Surely the same arguments could be made for other things that affect us on an everyday level, such as food (learn to farm!), health (learn to sequence genomes!). We drive cars: must we learn how they work? We surf the internet: must we tinker with IPv6 header fields and the protocol stack? Where does it end?

Benjamin Franklin detested that schools in his day taught Latin and Greek as standard fare — far better to learn living languages to actually talk to people, he recommended. I’m not as narrowly practical as that. (After all, it let scholars across Europe communicate, as the term “Latin Quarter” in Paris suggests. And it opens the mind to a world of great works.) Yet there is a lot of sense to the idea that we should discriminate with our time wisely.

Yes, yes. I understand that as more facets of life are dominated by computers, with algorithms making decisions that once were done by people, it is essential that the public has a basic understanding of how software works, so that they can appreciate its limitations — and can act on that knowledge as citizens, voters, consumers, parents, etc. I get it. (I’m even writing a book on big-data that deals with this.)

Still, the principles of software, albeit useful to be familiar with, holds no sacrosanct importance that it should jump the queue of priorities; coding can hardly claim some sort of categorical imperative that elevates it to something any honorable person must know. Rather, it is like most other things in life: nice to know if you can, but one can avail oneself of the marketplace to bring in the skills when it’s needed. Typists never needed to know the innards of a mechanical typewriter. Newspapers subscribers don’t need to learn about presses, or the stylebook, or HTML5.

What may be most surprising is that people are surprised. So coders urge everyone to code. Priests want parishioners to pray. Boy scouts want us to camp. Generals ask that we be prepared to defend. When you’re a hammer, everything looks like a nail. Frogs see the universe as a pond. Lawyers want us to code too, but not in the way software engineers mean.

In the middle ages, music was the fourth of the seven liberal arts which all educated men needed to learn. Should today’s computer scientists think less of themselves if they cannot sightread a staff?

Categories: Uncategorized

Can a cellphone carrier be stupider?

May 10, 2012 cukier Leave a comment

Even as someone with a jaundice-eyed look at the use of data and analytics in business, I can’t help but think that the introductory message on Vodafone’s help line just couldn’t be worse.

Calling today to solve a small problem with the new account, I am accosted by the “perky British ‘It Girl'” voice. She’s enthusing that the new Samsung smartphone is about to be released. But Vodafone already knows who I am — they can detect that from the cellphone I am calling from. They also know my account information. Certainly they know that I just signed an 18 month contract for an iPhone. Why should I care about a Samsung phone?

The answer is that I shouldn’t — and they know it. In their dunderhead minds, this was probably just an advertising deal in which they agreed to bombard callers to the service line with the advert in return for some dosh. But it doesn’t seem very cost effective. Why not use the moment to call to my attention something that I might actually be interested in buying. Instead, Vodafone is “training” me to ignore their marketing messages, since they are not relevant.

So Vodafone gets some short term lucre, but annoys its customers and creates psychological incentives to disregard its adverts, creating longer term harm.

What is most pathetic about this is that it need not happen. Who is to say that there ought be one advert for all callers? Wouldn’t segmentation make more sense, and have six or thirty or two hundred different messages?

It underscores that the entities most teeming with data can be the stupidest at using it.

Grrr… I’m still on hold!

Categories: analytics, Apple, iPhone, Samsung, Vodafone Tags: iPhone, Samsung, Vodafone

Help! My iPad is dumber than I am!

April 2, 2012 cukier Leave a comment

If there ever was a case for basic analytics and personalization, this is it. For such a smart machine in so many ways, my iPad couldn’t be dumber when it comes to its recommendations.

The app store and newsstand apparently think it’s OK to make recommendations of items that I already own. On the newsstand, it foists ads for The Economist — seemingly unaware that I already have it on the device, and that I’m a full subscriber. Being bombarded with a useless ad might seem to only cost me something (ie, my attention), but it costs Apple something too: an occasion to show me something relevant, like a subscription offer to The Atlantic or The New York Review of Books, which I may indeed want.

The app store is just as bad. One day I buy an app, and the next day the app store still tries to recommend it to me. It does this even though it actually knows that I already own it, considering that it marks “installed” in the box where the price usually is.

Are we so inured to information-technology not working that we fail to care when it confirms our presumption? There are two reasons why Apple’s failure to incorporate users’ information into what it recommends is more than just sloppy system design.

First, Apple’s brand promises a premium service and excellent design. Steve Jobs built the company’s reputation on that and trounced rivals. The lack of personalization leads Apple to fall short of the standard it sets for itself.

Second, Apple’s ignorance hurts us both. The company’s fortunes are tied to software and services atop the device. So it effectively forgoes revenue opportunities whenever it tries to sell me something that I already have. Yet as a customer, the irrelevant ads in effect “train” me to give Apple less of my attention when I interact with the service, since I don’t expect the ads to be as useful.

Ultimately, the problem underscores that many people may want personalization and targeted advertising when it brings them value. For the owner of an iPad who wants to cut through the chaff and add functionality to the device, Apple’s use of my data is useful to me. The episode shows that customers can be just as angry when the expectation of personalization falls short, as when it creepily happens when one doesn’t expect it.

Categories: analytics, Apple, iPad Tags: analytics, Apple, iPad

What Facebook’s IPO reveals about big-data analytics

February 10, 2012 cukier Leave a comment

Those obsessed with Mammon will read Facebook’s IPO prospectus for what it says about making money. Others of us with a more geeky bent will pour over what it reveals about how the company handles data. It starts with arresting stats: 845 million active monthly users; 100 billion friendships, and every day 250 million photos uploaded and 2.7 billion likes or comments.

But that is just the eye-candy. The substance is buried deep in the prose, under the heading “Data Management and Personalization Technologies.” Get a load of this:

“loading a user’s home page typically requires accessing hundreds of servers, processing tens of thousands of individual pieces of data, and delivering the information selected in less than one second. In addition, the data relationships have grown exponentially and are constantly changing.”

And then there is this:

“We use a proprietary distributed system that is able to query thousands of pieces of content that may be of interest to an individual user to determine the most relevant and timely stories and deliver them to the user in milliseconds.”

And this:

“We store more than 100 petabytes (100 quadrillion bytes) of photos and videos.”

And this:

“We use an advanced click prediction system that weighs many real-time updated features using automated learning techniques. Our technology incorporates the estimated click-through rate with both the advertiser’s bid and a user relevancy signal to select the optimal ads to show.”

But my favorite is this:

“Our research and development expenses were $87 million, $144 million, and $388 million for 2009, 2010, and 2011, respectively.”

So R&D expenses grew almost five-fold in three years. Considering Facebook had $1 billion in profit on $3.7 billion of revenue last year, the company’s research budget came to 10% of sales. This is very healthy (albeit natural, perhaps, with a company boasting such hefty profit margins). According to the OECD, the top 100 R&D-inteisve companies in the IT and telecoms sectors spend an average of nearly 7% of revenue on R&D.

Most of the fruits of the R&D is probably kept internal and covered under trade secrets. But for that generous sum, the prospectus informs us:

“As of December 31, 2011, we had 56 issued patents and 503 filed patent applications in the United States and 33 corresponding patents and 149 filed patent applications in foreign countries relating to social networking, web technologies and infrastructure, and related technologies. Our issued patents expire between May 2016 and June 2031.”

But the most interesting thing is how much was not exposed in the prospectus. In a section were Facebook purported to explain its analytics, with an example of how it uses elements on a webpage to determine what ads to show (page 87), the example was so juvenile as to be meaningless.

It is actually funny the way Facebook keeps quiet on analytics, considering that the first time the word appears is on page 12, when Facebook cites it as one of the “risk factors” that could ruin the business:

“our inability to improve our analytics and measurement solutions that demonstrate the value of our ads and other commercial content”

Though it is loath to make too much of it, since it is its main source of value, Facebook is an analytics company before anything else. Google might have been the world’s first big-data IPO. Facebook may be the first analytics one. But you wouldn’t know it from its IPO prospectus.

Categories: analytics, big data, Facebook

Data in antiquity — and the info all around us

December 29, 2011 cukier Leave a comment

The always insightful Pete Warden recently penned a blog post on “What the Sumerians can teach us about data.” There is much to praise and react to in his analysis. But I’m struck in particular by a semantic matter: does Pete really mean “data” or “information”? I usually hate this genre of challenge; it’s the most tedious in our business. But this time it deserves to be raised.

The reason is that the idea of quantification is really a phenomenon of the Middle Ages in Europe (laying to rest the old canard that they were “dark ages” devoid of progress). On the other hand, the period of antiquity is typified by man describing his world as one of qualities. (Remember Socrates’s “forms?” And Aristotle’s taxonomy on just about everything?)

To be sure, in the area of money we can talk about quantification and thus data as we think about it today. But in many of Pete’s terrific examples of how the Sumerians recorded their world — in the “fixed media” of clay tablets and the like — I am unsure if the term data fits.

Ought “writing” be considered data? If so, how about caveman paintings? Surely the Egyptian hieroglyphs imparted information — but should we call it “data” per se? The only way to answer that question is to define data.

The word data is the plural of datum, neuter past participle of the Latin dare, “to give”, hence “something given,” instructs Wikipedia. “1. Facts and statistics collected together for reference or analysis. 2. The quantities, characters, or symbols on which operations are performed by a computer, being stored and transmitted in the form of…” reports a Google definition.

Building on the idea that data may be something different than just recording information, at what point does something go from being simply info to data?

I have a few ideas on how to answer this — I am scribbling away on a large work that looks at this topic among others. But I’m not quite ready to share it with the world, since the thoughts are still fermenting. In the meantime, Pete’s post is a wonderful look at how an early society recorded and used information. Among my favorite points:

* “Written records remove the problem of fallible memories, but replaces it with a second-degree question of provenance. How do you know the data accurately reflects what happened?”

* “We still have a disturbing tendency to trust anything that’s recorded, without understanding the subjective process that went into creating the record.”

* “The main way Sumerians protected the integrity of their data was through curses. This may seem laughable to a modern audience, but I don’t think we’re so different. Do you expect the FBI to actually raid your house if you copy that VHS tape?”

* “In the absence of real answers, we’ll take bogus ones painted with a veneer of data, just like the Sumerians.

* “If there’s any way you can, please think about how to open up data you control, it’s the best way to pass it on to posterity.”

Having pointed out what I enjoyed most, let me close on a final quibble. Pete writes:

“The Sumerians recorded everything on stone or clay tablets … This data exhaust gives a rich view into trade, worship, life, death, medicine and almost every other aspect of the Sumerian’s world.”

It is absolutely not “data exhaust” in the way that the term has come to be known (and how I helped popularize it in a report a few years ago). The idea was information provided as a byproduct of interacting with information that itself could be collected and analyzed. The simplest example is tracking readers activities to reveal to website visitors the most-read articles, as a simple heuristic to indicate what might interest them.

What Pete describes, and what the Sumerians recorded, was information (or perhaps data) pure and simple. No “exhaust” about it — other than that the tablets had been thrown away by the Sumerians before modern archeologists dug them up.

But all this ranting is only meant to add momentum to my appreciation for Pete’s splendid work in this post and others!

Categories: historical, textual info

From insanity to inanity

June 12, 2011 cukier 1 comment

How to craft rules in a BigData world for information access? It is a hard question. But how not to is far clearer.

According to a new US government policy, lawyers representing Guantanamo prisoners are allowed to read Wikileaks’ classified US documents — but not print or save them. The actual policy “guidance” is here (from Politico) and an analysis by Politico’s Josh Gerstein is here.

Are the US officials that devised this policy out of their minds? How could anyone rationally adopt such an inherently inconsistent policy?

If the lawyers cannot read the material, they are blocked from accessing pertinent information that is already in the public domain, which could help them prepare a defense. Allowing access is only sensible. To do otherwise would be to deny reality (that the material is widely available), and might deny justice too.

However, crippling that access by placing arbitrary restrictions on its use make no sense whatsoever. Why? On what basis is one allowed to read but not print or save? Surely the US does not mean for the frailty of a person’s memory to govern how material is put to use. But that is the policy’s effect.

The irony is that the current policy is actually a slightly more rational shift from previous rules that forbid any access at all. It underscores the fact that the government has no clue how to respond to the new world we’re in regarding BigData leaks.

And it is a longstanding problem. Just this month, the US officialy released the trove of documents known as the Pentagon Papers — 40 years after they appeared in the New York Times. (The AP’s story is here) The Economist, in an article last month about it (“The open society and its ostriches“) argued that the way to think about these cases is that “the illegal disclosures in effect declassify the information.”

When the contradiction between futile policies and the reality on the ground grow so wide as to be preposterous — as it is now — something has to give. It will be the rules, of course, that go. But with government, this takes a long time.

Categories: government policy, textual info, Wikileaks

Scott McNealy’s latest privacy top-ten

May 21, 2011 cukier Leave a comment

Scott McNealy, the co-founder and long time boss of Sun Microsystems, was famous for his “top ten” riffs on tech trends. Today he’s recreated it on Twitter (follow @scottmcnealy), reprising his famous remark in 1999: “You have zero privacy anyway. Get over it.”

Here’s a compilation of the tweets (followed by a quick analysis relating it to Sony’s Stringer on security):

* * *

Top 10 signs you no longer have privacy and should get over it:
10. The guy behind the McDonalds counter greets you with, “Would you like a salad to help you with your constipation?”
9. A Google search on “white only clubs” has just one result: TaylorMade.
8. Your soon to be ex-spouse produces your iPhone GPS database in settlement hearings.
7. The TSA stops molesting and radiating your 82 year old mom because she is clearly not going to hijack that plane.
6. 20 neighbors show up at same Groupon inspired Spearmint Rhino happy hour in Vegas.
5. IRS starts auditing folks who don’t pay income taxes, not the folks who pay the most.
4. Local police become largest purchaser of camera equipped UAV’s.
3. Your parents require your Facebook, laptop, and phone passwords and actually review your online activity regularly. And you are 40.
2. The UPS driver delivers your small package to your door and, with a smile and wink, asks if you would like batteries with that.
1. Twitter starts suggesting Tweets for you, and they are perfect and better than your own.

* * *

As in 1999, McNealy is right on fact, wrong on what to do about it (as critics argued at the time). Not ensuring some protections is irrational. But whether he’s right or not is beside the point. It is refreshing when a top executive calls it as he sees it — and a bit silly when people quibble with the wording rather than the larger point itself.

Here, I’m thinking of Sony’s boss, Howard Stringer, who recently described the PlayStation Network hack is words that was sure to eviscerate him among tech journos. “Nobody’s system is 100 percent secure,” he said in a conference call. “This is a hiccup in the road to a network future.” (in Bloomberg’s piece). “It’s not a brave new world; it’s a bad new world,” he said (in the WSJ piece).

Stringer has been pounced on by some in the press. He shouldn’t be. Though the point he raises we’ve known for a long time, it is still quite right.

Categories: privacy, security

Killing geo-location in the crib

May 14, 2011 cukier Leave a comment

Is this a return of the US-EU privacy wars of the 1990s, when Brussels’ bureaucrats threatened to halt intercontinental online transactions? They may be coming back. This time it’s over location data.

An upcoming EU report will say that “geo-location data has to be considered as personal data… The rules on personal data apply,” an EU official tells the Wall Street Journal.
The implication is that data collected by cellphones, twitter, Facebook and others must be handled like names, birth dates, and other personal information: requiring user consent, deletion after a certain period, and kept anonymously.

This is absolutely preposterous. Yes, rules — better, tougher rules — are desperately needed. But to simply drop the data into a pre-existing regulatory bucket (as the EU is doing, of calling it “personal information” which has sweeping regulatory burdens) is asinine. It will hold back the amazing innovations and services that are just starting to emerge, and future ones that we can scarcely imagine today.

Calling something new (geo-location data) something old (personal identifiable information, or PII in the trade) is a far too blunt way to go about upholding legitimate public interest concerns that need to be addressed. It avoids the more humble — and probably more effective — task of trying to figure out the new properties of this type of data, and thus devise appropriate ways to balance personal privacy with innovative services. It’s harder to do this, but sounder.

This of course will happen, but over time, and probably in a different regulatory jurisdiction. Possibly America? Perhaps China? Maybe Brazil? But European geo-loco firms will suffer in the meantime, since they’ll crammed into a regulatory straitjacket. And to be clear: this is not to say that better rules aren’t needed — they definitely are. But they ought be sensible ones.

Failing to take a more cautious and reflective regulatory approach results in things like the EU’s 1998 privacy directive. It did an excellent job of getting governments into the privacy arena, but it had lots of silly parts too. For one thing, it required an international “safe-harbor” provision in order to do innocuous things like allowing a US firm in France to send its payroll data to headquarters in Detroit. The rules are already out of date, and although it boasts strong enforcement provisions, they’ve barely ever been used.

In fairness, the US has miserable privacy legislation — no country does it well — but the piecemeal approach and building up of a body of regulatory experience is looking like a better way forward. There is no “privacy kommissar” in America, but that hasn’t stopped the FTC from taking serious action often.

A far better way to proceed is the way the US is moving. Sen. Al Franken’s opening statement to hearings on May 10th on cellphone privacy was a paragon of wise policymaking: he wants to find the right balance. He was scorching in his condemnation of current practices:

“Once the maker of a mobile app, a company like Apple or Google, or even your wireless company gets your location information … these companies are free to disclose your location information and other sensitive information to almost anyone they please-without letting you know. And then the companies they share your information with can share and sell it to yet others-again, without letting you know. This is a problem. It’s a serious problem.”

But at the same time, he understood the risks of regulating too soon:

“I just want to be clear that the answer to this problem is not ending location-based services. No one up here wants to stop Apple or Google from producing their products or doing the incredible things that you do. You guys are brilliant. When people think of the word “brilliant” they think of the people that founded and run your companies.”

If this gap in regulatory approach is not settled, the result may well be another round of the privacy wars. Companies like Apple, Google, Facebook, Twitter, Foursquare and others will have to tailor their operations depending on jurisdiction, down to their very code base. The EU will argue that they have to do this any way for language and law. But this still fractures and debilitates the service. And it is hypocritical: the idea behind the EU’s common market and common currency is about the gains from harmonization.

The best way to ward off bad public policy are good case studies of excellent services. Industry basically has its head in the sand hoping this issue will go away (it won’t) or is in hiding, hoping it doesn’t need to disclose how the services work (it does). Their actions are shortsighted. Geo-location services are interesting and useful, and if people really knew what was happening, many would be fine with it, provided a backstop of basic protections exist.

The case must be made publicly. So what are the amazing new services that are emerging that show why the EU’s approach is not quite right? Share your stories here.

Categories: Uncategorized

Tautology, or something more?

May 6, 2011 cukier 1 comment

This might be utterly obvious, but let me posit that one of the most compelling features of the current information avalanche is that (if you will): “big-data solves the problem of big-data.”

The problem is that the amount of information has expanded so much that it has become almost impossible to work with or comprehend in its totality. But new techniques that actually rely on the huge scale makes that huge scale manageable and indeed useful. And though it is certainly an overstatement to say it “solves the problem,” I’d argue this the right way to think about it.

Examples abound, if we look at things the right way. For example, computer translation was hard until the “test data” went from the billions to one trillion words — and then the machines got talking (as Google’s Peter Norvig explains here and Steven Levy tell here). Likewise, Jeff Jonas of IBM recounts a situation years ago when by adding more information to a database of people, the number of records on individuals actually shrunk: he was able to identify and consolidate duplicate entries.

But the inspiration for these musings is an article in last week’s The Economist, “The science of science: How to use the web to understand the way ideas evolve.” Researchers came up with a clever way to identify and classify texts by grasping meaning from their content, outside of what the authors felt the classification ought be.

This lets machines parse huge volumes of text that people can’t do, or can’t do well. Academic authors label the subject areas of their papers, but sometimes use far too many as a way to trick people into reading it, or are limited to just five labels which may be too narrow. Sometimes they are required to use pre-determined labels from library science, which fails to account for emerging areas of scholarship. So for example, Adam Smith never regarded himself an economist — the term didn’t exist in that context — rather, he was a moral philosopher. This system would place him alongside Malthus, a pastor by trade and demographer by study, who incontestably wrote on economics.

Moreover, the system enables one to see how ideas molt and meld over time — just as Smith and Malthus seemed out of step with their “professions” in their time, but were foundational for the new field of economics. And it bears repeating: the reason the technology described in the article works is because there is enough data to make inferences about meaning. As the article states:

“Citation indices, which work only where publications refer to their sources explicitly, form a tiny nebula in the digital universe. News articles, blog posts and e-mails often lack a systematic reference list that could be used to make a citation index. Yet they, too, are part of what makes an idea influential.”

This opens up new areas for researches to amass sources. For instance, the huge area of “gray literature” (as it’s called in library science) that is slightly outside the mainstream publication world is now more easily retrievable and citable.

It also indirectly overcomes Google’s inherent shortcoming. Google’s PageRank algorithm, at its most basic level, counts inbound link akin to academic citations and presumes that a page with more is more relevant. But basing relevance on link structure invites imperfection because ordinary people are themselves imperfect and may not link to the ideal content, thus creating suboptimal search results. The technique described in the article may help remedy this.

The upshot is that we are generally familiar with the idea that a characteristic of big-data is it seems to exhibit “inverse scaling features”: the more data you add, the better the system gets (rather than deteriorates, as most systems do when under more load). But another step ahead of this point is that “big-data solves the problem of big-data.” With so much info around, the only way to tackle it is to use its huge size to sort itself. This idea sounds like a serpent eating its tail — but it may be more than that.

Categories: big data, economics, Google, metadata

What donations tell us about … more donations

April 26, 2011 cukier Leave a comment

One of the most impressive trends over the past decade (and broadly, the past century) has been the rise of the NGO. In the 1990s they mushroomed like start-ups and attracted “social entrepreneurs.” The bigger shift today is that it’s no longer a person’s full-time job: now actual entrepreneurs toiling at start-ups have their own philanthropic gig on the side. A computer went from a 2-ton, $2 million, room-sized machine to a pocket-sized thing. So did non-profit organizations.

I recently scribbled a few thoughts about the data dimensions of responding to Japan’s crisis for The Economist’s website: “The information equation” on April 24th. I was impressed that a private-sector company was playing the role that a governmental organization or NGO might play. (It’s a Google.org project, to be exact.)

Among the things I learned was that Google collected $5.5 million in donations through its crisis-response page. A small but not insignificant haul. But it got me thinking. The world of BigData is about learning new things from information that is otherwise invisible to the naked eye. What could the donation data tell us about how to more effectively solicit charitable contributions? Specifically, as I wrote in the penultimate paragraph of the article:

The donation data may offer a chance to learn new things about how people contribute. For example, what is the average amount? Does it follow a standard normal deviation (ie, a “bell curve”) in which a few give a little and a lot, with the majority donating around $15? Or is it a power-law distribution, in which there are two or three extremely rich donors, a handful of generous ones, followed by a long tail of $2 contributions? Did they donate using PayPal or credit cards? What time of day do people give? Is it after they have read a news story or clicked a link within an e-mail? The information would help fundraisers tailor how to make their appeals. And the data can be broken down by country or even city via Internet Protocol addresses.

I’ve asked Google’s hyper-helpful PR team to run the idea past their number-crunchers, to get access to the findings so I can write a story about this. It’s sort of like Google Flu Trends, but for charities. It would be highly valuable information for NGOs to know — particularly one that is dear to my heart, International Bridges to Justice (where I proudly serve on the board).

Categories: analytics, NGOs

Older Entries