Shared posts

14 Dec 17:10

Don’t, don’t, don’t, don’t . . . We’re brothers of the same mind, unblind

by Andrew

PlanetHype

Hype can be irritating but sometimes it’s necessary to get people’s attention (as in the example pictured above). So I think it’s important to keep these two things separate: (a) reactions (positive or negative) to the hype, and (b) attitudes about the subject of the hype.

Overall, I like the idea of “data science” and I think it represents a useful change of focus. I’m on record as saying that statistics is the least important part of data science, and I’m happy if the phrase “data science” can open people up to new ideas and new approaches.

Data science, like any just about new idea you’ve heard of, gets hyped. Indeed, if it weren’t for the hype, you might not have heard of it!

So let me emphasize, that in my criticism of some recent hype, I’m not dissing data science, I’m just trying to help people out a bit by pointing out which of their directions might be more fruitful than others.

Yes, it’s hype, but I don’t mind

Phillip Middleton writes:

I don’t want to rehash the Data Science / Stats debate yet again. However, I find the following post quite interesting from Vincent Granville, a blogger and heavy promoter of Data Science.

I’m not quite sure if what he’s saying makes Data Science a ‘new paradigm’ or not. Perhaps it is reflective of something new apart from classical statistics, but then I would also say so of Bayesian analysis as paradigmatic (or at least a still budding movement) itself. But what he alleges – i.e that ‘Big Data’ by its very existence necessarily implies that cause of a response/event/observation can be ascertained, and seemingly w/o any measure of uncertainty, seems rather ‘over-promising’ and hypish.

I am a bit concerned with what I’m thinking he implies regarding ‘black box’ methods – that is the blind reliance upon them by those who are technically non-proficient. I feel the notion that one should always trust ‘the black box’ is not in alignment with reality.

He does appear to discuss dispensing with p-values. In a few cases, like SHT, I’m not totally inclined to disagree (for reasons you speak aobut frequently), but I don’t think we can be quite so universal about it. That would pretty much throw out most every frequentist test wrt to comparison, goodness-of-fit, what have you.

Overall I get the feeling that he’s implying the ‘new’ era as one of solving problems w/ certainty, which seems more the ideal than the reality.

What do you think?

OK, so I took a look at Granville’s post, where he characterizes data science as a new paradigm “very different, if not the opposite of old techniques that were designed to be implemented on abacus, rather than computers.”

I think he’s joking about the abacus but I agree with this general point. Let me rephrase it from a statistical perspective.

It’s been said that the most important thing in statistics is not what you do with the data, but, rather, what data you use. What makes new statistical methods great is that they open the door to the use of more data. Just for example:

- Lasso and other regularization approaches allow you to routinely thrown in hundreds or thousands of predictors, whereas classical regression models blow up at that. Now, just to push this point a bit, back before there was lasso etc., statisticians could still handle large numbers of predictors, they’d just use other tools such as factor analysis for dimension reduction. But lasso, support vector machines, etc., were good because they allowed people to more easily and more automatically include lots of predictors.

- Multiple imputation allows you to routinely work with datasets with missingness, which in turn allows you to work with more variables at once. Before multiple imputation existed, statisticians could still handle missing data but they’d need to develop a customized approach for each problem, which is enough of a pain that it would often be easier to simply work with smaller, cleaner datasets.

- Multilevel modeling allows us to use more data without having that agonizing decision of whether to combine two datasets or keep them separate. Partial pooling allows this to be done smoothly and (relatively) automatically. This can be done in other ways but the point is that we want to be able to use more data without being tied up in the strong assumptions required to believe in a complete-pooling estimate.

And so on.

Similarly, the point of data science (as I see it) is to be able to grab the damn data. All the fancy statistics in the world won’t tell you where the data are. To move forward, you have to find the data, you need to know how to scrape and grab and move data from one format into another.

On the other hand, he’s wrong in all the details

But I have to admit that I’m disturbed on how much Granville gets wrong. His buzzwords include “Model-free confidence intervals” (huh?), “non-periodic high-quality random number generators” (??), “identify causes rather than correlations” (yeah, right), and “perform 20,000 A/B tests without having tons of false positives.” OK, sure, whatever you say, as I gradually back away from the door. At this point we’ve moved beyond hype into marketing.

Can we put aside the cynicism, please?

Granville writes:

Why some people don’t see the unfolding data revolution?
They might see it coming but are afraid: it means automating data analyses at a fraction of the current cost, replacing employees by robots, yet producing better insights based on approximate solutions. It is a threat to would-be data scientists.

Ugh. I hate that sort of thing, the idea that people who disagree with you, do so out of corrupt reasons. So tacky. Wake up, man! People who disagree with you aren’t “afraid of the truth,” they just have different experiences than yours, they have different perspectives. Your perspective may be closer to the truth—as noted above, I agree with much of what Granville writes—but you’re a fool if you so naively dismiss the perspectives of others.

P.S. I just noticed this post is coming up, and I was reading it—based on the title, I had no idea what it would be about and no recollection of having written it! But the name Vincent Granville rang a bell . . . it turns out that just a few days ago (i.e., a couple months after writing the above post), I happened to get a completely unrelated email from someone else asking about this guy, and the funny thing is, I replied that I’d never heard of Vincent Granville but I thought he had some interesting and some silly things to say. And this other correspondent and I had an email exchange which I decided I’d blog. I’d post that email here but I think it would dilute the points above. So it will appear in a couple of months. It’s funny how I completely forgot this whole thing. Good that I blog; it’s an excellent memory extender.

P.P.S. From comments, I learn that Granville seems to have a habit of propping up his reputation via paid reviews and sock puppets. So perhaps people are taking his writings too seriously: he seems to have had some success grabbing the “data science” label and getting a bunch of hits to his site, but that doesn’t mean that he knows what he’s talking about. Indeed, if he’s actually making it up as he goes along, that would explain why so much of what he writes makes no sense.

The best analogy, perhaps, is to various business-advice books, poker manuals, and fad diets that try the bully the reader into submission with emphatic advice, unpolluted by evidence beyond the apparent success or slimness of the author.

P.P.P.S. Granville seems to be making stuff up about me. I have no interest in dealing with this sort of person and I don’t plan to post anything more about him.

The post Don’t, don’t, don’t, don’t . . . We’re brothers of the same mind, unblind appeared first on Statistical Modeling, Causal Inference, and Social Science.