Shared posts

22 Aug 16:39

Verizon throttled fire department’s “unlimited” data during Calif. wildfire

by Jon Brodkin

Enlarge / A firefighter battling the Medocino Complex fire on August 7, 2018 near Lodoga, California. (credit: Getty Images | Justin Sullivan )

Update: The Santa Clara fire department has responded to Verizon's claim that the throttling was just a customer service error and "has nothing to do with net neutrality." To the contrary, "Verizon's throttling has everything to do with net neutrality," a county official said.

Verizon Wireless' throttling of a fire department that uses its data services has been submitted as evidence in a lawsuit that seeks to reinstate federal net neutrality rules.

"County Fire has experienced throttling by its ISP, Verizon," Santa Clara County Fire Chief Anthony Bowden wrote in a declaration. "This throttling has had a significant impact on our ability to provide emergency services. Verizon imposed these limitations despite being informed that throttling was actively impeding County Fire's ability to provide crisis-response and essential emergency services."

Bowden's declaration was submitted in an addendum to a brief filed by 22 state attorneys general, the District of Columbia, Santa Clara County, Santa Clara County Central Fire Protection District, and the California Public Utilities Commission. The government agencies are seeking to overturn the recent repeal of net neutrality rules in a lawsuit they filed against the Federal Communications Commission in the US Court of Appeals for the District of Columbia Circuit.

Read 34 remaining paragraphs | Comments

01 Apr 14:33

Filming mosquitoes reveals a completely new approach to flight

by John Timmer

Enlarge / See that vortex on the back edge of the wing? That means lift. (credit: Bomphrey/Nakata/Phillips/Walker)

It's unmistakable. A high-pitched whine tells you you're sharing a room with a mosquito, and you are unlikely to end the evening without some itchy welts. The sound alone is enough to make you shudder.

You're not imagining things. Within the insect world, mosquitoes have a distinctive flight, with a short wing stroke and a very high frequency of wing beats. And now, researchers have figured out the physics behind their flight. They have identified two mechanisms for generating lift that had not previously been seen in any animal. "Much of the aerodynamic force that supports [the mosquito's] weight," the authors conclude, "is generated in a manner unlike any previously described for a flying animal."

The work, done by a small team of Japanese and UK researchers, involved setting up a series of eight high-speed cameras to capture every instant of a mosquito's wing flap from multiple angles. The resulting data allowed them to create a digital model of the wings as they went through a full stroke. This was then used to solve fluid dynamics equations for the air around the wings, letting the researchers track the movement of the air as the wing beat through it.

Read 10 remaining paragraphs | Comments

11 Mar 13:54

How to Think About Fashion

image


For being about clothes, online sites about men’s style can get surprisingly contentious. Just visit any online forum or check the comment section at Ivy Style. People outside of fashion tend to have a healthy sense of humor when it comes to dress, but enthusiasts – particularly ones with more classic sensibilities – can work themselves into a rage. Bruce Boyer put it well when he said some people treat this as though it were a bloodsport

When you boil them down, arguments over clothes always come back to the same things. If it’s a runway show, someone will comment on how the clothes look odd. If it’s about bespoke, someone will ask what’s the point of all that handwork. If it’s someone new to dressing well, they’re sure to complain about prices (OK, that one is fair). 

The thing is, these critiques are rarely about the clothes themselves; they’re about the person’s values. A lot of the arguments are really about how different people value clothing differently. If you take the time to consider another person’s set of values, you’ll have your answer to almost any question – “who would wear such a thing in real life?” or “is bespoke worth it?”



image


Let’s break fashion down into three dimensions:

  • Practicality: On some level, all of us buy clothes for the same reasons. We want to stay comfortable, protect ourselves from the elements, and look good. Or at least look the way we wish to look. 
  • Designs: To get there, designers have to design. They come up with their versions of motorcycle jackets, anoraks, duffle coats, slim suits, sport shirts, chunky knitwear, etc. 
  • Construction: Since clothes are physical objects, they have to be made. It’s just a question of how the construction is done. This can include fusing, canvassing, padding, handsewing, machine-sewing, darts, felled seams, etc. 

Since fashion is a commercial enterprise, not a pure art, brands have to consider all three things when building a collection. They have to think about how something will sell in a store (the practical dimension), along with how they want to design and build a garment. 

In practice, however, most brands will focus on one of these dimensions more than the others. To take some examples:

Most brands focus on the practical. Here, we’re talking about companies such as H&M, Ralph Lauren, and Tom Ford. Whether your budget is $50 or $50,000, there’s a brand out there that will help you look good in straightforward and conventional ways. 

See the photo above. The clothes are so generic that you can’t even make out the company (it’s J. Crew). Such clothes are popular because most people have very practical interests when it comes to fashion – they want to look good, but without standing out from the crowd. That means wearing things such as slim cut chinos, basic button-ups, and conservative, casual outerwear. Even if formal dress codes have mostly disappeared, softly coded dress norms still govern what we wear. 

So, there are a million stores nowadays that will help you dress for the office and holidays; weddings and jobs interviews; pubs and parties. These are the sort of clothes that will make you look good, while also not making much of a statement. 


image
image


Some brands, on the other hand, focus more on design. This is where fashion tends to lose people, but you have to be comfortable with the idea of challenging conventional norms. Or at least the idea of dressing for expressive and artistic sake – not just getting by in public.

There are a million examples here. Raf Simons’ continual themes around recapturing youth; Martin Margiela’s unique mix of minimalism and deconstruction. In a recent article in the New York Times, Susannah Frankel wrote this of Margiela and his business partner, Jenny Meirens:

However radical, none of these concepts would have meant anything without the clothes. “For me, one of the strongest things about Martin was him taking something very popular, common or cheap and turning it into something chic,” Meirens says. He consistently used inexpensive men’s suiting and shiny black lining fabric. He made chubby coats out of silver Christmas tinsel and sheath dresses out of gold plastic rings. Then there were the gently frayed hems, the reversed seams, the reinventing of vintage finds from leather butchers’ aprons to antique wedding dresses — the turning of clothes literally upside down and inside out. 

Similarly, although not the same, brands such as Kapital and Rare Weaves are known for their patchwork clothes (Rare Weaves is the one who made the indigo coat you see above). To say that these clothes look beaten-up and haphazardly produced would miss the point. They’re beautiful precisely because they draw on ideas of impermanence, imperfection, and austerity. See this old post I wrote on Japanese boro. (Yes, designer versions are just verisimilitudes of the real thing, but the same can be said of almost any modern aesthetic tradition). 

Other good examples include Rei Kawakubo, Yohji Yamamoto, and Rick Owens. In Western Europe and the US, traditional pattern-making tends to emphasize the idealized, gendered form. For women, that’s a curvy hourglass silhouette. For men, it’s an athletic, v-shaped torso set upon a strong pair of columnar legs. 

Instead of revealing the body, however, designers such as Yamamoto shroud it with voluminous fabrics, generous pleats, and freer interpretations on the modern silhouette. It’s about finding creative ways to cut clothes; not just sexualizing the body. For a good write-up on how Comme des Garcons challenges conventional norms around age, class, and gender, see this post over at The Rosenrot. The appeal here is just as much about concepts as they are about aesthetics. 


image


The usual critiques here, however, are about the runway (e.g. “who would wear that in real life?”). Often times, the answer can be as simple as “lots of people, particularly those living in big cities.” Other times, it can be about how a designer presents a big idea at a fashion show, but then repackages it for stores. If you think you’ve never been affected by a runway concept, check your closet. Even Brooks Brothers stocks slimmer, shorter fitting jackets nowadays thanks to the influence of Hedi Slimane and Thom Browne. 

Fundamentally, how a company treats a runway show isn’t too different from how they treat their clothes. Companies that have more practically minded designs, such as Zegna, will often send their models marching down the catwalk – sort of like the automated conveyor at your local dry cleaners. Those that focus on design, on the other hand, will sometimes organize a show as though it’s performance art. 

You can see some of this by Thierry Mugler and John Galliano. Both have casted shows as though they were theatre productions, with models dressed like heroes and villains. Then there’s Martin Margiela, who used creative lighting and video projections to emphasize his artistic, deconstructed creations. Or more recently Rick Owens, who casted a show with step dancers from American sororities. That one was as much a dance performance as it was a fashion show. 

Cynics will say these are just marketing ploys – organized to drum up press and, ultimately, sell clothes. And that’s probably true, but doesn’t negate their artistic value. Creative designers, both through their collections and presentations, force us to reconsider traditional norms around dress. In order to appreciate this work, you have to consider that clothes can be something else besides practical tools – they can be art, creative concepts, or ways to add to the ongoing conversation about fashion. 

If you’re still skeptical of unconventional designs, I encourage you to check out this Met Museum page on Alexander McQueen. His were some of the most brilliant anywhere. (Frankly, if you remain unmoved, it’s possible that you have no soul.)


image


Finally, you have companies that focus on construction. Simon Crompton over at Permanent Style calls this “craft based menswear.” He mostly uses the term to refer to bespoke tailors and shoemakers – who make up the majority of people in this category – but there are contemporary designers employing the same methods. 

It’s important to understand that construction here stands for its own sake, apart from design and practicality. To be sure, some construction techniques are necessary for a certain design or function. A lot of this, however, is about craft. 

Why does Steed attach the waistbands to their trousers by hand? It looks the same either way from the outside, but the point is about producing the finest work possible. A hand-attached waistband looks a little cleaner on the inside, and even if nobody else will know, it’s nice to see that sort of traditional work still being done.

It’s the same reason why someone might appreciate the subtle, slightly raised look of a Milanese buttonhole. There’s no practical advantage; it’s nice just as a piece of hand-executed craft. People can find beauty in craftsmanship just as they can find beauty in conceptual designs. 


image


Much like companies, everyone interested in fashion values all three dimensions – we just weight them differently. See sites such as GQ and Esquire, whose readers are almost exclusively interested in the practical aspects of dressing well (e.g. how to look good on a first date). Or StyleZeitgeist and The Rosenrot, which are more about conceptual design. Or Permanent Style, which is mostly about craft and construction. (Notable is how many of Simon’s posts are about factory visits). 

And here is how arguments happen. When people disagree about clothes, they’re often taking their preferred dimension and just running with it – without necessarily considering how others can value things differently. Take any one of the endless arguments on StyleForum about whether bespoke shoes last longer than good Goodyear welted footwear. The answer is probably yes, but I don’t know if handwelted shoes necessarily even have to justify themselves on practical grounds. You can get Goodyear welted shoes these days for about $300-400; West End bespoke starts around $3,000. It would be hard to make a case that bespoke shoes last ten times longer. (Even Vass makes ready-to-wear handwelted shoes for $600).

So then, what’s the point of bespoke shoes? If you value traditional craft, they’re are a joy in their own right. There may not be any added value in terms of comfort or longevity – you’re just enjoying the best of traditional craftsmanship. Answering the question “are bespoke shoes worth the price” depends on what you value. 

On the other end of this equation, take Tom Ford. His suits cost more than bespoke, but they’re made to ready-to-wear patterns and constructed with less handwork. Does it matter that the suits are of “lower quality?”

Again, if you value traditional craft above all else, then maybe. If you’re mostly interested in the practical dimensions of dressing well, then perhaps not. Tom Ford offers a unique look and style – one that you can’t easily get from a bespoke tailor – so handwork is secondary. 

The list could go on forever, but the point is that some things ought to be judged on their own terms. A conceptual design isn’t about its practicality; neither is handsewing (most of the time). To apply one dimension of fashion to another would make as much as sense as critiquing free jazz using standards set in classical music. We all have our own interests when it comes to clothes, but it’s useful to recognize that others may value things differently. It’s because of that diversity that fashion is so interesting today. 

04 Jan 18:28

Pho Countryside Is Now Open Kenmore Square

by Dana Hatic

The original Quincy restaurant now has a sibling

A Vietnamese restaurant in Quincy recently opened a new outpost in Kenmore Square, taking over the space at 468 Commonwealth Ave. from a shuttered French restaurant. Pho Countryside has officially opened, as Boston Restaurant Talk reported and an employee reached at the Quincy location Wednesday morning confirmed.

Pho Countryside made plans to take over the Kenmore Square restaurant space in September, as previously reported. The new restaurant offers the same menu as its original Quincy location, which includes plenty of pho and a handful of hot pot options, plus vegetable dishes, rice plates with different toppings, and an assortment of other seafood and meat entrees.

Pho Countryside took over for Josephine, which served classic Parisian dishes and closed down in August 2016 after about a year in business.

In other Boston/Quincy pho expansion news, but in the opposite direction, Chinatown’s Pho Pasteur recently expanded to Quincy.

Pho Countryside to Open in Former Josephine Space in Kenmore Square [BRT]
Quincy’s Pho Countryside Will Expand to Kenmore Square [EBOS]

24 Nov 01:46

Political winds and hair styling

by junkcharts

Washington Post (link) and New York Times (link) published dueling charts last week, showing the swing-swang of the political winds in the U.S. Of course, you know that the pendulum has shifted riotously rightward towards Republican red in this election.

The Post focused its graphic on the urban / not urban division within the country:

Wp_trollhair

Over Twitter, Lazaro Gamio told me they are calling these troll-hair charts. You certainly can see the imagery of hair blowing with the wind. In small counties (right), the wind is strongly to the right. In urban counties (left), the straight hair style has been in vogue since 2008. The numbers at the bottom of the chart drive home the story.

Previously, I discussed the Two Americas map by the NY Times, which covers a similar subject. The Times version emphasizes the geography, and is a snapshot while the Post graphic reveals longer trends.

Meanwhile, the Times published its version of a hair chart.

Nyt_hair_election

This particular graphic highlights the movement among the swing states. (Time moves bottom to top in this chart.) These states shifted left for Obama and marched right for Trump.

The two sets of charts have many similarities. They both use curvy lines (hair) as the main aesthetic feature. The left-right dimension is the anchor of both charts, and sways to the left or right are important tropes. In both presentations, the charts provide visual aid, and are nicely embedded within the story. Neither is intended as exploratory graphics.

But the designers diverged on many decisions, mostly in the D(ata) or V(isual) corner of the Trifecta framework.

***

The Times chart is at the state level while the Post uses county-level data.

The Times plots absolute values while the Post focuses on relative values (cumulative swing from the 2004 position). In the Times version, the reader can see the popular vote margin for any state in any election. The middle vertical line is keyed to the electoral vote (plurality of the popular vote in most states). It is easy to find the crossover states and times.

The Post's designer did some data transformations. Everything is indiced to 2004. Each number in the chart is the county's current leaning relative to 2004. Thus, left of vertical means said county has shifted more blue compared to 2004. The numbers are cumulative moving top to bottom. If a county is 10% left of center in the 2016 election, this effect may have come about this year, or 4 years ago, or 8 years ago, or some combination of the above. Again, left of center does not mean the county voted Democratic in that election. So, the chart must be read with some care.

One complaint about anchoring the data is the arbitrary choice of the starting year. Indeed, the Times chart goes back to 2000, another arbitrary choice. But clearly, the two teams were aiming to address slightly different variations of the key question.

There is a design advantage to anchoring the data. The Times chart is noticeably more entangled than the Post chart. There are tons more criss-crossing. This is particularly glaring given that the Times chart contains many fewer lines than the Post chart, due to state versus county.

Anchoring the data to a starting year has the effect of combing one's unruly hair. Mathematically, they are just shifting the lines so that they start at the same location, without altering the curvature. Of course, this is double-edged: the re-centering means the left-blue / right-red interpretation is co-opted.

On the Times chart, they used a different coping strategy. Each version of their charts has a filter: they highlight the set of lines to demonstrate different vignettes: the swing states moved slightly to the right, the Republican states marched right, and the Democratic states also moved right. Without these filters, the readers would be winking at the Times's bad-hair day.

***

Another decision worth noting: the direction of time. The Post's choice of top to bottom seems more natural to me than the Times's reverse order but I am guessing some of you may have different inclinations.

Finally, what about the thickness of the lines? The Post encoded population (voter) size while the Times used electoral votes. This decision is partly driven by the choice of state versus county level data.

One can consider electoral votes as a kind of log transformation. The effect of electorizing the popular vote is to pull the extreme values to the center. This significantly simplifies the designer's life. To wit, in the Post chart (shown nbelow), they have to apply a filter to highlight key counties, and you notice that those lines are so thick that all the other countries become barely visible.

  Wp_trollhair_texas

 

22 Nov 23:54

NASA’s EM-drive still a WTF-thruster

by Chris Lee

Enlarge (credit: Aurich Lawson / Thinkstock)

For the past several years, a few corners of the Internet have sporadically lit up with excitement about a new propulsion system, which I'll call the WTF-thruster. The zombie incarnation of the EM-drive has all the best features of a new technology: it generally violates well-established physical principles, there is a badly outlined suggestion for how it might work, and the data that ostensibly demonstrates that it does work is both sparse and inadequately explained.

The buzz returned this week, as the group behind the EM-drive has published a paper describing tests of its operation.

Before getting into the paper, let me step back a bit to set the scene. I am not automatically rejecting the authors' results. I am not even rejecting the possibility that this study may hint at a new physics. I am saying that before I will take that possibility seriously, I have to be convinced that the data cannot be explained by the current laws of physics. And currently, I am not convinced. In fact I am very frustrated by the lack of detail.

Read 29 remaining paragraphs | Comments

12 Nov 16:52

You don't usually look to subway walls for inspiration

by adamg

When Venita Subramanian heard about some artists in New York who put out markers and sticky notes so subway riders could write up inspirational messages on a station wall, she thought Boston should do the same thing.

So Subramanian, a designer at a local company and some friends, put out plenty of markers and notes on the southbound platform of the Red Line at Park Street Friday night. And for three hours, people getting off or waiting for trains crowded along the wall, writing their own messages and reading those others had left.

Boston has America's back

A boy and his note:

Inspirational note
Inspirational note
Inspirational note
Inspirational note
Inspirational note

Lizi Bennett also took note of the notes:

Inspirational notes
12 Nov 14:52

Election surprise, and Three ways of thinking about probability

by Andrew

adorable

Background: Hillary Clinton was given a 65% or 80% or 90% chance of winning the electoral college. She lost.

Naive view: The poll-based models and the prediction markets said Clinton would win, and she lost. The models are wrong!

Slightly sophisticated view: The predictions were probabilistic. 1-in-3 events happen a third of the time. 1-in-10 events happen a tenth of the time. Polls have nonsampling error. We know this, and the more thoughtful of the poll aggregators included this in their model, which is why they were giving probabilities in the range 65% to 90%, not, say, 98% or 99%.

More sophisticated view: Yes, the probability statements are not invalidated by the occurrence of a low-probability event. But we can learn from these low-probability outcomes. In the polling example, yes an error of 2% is within what one might expect from nonsampling error in national poll aggregates, but the point is that nonsampling error has a reason: it’s not just random. In this case it seems to have arisen from a combination of differential nonresponse, unexpected changes in turnout, and some sloppy modeling choices. It makes sense to try to understand this, not to just say that random things happen and leave it at that.

This also came up in our discussions of betting markets’ failure in Trump in the Republican primaries, Leicester City, and Brexit. Dan Goldstein correctly wrote that “Prediction markets have to occasionally ‘get it wrong’ to be calibrated,” but, once we recognize this, we should also, if possible, do what the plane-crash investigators do: open up the “black box” and try to figure out what went wrong that could’ve been anticipated.

Hindsight gets a bad name but we can learn from our failures and even from our successes—if we look with a critical eye and get inside the details of our forecasts rather than just staring at probabilities.

The post Election surprise, and Three ways of thinking about probability appeared first on Statistical Modeling, Causal Inference, and Social Science.

11 Nov 20:23

Spotify is writing massive amounts of junk data to storage drives

by Dan Goodin

Enlarge / SSD modules like this one are being abused by Spotify. (credit: iFixit)

For almost five months—possibly longer—the Spotify music streaming app has been assaulting users' storage devices with enough data to potentially take years off their expected lifespans. Reports of tens or in some cases hundreds of gigabytes being written in an hour aren't uncommon, and occasionally the recorded amounts are measured in terabytes. The overload happens even when Spotify is idle and isn't storing any songs locally.

The behavior poses an unnecessary burden on users' storage devices, particularly solid state drives, which come with a finite amount of write capacity. Continuously writing hundreds of gigabytes of needless data to a drive every day for months or years on end has the potential to cause an SSD to die years earlier than it otherwise would. And yet, Spotify apps for Windows, Mac, and Linux have engaged in this data assault since at least the middle of June, when multiple users reported the problem in the company's official support forum.

"This is a *major* bug that currently affects thousands of users," Spotify user Paul Miller told Ars. "If for example, Castrol Oil lowered your engine's life expectancy by five to 10 years, I imagine most users would want to know, and that fact *should* be reported on."

Read 5 remaining paragraphs | Comments

19 Aug 16:04

Stealing bitcoins with badges: How Silk Road’s dirty cops got caught

by Ars Staff

(credit: Aurich Lawson)

DEA Special Agent Carl Force wanted his money—real cash, not just numbers on a screen—and he wanted it fast.

It was October 2013, and Force had spent the past couple of years working on a Baltimore-based task force investigating the darknet's biggest drug site, Silk Road. During that time, he had also carefully cultivated several lucrative side projects all connected to Bitcoin, the digital currency Force was convinced would make him rich.

One of those schemes had been ripping off the man who ran Silk Road, "Dread Pirate Roberts." That plan was now falling apart. As it turns out, the largest online drug market in history had been run by a 29-year-old named Ross Ulbricht, who wasn’t as safe behind his screen as he imagined he was. Ulbricht had been arrested earlier that month in the San Francisco Public Library by federal agents with their guns drawn.

Read 114 remaining paragraphs | Comments

19 Aug 14:06

Shorter-range electric cars meet the needs of almost all US drivers

by Jonathan M. Gitlin

(credit: Nissan)

The vast majority of American drivers could switch to battery electric vehicles (BEVs) tomorrow and carry on with their lives unaffected, according to a new study in Nature Energy. What's more, those BEVs need not be a $100,000 Tesla, either. That's the conclusion from a team at MIT and the Santa Fe Institute in New Mexico that looked at the potential for BEV adoption in the US in light of current driving patterns. Perhaps most interestingly, the study found that claim to be true for a wide range of cities with very distinct geography and even per-capita gasoline consumption.

The authors—led by MIT's Zachary Needell—used the Nissan Leaf as their representative vehicle. The Leaf is one of the best-selling BEVs on the market, second only to the Tesla Model S in 2015 (10,990 sold vs 13,300 Teslas). But it's not particularly long-legged; although the vehicle got an optional battery bump from 24kWh to 30kWh for 2016, its quoted range is 107 miles on a full charge. You don't need to spend long browsing comment threads or car forums to discover that many drivers think this is too short a range for their particular use cases. Yet, Needell and colleagues disagree.

The authors use the 24kWh Nissan Leaf as the basis for their calculations, based on a probabilistic model of BEV range based on driving behavior (rather than just looking at average commute distances and BEV range). This involved using information from the National Household Travel Survey, hourly temperature data for 16 US cities, and GPS data from travel surveys in California, Atlanta, and Houston (to calculate second-by-second speed profiles of different trip types).

Read 7 remaining paragraphs | Comments

07 Aug 18:08

Don’t believe the bounce

by Andrew

image001

Alan Abramowitz sent us the above graph, which shows the results from a series of recent national polls, for each plotting Hillary Clinton’s margin in support (that is, Clinton minus Trump in the vote-intention question) vs. the Democratic Party’s advantage in party identification (that is, percentage Democrat minus percentage Republican).

This is about as clear a pattern as you’ll ever see in social science: Swings in the polls are driven by swings in differential nonresponse. After the Republican convention, Trump supporters were stoked, and they were more likely to respond to surveys. After the Democratic convention, the reverse: Democrats are more likely to respond, driving Clinton up in the polls.

David Rothschild and I have the full story up at Slate:

Tou sort of know there is a convention bounce that you should sort of ignore, but why? What’s actually in a polling bump? The recent Republican National Convention featured conflict and controversy and one very dark acceptance speech—enlivened by some D-list celebrities (welcome back Chachi!)—but it was still enough to give nominee Donald Trump a big, if temporary, boost in many polls. This swing, which occurs predictably in election after election, is typically attributed to the persuasive power of the convention, with displays of party unity persuading partisans to vote for their candidate and cross-party appeals coaxing over independents and voters of the other party.

Recent research, however, suggests that swings in the polls can often be attributed not to changes in voter intention but in changing patterns of survey nonresponse: What seems like a big change in public opinion turns out to be little more than changes in the inclinations of Democrats and Republicans to respond to polls. We learned this from a study we performed [with Sharad Goel and Wei Wang] during the 2012 election campaign using surveys conducted on the Microsoft Xbox. . . .

Our Xbox study showed that very few respondents were changing their vote preferences—less than 2 percent during the final month of the campaign—and that most, fully two-thirds, of the apparent swings in the polls (for example, a big surge for Mitt Romney after the first debate) were explainable by swings in the percentages of Democrats and Republicans responding to the poll. This nonresponse is very loosely correlated with likeliness to vote but mainly reflects passing inclinations to participate in polling. . . . large and systematic changes in nonresponse had the effect of amplifying small changes in actual voter intention. . . .

[See this paper, also with Doug Rivers, with more, including supporting information from other polls.]

We can apply these insights to the 2016 convention bounces. For example, Reuters/Ipsos showed a swing from a 15-point Clinton lead on July 14 to a 2-point Trump lead on July 27. Who was responding in these polls? The pre-convention survey saw 53 percent Democrats, 38 percent Republican, and the rest independent or supporters of other parties. The post-convention respondents looked much different, at 46 percent Democrat, 43 percent Republican. The 17-point swing in the horse-race gap came with a 12-point swing in party identification. Party identification is very stable, and there is no reason to expect any real swings during that period; thus, it seems that about two-thirds of the Clinton-Trump swing in the polls comes from changes in response rates. . . .

Read the whole thing.

The political junkies among you have probably been seeing all sorts of graphs online showing polls and forecasts jumping up and down. These calculations typically don’t adjust for party identification (an idea we wrote about back in 2001, but without realizing the political implications that come from systematic, rather than random, variation in nonresponse) and thus can vastly overestimate swings in preferences.

The post Don’t believe the bounce appeared first on Statistical Modeling, Causal Inference, and Social Science.

29 Apr 21:21

“Well that escalated quickly.”

by Nathan Yau

Well that escalated quickly

This is the trend line from Google Trends for “Well that escalated quickly.” I know there’s a perfectly good reason for the sudden rise, but I prefer not to know. [via @HPS_Vanessa]

The same goes for this one for Mount Everest:

Mount Everest

[Thanks, @PabloMartinezAlmeida]

Tags: Google Trends, humor

24 Apr 12:25

What is the “true prior distribution”? A hard-nosed answer.

by Andrew

The traditional answer is that the prior distribution represents your state of knowledge, that there is no “true” prior. Or, conversely, that the true prior is an expression of your beliefs, so that different statisticians can have different true priors. Or even that any prior is true by definition, in representing a subjective state of mind.

I say No to all that.

I say there is a true prior, and this prior has a frequentist interpretation.

1. The easy case: the prior for an exchangeable set of parameters in a hierarchical model

Let’s start with the easy case: you have a parameter that is replicated many times, the 8 schools or the 3000 counties or whatever. Here, the true prior is the actual population distribution of the underlying parameter, under the “urn” model in which the parameters are drawn from a common distribution. Sure, it’s still a model, but it’s often a reasonable model, in the same sense that a classical (non-hierarchical) regression has a true error distribution.

2. The hard case: the prior for a single parameter in a model (or for the hyperparameters in a hierarchical model)

OK, now for the more difficult problem in which there is a unitary parameter. Or parameter vector, it doesn’t matter, the point is that there’s only one of it, it’s not part of a hierarchical model and there’s no “urn” that it was drawn from.

In this case, we can understand the true prior by thinking of the set of all problems to which your model might be fit. This is a frequentist interpretation and is based on the idea that statistics is the science of defaults. The true prior is the distribution of underlying parameter values, considering all possible problems for which your particular model (including this prior) will be fit.

Here we are thinking of the statistician as a sort of Turing machine that has assumptions built in, takes data, and performs inference. The only decision this statistician makes is which model to fit to which data (or, for any particular model, which data to fit it to).

We’ll never know what the true prior is in this world, but the point is that it exists, and we can think of any prior that we do use as an approximation to this true distribution of parameter values for the class of problems to which this model will be fit.

3. The hardest case: the prior for a single parameter in a model that is only being used once

And now we come to the most challenging setting: a model that is only used once. For example, we’re doing an experiment to measure the speed of light in a vacuum. The prior for the speed of light is the prior for the speed of light; there is no larger set of problems for which this is a single example.

My short answer is: for a model that is only used once, there is no true prior.

But I also have a long answer which is that in many cases we can use a judicious transformation to embed this problem into a larger class of exchangeable inference problems. For example, we consider all the settings where we’re trying to estimate some physical constant from experiment and prior information from the literature. We summarize the literature by N(mu_0, sigma_0) prior. In this case we can think of the inputs to the inference as being mu_0, sigma_0, and the experimental data, in which case the repeated parameter is the prediction error. And, indeed, that is typically how we think of such measurement problems.

For another example, what’s our prior probability that Hillary Clinton will be elected president in November. We can put together what information we have, fit a model, and get a predictive probability. Or even just use the published betting odds, but in either case we are thinking of this election as one of a set of examples for which we would be making such predictions.

What does this do for us?

OK, fine, you might say. But so what? What is gained by thinking of a “true prior” instead of considering each user’s prior as a subjective choice?

I see two benefits. First, the link to frequentist statistics. I see value in the principle of understanding statistical methods through their average properties, and I think the approach described above is the way to bring Bayesian methods into the fold. It’s unreasonable in general to expect a procedure to give the right answer conditional on the true unknown value of the parameter, but it does seem reasonable to try to get the right answer when averaging over the problems to which the model will be fit.

Second, I like the connection to hierarchical models, because in many settings we can think about a parameter of interest as being part of a batch, as in the examples we’ve been talking about recently, of modeling all the forking paths at once. In which case the true prior is the distribution of all these underlying effects.

The post What is the “true prior distribution”? A hard-nosed answer. appeared first on Statistical Modeling, Causal Inference, and Social Science.

19 Apr 18:22

NWS forecasts will ditch all-caps format starting May 11—DON’T PANIC

by Sam Machkovech

Even by 1991, the National Weather Service's all-caps requirement felt dated. We're still waiting, but mixed-case change will finally appear in May of this year. (credit: National Weather Service)

After upgrading its supercomputing core in 2015, the National Weather Service is continuing its lumbering slog toward modern systems in a far different way: by saying goodbye to teletype.

After more than two decades of trying, the NWS has finally made every upgrade needed in both the hardware and software chain to remove an all-caps requirement from forecasts and other warnings. The service's Monday announcement kicked off the 30-day transition period that is being given so that customers and subscribers can prepare for the change to mixed-case lettering in all NWS announcements, meaning we'll see the change begin to propagate on May 11.

All-caps messaging was previously required due to the NWS' reliance on teletype machines, which broadcast their text over phone lines and weren't built to recognize upper or lower cases of letters. In addition to removing teletype machines from the information chain, the NWS also had to upgrade its AWIPS 2 software system across the board to recognize mixed-type submissions.

Read 2 remaining paragraphs | Comments

29 Mar 15:01

Racist troll says he sent white supremacist flyers to public printers at colleges

by Sean Gallagher

Andrew "Weev" Auernheimer in 2012. Auernheimer told the New York Times he was behind a wave of racist print jobs that hit universities across the US. (credit: pinguino k)

Public networked printers at a number of universities were part of the target pool of a massive print job sent out by hacker and Internet troll Andrew "Weev" Auernheimer. At least seven universities were among those that printed out flyers laden with swastikas and a white-supremacist message.

Since Auernheimer merely sent printouts to the printers and didn't actually do anything to gain access to the printers that would fall into the realm of unauthorized access, it's unlikely that he will be prosecuted in any way. Auernheimer exploited the open nature of university networks to send print jobs to the networked printers, which in some cases were deliberately left open to the Internet to allow faculty and students to print documents remotely. These printers could easily be found with a network scan of public Internet addresses.

The New York Times reports that the flyers were printed at Princeton University, University of California-Berkeley, University of Massachusetts-Amherst, Brown University, Smith College, and Mount Holyoke College, as well as others. Auernheimer took credit for the printouts in an interview with the Times, saying that he had not specifically targeted the universities but had sent the flyer print job to every publicly accessible printer in North America.

Read 4 remaining paragraphs | Comments

09 Mar 15:11

Bruised and battered, I couldn’t tell what I felt. I was ungeneralizable to myself.

by Andrew

surely

One more rep.

The new thing you just have to read, if you’re following the recent back-and-forth on replication in psychology, is this post at Retraction Watch in which Nosek et al. respond to criticisms from Gilbert et al. regarding the famous replication project.

Gilbert et al. claimed that many of the replications in the replication project were not very good replications at all. Nosek et al. dispute that claim.

And, as I said, you’ll really want to read the details here. They’re fascinating, and they demonstrate how careful the replication team really was.

When reading all this debate, it could be natural as an outsider to want to wash your hands of the whole thing, to say that it’s all a “food fight,” why can’t scientists be more civil, etc. But . . . the topic is important. These people all care deeply about the methods and the substance of psychology research. It makes sense for them to argue and to get annoyed if they feel that important points are being missed. In that sense I have sympathy for all sides in this discussion, and I don’t begrudge anyone their emotions. It’s also good for observers such as Uri Simonsohn, Sanjay Srivastava, Dorothy Bishop, and myself to give our perspectives. Again, there are real issues at stake here, and there’s nothing wrong—nothing wrong at all—with people arguing about the details while at the same time being aware of the big picture.

Before sharing Nosek et al.’s amazing, amazing story, I’ll review where we are so far.

Background and overview

As most of you are aware (see here and here), there is a statistical crisis in science, most notably in social psychology research but also in other fields. For the past several years, top journals such as JPSP, Psych Science, and PPNAS have published lots of papers that have made strong claims based on weak evidence. Standard statistical practice is to take your data and work with it until you get a p-value of less than .05. Run a few experiments like that, attach them to a vaguely plausible (or even, in many cases, implausible) theory, and you got yourself a publication. Give it a bit more of a story and you might get yourself on Ted, NPR, Gladwell, and so forth.

The claims in all those wacky papers have been disputed in three, mutually supporting ways:

1. Statistical analysis shows how it is possible—indeed, easy—to get statistical significance in an uncontrolled study in which rules for data inclusion, data coding, and data analysis are determined after the data have been seen. Simmons, Nelson, and Simonsohn called it “researcher degrees of freedom” and Eric Loken and I called it “the garden of forking paths.” It’s sometimes called “fishing” or “p-hacking” but I don’t like those terms as they can be taken to imply that researchers are actively cheating.

Researchers do cheat, but we don’t have to get into that here. If someone reports a wrong p-value that just happens to be below .05, when the correct calculation would give a result above .05, or if someone claims that a p-value of .08 corresponds to a weak effect, or if someone reports the difference between significant and non-significant, I don’t really care if it’s cheating or just a pattern of sloppy work.

2. People try to replicate these studies and the replications don’t show the expected results. Sometimes these failed replications are declared to be successes (as in John Bargh’s notorious quote, “There are already at least two successful replications of that particular study . . . Both articles found the effect but with moderation by a second factor” [actually a different factor in each experiment]), other times they are declared to be failures (as in Bargh’s denial of the relevance of another failed replication which, unlike the others, was preregistered). The silliest of all these was Daryl Bem counting as successful replications several non-preregistered studies which were performed before his original experiment (anything’s legal in ESP research, I guess), and the saddest, from my perspective, came from the ovulation-and-clothing researchers who replicated their own experiment, failed to find the effect they were looking for, and then declared victory because they found a statistically significant interaction with outdoor temperature. That last one saddened me because Eric Loken and I repeatedly advised them to rethink their paradigm but they just fought fought fought and wouldn’t listen. Bargh I guess is beyond redemption, so much of his whole career is at stake, but I was really hoping those younger researchers would be able to break free of their statistical training. I feel so bad partly because this statistical significance stuff is how we all teach introductory statistics, so I, as a representative of the statistics profession, bear much of the blame for these researchers’ misconceptions.

Anyway, back to the main thread, which concerns the three reasons above why it’s ok not to believe in power pose or so many of these other things that you used to read about in Psychological Science.

Here’s the final reason:

3. In many cases there is prior knowledge or substantive theory that the purported large effects are highly implausible. This is most obvious in the case of that ESP study or when there are measurable implications in the real world, for example in that paper that claimed that single women were 20 percentage points more likely to support Obama for president during certain times of the month, or in areas of education research where there is “the familiar, discouraging pattern . . . small-scale experimental efforts staffed by highly motivated people show effects. When they are subject to well-designed large-scale replications, those promising signs attenuate and often evaporate altogether.”

Item 3 rarely stands on its own—researchers can come up with theoretical justifications for just about anything, and indeed research is typically motivated by some theory. Even if I and others might be skeptical of a theory such as embodied cognition or himmicanes, that skepticism is in the eye of the beholder, and even a prior history of null findings (as with ESP) is no guarantee of future failure: again, the researchers studying these things have new ideas all the time. Just cos it wasn’t possible to detect a phenomenon or solve a problem in the past, that doesn’t mean we can’t make progress: scientists do, after all, discover new planets in the sky, cures for certain cancers, cold fusion, etc.

So if my only goal here were to make an ironclad case against certain psychology studies, I might very well omit item 3 as it could distract from my more incontestable arguments. My goal here, though, is scientific not rhetorical, and I do think that theory and prior information should and do inform our understanding of new claims. It’s certainly relevant that in none of these disputed cases is the theory strong enough on its own to hold up a claim. We’re disputing power pose and fat-arms-and-political-attitudes, not gravity, electromagnetism, or evolution.

Putting the evidence together

For many of these disputed research claims, statistical reasoning (item 1 above) is enough for me to declare Not Convinced and move on, but empirical replication (item 2) is also helpful in convincing people. For example, Brian Nosek was convinced by his own 50 Shades of Gray experiment. There’s nothing like having something happen to you to really make it real. And and theory and prior experience (item 3) tells us that we should at least consider the possibility that these claimed effects are spurious.

OK, so here we are. 2016. We know the score. A bunch of statistics papers on why “p less than .05” implies so much less than we used to think, a bunch of failed replications of famous papers, a bunch of re-evaluations of famous papers revealing problems with the analysis, researcher degrees of freedom up the wazz, miscalculated p-values, and claimed replications which, when looked at carefully, did not replicate the original claims at all.

This is not to say that all or even most of the social psychology papers in Psychological Science are unreplicable. Just that many of them are, as (probabilistically) shown either directly via failed replications or statistically through a careful inspection of the evidence.

Given everything written above, I think it’s unremarkable to claim that Psychological Science, PPNAS, etc., have been publishing a lot of papers with fatal statistical weaknesses. It’s sometimes framed as a problem of multiple comparisons but I think the deeper problem is that people are studying highly variable and context-dependent effects with noisy research designs and often with treatments that seem almost deliberately designed to be ineffective (for example, burying key cues inside of a word game; see here for a quick description).

So, I was somewhat distressed to read this from a recent note by Gilbert et al., taking no position on whether “some of the surprising results in psychology are theoretical nonsense, knife-­edged, p-­hacked, ungeneralizable, subject to publication bias, and otherwise unlikely to be replicable or true” (see P.S. here).

I could see the virtue of taking an agnostic position on any one of these disputed public claims: Maybe women really are three times more likely to wear red during days 6-14 of their cycle. Maybe elderly-related words really do make people walk more slowly. Maybe Cornell students really do have ESP? Maybe obesity really is contagious? Maybe himmicanes really are less dangerous than hurricanes. Maybe power pose really does help you. Any one of these claims might well be true: even if you study something in such a noisy way that your data are close to useless, even if your p-values mean nothing at all, you could still have a solid underlying theory and have got lucky with your data. So it might seem like a safe position to keep an open mind on any of these claims.

But to take no position on whether some of these “surprising results” have problems? That’s agnosticism taken to a bit of an extreme.

If they do take this view, I hope they’ll also take no position on the following claims which are supported just about as well from the available data: that women are less likely to wear red during days 6-14 of their cycle, that elderly-related words make people walk faster, that Cornell students have an anti-ESP which makes them consistently give bad forecasts (thus explaining that old hot-hand experiment), that obesity is anti-contagious and when one of your friends gets fat, you go on a diet, etc.

Let’s keep an open mind about all these things. I, for one, am looking forward to the Ted talks on the “coiled snake” pose and on the anti-contagion of obesity.

The new story

OK, now you should go here and read the story from Brian Nosek and Elizabeth Gilbert, (no relation to the Daniel Gilbert of “Gilbert et al.” discussed above). They take one of the criticisms of Gilbert et al. who purported to show how unfaithful one of the replicated studies was, and carefully and systematically describe the study, the replication, and why the criticism of Gilbert et al. was at best sloppy and misinformed, and at worst a rabble-rousing, misleading bit of rhetoric. As I said, follow the link and read the story. It’s stunning.

In a way it doesn’t really matter, but given the headlines such as “Researchers overturn landmark study on the replicability of psychological science” (that was from Harvard’s press release; I was going to say I’d expect better from that institution where I’ve studied and taught, but it’s not fair to blame the journalist who wrote the press release; he was just doing his job), I’m glad Nosek and E. Gilbert went to the trouble to explain this to all of us.

P.S. I’m about as tired of writing about all this as you are of reading about it. But in this case I thought the overview (in particular, separating items 1, 2, and 3 above) would help. The statistical analysis and the empirical replication studies reinforce each other: the statistics explains how those gaudy p-values could be obtained even in the absence of any real and persistent effect, and the empirical replications are convincing to people who might not understand the statistics.

P.P.S. I just noticed that the Harvard press release featuring Gilbert et al. also says that “the replication rate in psychology is quite high—indeed, it is statistically indistinguishable from 100%.”

100%, huh? Maybe just to be on the safe side you should call it 99.9% so you don’t have to believe that the Cornell ESP study replicates.

What a joke. Surely you can’t be serious. Why didn’t you just say “Statistically indistinguishable from 200%”—that would sound even better!

The post Bruised and battered, I couldn’t tell what I felt. I was ungeneralizable to myself. appeared first on Statistical Modeling, Causal Inference, and Social Science.

04 Mar 14:39

Evaluating a new critique of the Reproducibility Project – The Hardest Science

04 Mar 14:39

More on replication crisis

by Andrew

cham

The replication crisis in social psychology (and science more generally) will not be solved by better statistics or by preregistered replications. It can only be solved by better measurement.

Let me say this more carefully. I think that improved statistics and preregistered replications will have very little direct effect on improving psychological science, but they could have important indirect effects.

Why no big direct effects? Cos if you’re studying a phenomenon that’s tiny, or that is so variable that any main effects will be swamped by interactions, with the interaction changing in each scenario where it’s studied, then better statistics and preregistered replications will just reveal what we already know, which is that existing experimental results say almost nothing about the size, direction, and structure of these effects.

I’m thinking here of various papers we’ve discussed here over the years, examples such as the studies of political moderation and shades of gray, or power pose, or fat arms and political attitudes, or ovulation and vote preference, or ovulation and clothing, or beauty and sex ratios, or elderly-related words and walking speed, or subliminal smiley faces and attitudes toward immigration, or ESP in college students, or baseball players with K in their names being more likely to strike out, or brain scans and political orientation, or the Bible Code, or . . .

Let me put it another way. Lots of the studies that we criticize don’t just have conceptual problems, they have very specific statistical errors—for example, the miscalculated test statistics in Amy Cuddy’s papers, where p-values got shifted below the .05 threshold—and they disappear under attempted replications. But this doesn’t mean that, if these researchers did better statistics or if they routinely replicated, that they’d be getting stronger conclusions. Rather, they’d just have to give up their lines of research, or think much harder about what they’re studying and what they’re measuring.

It could be, however, that improved statistical analysis and preregistered replications could have a positive indirect effect on such work: If these researchers knew ahead of time that their data would be analyzed correctly, and that outside teams would be preparing replications, they might be less willing to stake their reputations on shaky findings.

Think about Marc Hauser: had he been expected ahead of time to make all his monkey videotapes available for the world to see, he would’ve had much less motivation to code them the way they did.

So, yes, I think the prospect of reanalysis of existing data, and replication of studies, concentrates the mind wonderfully.

But . . . all the analysis and replication in the world won’t save you, if what you’re studying just isn’t there, or if any effects are swamped by variation.

That’s why, in my long blog conversations with the ovulation-and-clothing researchers, I never suggested they do a preregistered replication. If they or anyone else wants to do such a replication, fine—so far, I know of two such replications, neither of which found the pattern claimed in the original study but each of which reported a statistically significance comparison on something new, i.e., par for the course—but it’s not something I’d recommend because then I’d be recommending they waste their time. It the same reason I didn’t recommend that the beauty-and-sex-ratio guy gather more samples of size 3000. When your power is 6%, or 5.1%, or 5.01%, or whatever, to gather more data and look for statistical significance is at best a waste of time, at worst a way to confuse yourself with noise.

So . . . as I wrote a few months ago, doing better statistics is fine, but we really need to be doing better psychological measurement and designing studies to make the best use of these measurements:

Performing more replicable studies is not just a matter of being more careful in your data analysis (although that can’t hurt) or increasing your sample size (although that, too, should only help) but also it’s about putting real effort into design and measurement. All too often I feel like I’m seeing the attitude that statistical significance is a win or a proof of correctness, and I think this pushes researchers in the direction of going the cheap route, rolling the dice, and hoping for a low p-value that can be published. But when measurements are biased, noisy, and poorly controlled, even if you happen to get that p less than .05, it won’t really be telling you anything.

With this in mind, let me speak specifically of the controversial studies in social priming and evolutionary psychology. One feature of many such studies is that the manipulations are small, sometimes literally imperceptible. Researchers often seem to go to a lot of trouble to do tiny things that won’t be noticed by the participants in the experiments. For example, flashing a smiley face on a computer screen for 39 milliseconds, or burying a few key words in a sham experiment. In other cases, manipulations are hypothesized to have a seemingly unlimited number of interactions with attitudes, relationship status, outdoor temperature, parents’ socioeconomic status, etc. Either way, you’re miles away from the large, stable effects you’d want be studying if you want to see statistical regularity.

If effects are small, surrounded by variability, but important, then, sure, research them in large, controlled studies. Or go the other way and try to isolate large effects from big treatments. Swing some sledgehammers and see what happens. But a lot of this research has been going in the other direction, studying tiny interventions on small samples.

The work often “succeeds” (in the sense of getting statistical significance, publication in top journals, Ted talks, NPR appearances, etc.) but we know that can happen, what with the garden of forking paths and more.

So, again, in my opinion, the solution to the “replication crisis” is not to replicate everything or to demand that every study be replicated. Rather, the solution is more careful measurement. Improved statistical analysis and replication should help indirectly in reducing the motivation for people to perform analyses that are sloppy or worse, and reducing the motivation for people to think of empirical research as a sort of gambling game where you gather some data and then hope to get statistical significance. Reanalysis of data and replication of studies should reduce the benefit of sloppy science and thus shift the cost-benefit equation in the right direction.

Piss-poor omnicausal social science

One of my favorite blogged phrases comes from political scientist Daniel Drezner, when he decried “piss-poor monocausal social science.”

By analogy, I would characterize a lot of these unreplicable studies in social and evolutionary psychology as “piss-poor omnicausal social science.” Piss-poor because of all the statistical problems mentioned above—which arise from the toxic combination of open-ended theories, noisy data, and huge incentives to obtain “p less than .05,” over and over again. Omnicausal because of the purportedly huge effects of, well, just about everything. During some times of the month you’re three times more likely to wear red or pink—depending on the weather. You’re 20 percentage points more likely to vote Republican during those days—unless you’re single, in which case you’re that much more likely to vote for a Democrat. If you’re a man, your political attitudes are determined in large part by the circumference of your arms. An intervention when you’re 4 years old will increase your earnings by 40%, twenty years down the road. The sex of your baby depends on your attractiveness, on your occupation, on how big and tall you are. How you vote in November is decided by a college football game at the end of October. A few words buried in a long list will change how fast you walk—or not, depending on some other factors. Put this together, and every moment of your life you’re being buffeted by irrelevant stimuli that have huge effects on decisions ranging from how you dress, to how you vote, to where you choose to live, your career, even your success at that career (if you happen to be a baseball player). It’s an omnicausal world in which there are thousands of butterflies flapping their wings in your neighborhood, and each one is capable of changing you profoundly. A world if, it truly existed, would be much different from the world we live in.

A reporter asked me if I found the replication rate of various studies in psychology to be “disappointingly low.” I responded that yes it’s low, but is it disappointing? Maybe not. I would not like to live in a world in which all those studies are true, a world in which the way women vote depends on their time of the month, a world in which men’s political attitudes were determined by how fat their arms are, a world in which subliminal messages can cause large changes in attitudes and behavior, a world in which there are large ESP effects just waiting to be discovered. I’m glad that this fad in social psychology may be coming to an end, so in that sense, it’s encouraging, not disappointing, that the replication rate is low. If the replication rate were high, then that would be cause to worry, because it would imply that much of what we know about the world would be wrong. Meanwhile, statistical analysis (of the sort done by Simonsohn and others), and lots of real-world examples (as discussed on this blog and elsewhere) have shown us how it is that researchers could continue to find “p less than .05” over and over again, even in the absence of any real and persistent effects.

The time-reversal heuristic

A couple more papers on psychology replication came in the other day. They were embargoed until 2pm today which is when this post is scheduled to appear.

I don’t really have much to say about the two papers (one by Gilbert et al., one by Nosek et al.). There’s some discussion about how bad is the replication crisis in psychology research (and, by extension, in many other fields of science), and my view is that it depends on what is being studied. The Stroop effect replicates. Elderly-related-words priming, no. Power pose, no. ESP, no. Etc. The replication rate we see in a study-of-studies will depend on the mix of things being studied.

Having read the two papers, I pretty much agree with Nosek et al. (see Sanjay Srivastava for more on this point), and the only thing I’d like to add is to remind you of the time-reversal heuristic for thinking about a published paper followed by an unsuccessful replication:

One helpful (I think) way to think about such an episode is to turn things around. Suppose the attempted replication experiment, with its null finding, had come first. A large study finding no effect. And then someone else runs a replication under slightly different conditions with a much smaller sample size and found statistically significance under non-preregistered conditions. Would we be inclined to believe it? I don’t think so. At the very least, we’d have to conclude that any such phenomenon is fragile.

From this point of view, what the original claim has going for it is that (a) statistical significance was obtained in an uncontrolled setting, (b) it was published in a peer-reviewed journal, and (c) this paper came before, rather than after, the attempted replication. I don’t find these pieces of evidence very persuasive. (a) Statistical significance doesn’t mean much in the absence of preregistration or something like it, (b) lots of mistakes get published in peer-reviewed journals, to the extent that the phrase “Psychological Science” has become a bit of a punch line, and (c) I don’t see why we should take the apparently successful result as the starting point in our discussion, just because it was published first.

P.S. More here: “Replication crisis crisis: Why I continue in my ‘pessimistic conclusions about reproducibility”

The post More on replication crisis appeared first on Statistical Modeling, Causal Inference, and Social Science.

03 Mar 14:58

Gray’s Anatomy may have been largely plagiarized, written by a scoundrel

by Beth Mole

The history of Gray's famous text may reveal the anatomy of a jerk. (credit: Public Domain)

Gray’s Anatomy is easily recognized worldwide as one of the most revered and influential medical texts of all time. But a closer examination of its medical history turns up tales of a disgraceful birth and hints that its author, Henry Gray, may have been a bit of a fraudster.

Henry Gray, author of Gray's Anatomy. (credit: H. Pollock)

Notes, publications, and diary entries from Gray’s colleagues suggest that the famous author may have plagiarized numerous passages of the text and was pushy, cut-throat, and resented, a new commentary piece in the journal Clinical Anatomy argues. While the allegations are not new, one researcher claims to have fresh data that refutes them, urging a renewed dissection of Gray’s character and actions.

The commentary’s author, Ruth Richardson, a medical historian and visiting scholar at King’s College London, wrote about Gray’s alleged cheating ways in her 2008 book, The Making of Mr. Gray’s Anatomy. But in a 2014 scientific conference, anatomy professor Brion Benninger, of the College of Osteopathic Medicine of the Pacific – Northwest, publicly announced that he and a colleague had carried out a computer analysis of the text and found no such evidence of plagiarism. He said that he intended to publish the analysis. But, in the year since, he has not produced any data.

Read 14 remaining paragraphs | Comments

04 Feb 23:29

Chief Justice sells at least $250K of Microsoft stock in advance of hearing

by Joe Mullin

Supreme Court Chief Justice John Roberts has sold between $250,000 and $500,000 worth of Microsoft stock, according to an Associated Press report out today. It's the largest single stock sale by anyone on the court in more than a decade.

The large stock sale is news in part because the high court agreed a few weeks ago to take a case involving alleged defects in Microsoft's Xbox 360 console. Assuming that Roberts sold all his Microsoft stock, that means he won't have to withdraw from the case.

The last time Microsoft had a case in front of the Supreme Court was 2011, in which the software giant made a last-ditch attempt to fend off a patent claim brought by i4i, a small Canadian firm. Microsoft asked the court to reconsider the standard of proof used to invalidate patents, but the justices sided with i4i in an 8-0 vote, cementing the firm's $290 million payday. Roberts recused himself from that case.

Read 4 remaining paragraphs | Comments

04 Feb 23:28

5 easy pieces: How Deepmind mastered Go

by xcorr

Google Deepmind announced last week that it created an AI that can play professional-level Go. The game of Go has always been something of a holy grail for game AI, given its large branching factor and the difficulty of evaluating a position.

The new AI, called AlphaGo, has already won against the current European champion in October. It is scheduled to play in March against legendary player Lee Se-dol, widely considered one of the greatest players of all time, in a tournament reminiscent of the head-to-head between IBM’s Deep Blue and Gary Kasparov.

AlphaGo is a complex AI system with five separate learning components: 3 of the components are deep neural networks built with supervised learning; one is a deep neural net built with reinforcement learning; while the final piece of the puzzle is a multi-armed bandit-like algorithm that that guides a tree search.

This might appear overwhelming, so in this post I decompose this complex AI into 5 easiy pieces to help guide the technically inclined through the paper. An excellent introduction for the layman is here.

Finding the best moves

Screen Shot 2016-02-03 at 12.02.12 AM
Four different networks used by AlphaGo – from Silver et al. (2016)

There are two underlying principles that allows AlphaGo to play professional-level Go:

  1. If you have a great model for the next best move. you can get a pretty good AI
  2. If you can look ahead and simulate the outcome of your next move, you can a really good AI

To attack the first problem, Deepmind trained a deep neural network – the policy network – to predict the best move for a given board position. They built a database of 30 million moves played by human experts and trained a 13-layer deep neural network to predict the next expert move based on the current board position.

This is similar to training a categorical deep neural network to classify images into different categories. The trained policy network could predict the best next move 57% of the time.

To strengthen this AI, they then pitted it against itself in a round of tournaments. They used policy gradient reinforcement learning (RL) to learn to better play Go. This lead to a markedly better AI – the RL network beat the vanilla network more than 80% of the time.

This is enough to create a decent go AI, and head-to-head matches against current state-of-the-art computer opponents showed a distinct advantage of the system- it won 85% of its games against the state-of-the-art open-source AI for Go, Pachi.

Going deeper

Training a network to play the very next best move at all times is surprisingly powerful, and rather straightforward, but it is not sufficient to beat professional-level players.

a88
From Inception (Warner Brothers Pictures)

Chess and other adversarial two-player game AIs often work by looking at the game tree, focusing on the sequence of moves that maximizes (max) the expected outcome for the AI assuming the opponent plays his own best moves (mini); this is known as minimax tree search.

Minimax search can reveal lines of play that guarantee a win for the AI. It can also reveal when the opponent is guaranteed to win, in which case the AI abandons.

In other cases, however, it is simply too expensive to exhaustly search the game tree. The search tree grows approximately as M^N, where M is the number of moves available per turn, and N is the number of total turns in a game.

When the search goes too deep, the game tree is truncated, and a predictive algorithm assigns a value to the position – instead of assigning a value of +1 for a sure win and 0 for a loss, it might assign a value of +0.8 for a pretty sure win, for instance.

In chess, for example, one might assign a value of +1 to a checkmate, while an uncertain position would be scored based on the value of the pieces on the board. If we think that pawns have a value of +1, bishops +3, rooks +5, etc., and A stands for the AI and O for its opponent, the value of a position might be scored as:

p(win) = \sigma(\alpha \cdot (P_A - P_O + 3 B_A - 3 B_O + 5 R_A - 5 R_O \ldots))

It is not so straightforward to create such a value formula in Go: experienced players can often come to agreement on how good or bad a position is, but it is very difficult to describe the process by which they come with this evaluation- there is no simple rule of thumb, unlike in chess.

This is where the third learning component comes into play. It is a deep neural network trained to learn a mapping between a position and the probability that it will lead to a win. This is, of course, nothing more than binary classification. This network is known as the value network.

Researchers pitted the RL policy network against other, earlier versions of itself in a tournament. This made it possible to build a large database of simulated games on which to train the value network.

The need for speed

In order to evaluate lots of positions in the game tree, it was necessary to build a system which can evaluate the next best move very rapidly. This is where the fourth learning component come in. The rollout network, like the policy networks, tries to predict the best move in a given position; it uses the same data set as the policy network to do this.

However, it is rigged to be much faster to evaluate, by factor of about 1000 compared to the original policy network. That being said, it is also much less effective at predicting the next best move than the full policy network. This turns out to be not so important, because during the game tree search, it is more effective to try 1000 different lines of play which are not quite optimal than one very good line of play.

Sure, we lose money on every sale, but we make it up in volume.

Let’s put this into perspective. There is a policy network which predicts the next best move. Many different potential outcomes of the top moves are then evaluated based on the rollout network. The search tree is evaluated until a certain depth, at which point the outcome is predicted based on the value network.

Again, there’s a slight twist: in addition to the truncated value of a position predicted by the value network, it also plays out the position once until the very end of the search tree. Mind you, this is just one potential line of play, so the outcome of this full rollout is noisy. The average of the prediction from the value network and this full rollout, however, gives a better prediction of how good a position is than just the output of this value network by itself.

Putting it all together: tree search

The final piece of the puzzle is a very elaborate tree search algorithm. The previous best results in Go have been obtained with Monte Carlo Search Tree (MCTS). The game tree for Go is so large that going through the game tree to even moderate depths is hopeless.

In MCTS, the AI tests out positions by playing against a virtual opponent, choosing moves at random from a policy. By averaging the outcomes of thousands of such plays, it obtains an estimate of how good the very first move was. The important distinction between MCTS and minimax search is that not every line of play is considered: we look at the most interesting lines of play exclusively.

What do we mean by interesting? A line might be interesting because it hasn’t been well-explored yet – or it might be interesting because previous playthroughs seem to indicate a very good outcome. Thus, we want to balance exploitation of promising lines with exploration of other, poorly sampled lines.

This is, in fact, a version of the multi-armed bandit problem in the context of a tree, A very effective heuristic for the bandit problem is the upper confidence bound (UCB) policy. To balance exploration and exploitation at any given time step, UCB chooses the arm which has the highest upper confidence bound.

When faced with two arms which have the same expected reward, UCB chooses the one with the biggest error bar: we get a bigger boost in information from the most uncertain arm, all else being equal. On the other hand, faced with two arms with similar error bars, UCB chooses the one with the highest expected value: this has a higher probability of giving us a higher reward. Thus, UCB policy balances exploration and exploitation in an intuitive way.

The AlphaGo AI uses a variant of this heuristic adapted to trees known as upper confidence bound tree search, or UCTS. It explores branches of the tree – lines of play – which seemed the most promising at a given time step. The promise of a move is given by a mix of the evaluations from the policy, value, and rollout networks, plus a bonus proportional to the size of the error bar for this evaluation.

Much of the methods of the paper explain how this tree search can be done in an effective way in a distributed environment, which is a huge computer science and engineering undertaking… made a lot more tractable with Google’s resources and expertise, of course.

Conclusion

The AlphaGo AI is a hybrid system consisting of both deep neural networks and traditional Monte Carlo tree search. The neural nets allow the AI to examine at a glance what’s the next best move and what is the value of a position. It gives the AI a certain kind of intuition. The tree search, on the other hand, evaluates hundreds of thousands of potential outcomes by intelligently sampling from the game tree.

The advantage of such a hybrid system is that it scales with computational resources. The more CPUs you throw at the tree search, the better the AI becomes. This is not the case with a deep neural network, which becomes faster, not better, with more computational resources.

AlphaGo shows, once again, the promise of combining reinforcement and deep learning. These approaches could soon lead to intelligent autonomous agents, and perhaps even, as Demis Hassabis envisions, to general intelligence. These are truly exciting times.

Disclaimer: I work for Google, but was not involved in this research, and was not paid to write this introduction. All the info here is based on public information contained in the paper.

ResearchBlogging.org

Silver D, Huang A, Maddison CJ, Guez A, Sifre L, van den Driessche G, Schrittwieser J, Antonoglou I, Panneershelvam V, Lanctot M, Dieleman S, Grewe D, Nham J, Kalchbrenner N, Sutskever I, Lillicrap T, Leach M, Kavukcuoglu K, Graepel T, & Hassabis D (2016). Mastering the game of Go with deep neural networks and tree search. Nature, 529 (7587), 484-9 PMID: 26819042

02 Feb 16:18

Holly Crab Brings Cajun Seafood to Boston

by Dana Hatic
Edenovellis

Koreans making Cajun food with a Vietnamese touch?

Packard's Corner runneth over with spicy flavors.

An unassuming storefront in Packard's Corner is now serving seafood that packs a punch.

Holly Crab is the work of Ryan and Rick Kim, and its interior is as simple as the restaurant's menu. The dark wood tables and the whimsical wall decorations provide a welcoming ambiance and accent a chalkboard menu that boasts Holly Crab's range of Cajun seafood options.

"It's more about having freshest ingredients from all of the U.S., and that's why we made it very simple, but all the soups, like the clam chowder and the gumbo, they're all freshly made in the kitchen from scratch," Ryan Kim said. "And fresh oyster, we batter it ourselves. So everything's simple, but there's love to it."

Holly Crab's menu is an array of create-your-own options, beginning with the seafood — everything from Louisiana blue crab to Blue Point oysters and King crab legs from Alaska. The seafood is boiled and served up hot with flavored garlic butter or lemon pepper, or dry, or "Holly Crab" style, with a mix of the flavors. There is also a spice range to choose from: mild, medium, spicy, or Holly X.

On the appetizer side of things, there's the aforementioned clam chowder and chicken gumbo, as well as fried oysters, catfish, chicken fingers, fries, rice balls, and more.

Holly Crab held a soft opening last week, offering a 25 percent discount to customers and getting a test-run setting in return.

"It was good, and we had fun. We're still making the kitchen system very stable, we're still working on it, but sooner or later it's going to be all set," Ryan Kim said. "Once we have enough staff front and back, we should be able to fly."

Ryan Kim initially came to the U.S. in 1999 and ended up in Boston with his parents and uncle. He went to the Culinary Institute of America with Rick, and after they graduated, Ryan moved to Las Vegas to attend UNLV and work in the restaurants there. He said he'd go to class in the day and work 3 p.m. to 11 or 12 at night. In Vegas, he discovered a wealth of flavors and said he wanted to return to Boston to bring a new take on seafood to the already bustling seafood town.

Rick Kim, who spent years developing menus for a company in Korea, had sampled Cajun seafood in California and said he loved the spiciness.

"So spiciness and seafood, what should I say? I started digging out how to make it perfectly, so we just tested together more than 100 times, I guess, and then we just came up with a great recipe and tested with our friends, and everyone loved it," he said.

From there, they decided to launch the concept, nailing down the location at 1098 Commonwealth Ave. (formerly home to a Turkish restaurant called Saray), perfecting their seafood sauce with Cajun seasoning, butter, garlic, and lemon, and figuring out how to give Bostonians delicious seafood that won't break the bank.

"Seafood in this state, especially Boston, it's not cheap. It's expensive. Seafood is always expensive, but if you can take a look at the menu, we're trying our best to provide the seafood with a very reasonable price," Ryan Kim said.

Kim said the concept of Cajun seafood is popular in California and has roots in Louisiana, but it also has Vietnamese influences. Supposedly, he said, when the French conquered Vietnam, there was some ingredient appropriation. Butter, baguettes, and cooking techniques became part of Vietnamese techniques, which came to the U.S. along with Vietnamese immigrants in the 1970s.

"So it's Louisiana spice, but there is a Vietnamese touch to it," Ryan Kim said. "We are actually Korean, but I saw this [as a good move] business-wise. It's really good because we could make it simple, provide good stuff — the freshest stuff, and we're in the right market, so we're very excited." Locally, not too many restaurants offer this style of seafood, aside from the recently opened Shaking Crab in Newton.

Holly Crab is open for dinner seven days a week from 5 to 11 p.m. The team may add lunch service eventually, but for now, they want to keep things simple and delicious.

"When customers are here, we want to provide the perfect service. I don't know if perfect is the right word, but perfect-as-possible service — satisfied service," Ryan Kim said.


02 Feb 03:58

When does peer review make no damn sense?

by Andrew

Disclaimer: This post is not peer reviewed in the traditional sense of being vetted for publication by three people with backgrounds similar to mine. Instead, thousands of commenters, many of whom are not my peers—in the useful sense that, not being my peers, your perspectives are different from mine, and you might catch big conceptual errors or omissions that I never even noticed—have the opportunity to point out errors and gaps in my reasoning, to ask questions, and to draw out various implications of what I wrote. Not “peer reviewed”; actually peer reviewed and more; better than peer reviewed.

The_Riddler_3

Last week we discussed Simmons and Simonsohn’s survey of some of the literature on the so-called power pose, where they wrote:

While the simplest explanation is that all studied effects are zero, it may be that one or two of them are real (any more and we would see a right-skewed p-curve). However, at this point the evidence for the basic effect seems too fragile to search for moderators or to advocate for people to engage in power posing to better their lives.

Also:

Even if the effect existed, the replication suggests the original experiment could not have meaningfully studied it.

The first response of one of the power-pose researchers was:

I’m pleased that people are interested in discussing the research on the effects of adopting expansive postures. I hope, as always, that this discussion will help to deepen our understanding of this and related phenomena, and clarify directions for future research. . . . I respectfully disagree with the interpretations and conclusions of Simonsohn et al., but I’m considering these issues very carefully and look forward to further progress on this important topic.

This response was pleasant enough but I found it unsatisfactory because it did not even consider the possibility that her original finding was spurious.

After Kaiser Fung and I publicized Simmons and Simonsohn’s work in Slate, the power-pose author responded more forcefully:

The fact that Gelman is referring to a non-peer-reviewed blog, which uses a new statistical approach that we now know has all kinds of problems, as the basis of his article is the WORST form of scientific overreach. And I am certainly not obligated to respond to a personal blog. That does not mean I have not closely inspected their analyses. In fact, I have, and they are flat-out wrong. Their analyses are riddled with mistakes, not fully inclusive of all the relevant literature and p-values, and the “correct” analysis shows clear evidential value for the feedback effects of posture.

Amy Cuddy, the author of this response, did not at any place explain how Simmons and Simonsohn were “flat-out wrong,” nor did she list even one of the mistakes with which their analyses were “riddled.”

Peer review

The part of the above quote I want focus on, though, is the phrase “non-peer-reviewed.” Peer reviewed papers have errors, of course (does the name “Daryl Bem” ring a bell?). Two of my own published peer-reviewed articles had errors so severe as to destroy their conclusions! But that’s ok, nobody’s claiming perfection. The claim, I think, is that peer-reviewed articles are much less likely to contain errors, as compared to non-peer-reviewed articles (or non-peer-reviewed blog posts). And the claim behind that, I think, is that peer review is likely to catch errors.

And this brings up the question I want to address today: What sort of errors can we expect peer review to catch?

I’m well placed to answer this question as I’ve published hundreds of peer-reviewed papers and written thousands of referee reports for journals. And of course I’ve also done a bit of post-publication review in recent years.

To jump to the punch line: the problem with peer review is with the peers.

In short, if an entire group of peers has a misconception, peer review can simply perpetuate error. We’ve seen this a lot in recent years, for example that paper on ovulation and voting was reviewed by peers who didn’t realize the implausibility of 20-percentage-point vote swings during the campaign, peers who also didn’t know about the garden of forking paths. That paper on beauty and sex ratio was reviewed by peers who didn’t know much about the determinants of sex ratio and didn’t know much about the difficulties of estimating tiny effects from small sample sizes.

OK, let’s step back for a minute. What is peer review good for? Peer reviewers can catch typos, they can catch certain logical flaws in an argument, they can notice the absence of references to the relevant literature—that is, the literature that the peers are familiar with. That’s how the peer reviewers for that psychology paper on ovulation and voting didn’t catch the error of claiming that days 6-14 were the most fertile days of the cycle: these reviewers were peers of the people who made the mistake in the first place!

Peer review has its place. But peer reviewers have blind spots. If you want to really review a paper, you need peer reviewers who can tell you if you’re missing something within the literature—and you need outside reviewers who can rescue you from groupthink. If you’re writing a paper on himmicanes and hurricanes, you want a peer reviewer who can connect you to other literature on psychological biases, and you also want an outside reviewer—someone without a personal and intellectual stake in you being right—who can point out all the flaws in your analysis and can maybe talk you out of trying to publish it.

Peer review is subject to groupthink, and peer review is subject to incentives to publishing things that the reviewers are already working on.

This is not to say that a peer-reviewed paper is necessarily bad—I stand by over 99% of my own peer-reviewed publications!—rather, my point is that there are circumstances in which peer review doesn’t give you much.

To return to the example of power pose: There are lots of papers in this literature and there’s a group of scientists who believe that power pose is real, that it’s detectable, and indeed that it can help millions of people. There’s also a group of scientists who believe that any effects of power pose are small, highly variable, and not detectable by the methods used in the leading papers in this literature.

Fine. Scientific disagreements exist. Replication studies have been performed on various power-pose experiments (indeed, it’s the null result from one of these replications that got this discussion going), and the debate can continue.

But, my point here is that peer-review doesn’t get you much. The peers of the power-pose researchers are . . . other power-pose researchers. Or researchers on embodied cognition, or on other debatable claims in experimental psychology. Or maybe other scientists who don’t work in this area but have heard good things about it and want to be supportive of this work.

And sometimes a paper will get unsupportive reviews. The peer review process is no guarantee. But then authors can try again until they get those three magic positive reviews. And peer review—review by true peers of the authors—can be a problem, if the reviewers are trapped in the same set of misconceptions, the same wrong framework.

To put it another way, peer review is conditional. Papers in the Journal of Freudian Studies will give you a good sense of what Freudians believe, papers in the Journal of Marxian Studies will give you a good sense of what Marxians believe, and so forth. This can serve a useful role. If you’re already working in one of these frameworks, or if you’re interested in how these fields operate, it can make sense to get the inside view. I’ve published (and reviewed papers for) the journal Bayesian Analysis. If you’re anti-Bayesian (not so many of these anymore), you’ll probably think all these papers are a crock of poop and you can ignore them, and that’s fine.

(Parts of) the journals Psychological Science and PPNAS have been the house organs for a certain variety of social psychology that a lot of people (not just me!) don’t really trust. Publication in these journals is conditional on the peers who believe the following equation:

“p less than .05” + a plausible-sounding theory = science.

Lots of papers in recent years by Uri Simonsohn, Brian Nosek, John Ioannidis, Katherine Button, etc etc etc., have explored why the above equation is incorrect.

But there are some peers that haven’t got the message yet. Not that they would endorse the above statement when written as crudely as in that equation, but I think this is how they’re operating.

And, perhaps more to the point, many of the papers being discussed are several years or even decades old, dating back to a time when almost nobody (myself included) realized how wrong the above equation is.

Back to power pose

And now back to the power pose paper by Carney et al. It has many garden-of-forking-paths issues (see here for a few of them). Or, as Simonsohn would say, many researcher degrees of freedom.

But this paper was published in 2010! Who knew about the garden of forking paths in 2010? Not the peers of the authors of this paper. Maybe not me either, had it been sent to me to review.

What we really needed (and, luckily, we can get) is post-publication review: not peer reviews, but outside reviews, in this case reviews by people who are outside of the original paper both in research area and in time.

And also this, from another blog comment:

It is also striking how very close to the .05 threshhold some of the implied p-values are. For example, for the task where the participants got the opportunity to gamble the reported chi-square is 3.86 which has an associated p-value of .04945.

Of course, this reported chi-square value does not seem to match the data because it appears from what is written on page 4 of the Carney et al. paper that 22 participants were in the high power-pose condition (19 took the gamble, 3 did not) while 20 were in the low power-pose condition (12 took the gamble, 8 did not). The chi-square associated with a 2 x 2 contingency table with this data is 3.7667 and not 3.86 as reported in the paper. The associated p-value is .052 – not less than .05.

You can’t expect peer reviewers to check these sorts of calculations—it’s not like you could require authors to supply their data and an R or Stata script to replicate the analyses, ha ha ha. The real problem is that the peer reviewers were sitting there, ready to wave past the finish line a result with p less than .05, which provides an obvious incentive for the authors to get p less than .05, one way or another.

Commenters also pointed out an earlier paper by one of the same authors, this time on stereotypes of the elderly, from 2005, that had a bunch more garden-of-forking-paths issues and also misreported two t statistics: the actual values were something like 1.79 and 3.34; the reported values were 5.03 and 11.14! Again, you can’t expect peer reviewers to catch these problems (nobody was thinking about forking paths in 2005, and who’d think to recalculate a t statistic?), but outsiders can find them, and did.

At this point one might say that this doesn’t matter, that the weight of the evidence, one way or another, can’t depend on whether a particular comparison in one paper was or was not statistically significant—but if you really believe this, what does it say about the value of the peer-reviewed publication?

Again, I’m not saying that peer review is useless. In particular, peers of the authors should be able to have a good sense of how the storytelling theorizing in the article fits in with the rest of the literature. Just don’t expect peers to do any assessment of the evidence.

Linking as peer review

Now let’s consider the Simmons and Simonsohn blog post. It’s not peer reviewed—except it kinda is! Kaiser Fung and I chose to cite Simmons and Simonsohn in our article. We peer reviewed the Simmons and Simonsohn post.

This is not to say that Kaiser and I are certain that Simmons and Simonsohn made no mistakes in that post; peer review never claims to that sort of perfection.

But I’d argue that our willingness to cite Simmons and Simonsohn is a stronger peer review than whatever was done for those two articles cited above. I say this not just because those papers had demonstrable errors which affect their conclusions (and, yes, in the argot of psychology papers, if a p-value shifts from one side of .05 to the other, it does affect the conclusions).

I say this also because of the process. When Kaiser and I cite Simmons and Simonsohn in the way that we do, we’re putting a little bit of our reputation on the line. If Simmons and Simonsohn made consequential errors—and, hey, maybe they did, I didn’t check their math, any more than the peer reviewers of the power pose papers checked their math—that rebounds negatively on us, that we trusted something untrustworthy. In contrast, the peer reviewers of those two papers are anonymous. The peer review that they did was much less costly, reputationally speaking, than ours. We have skin in the game, they do not.

Beyond this, Simmons and Simonsohn say exactly what they did, so you can work it out yourself. I trust this more than the opinions of 3 peers of the authors in 2010, or 3 other peers in 2005.

Summary

Peer review can serve some useful purposes. But to the extent the reviewers are actually peers of the authors, they can easily have the same blind spots. I think outside review can serve a useful purpose as well.

If the authors of many of these PPNAS or Psychological Science-type papers really don’t know what they’re doing (as seems to be the case), then it’s no surprise that peer review will fail. They’re part of a whole peer group that doesn’t understand statistics. So, from that perspective, perhaps we should trust “peer review” less than we should trust “outside review.”

I am hoping that peer review in this area will improve, given the widespread discussion of researcher degrees of freedom and garden of forking paths. Even so, though, we’ll continue to have a “legacy” problem of previously published papers with all sorts of problems, up to and including t statistics misreported by factors of 3. Perhaps we’ll have to speak of “post-2015 peer-reviewed articles” and “pre-2015 peer-reviewed articles” as different things?

The post When does peer review make no damn sense? appeared first on Statistical Modeling, Causal Inference, and Social Science.

29 Jan 19:18

Levi’s 501s: The Choice of American Icons (and 1990s Heroin...

by breathnaigh


Levi’s 501s: The Choice of American Icons (and 1990s Heroin Dealers in Mexico)

The Atlantic details the cachet Levi’s jeans had in Mexico in the 1990s, when they represented wealth and connection to America.

When Enrique, an aspiring heroin dealer from Nayarit, Mexico, arrived at his uncles’ home in California in 1989, they led him to a closet full of brand-new Levi’s 501 jeans.

“Take what you want,” the uncles, who were deep into the heroin trade, told Enrique.

Mexican dealers, who peddle much of the black-tar heroin in the United States, valued Levi’s 501 jeans almost as much as currency. In fact, they sometimes accepted payment in denim form: One balloon could be had for two pairs. When the dealers came home to their ranchos wearing 501s—or if they brought them as gifts—their families knew they had truly made it.

The scene is taken from a recent book on the opioid epidemic in America: Dreamland, by Sam Quinones.

Read more here.

28 Jan 21:00

Same source, different styles

by Nathan Yau

10styles_100characters

Jaakko Seppälä drew ten comic characters, each in its original style and in the style of the other nine. It's like the same source material can be shown and seen in different ways, communicating different moods and themes. Imagine that.

See all 10.

Tags: comics

26 Jan 23:22

What Do We Really Know About Osama bin Laden’s Death?

Edenovellis

An article about narrative, news reporting, and anonymous sources.

26 Jan 23:21

Saturday Morning Breakfast Cereal - Science Communication

by admin@smbc-comics.com

Hovertext: Anything I don't understand explains everything I don't understand.


New comic!
Today's News:

It begins.

 

26 Jan 23:17

Brookline's Tatte Is Closed for Renovations

by Dana Hatic

Prepare for a whole new Tatte in two months.

A local bakery is temporarily closed for renovations. Tatte Bakery & Cafe's Brookline location at 1003 Beacon St. will be out of commission for the next two months as the store undergoes extensive interior changes.

The bakery posted about the news on its website, with a note to the store's customers.

"Thank you for the massive love, support and hugs you gave us in the past 8 years. We feel blessed and honored every day by your love. I could never have done it without you, your belief in me and your support and for that I am forever thankful," the note read.

Tatte got its start in 2007 when its founder Tzurit Or, a self-trained pastry chef, began selling pastries to local farmers markets. She grew up in Israel baking with her family and has since created recipes inspired by her roots to share with customers across five Tatte locations in Greater Boston.

Brookline was the first brick-and-mortar location, and the ground-up renovations will completely rework and expand the space. The other locations remain open.

The redone store will have: "more space, more seats, {customers restrooms!!} shorter lines, faster service, more food, pastries and more of Tatte".

Tatte will post progress updates on its website over the next two months.

26 Jan 18:23

The time-reversal heuristic—a new way to think about a published finding that is followed up by a large, preregistered replication (in context of Amy Cuddy’s claims about power pose)

by Andrew

121206124346_1_900x600

[Note to busy readers: If you’re sick of power pose, there’s still something of general interest in this post; scroll down to the section on the time-reversal heuristic. I really like that idea.]

Someone pointed me to this discussion on Facebook in which Amy Cuddy expresses displeasure with my recent criticism (with Kaiser Fung) of her claims regarding the “power pose” research of Cuddy, Carney, and Yap (see also this post from yesterday). Here’s Cuddy:

This is sickening and, ironically, such an extreme overreach. First, we *published* a response, in Psych Science, to the Ranehill et al conceptual (not direct) replication, which varied methodologically in about a dozen ways — some of which were enormous, such as having people hold the poses for 6 instead of 2 minutes, which is very uncomfortable (and note that even so, somehow people missed that they STILL replicated the effects on feelings of power). So yes, I did respond to the peer-reviewed paper. The fact that Gelman is referring to a non-peer-reviewed blog, which uses a new statistical approach that we now know has all kinds of problems, as the basis of his article is the WORST form of scientific overreach. And I am certainly not obligated to respond to a personal blog. That does not mean I have not closely inspected their analyses. In fact, I have, and they are flat-out wrong. Their analyses are riddled with mistakes, not fully inclusive of all the relevant literature and p-values, and the “correct” analysis shows clear evidential value for the feedback effects of posture. I’ve been quiet and polite long enough.

There’s a difference between having your ideas challenged in constructive way, which is how it used in to be in academia, and attacked in a destructive way. My “popularity” is not relevant. I’m tired of being bullied, and yes, that’s what it is. If you could see what goes on behind the scenes, you’d be sickened.

I will respond here but first let me get a couple things out of the way:

1. Just about nobody likes to be criticized. As Kaiser and I noted in our article, Cuddy’s been getting lots of positive press but she’s had some serious criticisms too, and not just from us. Most notably, Eva Ranehill, Anna Dreber, Magnus Johannesson, Susanne Leiberg, Sunhae Sul, and Roberto Weber published a paper last year in which they tried and failed to replicate the results of Cuddy, Carney, and Yap, concluding “we found no significant effect of power posing on hormonal levels or in any of the three behavioral tasks.” Shortly after, the respected psychology researchers Joe Simmons and Uri Simonsohn published on their blog an evaluation and literature review, writing that “either power-posing overall has no effect, or the effect is too small for the existing samples to have meaningfully studied it” and concluding:

While the simplest explanation is that all studied effects are zero, it may be that one or two of them are real (any more and we would see a right-skewed p-curve). However, at this point the evidence for the basic effect seems too fragile to search for moderators or to advocate for people to engage in power posing to better their lives.

OK, so I get this. You work hard on your research, you find something statistically significant, you get it published in a top journal, you want to draw a line under it and move on. For outsiders to go and question your claim . . . that would be like someone arguing a call in last year’s Super Bowl. The game’s over, man! Time to move on.

So I see how Cuddy can find this criticism frustrating, especially given her success with the Ted talk, the CBS story, the book publication, and so forth.

2. Cuddy writes, “If you could see what goes on behind the scenes, you’d be sickened.” That might be so. I have no idea what goes on behind the scenes.

OK, now on to the discussion

The short story to me is that Cuddy, Carney, and Yap found statistical significance in a small sample, non-preregistered study with a flexible hypothesis (that is, a scientific hypothesis that posture could affect performance, which can map on to many many different data patterns). We already know to watch out for such claims, and in this case a large follow-up study by an outside team did not find a positive effect. Meanwhile, Simmons and Simonsohn analyzed some of the published literature on power pose and found it to be consistent with no effect.

At this point, a natural conclusion is that the existing study by Cuddy et al. was too noisy to reveal much of anything about whatever effects there might be of posture on performance.

This is not the only conclusion one might draw, though. Cuddy draws a different conclusion, which is that her study did find a real effect and that the replication by Ranehill et al. was done under different, less favorable conditions, for which the effect disappeared.

This could be. As Kaiser and I wrote, “This is not to say that the power pose effect can’t be real. It could be real and it could go in either direction.” We question on statistical grounds the strength of the evidence offered by Cuddy et al. And there is also the question of whether a lab result in this area, if it were real, would generalize to the real world.

What frustrates me is that Cuddy in all her responses doesn’t seem to even consider the possibility that the statistically significant pattern they found might mean nothing at all, that it might be an artifact of a noisy sample. It’s happened before: remember Daryl Bem? Remember Satoshi Kanazawa? Remember the ovulation-and-voting researchers? The embodied cognition experiment? The 50 shades of gray? It happens all the time! How can Cuddy be so sure it hasn’t happened to her? I’d say this even before the unsuccessful replication from Ranehill et al.

Response to some specific points

“Sickening,” huh? So, according to Cuddy, her publication is so strong it’s worth a book and promotion in NYT, NPR, CBS, TED, etc. But Ranehill et al.’s paper, that somehow has a lower status, I guess because it was published later? So it’s “sickening” for us to express doubt about Cuddy’s claim, but not “sickening” for her to question the relevance of the work by Ranehill et al.? And Simmons and Simonsohn’s blog, that’s no good because it’s a blog, not a peer reviewed publication. Where does this put Daryl Bem’s work on ESP or that “bible code” paper from a couple decades ago? Maybe we shouldn’t be criticizing them, either?

It’s not clear to me how Simmons, Simonsohn, and I are “bullying” Cuddy. Is it bullying to say that we aren’t convinced by her paper? Are Ranehill, Dreber, etc. “bullying” her too, by reporting a non-replication? Or is that not bullying because it’s in a peer-reviewed journal?

When a published researcher such as Cuddy equates “I don’t believe your claims” with “bullying,” that to me is a problem. And, yes, the popularity of Cuddy’s work is indeed relevant. There’s lots of shaky research that gets published every year and we don’t have time to look into all of it. But when something is so popular and is promoted so heavily, then, yes, it’s worth a look.

Also, Cuddy writes that “somehow people missed that they STILL replicated the effects on feelings of power.” But people did not miss this at all! Here’s Simmons and Simonsohn:

In the replication, power posing affected self-reported power (the manipulation check), but did not impact behavior or hormonal levels. The key point of the TED Talk, that power poses “can significantly change the outcomes of your life”, was not supported.

In any case, it’s amusing that someone who’s based an entire book on an experiment that was not successfully replicated is writing about “extreme overreach.” As I’ve written several times now, I’m open to the possibility that power pose works, but skepticism seems to me to be eminently reasonable, given the evidence currently available.

In the meantime, no, I don’t think that referring to a non-peer-reviewed blog is “the worst form of scientific overreach.” I plan to continue to read and refer to the blog of Simonsohn and his colleagues. I think they do careful work. I don’t agree with everything they write—but, then again, I don’t agree with everything that is published in Psychological Science, either. Simonsohn et al. explain their reasoning carefully and they give their sources.

I have no interest in getting into a fight with Amy Cuddy. She’s making a scientific claim and I don’t think the evidence is as strong as she’s claiming. I’m also interested in how certain media outlets take her claims on faith. That’s all. Nothing sickening, no extreme overreach, just a claim on my part that, once again, a researcher is being misled by the process in which statistical significance, followed by publication in a major journal, is taken as an assurance of truth.

The time-reversal heuristic

One helpful (I think) way to think about this episode is to turn things around. Suppose the Ranehill et al. experiment, with its null finding, had come first. A large study finding no effect. And then Cuddy et al. had run a replication under slightly different conditions with a much smaller sample size and found statistically significance under non-preregistered conditions. Would be be inclined to believe it? I don’t think so. At the very least, we’d have to conclude that any power-pose effect is fragile.

From this point of view, what Cuddy et al.’s research has going for it is that (a) they found statistical significance, (b) their paper was published in a peer-reviewed journal, and (c) their paper came before, rather than after, the Ranehill et al. paper. I don’t find these pieces of evidence very persuasive. (a) Statistical significance doesn’t mean much in the absence of preregistration or something like it, (b) lots of mistakes get published in peer-reviewed journals, to the extent that the phrase “Psychological Science” has become a bit of a punch line, and (c) I don’t see why we should take Cuddy et al. as the starting point in our discussion, just because it was published first.

What next?

I don’t see any of this changing Cuddy’s mind. And I have no idea what Carney and Yap think of all this; they’re coauthors of the original paper but don’t seem to have come up much in the subsequent discussion. I certainly don’t think of Cuddy as any more of an authority on this topic than are Eva Ranehill, Anna Dreber, etc.

And I’m guessing it would take a lot to shake the certainty expressed on the matter by team TED. But maybe people will think twice when the next such study makes its way through the publicity mill?

And, for those of you who can’t get enough of power pose, I just learned that the journal Comprehensive Results in Social Psychology, “the preregistration-only journal for social psychology,” will be having a special issue devoted to replications of power pose! Publication is expected in fall 2016. So you can expect some more blogging on this topic in a few months.

The potential power of self-help

What about the customers of power pose, the people who might buy Amy Cuddy’s book, follow its advice, and change their life? Maybe Cuddy’s advice is just fine, in which case I hope it helps lots of people. It’s perfectly reasonably to give solid, useful advice without any direct empirical backing. I give advice all the time without there being any scientific study behind it. I recommend to write this way, and teach that way, and make this and that sort of graphs, typically basing my advice on nothing but a bunch of stories. I’m not the best one to judge whether Cuddy’s advice will be useful for its intended audience. But if it, that’s great and I wish her book every success. The advice could be useful in any case. Even if power pose has null or even negative effects, the net effect of all the advice in the book, informed by Cuddy’s experiences teaching business students and so forth, could be positive.

As I wrote in a comment in yesterday’s thread, consider a slightly different claim: Before an interview you should act confident; you should fold in upon yourself and be coiled and powerful; you should be secure about yourself and be ready to spring into action. It would be easy to imagine an alternative world in which Cuddy et al. found an opposite effect and wrote all about the Power Pose, except that the Power Pose would be described not as an expansive posture but as coiled strength. We’d be hearing about how our best role model is not cartoon Wonder Woman but rather the Lean In of the modern corporate world. Etc. And, the funny thing is, that might be good advice too! As they say in chess, it’s important to have a plan. It’s not good to have no plan. It’s better to have some plan, any plan, especially if you’re willing to adapt that plan in light of events. So it could well be that either of these power pose books—Cuddy’s actual book, or the alternative book, giving the exact opposite posture advice, which might have been written had the data in the Cuddy, Carney, and Yap paper come out different—could be useful to readers.

So I want to separate three issues: (1) the general scientific claim that some manipulation of posture will have some effects, (2) the specific claim that the particular poses recommended by Cuddy et al. will have the specific effects claimed in their paper, and (3) possible social benefits from Cuddy’s Ted talk and book. Claim (1) is uncontroversial, claim (2) is suspect (both from the failed replication and from consideration of statistical noise in the original study), and item (3) is completely different issue entirely, which is why I wouldn’t want to argue with claims that the talk and the book have helped people.

P.P.S. You might also want to take a look at this post by Uri Simonsohn who goes into detail on a different example of a published and much-cited result from psychology that did not replicate. Long story short: forking paths mean that it’s possible to get statistical significance from noise, also mean that you can keep finding conformation by doing new studies and postulating new interactions to explain whatever you find. When an independent replication fails, it doesn’t necessarily mean that the original study found something and the replication didn’t; it can mean that the original study was capitalizing on noise. Again, consider the time-reversal heuristic: Pretend that the unsuccessful replication came first, then ask what you would think if a new study happened to find a statistically significant interaction happened somewhere.

P.P.P.S. More here from Ranehill and Dreber. I don’t know if Cuddy would consider this as bullying. One hand, it’s a blog comment, so it’s not like it has been subject to the stringent peer review of Psych Science, PPNAS, etc, ha ha; on the other hand, Ranehill and Dreber do point to some published work:

Finally, we would also like to raise another important point that is often overlooked in discussions of the reliability of Carney et al.’s results, and also absent in the current debate. This issue is raised in Stanton’s earlier commentary to Carney et al. , published in the peer-reviewed journal Frontiers in Behavioral Neuroscience (available here http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3057631/). Apart from pointing out a few statistical issues with the original article, such as collapsing the hormonal analysis over gender, or not providing information on the use of contraceptives, Stanton (footnote 2) points out an inconsistency between the mean change in cortisol reported by Carney et al. in the text, and those displayed in Figure 3, depicting the study’s main hormonal results. Put succinctly, the reported hormone numbers in Carney, et al., “don’t add up.” Thus, it seems that not even the original article presents consistent evidence of the hormonal changes associated with power poses. To our knowledge, Carney, et al., have never provided an explanation for these inconsistencies in the published results.

From the standpoint of studying hormones and behavior, this is all interesting and potentially important. Or we can just think of this generically as some more forks in the path.

The post The time-reversal heuristic—a new way to think about a published finding that is followed up by a large, preregistered replication (in context of Amy Cuddy’s claims about power pose) appeared first on Statistical Modeling, Causal Inference, and Social Science.