Shared posts

07 May 18:22

Why big data is in trouble: they forgot about applied statistics

by Jeff Leek

This year the idea that statistics is important for big data has exploded into the popular media. Here are a few examples, starting with the Lazer et. al paper in Science that got the ball rolling on this idea.

All of these articles warn about issues that statisticians have been thinking about for a very long time: sampling populations, confounders, multiple testing, bias, and overfitting. In the rush to take advantage of the hype around big data, these ideas were ignored or not given sufficient attention.

One reason is that when you actually take the time to do an analysis right, with careful attention to all the sources of variation in the data, it is almost a law that you will have to make smaller claims than you could if you just shoved your data in a machine learning algorithm and reported whatever came out the other side.

The prime example in the press is Google Flu trends. Google Flu trends was originally developed as a machine learning algorithm for predicting the number of flu cases based on Google Search Terms. While the underlying data management and machine learning algorithms were correct, a misunderstanding about the uncertainties in the data collection and modeling process have led to highly inaccurate estimates over time. A statistician would have thought carefully about the sampling process, identified time series components to the spatial trend, investigated why the search terms were predictive and tried to understand what the likely reason that Google Flu trends was working.

As we have seen, lack of expertise in statistics  has led to fundamental errors in both genomic science and economics. In the first case a team of scientists led by Anil Potti created an algorithm for predicting the response to chemotherapy. This solution was widely praised in both the scientific and popular press. Unfortunately the researchers did not correctly account for all the sources of variation in the data set and had misapplied statistical methods and ignored major data integrity problems. The lead author and the editors who handled this paper didn't have the necessary statistical expertise, which led to major consequences and cancelled clinical trials.

Similarly, two economists Reinhart and Rogoff, published a paper claiming that GDP growth was slowed by high governmental debt. Later it was discovered that there was an error in an Excel spreadsheet they used to perform the analysis. But more importantly, the choice of weights they used in their regression model were questioned as being unrealistic and leading to dramatically different conclusions than the authors espoused publicly. The primary failing was a lack of sensitivity analysis to data analytic assumptions that any well-trained applied statisticians would have performed.

Statistical thinking has also been conspicuously absent from major public big data efforts so far. Here are some examples:

One example of this kind of thinking is this insane table from the alumni magazine of the University of California which I found from this amazing talk by Terry Speed (via Rafa, go watch his talk right now, it gets right to the heart of the issue).  It shows a fundamental disrespect for applied statisticians who have developed serious expertise in a range of scientific disciplines.

Screen Shot 2014-05-06 at 9.06.38 PM

All of this leads to two questions:

  1. Given the importance of statistical thinking why aren't statisticians involved in these initiatives?
  2. When thinking about the big data era, what are some statistical ideas we've already figured out?

16 Apr 18:25

Testing on the Toilet: Test Behaviors, Not Methods

by Google Testing Bloggers
by Erik Kuefler

This article was adapted from a Google Testing on the Toilet (TotT) episode. You can download a printer-friendly version of this TotT episode and post it in your office.

After writing a method, it's easy to write just one test that verifies everything the method does. But it can be harmful to think that tests and public methods should have a 1:1 relationship. What we really want to test are behaviors, where a single method can exhibit many behaviors, and a single behavior sometimes spans across multiple methods.

Let's take a look at a bad test that verifies an entire method:

@Test public void testProcessTransaction() {
User user = newUserWithBalance(LOW_BALANCE_THRESHOLD.plus(dollars(2));
transactionProcessor.processTransaction(
user,
new Transaction("Pile of Beanie Babies", dollars(3)));
assertContains("You bought a Pile of Beanie Babies", ui.getText());
assertEquals(1, user.getEmails().size());
assertEquals("Your balance is low", user.getEmails().get(0).getSubject());
}

Displaying the name of the purchased item and sending an email about the balance being low are two separate behaviors, but this test looks at both of those behaviors together just because they happen to be triggered by the same method. Tests like this very often become massive and difficult to maintain over time as additional behaviors keep getting added in—eventually it will be very hard to tell which parts of the input are responsible for which assertions. The fact that the test's name is a direct mirror of the method's name is a bad sign.

It's a much better idea to use separate tests to verify separate behaviors:

@Test public void testProcessTransaction_displaysNotification() {
transactionProcessor.processTransaction(
new User(), new Transaction("Pile of Beanie Babies"));
assertContains("You bought a Pile of Beanie Babies", ui.getText());
}
@Test public void testProcessTransaction_sendsEmailWhenBalanceIsLow() {
User user = newUserWithBalance(LOW_BALANCE_THRESHOLD.plus(dollars(2));
transactionProcessor.processTransaction(
user,
new Transaction(dollars(3)));
assertEquals(1, user.getEmails().size());
assertEquals("Your balance is low", user.getEmails().get(0).getSubject());
}

Now, when someone adds a new behavior, they will write a new test for that behavior. Each test will remain focused and easy to understand, no matter how many behaviors are added. This will make your tests more resilient since adding new behaviors is unlikely to break the existing tests, and clearer since each test contains code to exercise only one behavior.

17 Jul 20:28

Data Community DC: Python for Data Analysis: The Landscape of Tutorials

Python has been one of the premier general scripting languages, and a major web development language. Numerical and data analysis and scientific programming developed through the packages Numpy and Scipy, which, along with the visualization package Matplotlib formed the basis for an open-source alternative to Matlab. Numpy provided array objects, cross-language integration, linear algebra and other functionalities. Scipy adds to this and provides optimization, linear algebra, optimization, statistics and basic image analysis capabilities. Matplotlib provides sophisticated 2-D and basic 3-D graphics capabilities with Matlab-like syntax.

Python_natalensis_Smith_1840

Further recent development has resulted in a rather complete stack for data manipulation and analysis, that includes Sympy for symbolic mathematics, pandas for data structures and analysis, and IPython as an enhanced console and HTML notebook that also facilitates parallel computation.

An even richer data analysis ecosystem is quickly evolving in Python, led by Enthought and Continuum Analytics and several other independent and associated efforts. We have described this ecosystem here.

This week, as part of Data Community DC meetups, Peter Wang from Continuum Analytics presents about the PyData ecosystem at Statistical Programming DC, and Jonathan Street from NIH is presenting on Scientific Computing in Python at Data Science MD.

How do you get started??!!!

This ecosystem is evolving and exists today, but how do you get started using these tools? Fortunately there are several tutorials available both in video and as presentations that you can use. Hopefully this will put you on the path. This listing is of course incomplete, and may not include your favorite tool. Tell us about it in the comments!!

PyData Workshop 2012

The PyData Workshop 2012 was organized in NYC last October to bring together data scientists, scientists and engineers. It focused on “techniques and tools for management, analytics, and visualization of data of different types and sizes with particular emphasis on big data”. It was primarily sponsored by Continuum Analytics.  The videos for this workshop are aggregated here.

PyData Silicon Valley

The follow up PyData workshop was held alongside PyCon2013 in Santa Clara, CA. The videos for the presentations are available here. The topics at the workshop included tutorials for pandas, matplotlib, PySpark (for cluster computing), scikits-learn, Wise.io, Disco (a MapReduce implementation), Naive Bayes, Nodebox, machine learning in Python, and IPython.

PyData Workshop 2013

The next PyData Workshop will be in Cambridge, MA July 27-28, 2013

Tutorials for Particular Tools

Python for Data Analysis

  1. Getting started from Kaggle.com.

IPython

IPython notebooks have become the de facto standard for presenting Python analyses, as evidenced by the recent Scipy conference. There are several tutorials for learning IPython.

  1. The IPython tutorial
  2. Fernando Perez’s talk on IPython (and video)
  3. PyCon 2012 tutorial
  4. Interesting IPython notebooks
  5. IPython notebook examples

Python Data Analysis Library (pandas)

  1. The 10-minute introduction to pandas
  2. The pandas cookbook
  3. 2012 PyData Workshop
  4. The pandas documentation
  5. Randal Olson’s tutorial
  6. Wes McKinney’s tutorials 1 and 2 on Kaggle.
  7. Hernan Rojas’ tutorial
  8. Tutorials on financial data and time series using pandas

Scikit-learn

  1. 2012 PyData Workshop
  2. Official scikit-learn tutorial
  3. Jacob VanderPlas’ tutorial
  4. PyCon 2013 tutorial on advanced machine learning with scikit-learn
  5. More scikit-learn tutorials.

Matplotlib

  1. Official tutorial
  2. N.P. Rougier’s tutorial from EuroSciPy 2012
  3. Jake VanderPlas’ tutorial from PyData NYC 2012
  4. John Hunter’s Advanced Matplotlib Tutorial from PyData 2012
  5. A tutorial from Scigraph.

Sympy

  1. Official tutorial
  2. SciPy 2013 presentations

Numpy and Scipy

  1. The Guide to Numpy
  2. M. Scott Shell’s Introduction to Numpy and Scipy

Databases from Python

  1. SQLite
  2. MySQL
  3. PostgreSQL

Books

  1. Python for Data Analysis by Wes McKinney
  2. Learning IPython for Interactive Computing and Data Visualization by Cyrille Rossant

If you have any additional suggestions, please leave them in the comments section below!

Editors Note: The book images link out to Amazon, of which we are an affiliate. Thus, if you click the link and buy the book, we get a single digit percentage cut of the purchase. So please, click and buy the books ;)

The post Python for Data Analysis: The Landscape of Tutorials appeared first on Data Community DC.

16 May 21:59

A lição dos seios de Angelina Jolie

Há pessoas que nos ensinam a mudar o mundo -e a começar por nós mesmos. Certamente esse é o caso de Angelina Jolie que, depois de fazer uma análise genética, mandou a vaidade para o espaço e preferiu a vida: tirou os dois seios. Isso antes que qualquer tumor tivesse aparecido. Por ter dois genes falhos, seu risco de contrair um câncer no seio era de 87% - a mesma doença que matou sua mãe ainda jovem. A atriz não queria que seus filhos passassem pelo mesmo sofrimento da perda da mãe. Leia mais (14/05/2013 - 08h03)
13 Apr 19:34

O app da periferia e a revolução do celular

A partir de hoje já é possível saber pelo celular o que de mais relevante ocorre em cultura e educação na periferia da cidade de São Paulo, estendendo-se às regiões metropolitanas. É batizado de Cultura de Ponta (baixe aqui). O app foi desenvolvido pela Fábrica de Aplicativos. A ideia é mostrar que há uma produção cultural interessante nas pontas da cidade de São Paulo -e, na maioria das vezes, fica invisível até mesmo na periferia. Invisível até para quem mora na periferia. Leia mais (12/04/2013 - 08h13)
12 Apr 16:16

Vagrant + Chef

by Miguel Galves
![](http://devopsdotcom.files.wordpress.com/2013/02/devops-define-roles.jpg)

Desenvolvedor de startup codifica, testa, publica, mantém infra, desenha telas, atende telefone e lava a louça do cafezinho no fim da tarde. E aí, o único jeito de se manter digno e produtivo é automatizar tudo o que puder ser automatizado, economizando tempo e erros bestas. Além do mais, em tempos de computação nas nuvens, hardware virou software e a melhor forma de tirar proveito desta flexibilidade é automatizando o processo de criação e configuração de suas máquinas. Assim, fica possível expandir, reduzir e substituir servidores com um ou dois cliques.

No SIGA, usamos intensivamente os serviços da Amazon AWS, temos alguns servidores rodando (e em alguns casos, este número varia automaGicamente conforme a necessidade), e apenas um coitado gerenciando tudo: eu. Me vi obrigado a procurar soluções, e o Fabric caiu como uma luva. Montei um script capaz de gerenciar minhas máquinas na Amazon (start, stop, reboot), os serviços disponíveis em cada, além de configuração inicial e atualização de código.

Tudo muito bom, tudo muito bem, cheguei ao limite desta solução, por vários motivos:

  1. Totalmente amarrado à infra da Amazon;
  2. Totalmente amarrado ao Ubuntu;
  3. Desenvolvido de forma orgânica e sem muito planejamento. A portabilidade do script entre projetos melhorou muito, mas ainda é precária;

Procurando soluções, comecei a prestar mais atenção no Chef. E a coisa realmente começou a me interessar quando fui apresentado ao Vagrant. Resumindo muito cada ferramenta, o Chef é um sistema que automatiza a configuração de máquinas e instalação de pacotes, através de receitas (recipes), sendo capaz de lidar com sutilidades de diversos sistemas operacionais.

Chef provides a flexible model for reuse by enabling users to model infrastructure as code to easily and consistently configure and deploy infrastructure across any platform.

Já o Vagrant é um criador de máquinas virtuais, sendo capaz de gerar VMs do VirtualBox e VMWare a partir de um script.

Create a single file for your project to describe the type of machine you want, the software that needs to be installed, and the way you want to access the machine. Store this file with your project code.

Como você já deve ter adivinhado, o Vagrant usa o Chef para configurar e instalar pacotes em uma VM. Ambos permitem transformar a configuração de uma máquina em um simples script, que pode ser facilmente distribuido entre todos os desenvolvedores de um projeto (resolvendo outra dor de cabeça: ambientes heterogêneos. Quem nunca sofreu com isso que jogue a primeira pedra).

Resolvi por a mão na massa e montar setup básico para criar uma VM Ubuntu 12.04 com NGINX servindo uma página HELLO WORLD. Deu certo, e resolvi compartilhar este código com o mundo. Como este texto já está ficando longo, vou dar aqui apenas a receitinha de bolo pra executar o meu setup que, espero, poderá servir de ponto de partida para outros interessados no assunto.

Vamos lá:

  1. Instale o VirtualBox, disponível em https://www.virtualbox.org/;

  2. Instale Vagrant >= 1.1.X, disponivel no site http://vagrantup.com;

  3. Instale ruby 1.9;

  4. Instale chef:

    > gem install chef

  5. Instale o plugin vagrant-omnibus. Ele garante que o Chef estará instalado na versão correta na VM:

    > vagrant plugin install vagrant-omnibus

  6. Instale uma box do Ubuntu 12.04 no Vagrant :

    > vagrant box add precise64 http://dl.dropbox.com/u/1537815/precise64.box

    (caso prefira outro sabor de Linux, existe uma lista bastante grande de listadas em http://www.vagrantbox.es/)

  7. Baixe o meu template de código no GitHub:

    > git clone git://github.com/mgalves/vagrant-chef-nginx.git

  8. Entre no diretório vagrant-chef-nginx/chef_repo;

  9. Rode o comando:

    > vagrant up

  10. Acesse http://192.168.33.10/;

  11. Hello World!

11 Apr 22:13

O caso Feliciano

O caso do pastor e deputado federal Marco Feliciano (PSC-SP) encarna vários dilemas vividos pelo Brasil, de modo que acho que vale uma coluna grande tentando reunir as várias dimensões do "imbroglio". Comecemos pelo começo, isto é, os aspectos teológico-metafísicos do problema. O parlamentar é acusado de sustentar posições racistas e homofóbicas. Mas será que é verdade? Vamos às evidências. Pelo que tive a oportunidade de ler em reportagens e artigos dispersos, Feliciano segue a linha clássica dos religiosos que afirmam não ter nada contra gays, mas sim contra o ato homossexual, que encontraria forte condenação na Bíblia. Leia mais (11/04/2013 - 03h00)