Shared posts

13 Jun 04:59

Data skipping

Way back in 2006, I wrote about a cool Netezza feature called the zone map, which in essence allows you to do partition elimination even in the absence of strict range partitioning.

Netezza’s substitute for range partitioning is very simple. Netezza features “zone maps,” which note the minimum and maximum of each column value (if such concepts are meaningful) in each extent. This can amount to effective range partitioning over dates; if data is added over time, there’s a good chance that the data in any particular date range is clustered, and a zone map lets you pick out which data falls in the desired data range.

I further wrote

… that seems to be the primary scenario in which zone maps confer a large benefit.

But I now think that part was too pessimistic. For example, in bulk load scenarios, it’s easy to imagine ways in which data can be clustered or skewed. And in such cases, zone maps can let you skip a large fraction of potential I/O.

Over the years I’ve said that other things were reminiscent of Netezza zone maps, e.g. features of Infobright, SenSage, InfiniDB and even Microsoft SQL Server. But truth be told, when I actually use the phrase “zone map”, people usually give me a blank look.

In a recent briefing about BLU, IBM introduced me to a better term — data skipping. I like it and, unless somebody comes up with a good reason not to, I plan to start using it myself. :)

13 Jun 04:55

Where things stand in US government surveillance

Edit: Please see the comment thread below for updates. Please also see a follow-on post about how the surveillance data is actually used.

US government surveillance has exploded into public consciousness since last Thursday. With one major exception, the news has just confirmed what was already thought or known. So where do we stand?

My views about domestic data collection start:

  • I’ve long believed that the Feds — specifically the NSA (National Security Agency) — are storing metadata/traffic data on every telephone call and email in the US. The recent news, for example Senator Feinstein’s responses to the Verizon disclosure, just confirms it. That the Feds sometimes claim this has to be “foreign” data or they won’t look at it hardly undermines my opinion.
  • Even private enterprises can more or less straightforwardly buy information about every credit card purchase we make. So of course the Feds can get that as well, as the Wall Street Journal seems to have noticed. More generally, I’d assume the Feds have all the financial data they want, via the IRS if nothing else.
  • Similarly, many kinds of social media postings are aggregated for anybody to purchase, or can be scraped by anybody who invests in the equipment and bandwidth. Attensity’s service is just one example.
  • I’m guessing that web use data (http requests, search terms, etc.) is not yet routinely harvested by the US government.* Ditto deanonymization of same. I guess that way basically because I’ve heard few rumblings to the contrary. Further, the consumer psychographic profiles that are so valuable to online retailers might be of little help to national security analysts anyway.
  • Video surveillance seems likely to grow, from fixed cameras perhaps to drones; note for example the various officials who called for more public cameras after that Boston Marathon bombing. But for the present discussion, that’s of lesser concern to me, simply because it’s done less secretively than other kinds of surveillance. If there’s a camera that can see us, often we can see it too.

*Recall that these comments are US-specific. Data retention legislation has been proposed or passed in multiple countries to require recording of, among other things, all URL requests, with the stated goal of fighting either digital piracy or child pornography.

As for foreign data:

  • Last I heard, we were collecting at least 10s of petabytes of satellite images per day. That’s probably too much even for the US government to persist in its entirety  at this time. In the installation I heard of, most of the satellite data was deleted within 12-48 hours. But it may fit into the yottabyte-scale data center in Utah.
  • I also once heard the US monitors every radio transmission detectable from North Korea.

Beyond that, use your imagination.

The big question is how much domestic or quasi-domestic communications-content data the US government currently captures. I think it’s a lot more than we previously acknowledged. For example:

  • Both Edward Snowden and William Binney have said things that sound like the NSA is comprehensively storing actual communications content. I guess it’s possible that in each case they misspoke.
  • Other claims to that effect have been more ringing. For example:
    • The secret AT&T room/message splitter story dates back to 2007.
    • The FBI itself states that in 2011 it “checked U.S. government databases and other information to look for such things as derogatory telephone communications, possible use of online sites associated with the promotion of radical activity.”
  • Much of the PRISM project seems to be about access to communication or file contents.
  • The most visible, emphatic denials — e.g. those from President Obama or various tech companies — seem to leave weasel room if one parses them carefully.

And cost is not a barrier. I would guess the order of magnitude* for all email in the US at 10 petabytes/day uncompressed. (100s of billions of messages, 10s of KB per message.) Phone call volumes are probably less. (Fewer than 10 billion calls per day.) The Feds can afford to store that. Hadoop or NoSQL clusters, for example, can be set up for low six figures per petabyte.** HP Vertica will sell anybody an RDBMS cluster (hardware and software) for around $2 million/petabyte.**

*In the most literal high-school-chemistry sense of the phrase.

**Of raw data; particularly compressible data might be managed yet more cheaply.

Coverage of all this has of course been intense. In particular:

And my views can be summarized much as I did three years ago:

  • It is inevitable* that governments and other constituencies will obtain huge amounts of information, which can be used to drastically restrict everybody’s privacy and freedom.
  • To protect against this grave threat, multiple layers of defense are needed, technical and legal/regulatory/social/political alike.
  • One particular layer is getting insufficient attention, namely restrictions upon the use (as opposed to the acquisition or retention) of data.

*And indeed in many ways even desirable

30 May 10:28

What's the point of SecureZeroMemory?

by Raymond Chen - MSFT
zedware

security is a burden or a chance for programmer?

The Secure­Zero­Memory function zeroes out memory in a way that the compiler will not optimize out. But what's the point of doing that? Does it really make the application more secure? I mean, sure the data could go into the swap file or hibernation file, but you need to have Administrator access to access those files anyway, and you can't protect yourself against a rogue Administrator. And if the memory got swapped out before it got zeroed, then the values went into the swap file anyway. Others say that it's to prevent other applications from reading my process memory, but they could always have read the memory before I called Secure­Zero­Memory. So what's the point?

The Secure­Zero­Memory function doesn't make things secure; it just makes them more secure. The issue is a matter of degree, not absolutes.

if you had a rogue Administrator or another application that is probing your memory, then that rogue operator has to suck out the data during the window of opportunity between the time you generate the sensitive data and the time you zero it out. This is typically not a very long time, so it makes the attacker work harder to get the data. Similarly, the data has to be swapped out during the window between the sensitive data being generated and the data being zeroed. Whereas if you never called Secure­Zero­Memory, the attacker could take their sweet time looking for the sensitive information, because it'll just hang around until the memory gets re-used for something else.

Furthermore, the disclosure may not be due to a rogue operative, but may be due to your own program! If your program crashes, and you're signed up your program for Windows Error Reporting, then a crash dump file is generated and uploaded to Microsoft so that you can download and investigate why your program is failing. In preparation for uploading, the crash dump is saved to a file on the user's hard drive, and an attacker may be able to mine that crash dump for sensitive information. Zeroing out memory which contained sensitive information reduces the likelihood that the information will end up captured in a crash dump.

Another place your program may inadvertently reveal sensitive information is in the use of uninitialized buffers. If you have a bug where you do not fully-initialize your buffers, then sensitive information may end up leaking into them and then accidentally transmitted over the network or written to disk. Using the Secure­Zero­Memory function when finished with sensitive information is a defense-in-depth way of making it harder for sensitive information to go where it's not supposed to.

08 Apr 05:07

McIntyre: Scanning for assembly code in Free Software packages

by jake
On his blog, Steve McIntyre writes about work he has been doing to identify assembly code in Linux packages:

In the Linaro Enterprise Group, my task for the last several weeks was to work through a huge number of packages looking for assembly code. Why? So that we could identify code that would need porting to work well on AArch64, the new 64-bit execution state coming to the ARM world Real Soon Now.

Working with some Ubuntu and Fedora developers, we generated a list of packages included in each distribution that seemed to contain assembly code of some sort. Then I worked through that list, checking to see:

  1. if there was actually any assembly there;
  2. if so, what it was for, and
  3. whether it was actually used

That work resulted in a report with his findings.

08 Apr 05:06

Mozilla and Samsung building a new browser engine

by corbet
The Mozilla project has announced a collaboration with Samsung to build "Servo", a next-generation browser rendering engine. "Servo is an attempt to rebuild the Web browser from the ground up on modern hardware, rethinking old assumptions along the way. This means addressing the causes of security vulnerabilities while designing a platform that can fully utilize the performance of tomorrow’s massively parallel hardware to enable new and richer experiences on the Web. To those ends, Servo is written in Rust, a new, safe systems language developed by Mozilla along with a growing community of enthusiasts."
08 Apr 05:06

Google's "Blink" rendering engine

by corbet
Google has announced that it is forking the WebKit rendering engine to make a new project called Blink. "Chromium uses a different multi-process architecture than other WebKit-based browsers, and supporting multiple architectures over the years has led to increasing complexity for both the WebKit and Chromium projects. This has slowed down the collective pace of innovation - so today, we are introducing Blink, a new open source rendering engine based on WebKit."