Shared posts

01 Jul 00:59

Recent Image Sensor Theses

by Vladimir Koifman
There is a number of recently published image sensor theses:

University of Trento: "3D Camera Based on Gain-Modulated CMOS Avalanche Photodiodes", PhD Thesis by Olga Shcherbakova, April 2013

Delft University: "A Data Acquisition System Design for a 160x128 Single-photon Image Sensor", MSc Thesis by Sachin S. Chadha, April 2013

Delft University: "4T CMOS Active Pixel Sensors under Ionizing Radiation" PhD Thesis by Jiaming Tan, April 2013

Delft University: "The design of a 16*16 pixels CMOS image sensor with 0.5 e- RMS noise" MSc Thesis by Yao Q., embargoed till July 28, 2013

University of Pretoria: "Dynamic Range and Sensitivity Improvement of Infrared Detectors Using BiCMOS Technology" MSc Thesis by Johan Venter, Feb. 2013

York University: "Analysis and Design of a Wide Dynamic Range Pulse-Frequency Modulation CMOS Image Sensor" PhD Thesis by Tsung-Hsun Tsai, Dec. 2012

Ecole Centrale De Lyon: "Modeling and design of 3D Imager IC" PhD Thesis by Vijayaragavan Viswanathan, Feb. 2013 (Talks about different aspects of 3D stacking, not 3D imaging)
09 Jun 15:50

G2 and the Rolling Ball

by huerta
MathML-enabled post (click for more details).

I just returned from a month at Hong Kong University, visiting James Fullwood, an algebraic geometer who likes to think about the mathematics of string theory. There, I gave a colloquium on G2 and the rolling ball, a paper John Baez and I wrote that is due to appear in Transactions of the AMS. This project began over a decade ago in conversations between John and Jim Dolan, later continued between Jim and me. Though Jim opted not to be a coauthor, his insights were crucial.

I would like to tell you about this paper, but I’ll warm up with a puzzle—one you’ve seen in several guises if you’ve read John Baez’s posts about this paper, but well-worth revisiting.

Puzzle: Roll a ball of unit radius on a fixed ball of radius R, being careful not to let your ball slip or twist as you roll it. Suppose you roll along a great circle from the North Pole to the South Pole and back to the start at the North. How many 360∘ turns did the ball make as it rolled?

One ball rolling on another

Here’s a variant of this puzzle you can work out very concretely: for R=1, roll one coin around another of the same kind, without slipping, and count the number of times the rolling coin turns.

Below the fold, I’ll give you the answer, and I’ll also tell you about something amazing that happens when R=3, bringing in the exceptional Lie group G2 and a funky 8-dimensional number system called the split octonions.

MathML-enabled post (click for more details).

When Cartan and Killing classified the simple Lie groups, they found something unexpected. Besides the perfectly respectable infinite families, SO(n), SU(n) and Sp(n), there were five exceptions. In order of increasing dimension, these exceptional groups are prosaically called G2, F4, E6, E7 and E8. While the first three kinds of groups are all symmetry groups of vector spaces equipped with some kind of bilinear form, the exceptions do not, at first glance, have a reason to exist. I like to imagine Cartan and Killing reacting like physicist I.I. Rabi to the discovery of the muon: “Who ordered that?” Since their discovery, finding simple models for the exceptional Lie groups has been an important program in mathematics. Here, I’ll tell you about one such model for G2, essentially due to Cartan: G2 is almost the symmetry group of one ball rolling without slipping or twisting on another ball, provided the ratio of radii is 1:3 or 3:1.

When Jim Dolan and I started talking about this problem, we set out specifically to explain that funny ratio. We took our cue from Bor and Montgomery, who in their excellent paper G2 and the rolling distribution, write:

Open problem. Find a geometric or dynamical interpretation for the “3” of the 3:1 ratio.

Why only 1:3 or 3:1? You can see a hint of “threeness” in the Dynkin diagram for G2:

G2 Dynkin diagram.

And that’s probably responsible for the 1:3 ratio, but by the end of the post, I’ll have shown you another explanation, far more geometric in nature. The key is going to be to relate the rolling ball picture to another description of G2, also due to Cartan: G2 is the automorphism group of the ‘split octonions’, 𝕆′.

I’ll explain all of these ideas, so don’t worry if you’re unfamiliar with them. But first, an aside to those in the know: throughout this post, I’ll focus on the adjoint, split real form of G2, which is the only form for which all of these ideas make sense, though a lot of them continue to work for the complex form as well, as long as you replace the split octonions with their complexification. This is in keeping with a well-known theme in Lie theory: split real forms and complex forms behave in almost identical fashion.

The rolling ball


Now let’s get down to business, by working out a mathematical framework to describe a ball rolling on another ball, without slipping or twisting. To start off, I’ll let the fixed ball have radius R, and the rolling ball have radius 1. Only later will we see what happens when we set R=3.

First, a configuration of this system corresponds to a point in S2×SO(3), as follows: for (x,g)∈S2×SO(3), Rx is the point of contact where the rolling ball touches the fixed ball, and g tells us the present orientation of the rolling ball, rotating it from your favorite starting orientation. To show you what I mean, I can’t do better than a picture from the paper of Bor and Montgomery (though note that what I call g they call φ, and they let their rolling ball have arbitrary radius r):

A rolling ball configuration.

That’s how we describe configurations of the rolling ball system. What does it mean for the ball to roll without slipping or twisting? To describe that, we will turn to ‘incidence geometry’. This kind of geometry sounds very bare bones: in its simplest incarnation, an incidence geometry has points, lines, and an incidence relation (a point lies on a line, or is incident on the line). Yet this minimal structure will suffice to capture what we mean by rolling without slipping or twisting.

The rolling ball system defines an incidence geometry where:

  • Points are configurations of the rolling ball—elements of S2×SO(3).
  • Lines are given by rolling without slipping or twisting along great circles.

A line, then, is some one-dimensional submanifold of S2×SO(3). Which one-dimensional submanifolds are lines? Well, that’s the point of the puzzle at the start of this post. If you didn’t think about it before, go back and give it a shot now!

Here’s the answer: each time we roll a ball of unit radius around a ball of radius R, the rolling ball turns R+1 times! And that’s how we roll. For instance, if begin rolling from the North Pole and sweep out a central angle θ as we do so, we rotate our ball by an angle (R+1)θ about the axis going into the picture:

Rolling with angles marked.

Now that we have a incidence geometry, we can talk about its symmetries. These symmetries are the diffeomorphisms that preserve the lines:

Aut(S2×SO(3))={f∈Diff(S2×SO(3)):f takes lines to lines }.

Since the lines depend on R, so does this group. Alas, this group is not G2, even when R=3.

Remember, I said that G2 was almost the symmetry group of the rolling ball, and we have run head first into that “almost”: as shown by Bor and Montgomery, G2 does not act on S2×SO(3). Instead, it acts on its double cover:

S2×SU(2)⟶pS2×SO(3).

And, if we stretch our imaginations a bit, we can think of this as a variant of the rolling ball system, which we call the ‘rolling spinor’.

The rolling spinor…


What’s a rolling spinor? It’s like a ball, but thanks to the “double” in the double cover SU(2)→SO(3), a 360∘ rotation does not act like the identity. Instead, we need to rotate by 720∘ degrees to get back where we started.

Spinors show up physically and mathematically. Famously, electrons are spinors, but a far more concrete example is given by the belt trick: put a 360∘ twist in a belt, and you cannot undo the twist just by translating the ends around, but for 720∘ twist, you can!

I can make more sense of this using quaternions. If I identify SU(2) with the group of unit quaternions:

SU(2)={q∈ℍ:∣q∣=1}

and identify ℝ3 with the subspace of imaginary quaternions:

ℝ3=Im(ℍ)={xi+yj+zk:x,y,z∈ℝ}

than it’s easy for me to describe the covering map, SU(2)→SO(3). Each unit quaternion q∈SU(2) goes to the rotation of ℝ3 given by conjugation:

x↦qxq−1,x∈ℝ3.

The kernel of SU(2)→SO(3) is {±1}. If I think of +1 as giving a rotation by 0°, and −1 as giving a rotation by 360°, then the idea that a 720° rotation is the identity becomes (−1)2=1, which is indeed true. If this idea sounds funny, or you want an overview of quaternions, read this talk I gave at Fullerton College.

The rolling spinor system comes with an incidence geometry where:

  • Points are elements of S2×SU(2).
  • Lines are preimages of lines in S2×SO(3).

When R=3, G2 acts on S2×SU(2) as symmetries preserving lines. To understand this action, it helps to introduce yet another variant of the rolling ball system: the rolling spinor on a projective plane.

…on a projective plane


This system is double-covered by the rolling spinor:

S2×SU(2)⟶qℝP2×SU(2),

where the map q identifies antipodal points of S2. We write (±x,q)∈ℝP2×SU(2) for the equivalence class containing (x,q) and (−x,q). There is an incidence geometry where:

  • Points are elements of ℝP2×SU(2).
  • Lines are images of lines in S2×SU(2).

Unlike the original rolling ball system, when R=3, G2 acts as symmetries of this incidence geometry. To see why, it’s time to introduce the split octonions.

Split octonions


The split octonions 𝕆′ are an eight dimensional real composition algebra. There are two such algebras over the reals, the other being their more famous cousin, the octonions. We can define the split octonions as pairs of quaternions:

𝕆′=ℍ2

with the following multiplication:

(a,b)(c,d)=(ac+d¯b,a¯d+cb).

Here, the barred quaternions a¯ and d¯ denote the conjugate, in ℍ, of a and d, obtained by flipping the sign of the imaginary part, just as we would do for a complex number. With this multiplication, one can check that the quadratic form:

Q(a,b)=∣a∣2−∣b∣2,(a,b)∈ℍ2

makes 𝕆′ into a composition algebra, with the property:

Q(xy)=Q(x)Q(y),x,y∈𝕆′.

Now we can define G2=Aut(𝕆′). And, as promised, this description of G2 can be related to the rolling ball descriptions I have been sketching. But first, it helps to note that we don’t really need all of 𝕆′: since G2 fixes 1∈𝕆′, we really only need the subspace of imaginary split octonions. This is the subspace orthogonal to 1 with respect to Q. In terms of pairs of quaternions, this is the subspace where the first quaternion is purely imaginary while the second can be arbitrary:

𝕀=Im(ℍ)⊕ℍG2 acts on the imaginary split octonions as well, and in fact it is the smallest irreducible representation of G2. We’re close now. Because G2 preserves Q, G2 acts on the space of 1d null subspaces of 𝕀: PC={span(x):x∈𝕀−{0},Q(x)=0.}

This space of null lines is almost the configuration space of the spinor rolling on the projective plane, ℝP2×SU(2). Choosing a representative of each line in Im(ℍ)⊕ℍ, we can normalize its first component to have length 1, forcing the second component to also have unit length. Doing this, we see that:

S2×SU(2)(x,q)∼(−x,−q)≅PC.

This ought to remind you of the spinor rolling on a projective plane, which has configuration space:

ℝP2×SU(2)≅S2×SU(2)(x,q)∼(−x,q).

In the first case, we mod out both components by their sign, and in the second, we just mod out the first. And in fact, these spaces are diffeomorphic:

ℝP2×SU(2)→PC(±x,q)↦±(x,xq)

Note that this is not true for the ordinary rolling ball. We needed to think about a spinor rolling on a projective plane.

Under the diffeomorphism above, lines in ℝP2×SU(2) incidence geometry become submanifolds of PC. If and only if R=3, these submanifolds ‘straighten out’: they are given by projectivizing 2-dimensional null subspaces of the imaginary split octonions 𝕀.

However, not all 2d null subspaces of 𝕀 yield lines in PC. When R=3, the incidence geometry of ℝP2×SU(2) coincides with the incidence geometry on PC with:

  • Points are 1d null subspaces of 𝕀.
  • Lines are 2d null subspaces of 𝕀on which the product also vanishes.

In fact, if we call a subspace of 𝕀 a null subalgebra if the product of any two elements vanishes, we can give a very snappy description of the incidence geometry on PC, thanks to the helpful fact that a 1d subspace span(x) of 𝕀 is null if and only if x2=0:

  • Points are 1d null subalgebras of 𝕀.
  • Lines are 2d null subalgebras of 𝕀.

Now we have what we wanted: the geometry of a spinor rolling on a projective plane, when R=3, is the geometry of null subalgebras of the imaginary split octonions, 𝕀. Hence, G2 acts as symmetries of the spinor rolling on the projective plane!

Coda


We’ve seen how a spinor rolling on a projective plane three times as big acquires G2 symmetry, but we still haven’t seen a completely satisfying explanation of that 1:3 ratio. It’s hiding somewhere in the proofs of the theorems I described above.

Yet there is an intuitive way to see the 1:3 ratio. Let’s think of our spinor as a ball that comes back to itself after turning twice, and let’s think of our projective plane as a ball with antipodal points identified. Rolling from a point on our fixed ball to its antipode, we’ve come back to where we’ve started in our projective plane. For what ratio of radii will the rolling spinor also be back where it started? That is, for what ratio of radii does the rolling ball turn an even number of times as it goes from a point to its antipode? At minimum, we need to turn twice. So:

  • For what ratio of radii does the rolling ball turn twice as it goes from a point to its antipode? Or, put another way, for what ratio of radii does the rolling ball turn four times as it rolls once around the fixed ball?

By the solution to our puzzle, a ball of unit radius turns R+1 times as it rolls once around a fixed ball of radius R. To turn four times, we need R=3.

31 May 22:30

LLVM 3.3 Vectorization Improvements

by noreply@blogger.com (Nadav Rotem)
I would like to give a brief update regarding vectorization in LLVM. When LLVM 3.2 was released, it featured a new experimental loop vectorizer that was disabled by default. Since LLVM 3.2 was released, we have continued to work hard on improving vectorization, and we have some news to share. First, the loop vectorizer has new features and is now enabled by default on -O3. Second, we have a new SLP vectorizer. And finally, we have new clang command line flags to control the vectorizers.

Loop Vectorizer

The LLVM Loop Vectorizer has a number of new features that allow it to vectorize even more complex loops with better performance. One area that we focused on is the vectorization "cost model". When LLVM estimates if a loop may benefit from vectorization it uses a detailed description of the processor that can estimate the cost of various instructions. We improved both the X86 and ARM cost models. Improving the cost models helped the compiler to detect benefitting loops and improve the performance of many programs. During the analysis of vectorized programs, we also found and optimized many vector code sequences.

Another important improvement to the loop vectorizer is the ability to unroll during vectorization. When the compiler unrolls loops it generates more independent instructions that modern out-of-order processors can execute in parallel. The loop below adds all of the numbers in the array. When compiling this loop, LLVM creates two independent chains of calculations that can be executed in parallel.

int sum_elements(int *A, int n) {
int sum = 0;
for (int i = 0; i < n; ++i)
sum += A[i];
return sum;
}
The innermost loop of the program above is compiled into the X86 assembly sequence below, which processes 8 elements at once, in two parallel chains of computations. The vector registers XMM0 and XMM1 are used to store the partial sum of different parts of the array. This allows the processor to load two values and add two values simultaneously.

LBB0_4:
movdqu 16(%rdi,%rax,4), %xmm2
paddd %xmm2, %xmm1
movdqu (%rdi,%rax,4), %xmm2
paddd %xmm2, %xmm0
addq $8, %rax
cmpq %rax, %rcx
jne LBB0_4

Another important improvement is the support for loops that contain IFs, and the detection of the popular min/max patterns. LLVM is now able to vectorize the code below:

int fins_max(int *A, int n) {
int mx = A[0];
for (int i = 0; i < n; ++i)
if (mx > A[i])
mx = A[i];
return mx;
}
In the last release, the loop vectorizer was able to vectorize many, but not all, loops that contained floating point arithmetic. Floating point operations are not associative due to the unique rounding rules. This means that the expression (a + b) + c is not always equal to a + (b + c). The compiler flag -ffast-math tells the compiler not to worry about rounding errors and to optimize for speed. One of the new features of the loop vectorizer is the vectorization of floating point calculations when -ffast-math mode is used. Users who decide to use the -ffast-math flag will notice that many more loops get vectorized with the upcoming 3.3 release of LLVM.

SLP Vectorizer

The SLP vectorizer (short for superword-level parallelism) is a new vectorization pass. Unlike the loop vectorizer, which vectorizes consecutive loop iterations, the SLP vectorizer combines similar independent instructions in a straight-line code.
The SLP Vectorizer is now available and will be useful for many people.
The SLP Vectorizer can boost the performance of many programs in the LLVM test suite. In one benchmark, "Olden/Power", the SLP Vectorizer boosts the performance of the program by 16%. Here is one small example of a function that the SLP Vectorizer can vectorize.

void foo(int * restrict A, int * restrict B) {
A[0] = 7+(B[0] * 11);
A[1] = 6+(B[1] * 12);
A[2] = 5+(B[2] * 13);
A[3] = 4+(B[3] * 14);
}
The code above is compiled into the ARMv7s assembly sequence below. Notice that the 4 additions and 4 multiplication operations became a single Multiply-Accumulate instruction "vmla".

_foo:
adr r2, LCPI0_0
adr r3, LCPI0_1
vld1.32 {d18, d19}, [r1]
vld1.64 {d16, d17}, [r3:128]
vld1.64 {d20, d21}, [r2:128]
vmla.i32 q10, q9, q8
vst1.32 {d20, d21}, [r0]
bx lr

Command Line Flags

We've also added new command line flags to clang to control the vectorizers. The loop vectorizer is enabled by default for -O3, and it can be enabled or disabled for other optimization levels using the command line flags:

$ clang ... -fvectorize / -fno-vectorize file.c
The SLP vectorizer is disabled by default, and it can be enabled using the command line flags:

$ clang ... -fslp-vectorize file.c
LLVM has a second basic block vectorization phase which is more compile-time intensive (BB vectorizer). This optimization can be enabled through clang using the command line flag:

$ clang ... -fslp-vectorize-aggressive file.c
We've made huge progress in improving vectorization during the development of LLVM 3.3. Special thanks to all of the people who contributed to this effort.
03 May 02:05

GraphLab Challenge @ SC13

by Danny Bickson
Just learned from my boss Prof. Carlos Guestrin about student cluster competition which is part of SC13 conference. The interesting part is the GraphLab programming is one of the challenges:


•    GraphLab(rador)http://graphlab.orgThe GraphLab project started in 2009 to develop a new parallel computation abstraction tailored to machine learning. GraphLab scales to graphs with billions of vertices and edges easily, performing orders of magnitude faster than competing systems. GraphLab combines advances in machine learning algorithms, asynchronous distributed graph computation, prioritized scheduling, and graph placement with optimized low-level system design and efficient data-structures to achieve unmatched performance and scalability in challenging machine learning tasks.
The GraphLab project consists of a core C++ GraphLab API and a collection of high-performance machine learning and data mining toolkits built on top of the GraphLab API. The API is built on top of standard cluster and cloud technologies: interprocess communication is accomplished over TCP-IP and MPI is used to launch and manage GraphLab programs. Each GraphLab process is multithreaded to fully utilize the multicore resources available on modern cluster nodes. GraphLab supports reading and writing to both Posix and HDFS filesystems.
We will keep an eye to hear about the outcome of this contest...



03 May 02:05

Beer Mapper

by Patrick Durusau

Beer Mapper: An experimental app to find the right beer for you by Nathan Yau.

Beer map

Nathan reviews an app that with a data set of 10,000 beers, attempts to suggest similar beers based on your scoring of beers.

A clever app but I am betting on Lars Marius besting it more often than not!

03 May 02:05

chipotle sweet potato burgers $6.30 recipe / $1.58 serving

by Beth M
STOP. THE. PRESSES.

I'm sure I was getting some pretty weird looks in the lunchroom today because I was audibly moaning between every bite of my Chipotle Sweet Potato Burger and staring at it in awe. Seriously the best thing I've eaten in quite a while (and those Black Bean & Avocado Enchiladas were pretty spectacular... not to mention the Philly Cheesesteaks).

So here's the cool thing about these burgers - the patty itself is vegan and gluten free. I happened to put them on wheat buns and slather some mayo on the bun, but you could easily use veganaise or wrap this puppy up in a gluten free tortilla. Either way, the burger itself is gonna blow your little taste bud brains.

Non-meat burgers come in a lot of textures... This burger is rather soft and moist compared to most because I didn't use any flour or eggs as a binder. Because of that, you want to be careful while handling them so they don't fall apart on you. It's worth it, I promise.

I tested two cooking methods (I knew you'd want to know). I fried two of them up in a skillet with a little oil and I baked the other two after coating them with non-stick spray. Each method had pros and cons. The skillet was tasty, obvi, because it was cooked in oil. But, because it cooked quickly, it was more moist and therefore softer and more likely to fall apart. The baked patties only browned on the under side, were a bit drier, and therefore more solid. Both tasted great. I'll stick to the skillet method because it's faster and, well, oil=yum.

Chipotle Sweet Potato Burgers

Chipotle Sweet Potato Burgers

Print Friendly and PDF

Total Recipe cost: $6.30
Servings Per Recipe: 4 (large patties or 6 smaller patties)
Cost per serving: $1.58
Prep time: 20 min. Refrigerate Time: 30 min. Cook time: 20 min. Total: 1 hr. 10 min.

INGREDIENTS COST
1 medium (1 lb.) sweet potato $1.00
1/2 cup frozen corn kernels $0.31
1 (15 oz.) can black beans $1.19
1 whole chipotle pepper in adobo sauce $0.33
1/2 cup (divided) cornmeal $0.12
1/4 tsp garlic powder $0.02
1/2 tsp cumin $0.03
1/2 tsp salt $0.03
1/4 bunch cilantro (optional) $0.19
1 Tbsp vegetable oil (for frying, optional) $0.02
4 medium wheat rolls $1.42
1 medium avocado $1.19
1/4 cup mayonnaise (optional) $0.45
TOTAL $6.30

STEP 1: Wash the sweet potato well and then poke it several times with a fork so that steam can escape while it cooks.
To cook it in the microwave: wrap it loosely with a paper towel, set it on a microwave safe plate, and microwave on high for five minutes. Carefully squeeze the potato to make sure it is soft all the way through. If it is not, microwave longer, in one minute increments, until it is soft all the way through (mine took 8 minutes).
Took cook it in the oven: heat the oven to 400 degrees and then bake the potato for 45-60 minutes, or until it is soft all the way through (either directly on the oven rack or on a baking sheet).

STEP 2: While the sweet potato is cooling, prepare the rest of the ingredients. Place the frozen corn in a large bowl. Drain and rinse the black beans. Allow as much excess water to drain away as possible, and then add them to the bowl with the corn. Roughly chop the cilantro leaves and then add them to the bowl, along with half of the cornmeal (1/4 cup), garlic powder, cumin, and salt. Take one chipotle pepper out of the can, mince it, and then add it to the bowl along with about one teaspoon of the adobo sauce from the can.

STEP 3: Once the sweet potato is cool enough to handle, scoop out the flesh and add it to the bowl with the rest of the ingredients. Stir everything together and then either use a potato masher or the back of a fork to slightly mash the beans. Cover and chill the mixture for 30 minutes to allow the cornmeal to absorb some of the moisture.

STEP4 Divide the sweet potato mixture into four (or six) portions and shape each into a patty, approximately 3/4 inch thick. Use the remaining 1/4 cup of cornmeal to coat the outside of the patties. This will provide a nice crunch to the burger.

STEP 5a: To cook the patties in a skillet, heat 1/2 tablespoon of vegetable oil in a heavy bottomed skillet over a medium flame. When the oil is hot (it should look wavy on the surface), add two of the patties. Cook for about 5 minutes on each side or until the patties are golden brown. Add more oil and cook the remaining two patties. Remember to turn the patties carefully as they are quite soft and can fall apart.

STEP 5b: To cook the patties in the oven, Preheat the oven to 400 degrees. Line a baking sheet with foil and then coat lightly with non-stick spray. Place the patties on the foil and then spritz the top of the patties lightly with non-stick spray (this will help it crisp up). Bake for 25 minutes, or until the patties are heated through and the bottoms are golden brown.

STEP 6: Spread one tablespoon of mayonnaise on each bun and top with a chipotle sweet potato patty. Slice the avocado and place 1/4 of the slices on top of each burger. Top with extra cilantro leaves, if desired.


Chipotle Sweet Potato Burgers



Step By Step Photos


cook sweet potato
First cook the sweet potato. I like to do this in the microwave because it takes less than ten minutes, as opposed to almost an hour in the oven. Either way, you'll want to prick the skin several times with a fork to allow steam to escape as it cooks.

cooked sweet potato
To cook it in the microwave, loosely wrap the sweet potato in a paper towel and then cook on high for five minutes. Give the potato a gentle squeeze to make sure that it is soft all the way through. If the center is still stiff, microwave more, in one minute increments, until the center is soft. Allow it to cool for 5-10 minutes. (Or you can bake it in a 400 degree oven for 45-60 minutes).

chipotle peppers
While the sweet potato is cooking, you can prepare the rest of the ingredients. This is what the little can of chipotle peppers in adobo sauce looks like. These peppers have a smokey, spicy flavor. You can usually find them in the Hispanic ingredients section of most major grocery stores.

open chipotles
When you open up the can, this is what you'll see. Little fat peppers and lots of sauce. You can freeze the rest of this can, so that you don't have to buy a new one every time you want to use one or two peppers. Just transfer it to a small freezer bag.

chipotle pepper
The peppers are pretty small, but potent. Just take one pepper out of the can and mince it up (remove the stem if it is attached). Also take about one teaspoon of the sauce from the can and add it to the bowl with the other ingredients.

burger ingredients
Once the sweet potato is cool enough to handle, scoop out the flesh and add it to a large bowl along with the corn kernels, drained and rinsed black beans, roughly chopped cilantro, 1/4 cup of cornmeal, garlic powder, cumin, salt, and minced chipotle pepper + sauce.

mash ingredients
Stir everything together and then use either a potato masher or the back of a fork to mash up the beans. Refrigerate this mixture for about 30 minutes so that the cornmeal has time to soak up some of that moisture.

shape patties
The mixture will be very wet, but you should be able to form it into four patties. These patties were pretty large, so you may even want to do five or six patties. Use the remaining 1/4 cup of cornmeal to coat the outside of the patties. This adds a nice texture to the finished burger.

fry burger
You can either fry the burgers in a skillet with a little oil (about 5 minutes on each side over medium heat) or...

baked burgers
Bake them in a 400 degree oven for about 25 minutes. I sprayed the tops with non-stick spray, although the tops didn't brown at all. The patty in the front is flipped over so you can see the browned bottom. Flipping half way through baking may be a good solution, although I haven't tried it.

dressing
While the patties are cooking, you can prepare the buns and toppings. Need help with slicing avocados? Check out this step by step photo tutorial. I used mayonnaise for a little extra moisture, but that's optional. Why am I just now thinking about how awesome sriracha would be on there? Someone please try it for me.

Chipotle Sweet Potato Burger
And then it's time to experience... NIRVANA.
29 Apr 04:38

roasted vegetable french bread pizza $11.71 recipe / $1.95 serving

by Beth M
Ecstasy. Pure ecstasy.

This recipe turned out even better than I had imagined. You know, when you take that first bite and you're not even done chewing before you have to let out a muffled "Woah..." This roasted vegetable french bread pizza with pesto and ricotta made me do that.

The pesto sinks into the bread and creates an almost garlic bread type base, the ricotta gets warm and creamy, and roasted vegetables are, well, they're just heavenly (one of my favorite things on the planet, actually)! I'm going to be making this one over and over and over.

The cost of this recipe got a little out of hand, but there are several ways you can rein it in. First, if you can find a farmers market, CSA, or some other cheap source of vegetables, that will help a lot. Produce prices at my local grocery store were a bit (okay, more like a lot) on the steep side. This might be a good recipe to stash until the peak of summer when produce is abundant and cheap. You can make your own pesto (use this parsley pesto recipe) and that should help. Also, you could make this on some homemade focaccia instead of French Bread, which would cut a big chunk from the price (although the crispy exterior and fluffy interior of French bread is pretty magical). And lastly, this pizza was actually quite filling, so you might get more like 8 servings out of it.

Oh boy, I can't wait until I get hungry again so that I can have more...

Roasted Vegetable French Bread Pizza with Pesto and Ricotta

Roasted Vegetable French Bread Pizza with Pesto and Ricotta

Print Friendly and PDF

Total Recipe cost: $11.71
Servings Per Recipe: 6-8
Cost per serving: $1.95 (for six servings)
Prep time: 15 min. Cook time: 55 min. Total: 1 hr. 10 min.

INGREDIENTS COST
1 medium red onion $1.25
1 medium eggplant $1.71
1 medium zucchini $1.31
1 medium yellow squash $0.91
1 medium green bell pepper $0.89
4 cloves garlic $0.32
2 Tbsp olive oil $0.32
to taste salt & pepper $0.05
1 tsp dried basil $0.05
1 large loaf soft French bread* $1.69
1/4 cup pesto $0.81
1 cup ricotta cheese $1.40
1 cup shredded mozzarella $1.00
TOTAL $11.71
*This works best on soft "French bread" rather than real baguettes, which are much more tough.

STEP 1: Preheat your oven to 400 degrees. Prepare two baking sheets by lining with foil and coating lightly with non-stick spray. Chop all of the vegetables into one inch pieces and peel the garlic. Place all of the chopped vegetables into a large bowl or onto one of the baking sheets. Drizzle with olive oil, add the basil, and sprinkle lightly with salt and freshly cracked pepper. Toss the vegetables with your hands until they are well coated in oil and seasoning.

STEP 2: Divide the seasoned vegetables between the two baking sheets. Spread them out so that they are in a single layer and not piled on top of one another. Roast the vegetables in the oven for 45 minutes, or until they are shriveled and the edges are lightly browned. Stir the vegetables and rotate the baking sheets after the first 30 minutes of roasting.

STEP 3: While the vegetables are roasting, cut the French bread bread open like a hoagie roll. Spread 2 tablespoons of pesto over each open half (spread thinly) followed by 1/2 cup of ricotta on each half. The ricotta and pesto will mix slightly as you spread them on, that is okay.

STEP 4: After the vegetables are finished roasting, give them a taste and season with a bit more salt if needed. Transfer the bread back onto one of the baking sheets used for roasting the vegetables. Pile the roasted vegetables onto the bread and then sprinkle the shredded mozzarella cheese on top.

STEP 5: Pop the pizzas back into the oven (keep it set to 400 degrees after roasting the vegetables) and bake for an additional 5-10 minutes, or until the cheese is melted and the bread is crispy. Divide the pizzas into 6 or 8 equal sections and enjoy!

The bread can be divided up into 6 or 8 pieces before topping and baking, if it's easier to manage that way.

Roasted Vegetable French Bread Pizza with Pesto and Ricotta


Step By Step Photos


vegetables
Start preheating your oven to 400 degrees because it will probably be ready by the time you're done chopping your veggies. I used an eggplant, red onion, a few cloves of garlic, zucchini, and yellow squash. Mushrooms would also be awesome, but they were pretty expensive, so I skipped them.

seasoned vegetalbes
Cut the vegetables into one inch pieces. Either put them all in a big bowl or just put them on a baking sheet and add the olive oil, some salt and pepper, and the dried basil. Toss them with your hands until they're evenly coated with oil and seasoning. The oil won't fully cover all of them because the eggplant tends to soak it up like a sponge, but that's okay.

divide and roast
Once they're seasoned, divide the veggies between two baking sheets so that they're not piled on top of eachother. If the vegetables are over crowded on a baking sheet, the moisture will get trapped and they'll just steam in their own juices rather than roast. I like to lightly coat the foil with non-stick spray since I don't use a lot of olive oil to coat the vegetables. Roast the vegetables in the pre-heated oven for 45 minutes. Stir the vegetables and rotate the pans 30 minutes in.

cut French bread
I actually only made half a loaf because I'm using the other half for another recipe. Open the French bread up so that the two open sides are exposed. You can either leave the length of the French bread intact while you top it and cut it after baking, or cut it into sections now for easier handling.

ricotta pesto
Now comes the good stuff... Store bought pesto can be expensive, but this little jar (8 oz.) was the least expensive one at $3.25. You can usually find pesto by the pasta sauce, Mediterranean ingredients like artichoke hearts and olives, or sometimes in the produce section.

spread pesto
Spread the pesto on first, so that the oil soaks down into the bread. You don't need a lot of pesto because it's SUPER flavorful. I only used 1 tablespoon for each of these pieces, which were half of the total length of the bread.

ricotta
Next, spread on the ricotta. I used 1/4 cup for each of these pieces (again, half of the total length of the loaf), so you'll want 1 cup for the whole loaf. The ricotta and pesto will mix a little as you spread it on, but that's really delicious, so no worries.

roasted vegetables
Now your roasted vegetables are done. They should be shriveled and a little browned on the edges. Give them a taste and add a little more salt if you want.

pile on veggies
Put the French bread back onto one of the baking sheets you used to roast the vegetables and then pile the roasted vegetables on top.

mozzarella
Sprinkle the mozzarella cheese on top. Again, you don't need a lot because you already have the ricotta underneath. I only used a half cup for this half loaf of bread (both pieces). Put the pizza back into the oven (keep it turned onto 400 degrees after roasting the vegetables) and bake it for 5-10 minutes, or until the cheese is melted and the bread is crispy on the edges.

Roasted Vegetable French Bread Pizza with Pesto and Ricotta
And then O-M-G. ...I almost died from deliciousness overload.
19 Apr 02:54

Kenneth Reitz: Detect and highlight your heart-rate using just a webcam and this Python app

webcam-pulse-detector is a cross-platform Python application that can detect a person’s heart-rate using their computer’s webcam. I could write 1,000 words about it, or just show you this:

webcam-pulse-detector-screenshot

Pretty rad, huh? If you’re wondering how it all works, you’re in luck! There is an entire section of the README dedicated to the topic.

The app depends on Python 2.7+, OpenCV 2.4+, and OpenMDAO 0.5.5+, so it might take some work to get up and running. From the looks of it, it’d be totally worth the effort.

View the project on GitHub.

The post Detect and highlight your heart-rate using just a webcam and this Python app appeared first on The Changelog.

17 Apr 04:04

Uncovering Hidden Social Information Generates Quite a Buzz

by Greg Toth

We are pleased to have community member Greg Toth present this event review. Greg is a consultant and entrepreneur in the Washington DC area. As a consultant, he helps clients design and build large-scale information systems, process and analyze data, and solve business and technical problems. As an entrepreneur, he connects the dots between what’s possible and what’s needed, and brings people together to pursue new business opportunities. Greg is the president of Tricarta Corporation and the CTO of EIC Data Systems, Inc.

The March 2013 meetup of Data Science DC generated quite a buzz!  Well over a hundred data scientists and practitioners gathered in Chevy Chase to hear Prof. Jennifer Golbeck from the Univ. of Maryland give a very interesting – and at times somewhat startling – talk about how hidden information can be uncovered from people’s online social media activities.

600_220709992

Prof. Golbeck develops methods for discovering things about people online.  She opened her talk with a brief example of how bees reveal specific information to their hive’s social network through the characteristics of their “waggle dance.”  The figure eight patterns of the waggle dance convey distance and direction to pollen sources and water to the rest of the hive – which is a large social network.

Facebook Information Sharing

From there the discussion turned to how Facebook’s information sharing defaults have evolved from 2005 through 2010.  In 2005, Facebook’s default settings shared a relatively narrow set of your personal data with friends and other Facebook users.  At this point none of your information was – by default – shared with the entire Internet.

In subsequent years the default settings changed each year, sharing more and more information with a wider and wider audience.  By 2009, several pieces of your information were being shared openly with anyone on the Internet unless you had changed the default settings.  By 2010 the default settings were sharing significant amounts of information with a large swath of other people, including people you don’t even know.

The Facebook sharing information Prof. Golbeck described came from Matt McKeon’s work, which can be found here:  http://mattmckeon.com/facebook-privacy/

This ever-increasing amount of shared information has opened up new avenues for people to find out things about you, and many people may be shocked at what’s possible.  Prof. Golbeck gave a live demonstration of a web site called Take This Lollipop, using her own Facebook account.  I won’t spoil things by telling you what it does, but suffice to say it was quite startling.  If this piques your interest, check out www.takethislollipop.com

Predicting Personality Traits

From there the discussion shifted to a research project intended to determine whether it’s possible to predict people’s personality traits by analyzing what they put on social media.  First, a group of research participants were asked to identify their core personality traits by going through a standardized psychological evaluation.  The Big Five factors that they measured are openness, conscientiousness, extraversion, agreeableness, and neuroticism.

Next the research team gathered information from these people’s Facebook and Twitter accounts, including language features (e.g. words they use in posts), personal information, activities and preferences, internal Facebook stats, and other factors.  Tweets were processed in an application called LIWC, which stands for Linguistic Inquiry and Word Count.  LIWC is a text analysis program that examines a piece of text and the individual words it contains, and computes numeric values for positive and negative emotions as well as several other factors.

The data gathered from Twitter and Facebook was fed into a personality prediction algorithm developed by the research team and implemented using the Weka machine learning toolkit.  Predicted personality trait values from the algorithm were compared to the original Big Five assessment results to evaluate how well the prediction model performed.  Overall, the difference between predicted and measured personality traits was roughly 10 to 12% for Facebook (considered very good) and roughly 12 to 18% for Twitter (not quite as good).  The overall conclusion was that yes, it is possible to predict personality traits by analyzing what people put on social media.

Predicting Political Preferences

The second research project was about computing political preference in Twitter audiences.  Originally this project started with the intention of looking at the Twitter feeds of news media outlets and trying to predict media bias.  However, the topic of media bias in general was deemed too problematic and controversial and they decided instead to focus on predicting the political preferences of the media audiences.

The objective was to come up with a method for computing the political orientation of people who followed popular news media outlets on Twitter.  To do this, the team computed the political preference of about 1 million Twitter users by finding which Congresspeople they followed on Twitter, and looking at the liberal to conservative ratings of those Congresspeople.  A key assumption was that people’s political preferences will, on average, reflect those of the Congresspeople they follow.

From there, the team looked at 20 different Twitter news outlets and identified who followed each one.  The political preferences of each media outlet’s followers were composited together to compute an overall audience political preference factor ranging from heavily conservative to heavily liberal at the two extremes, with moderate ranges in the middle.  The results showed that Fox News had the most conservative audience, NPR Morning Edition had the most liberal audience, and Good Morning America was in the middle with a balanced mix of both conservative and liberal followers.  Further details on the results can be found in the paper here.

Summary & Wrap-up

An awful lot of things about you can be figured out by looking at public information in your social media streams.  Personality traits and political preferences are but two examples.  Sometimes this information can be used for beneficial purposes, such as showing you useful recommendations.  Likewise, a future employer could use this kind of information to form opinions during the hiring process.  People don’t always think about this (or necessarily even realize what’s possible) when they post things to social media.

Overall Prof. Golbeck’s presentation was well received and generated a number of questions and conversations after the talk.  The key takeaway was that “We know who you are and what you are thinking” and that information can be used for a variety of purposes – in most cases without you even being aware.  The situation was summed up pretty well in one of Prof. Golbeck’s opening slides:

I develop methods for discovering things about people online.

I never want anyone to use those methods on me.

– Jennifer Golbeck

For those who want to delve deeper, several resources are available:

Commentary

Overall I found this presentation to be very worthwhile and thought-provoking.  Prof. Golbeck was an engaging speaker who was both informative and entertaining.  She provided a number of useful references, links and papers for delving deeper into the topics covered.  The venue and logistics were great and there were plenty of opportunities for networking and talking with colleagues both before and after the presentation.

The topic of predicting people’s traits and behaviors is very relevant, particularly in the realm of politics.  At least one other Data Science DC meetup held within the last few months focused on how data sciences were used in the last presidential election and the tremendous impact it had.  That trend is sure to continue, fueled by research like this coupled with the availability of data, more sophisticated tools, and the right kinds of data scientists to connect the dots and put it all together.

If you have the time, I would recommend listening to the audio recording and following along the slide deck.  There were many more interesting details in the talk than what I could cover here.

My personal opinion is that too few people realize the data footprint they leave when using social media.  That footprint has a long memory and can be used for many purposes, including purposes that haven’t even been invented yet.  Many people seem to think that either the data they put on social media is trivial and doesn’t reveal anything, or think that no-one cares and it’s just “personal stuff.”  But as we’ve seen in this talk, people can discover a lot more than you may think.

The post Uncovering Hidden Social Information Generates Quite a Buzz appeared first on Data Community DC.

16 Apr 01:04

Massively Parallel Monte Carlo Simulation using GPU

by adnanboz

Introduction

In my previous blog posts I’ve explained how you can utilize the GPU on your computer to perform massively parallel computation with the help of NVIDIA CUDA and Thrust technologies. On this blog post I’m diving deeper into Thrust usage scenarios with a simple implementation of Monte Carlo simulation.

My influence was the PI prediction sample on Thrust web site. The sample is running Monte Carlo simulation with 10K samples on a unit circle to estimate the PI number. You can visit this Wikipedia page if you are interested into how Monte Carlo Simulation can be used to approximate PI. Actually it is a solution to the famous Buffon’s Needle problem.

I’m taking the original example one step further to show you how to send device variables to functors in Thrust methods, and also using a slightly different problem. Perhaps there are many other methods to do the same logic, but on this blog post I’m just concentrating on this specific implementation.

Background

About Monte Carlo Simulation

Monte Carlo simulation is an approach to solve deterministic problems with probabilistic analog. That is exactly what we are accomplishing in our example: estimating the area of intersecting disks. Monte Carlo methods are especially useful for simulating systems with many coupled degrees of freedom, such as fluids, disordered materials, strongly coupled solids, and cellular structures.

Our simulation is about predicting the intersect area of four overlapping unit disks as seen on the image below. (intersection of A,B,C and D disks) Actually, the problem can also be solved easily with the help of Geometry as explained here. I’ve calculated the area as 0.31515. On the other hand the simulation estimated 0.3149.

About Thrust

Writing code using CUDA API is very powerful in terms of controlling the hardware, but there are high level libraries like Thrust C++ template library, which provides many fundamental programming logic like sorting, prefix-sums, reductions, transformations etc.. The best part is that Thrust consists only of header files and is distributed with CUDA 4.0 installation.

If you are not familiar with terms like GPGPU and Thrust, I’m suggesting you to check out the background information on my previous posts.

Setup

The example is a console application written in C++. But you can easily transform it to a DLL to use it from your C# application. (previous posts)

I’ve used Visual Studio 2010 create the C++ console application. If you already did not, you need to install the NVIDIA CUDA Toolkit 4.0 and a supported graphics device driver from the same link. The new CUDA Toolkit 4.1 RC1 is also available at CUDA zone, but the project files are built on 4.0. Also do not forget to install the Build Customization BUG FIX Update from the same link for CUDA Toolkit 4.0.

Once the CUDA Toolkit is installed, creating CUDA enabled projects is really simple. For those who are not familiar using native C++ CUDA enabled projects, please follow the steps below to create one:

  • Create a Visual C++ console project in Visual Studio 2010 by selecting Empty project on the wizard,
  • Open Build Customization from the C++ project context menu, and check the CUDA 4.0(.targets, .props) checkbox,
  • Open the project properties, expand the Configuration Properties, Linker and select Input. Edit the additional dependencies and add cudart.lib.
  • Add a new empty source file ending with .cu.

You can also skip the steps above and download the example solution and project files directly from here.

Implementation

The main application consists of calling thrust::transform_reduce 50 times to run the intersection estimation simulation. transform_reduce performs a reduction on the transformation of the sequence [first, last) according to unary_op. The unary_op is applied to each element of the sequence and then the result is reduced to a single value with binary_op.

The main code is as follows:

int main(void)
{
  // use 50 independent seeds
  int M = 50;
  //Create some circles in the device
  thrust::host_vector dCircles;
  dCircles.push_back(CIRCLE(0.0f, 0.0f));
  dCircles.push_back(CIRCLE(1.0f, 0.0f));
  dCircles.push_back(CIRCLE(1.0f, 1.0f));
  dCircles.push_back(CIRCLE(0.0f, 1.0f));

 //The kernel can not access host or device vector directly,
 //therefore get the device pointer to the circles to pass to the kernel
  thrust::device_vector circles = dCircles;
  CIRCLE * circleArray = thrust::raw_pointer_cast( &circles[0] );
  float estimate = thrust::transform_reduce(thrust::counting_iterator(0),
                                            thrust::counting_iterator(M),
           estimate_intersection(circleArray, circles.size()),
                                            0.0f,
                                            thrust::plus());
  estimate /= M;
  std::cout << std::setprecision(6);
  //calculate area with gemometry : (pi + 3 – 3*sqrt(3)) / 3 = 0.31515s
  std::cout << “the area is estimated as ” << estimate
            << std::endl << “. It should be 0.31515.” ;
  return 0;
}

The unary_op has the Monte Carlo simulation logic implemented in the estimate_intersection functor. The estimate_intersection is a method derived from the thrust::unary_function class and returning the estimated intersect area as float. Using estimate_intersection in tranform_reduce means estimating the intersect area for every data element provided to tranform_reduce. For the data elements we are using two thrust::counting_iterators. This creates a range filled with a sequence from 1 to 50, without explicitly storing anything in the memory. Using a sequence of numbers helps us to assign different thread id for every estimate_intersection call. This is important to generate distinct seed for the random number generator of the simulation. (I’ve mentioned about random number generator seeds in my previous posts.)

For the reduction part of the tranform_reduce we are using the thrust::plus() binary functor, which sums all results into one number. At last we divide the result into 50 to find the average intersect area value.

Our goal with this code is to run the simulation on the device (GPU) and retrieve the result back to the host. Therefore any data we are going to use on the simulation must be placed into the device memory. That is exactly what is happening before we call thrust::transform_reduce. We are preparing properties of all circles we will try to intersect using the CIRCLE object defined below.

struct CIRCLE{
   float x,y;
   CIRCLE(float _x, float _y) : x(_x), y(_y){}
} ;

With thrust::host_vector dCircles; in the main code, we are defining a vector object in the host memory. Using a Thrust host vector object over a custom memory simplifies transferring data directly to device with the thrust::device_vector circles = dCircles; call. As you may know, transferring data between device and host memory in CUDA C is handled with cudaMemcpy. But Thrust has the equal operator overload, which allows you to copy memory easily.

On the next line we access the raw pointer of the circles object with the help of the thrust::raw_pointer_cast method. We do this because the estimate_intersection method can only accept a device pointer to the CIRCLE object array.

Simulation Method

The estimate_intersection unary function implements the simulation logic. A unary function is a function which takes one argument, has a () operator overload and returns one value. In our case the function takes the thrust::counting_iterator generated unique index number and returns the area of the intersection as float. Another important part of the method is the constructor (seen below) which takes in the device pointer to the CIRCLE array and the length of the allocated memory.

struct estimate_intersection : public thrust::unary_function
{
CIRCLE * Circles;
int CircleCount;

estimate_intersection(CIRCLE * circles, int circleCount) :
   Circles(circles), CircleCount(circleCount)
    {
    }

  __host__ __device__
  float operator()(unsigned int thread_id)
  {
    float sum = 0;
    unsigned int N = 30000; // samples per thread
    unsigned int seed = hash(thread_id);
    // seed a random number generator
    thrust::default_random_engine rng(seed);
    // create a mapping from random numbers to [0,1)
    thrust::uniform_real_distribution u01(0,1);

    // take N samples
    for(unsigned int i = 0; i < N; ++i)
    {
      // draw a sample from the unit square
      double x = u01(rng);
      double y = u01(rng);
      bool inside = false;

      //check if the point is inside all circles
      for(unsigned int k = 0; k < CircleCount; ++k)
      {
       double dy,dx;
       //check if the point is further from
       //the center of the circle than the radius
       dx = Circles[k].x - x;
       dy = Circles[k].y - y;
       if ((dx*dx + dy*dy) <= 1)
       {
        inside = true;
       }
       else
       {
        inside = false;
        break;
       }
      }
      if (inside)
       sum += 1.0f;
     }
    // divide by N
   return sum / N;
   }
};

In order to run the code on the device and call it from the host, the () operator overload has to be defined as __host__ __device__. The rest of the code is the Monte Carlo simulation logic as follows:

1) Initiate the thrust default random number generator

2) Generate 30K random x and y values

3) Loop through all circles and check if the x and y value is inside the circle by calculating the hypotenuse

4) If all circles are inside the x and y coordinates then increase the found points count

5) return the average found points count

That’s it! I hope you enjoy it.

In addition to the code I included here, there are header includes and a hashing algorithm. You can download the code from here.

About the Implementations

The Monte Carlo simulation I provided on this post is an example and therefore I’m not guaranteeing that it will perform good enough in your particular solution. Also, for clarity there is almost no exception handling and logging implemented. This is not an API; my goal is to give you a high level idea how you can use utilize the GPU for simulations. Therefore, it is important that you re-factor the code for your own use.

Some of the code is taken from original sample and is under Apache License V 2, the rest is my code which is free to use without any restriction or obligation.

Conclusion

Thrust is a powerful library providing you with simple ways to accomplish complicated parallel computation tasks. There are many libraries like Thrust which are built on CUDA C. These libraries will save you many engineering hours on parallel algorithm implementation and allow you to concentrate on your real business problem. You can check out GPU Computing Webinars for presentations on this area.

CodeProject


16 Apr 01:04

How to set up Amazon EC2 Windows GPU instance for NVIDIA CUDA development

by adnanboz

Introduction

Amazon Elastic Compute Cloud web service provides a very useful platform on the cloud. Especially for software developers who don’t have access to expensive hardware. Some time ago as I was looking for a better CUDA enabled GPU solution than my Mac Book Pro, I’ve realized that it is time to switch from laptop to a desktop. But luckily, Amazon introduced couple months ago the GPU instances, running on Windows Server 2008 OS. I’ve been using the scalable and cost efficient Amazon EC2′s since couple years without any problem and now that they are providing a platform with two Tesla M2050s to test my CUDA apps, I just want to say Thank You Amazon.

On this post I want to share with you my experience how to set up a full NVIDIA CUDA development environment on a Windows EC2 GPU instance. And I’ll also walk you through couple CUDA examples.

If you were following my previous blog posts and were not able to try them out because of not having a CUDA capable hardware, you will have a chance to do it after reading this blog.

One of the reasons I’m providing this blog post is also to use this information in our HPC & GPU Supercomputing group of South Florida hands-on lab meetups. If you are from the group, you’ve most probably received already the AMI. Therefore you can skip the set up part.

Background

About Amazon EC2 GPU Instances

Amazon Elastic Compute Cloud (Amazon EC2) is a web service that provides resizable compute capacity in the cloud. It is designed to make web-scale computing easier for developers.

The GPU instances provide general-purpose graphics processing units (GPUs) with proportionally high CPU and increased network performance for applications benefitting from highly parallelized processing, including HPC, rendering and media processing applications. The Windows GPU instance is named Cluster GPU Quadruple Extra Large instance and has

22 GB memory, 33.5 EC2 Compute Units, 2 x NVIDIA Tesla “Fermi” M2050 GPUs, 1690 GB of local instance storage, 64-bit platform, 10 Gigabit Ethernet.

GPGPU

Utilizing GPUs to do general purpose scientific and engineering computing is named GPGPU. You can visit my previous blog posts where I’ve explained how to use NVIDIA CUDA capable GPUs to perform massively parallel computations.

Setup

Browse to http://aws.amazon.com/ec2/ and click the link on the top of the page saying “Sign in to the AWS Management Console”.

Please be aware that you will get charged by Amazon for the usage of their services. Therefore, check for running objects before leaving the AWS Management console. Please check out the Amazon pricing web page for more information.

The next couple paragraphs explain how to create your AWS account and set up your environment. You can skip this section if you already have and account and familiar with Amazon EC2.

Registering for Amazon AWS

If you already have an Amazon account you can use it to log in, otherwise you can create a new account from the same screen. Once you logged into the AWS console, it may ask you to sign up for a Amazon S3 account. In that case just follow the links to finish the sign up. Once it is done, you should receive the confirmation email. Now, login to you account to finish the registration and to go through a phone verification.

Setting up your AWS environment

Login to your Amazon AWS account, the AWS management console will show up. Select the EC2 tab from the top to see your EC2 dashboard. We will create a security group and a key pair for later use.

First, click the Key Pairs link on the right and after that click the Create Key Pair button. Enter a name for your private key file, like My_KeyPair and then after save the .pem file somewhere to use it later. You will also see the new key pair on the screen.

Go back to the EC2 dashboard and click the Security Group link on the right. This will open the security group console. Click the Create Security Group button and create a group named GPGPU_SecurityGroup. Select the Inbound tab for the new group and the rule editor will open. Add an RDP group by selecting RDP from the rules drop down and clicking the Add Rule button. Now click the Apply Rule Changes button to save the changes.

Creating the GPU EC2 Instance

  1. Go to the EC2 dashboard and click the Launch Instance button.
  2. Select the Launch Classic Wizard and click Continue.
  3. Find the Microsoft Windows 2008 R2 64-bit for Cluster Instances (AMI Id: ami-c7d81aae) in the list and click the Select button right next to it.
  4. Select Cluster GPU(cg1.4xlarge, 22GB) from the Instance Type drop down and click continue. If you have other instances and you are planning to transfer data between your instances, I’m suggesting selecting the same region for all of them to prevent in cloud data transfer charges.
  5. Select Continue on the Advanced Instance Options page.
  6. Give a name to your instance. e.g. GPGPU.
  7. Select the Key Pair you have created and click the continue button.
  8. Select the Security Group you have created and click the continue button.
  9. Click the launch button to finish the wizard.

Running the GPU EC2 Instance

You can click the instances link on the left hand Navigation menu to see the instance you’ve just created. The instance will be in pending state for a while until it will boot up completely.
Right click on the newly created instance and select Get Windows Password. You may have to come back after couple minutes if the password generation is pending.
Paste the content of the .pem file you’ve received while creating the key pair, to the Private Key field on the password retrieval dialog and click the Decrypt Password button.
Copy the Decrypted Password to use it later to log into the instance.

Connecting to the Instance using RDP

In order to connect to the newly created instance :

  1. Right click on it and select Connect.
  2. Click “Download shortcut file” link and save the RDP shortcut to your local machine.
  3. Open the saved RDP shortcut and logon to the instance by enter the retrieved password.
  4. Change your random generated password from the Control Panel / User Accounts section.

Installing GPGPU Developer Tools

Go to the CUDA Downloads website to see available downloads. At this time we will download the 4.1 RC2 version from CUDA Toolkit 4.1 web site.
Download and install the following items in the same order :

  1. Visual Studio C++ 2010 Express.
  2. CUDA Toolkit.
  3. GPU Computing SDK.
  4. Developer Drivers for WinVista and Win7 (285.86). The default drivers coming
  5. (Optional) Parallel Nsight 2.1RC2. In order to download this you have to sign up for the Parallel Nsight Registered Developer Program.

Backup the GPU EC2 instance

You will get charged for any instance which is not terminated, even for those in stopped state. Therefore, it is a good practice to backup to S3 and terminate your instance once you are done with testing to prevent any charges in downtime. You can do this in two ways: you can detach the EBS volume (storage) and terminate the instance or you can take a snapshot and delete the instance and volume. As of today the EBS volume costs $0.10 per GB-month and the snapshot costs $0.14 per GB-month. You can visit the Amazon EC2 pricing web site for a more up to date pricing.

Please follow the steps below for a snapshot backup:

  1. Click the volumes link on the navigation bar on the left hand side. You will see the volume ( storage ) attached to your EC2 instance.
  2. Right click on the volume and select Create Snapshot.
  3. Provide a name for the new snapshot and click the Yes, Create button.
  4. Go to the Snapshots section from the navigation menu and click refresh. You should see the new snapshot in pending mode. It will take a while to create the snapshot.

Running CUDA Samples

Now you are ready to compile and run a CUDA sample from the GPU Computing SDK. Please follow these steps :

  1. Login to the instance using the RDP shortcut.
  2. The samples require cutil32d.lib in order to function, therefore you need to compile the cutil project first. For that browse to the C:\ProgramData\NVIDIA Corporation\NVIDIA GPU Computing SDK 4.1\C\common folder and open the cutil_vs2010.sln visual studio solution file. Compile the solution.
  3. It is convenient to have syntax highlighting on .cu files. Therefore go to C:\ProgramData\NVIDIA Corporation\NVIDIA GPU Computing SDK 4.1\C\doc\syntax_highlighting\visual_studio_8 folder and follow the instructions in the readme.txt file.
  4. Our first example is the deviceQuery, which shows the properties of your GPU. Browse to the C:\ProgramData\NVIDIA Corporation\NVIDIA GPU Computing SDK 4.1\C\src\deviceQuery folder and open the deviceQuery_vs2010.sln. Compile the solution.
  5. The output executable will be placed into the C:\ProgramData\NVIDIA Corporation\NVIDIA GPU Computing SDK 4.1\C\bin\win32\Debug folder. Open a administrative command prompt and run the deviceQuery.exe.
  6. You should see two Tesla M2050 devices each with device capability 2.0, 448 CUDA cores, 3GB memory, 515 GFlops, 148 GB/sec memory bandwidth. This feels like 400hp under the hood!

Let’s run one more sample to see the performance difference of our GPUs. The sample we are going to run is matrixMul, located under the same C:\ProgramData\NVIDIA Corporation\NVIDIA GPU Computing SDK 4.1\C\src root folder. On Tesla M2050 this sample will multiply a 640 x 640 matrix with a 640 x 960 matrix to generate a 640 x 960 matrix.

Open the solution, go to the project properties and add the C:\ProgramData\NVIDIA Corporation\NVIDIA GPU Computing SDK 4.1\shared\inc path to the Include Directories under the VC++ Directories configuration properties. ( I’ve noticed that the path can not be found. )

Compile and run the project in a command window. You should see 0.001 sec for CUBLAS kernel execution and 0.021 sec for CUDA execution. CUBLAS is CUDA’s Basic Linear Algebra Library with optimized algorithm.

Let’s compare the GPU with the Intel Xeon 2.93Ghz CPU of the current instance. In order to do this we need to modify the code a little :

  1. Open the matrixMul.cu file.
  2. Add the #include <time.h> at line 41, under the kernel include.
  3. Find the line with computeGold(reference, h_A, h_B, uiHA, uiWA, uiWB); ( around line 417) and replace it with the following code.
    clock_t startTime,endTime;
    startTime = clock() * CLK_TCK;
    computeGold(reference, h_A, h_B, uiHA, uiWA, uiWB);
    endTime = clock() * CLK_TCK;
    shrLogEx(LOGBOTH | MASTER, 0, "> Host matrixMul Time = %.5f s\n", 
    				(double)(endTime - startTime) / 1000000.0 );
    
  4. Compile the code and execute it. You should see something around 3.463 sec. This means that the CUBLAS GPU version is about 3500x faster than the single core CPU version. A fair comparison with all cores utilized can be found on the CUBLAS web site, which is about 6-17x.

Conclusion

GPGPU is rising since the last couple years and now that Amazon provides a Windows GPU instance, it is much easier to jump onto the massively parallel software track as a Windows developer.
CodeProject


16 Apr 00:47

Compressive System Identification (CSI): Theory and Applications of Exploiting Sparsity in the Analysis of High-Dimensional Dynamical Systems

by Igor
Borhan Sanandaji sent me the following (a compilation of two emails since I did not have time to show the first one a month ago):


Hi Igor,
I hope everything goes well with you. I would like to bring your and Nuit Blanche Reader's attention to my PhD Thesis and Defense Presentation Slides which can be found on mywebpage. In my PhD thesis ``Compressive System Identification (CSI): Theory and Applications of Exploiting Sparsity in the Analysis of High-Dimensional Dynamical Systems,'' I tried to combine the tools and ideas in Compressive Sensing and Sparse Signal Processing with Control Theory when dynamical systems are involved and in applications such as observability of linear systems whose initial state is sparse, identification of systems with sparse representations (impulse response, ARX, etc.), and topology identification of large-scale interconnected dynamical systems with sparse flow..... I [also] would like to bring your attention to a couple of our papers which both have a review/tutorial flavor. The first paper A Review of Sufficient Conditions for Structure Identification in Interconnected Systems reviews some of the recent results on Compressive Topology Identification (CTI) of large-scale but sparse-flow interconnected dynamical systems with a particular focus on sufficient recovery conditions. In the second paper A Tutorial on Recovery Conditions for Compressive System Identification of Sparse Channels we review some of the recent results concerning Compressive System Identification (CSI) (identification from few measurements) of sparse channels (and in general, Finite Impulse Response (FIR) systems) when it is known a priori that the impulse response of the system under study is sparse (high-dimensional but with few nonzero entries) in an appropriate basis. As a quick note, our Concentration of Measure Inequalities for Compressive Toeplitz Matrices with Applications is now published in the IEEE TSP (See also the companion technical report).  Once again, I would like to thank you for providing the Nuit Blanche which has been a useful source on CS over my PhD studies.

Thanks, Borhan

Thanks Borhan !


Here is Borhan 's thesis: Compressive System Identification (CSI): Theory and Applications of Exploiting Sparsity in the Analysis of High-Dimensional Dynamical Systems. The abstract reads:
The information content of many phenomena of practical interest is often much less thanwhat is suggested by their actual size. As an inspiring example, one active research area inbiology is to understand the relations between the genes. While the number of genes in aso-called gene network can be large, the number of contributing genes to each given genein the network is usually small compared to the size of the network. In other words, thebehavior of each gene can be expressed as a sparse combination of other genes.The purpose of this thesis is to develop new theory and algorithms for exploiting thistype of simplicity in the analysis of high-dimensional dynamical systems with a particu-lar focus on system identi cation and estimation. In particular, we consider systems witha high-dimensional but sparse impulse response, large-scale interconnected dynamical sys-tems when the associated graph has a sparse ow, linear time-varying systems with fewpiecewise-constant parameter changes, and systems with a high-dimensional but sparse ini-tial state. We categorize all of these problems under the common theme of CompressiveSystem Identi cation (CSI) in which one aims at identifying some facts (e.g., the impulseresponse of the system, the underlying topology of the interconnected graph, or the initialstate of the system) about the system under study from the smallest possible number ofobservations.Our work is inspired by the eld of Compressive Sensing (CS) which is a recent paradigmin signal processing for sparse signal recovery. The CS recovery problem states that a sparsesignal can be recovered from a small number of random linear measurements. Compared tothe standard CS setup, however, we deal with structured sparse signals (e.g., block-sparsesignals) and structured measurement matrices (e.g., Toeplitz matrices) where the structureis implied by the system under study

A Review of Sufficient Conditions for Structure Identification in Interconnected Systems by Borhan Sanandaji   Tyrone L. Vincent  Michael B. Wakin
Abstract: Structure identi fication of large-scale but sparse-flow interconnected dynamical systems from limited data has recently gained much attention in the control and signal processing communities. This paper reviews some of the recent results on Compressive Topology Identifi cation (CTI) of such systems with a particular focus on su fficient recovery conditions. We list and discuss the key elements that influence the recovery performance of CTI, namely, the network topology, the number of measurements, and the input sequence. In regards to the last element, we analyze the recovery conditions with respect to an experiment design.

Concentration of Measure Inequalities for Compressive Toeplitz Matrices with Applications by  Borhan Sanandaji, Tyrone L. Vincent, and Michael B. Wakin
We derive Concentration of Measure (CoM) inequalities for randomized Toeplitz matrices. Theseinequalities show that the norm of a high-dimensional signal mapped by a Toeplitz matrix to a lowdimensional space concentrates around its mean with a tail probability bound that decays exponentiallyin the dimension of the range space divided by a quantity which is a function of the signal. For theclass of sparse signals, the introduced quantity is bounded by the sparsity level of the signal. However,we observe that this bound is highly pessimistic for most sparse signals and we show that if a random distribution is imposed on the non-zero entries of the signal, the typical value of the quantity is boundedby a term that scales logarithmically in the ambient dimension. As an application of the CoM inequalities,we consider Compressive Binary Detection (CBD).

Why is this work important ?

As I mentioned earlier, there is an extreme paucity of tools for blind deconvolution of biochemical networks, Making sense of these networks is clearly key to solving many problems in medical research and potentially synthetic biology. In short, beyond devising the right sensors such as MRI, CT scanners and others, performing these inverse problems i,e, network identification on graphs is clearly a tremendous tool for potentially curing living things  [1,2] and building a better future. 

References:
[1] Reverse Engineering Biochemical Networks and Compressive Sensing, It's quite simply, the stuff of Life...
[2] Instances of Null Spaces: Can Compressive Sensing Help Study Non Steady State Metabolic Networks ?.



Join the CompressiveSensing subreddit or the Google+ Community and post there ! Liked this entry ? subscribe to Nuit Blanche's feed, there's more where that came from. You can also subscribe to Nuit Blanche by Email, explore the Big Picture in Compressive Sensing or the Matrix Factorization Jungle and join the conversations on compressive sensing, advanced matrix factorization and calibration issues on Linkedin.
12 Apr 01:56

black bean & avocado enchiladas $5.85 recipe / $1.46 serving

by Beth M
Okay, I have to take a little break in the Shredded Beef programming to bring you this absolutely scrumptious vegan delight!

I've been eating this for lunch all week and the fact that I only have one serving left almost makes me want to cry... I might have to make a second batch PRONTO. Yes, they're that good.

What makes these enchiladas so amazing? Not only are they filled with super fresh flavors, but they're smothered in a homemade enchilada sauce. Now, if you've made my homemade enchilada sauce before you know that it takes less than ten minutes to make and is 100 x better than the canned stuff. Well, this time I went one step further and added a little cocoa powder to the enchilada sauce. OMG. You won't believe how rich and amazing the cocoa makes the sauce! It doesn't taste like chocolate, just extra rich and good. It's the same amazing chili/chocolate combo that works magic in the Aztec Cocoa. SO. GOOD.

Black beans give these enchiladas filling power and avocado offers that creaminess that you'd usually get from cheese. If you're vegan, just make sure your tortillas are vegan (some use lard, some don't). Or, try making your own!

And yes, I know that enchiladas are supposed to be wrapped in corn tortillas, not flour. Use whichever type you like. Delicious is more important than correct, in my book.

Black Bean & Avocado Enchiladas

Black Bean & Avocado Enchiladas

Print Friendly and PDF

Total Recipe cost: $5.85
Servings Per Recipe: 4 (2 enchiladas each)
Cost per serving: $1.46
Prep time: 20 min. Cook time: 30 min. Total: 50 min.

SAUCE INGREDIENTS COST
2 Tbsp vegetable oil $0.10
2 Tbsp all-purpose flour $0.02
2 Tbsp chili powder $0.30
2 cups water $0.00
3 oz. tomato paste (1/2 6 oz. can) $0.27
1/2 tsp cumin $0.03
1/2 tsp garlic powder $0.03
1/4 tsp cayenne pepper $0.02
2 tsp unsweetened cocoa powder $0.03
1 tsp salt $0.05
FILLING INGREDIENTS COST
1 (15 oz.) can black beans $1.27
1 medium avocado $1.19
1 small tomato $0.37
1-2 whole green onions $0.08
1/2 cup frozen corn kernels $0.26
handful cilantro $0.19
1/4 tsp garlic powder $0.02
1/2 tsp salt $0.03
8 small tortillas (fajita size) $1.59
TOTAL $5.85

STEP 1: First prepare the enchilada sauce. In a medium sauce pot combine the vegetable oil, flour, and chili powder. Heat the mixture over a medium flame until it begins to bubble. Whisk and cook the bubbling paste for 1-2 minutes. Slowly pour in the water while whisking. Add the tomato paste, cumin, garlic powder, cayenne pepper, cocoa powder, and salt. Whisk until smooth and continue to heat over a medium flame. Let the sauce come up to a gentle simmer, at which point the sauce will thicken. Once thickened, turn off the heat and set the sauce aside until you're ready to use it.

STEP 2: Preheat the oven to 350 degrees. Drain and rinse the can of beans, then add them to a large bowl. Cube the avocado, dice the tomato, slice the green onion, and pull a handful of cilantro leaves from their stems. Add them all to the bowl, along with the frozen corn kernels. Stir everything together. Season with a little salt and garlic powder (recommended amounts above, or use fresh minced garlic).

STEP 3: Coat an 8x8 casserole dish with non-stick spray. Warm the tortillas briefly in the microwave to make them soft and pliable. Fill each tortilla with about 1/3 cup of filling and roll tightly. Place the filled tortillas in the casserole dish, seam side down. Once all of the filled enchiladas are in the dish, pour the enchilada sauce over top.

STEP 4: Bake the enchiladas in the preheated oven for about 25 minutes, or until they're heated through and the sauce is bubbly along the edges. Then devour!


Black Bean & Avocado Enchiladas

Step By Step Photos


enchilada sauce
Start with the amazing sauce, so that it will be ready to go on the enchiladas later. Add the vegetable oil, flour, and chili powder to a medium sauce pot. Turn the heat on to medium and begin to whisk them together.

enchilada roux
The mixture will form a paste and will begin to bubble along the edges. Keep whisking and cooking this paste for 1-2 minutes.

enchilada sauce
Whisk in the water first, then add the tomato paste, cumin, garlic powder, cayenne pepper, salt, and cocoa powder. Whisk it all in until smooth, and continue to cook over medium until it comes up to a gentle simmer. Right when it starts simmering it will begin to thicken up. Once it's thickened, just turn off the heat and set it aside until you're ready to use it.

chili cococa
The magic mix! Take note, the "chili powder" used in this recipe is a mild blend of chiles and other spices. It is NOT the same as cayenne pepper. If you use 2 Tbsp of cayenne pepper, you'll burn a hole through your mouth upon tasting it.

filling
Now on to the filling... Drain and rinse the beans, then add them to a bowl. Dice the avocado and tomato, slice the green onion, and add them all to the bowl. Pull a handful of cilantro leaves from the stems and add it to the bowl, along with the frozen corn kernels. Stir it up and then season with a little salt and garlic powder (or you can add a clove of minced, fresh garlic).

fill enchiladas
I had about 3 cups of filling and 8 tortillas, so I used about 1/3 cup of filling per enchilada.

leftover filling
I had a little filling left over, but NO problem getting rid of it (it went straight into my mouth).

enchiladas in dish
Roll the enchiladas tightly and then arrange them in your dish. Oh, and coat the dish with non-stick spray if you want the enchiladas to come out in one piece.

enchilada sauce
Then pour that amazing sauce over top! Bake the enchiladas in a preheated 350 degree oven for about 25 minutes, or until they're heated through and the sauce is bubbling along the sides.

Black Bean & Avocado Enchiladas
It's sad because the pictures don't do them justice. They're SO GOOD.
09 Apr 23:30

The Long Post of the Month (Part I)

by Igor
For some odd reasons, both the search feature on arxiv and my robots did not fare too well in the past few months, here is part 1 of a list of arxiv paper of note, enjoy!


The Round Complexity of Small Set Intersection
David P. Woodruff, Grigory Yaroslavtsev
(Submitted on 5 Apr 2013)
The set disjointness problem is one of the most fundamental and well-studied problems in communication complexity. In this problem Alice and Bob hold sets $S, T \subseteq [n]$, respectively, and the goal is to decide if $S \cap T = \emptyset$. Reductions from set disjointness are a canonical way of proving lower bounds in data stream algorithms, data structures, and distributed computation. In these applications, often the set sizes $|S|$ and $|T|$ are bounded by a value $k$ which is much smaller than $n$. This is referred to as small set disjointness. A major restriction in the above applications is the number of rounds that the protocol can make, which, e.g., translates to the number of passes in streaming applications. A fundamental question is thus in understanding the round complexity of the small set disjointness problem. For an essentially equivalent problem, called OR-Equality, Brody et. al showed that with $r$ rounds of communication, the randomized communication complexity is $\Omega(k \ilog^r k)$, where$\ilog^r k$ denotes the $r$-th iterated logarithm function. Unfortunately their result requires the error probability of the protocol to be $1/k^{\Theta(1)}$. Since na\"ive amplification of the success probability of a protocol from constant to $1-1/k^{\Theta(1)}$ blows up the communication by a $\Theta(\log k)$ factor, this destroys their improvements over the well-known lower bound of $\Omega(k)$ which holds for any number of rounds. They pose it as an open question to achieve the same $\Omega(k \ilog^r k)$ lower bound for protocols with constant error probability. We answer this open question by showing that the $r$-round randomized communication complexity of ${\sf OREQ}_{n,k}$, and thus also of small set disjointness, with {\it constant error probability} is $\Omega(k \ilog^r k)$, asymptotically matching known upper bounds for ${\sf OREQ}_{n,k}$ and small set disjointness.

The H-band Emitting Region of the Luminous Blue Variable P Cygni: Spectrophotometry and Interferometry of the Wind
N. D. Richardson, G. H. Schaefer, D. R. Gies, O. Chesneau, J. D. Monnier,F. Baron, X. Che, J. R. Parks, R. A. Matson, Y. Touhami, D. P. Clemens,E. J. Aldoretta, N. D. Morrison, T. A. ten Brummelaar, H. A. McAlister, S. Kraus, S. T. Ridgway, J. Sturmann, L. Sturmann, B. Taylor, N. H. Turner,C. D. Farrington, P. J. Goldfinger
(Submitted on 4 Apr 2013)
We present the first high angular resolution observations in the nearinfrared H-band (1.6 microns) of the Luminous Blue Variable star P Cygni. We obtained six-telescope interferometric observations with the CHARA Array and the MIRC beam combiner. These show that the spatial flux distribution is larger than expected for the stellar photosphere. A two component model for the star (uniform disk) plus a halo (two-dimensional Gaussian) yields an excellent fit of the observations, and we suggest that the halo corresponds to flux emitted from the base of the stellar wind. This wind component contributes about 45% of the H-band flux and has an angular FWHM = 0.96 mas, compared to the predicted stellar diameter of 0.41 mas. We show several images reconstructed from the interferometric visibilities and closure phases, and they indicate a generally spherical geometry for the wind. We also obtained near-infrared spectrophotometry of P Cygni from which we derive the flux excess compared to a purely photospheric spectral energy distribution. The H-band flux excess matches that from the wind flux fraction derived from the two component fits to the interferometry. We find evidence of significant near-infrared flux variability over the period from 2006 to 2010 that appears similar to the variations in the H-alpha emission flux from the wind. Future interferometric observations may be capable of recording the spatial variations associated with temporal changes in the wind structure.


Regularly random duality
Mihailo Stojnic
(Submitted on 29 Mar 2013)
In this paper we look at a class of random optimization problems. We discuss ways that can help determine typical behavior of their solutions. When the dimensions of the optimization problems are large such an information often can be obtained without actually solving the original problems. Moreover, we also discover that fairly often one can actually determine many quantities of interest (such as, for example, the typical optimal values of the objective functions) completely analytically. We present a few general ideas and emphasize that the range of applications is enormous.



A framework to characterize performance of LASSO algorithms
Mihailo Stojnic
(Submitted on 29 Mar 2013)
In this paper we consider solving \emph{noisy} under-determined systems of linear equations with sparse solutions. A noiseless equivalent attracted enormous attention in recent years, above all, due to work of \cite{CRT,CanRomTao06,DonohoPol} where it was shown in a statistical and large dimensional context that a sparse unknown vector (of sparsity proportional to the length of the vector) can be recovered from an under-determined system via a simple polynomial $\ell_1$-optimization algorithm. \cite{CanRomTao06} further established that even when the equations are \emph{noisy}, one can, through an SOCP noisy equivalent of $\ell_1$, obtain an approximate solution that is (in an $\ell_2$-norm sense) no further than a constant times the noise from the sparse unknown vector. In our recent works \cite{StojnicCSetam09,StojnicUpper10}, we created a powerful mechanism that helped us characterize exactly the performance of $\ell_1$ optimization in the noiseless case (as shown in \cite{StojnicEquiv10} and as it must be if the axioms of mathematics are well set, the results of \cite{StojnicCSetam09,StojnicUpper10} are in an absolute agreement with the corresponding exact ones from \cite{DonohoPol}). In this paper we design a mechanism, as powerful as those from \cite{StojnicCSetam09,StojnicUpper10}, that can handle the analysis of a LASSO type of algorithm (and many others) that can be (or typically are) used for "solving" noisy under-determined systems. Using the mechanism we then, in a statistical context, compute the exact worst-case $\ell_2$ norm distance between the unknown sparse vector and the approximate one obtained through such a LASSO. The obtained results match the corresponding exact ones obtained in \cite{BayMon10,DonMalMon10}. Moreover, as a by-product of our analysis framework we recognize existence of an SOCP type of algorithm that achieves the same performance.


Upper-bounding $\ell_1$-optimization weak thresholds
Mihailo Stojnic
(Submitted on 29 Mar 2013)
In our recent work \cite{StojnicCSetam09} we considered solving under-determined systems of linear equations with sparse solutions. In a large dimensional and statistical context we proved that if the number of equations in the system is proportional to the length of the unknown vector then there is a sparsity (number of non-zero elements of the unknown vector) also proportional to the length of the unknown vector such that a polynomial $\ell_1$-optimization technique succeeds in solving the system. We provided lower bounds on the proportionality constants that are in a solid numerical agreement with what one can observe through numerical experiments. Here we create a mechanism that can be used to derive the upper bounds on the proportionality constants. Moreover, the upper bounds obtained through such a mechanism match the lower bounds from \cite{StojnicCSetam09} and ultimately make the latter ones optimal.




A rigorous geometry-probability equivalence in characterization of $\ell_1$-optimization
Mihailo Stojnic
(Submitted on 29 Mar 2013)
In this paper we consider under-determined systems of linear equations that have sparse solutions. This subject attracted enormous amount of interest in recent years primarily due to influential works \cite{CRT,DonohoPol}. In a statistical context it was rigorously established for the first time in \cite{CRT,DonohoPol} that if the number of equations is smaller than but still linearly proportional to the number of unknowns then a sparse vector of sparsity also linearly proportional to the number of unknowns can be recovered through a polynomial $\ell_1$-optimization algorithm (of course, this assuming that such a sparse solution vector exists). Moreover, the geometric approach of \cite{DonohoPol} produced the exact values for the proportionalities in question. In our recent work \cite{StojnicCSetam09} we introduced an alternative statistical approach that produced attainable values of the proportionalities. Those happened to be in an excellent numerical agreement with the ones of \cite{DonohoPol}. In this paper we give a rigorous analytical confirmation that the results of \cite{StojnicCSetam09} indeed match those from \cite{DonohoPol}.




Convergence of a data-driven time-frequency analysis method
Thomas Y. Hou, Zuoqiang Shi, Peyman Tavallali
(Submitted on 28 Mar 2013)
In a recent paper, Hou and Shi introduced a new adaptive data analysis method to analyze nonlinear and non-stationary data. The main idea is to look for the sparsest representation of multiscale data within the largest possible dictionary consisting of intrinsic mode functions of the form $\{a(t) \cos(\theta(t))\}$, where $a \in V(\theta)$, $V(\theta)$ consists of the functions smoother than $\cos(\theta(t))$ and $\theta'\ge 0$. This problem was formulated as a nonlinear $L^0$ optimization problem and an iterative nonlinear matching pursuit method was proposed to solve this nonlinear optimization problem. In this paper, we prove the convergence of this nonlinear matching pursuit method under some sparsity assumption on the signal. We consider both well-resolved and sparse sampled signals. In the case without noise, we prove that our method gives exact recovery of the original signal.




Sparse approximation and recovery by greedy algorithms in Banach spaces
Vladimir Temlyakov
(Submitted on 27 Mar 2013)
We study sparse approximation by greedy algorithms. We prove the Lebesgue-type inequalities for the Weak Chebyshev Greedy Algorithm (WCGA), a generalization of the Weak Orthogonal Matching Pursuit to the case of a Banach space. The main novelty of these results is a Banach space setting instead of a Hilbert space setting. The results are proved for redundant dictionaries satisfying certain conditions. Then we apply these general results to the case of bases. In particular, we prove that the WCGA provides almost optimal sparse approximation for the trigonometric system in $L_p$, $2\le p<\infty$.






Sketching Sparse Matrices
Gautam Dasarathy, Parikshit Shah, Badri Narayan Bhaskar, Robert Nowak
(Submitted on 26 Mar 2013)
This paper considers the problem of recovering an unknown sparse p\times p matrix X from an m\times m matrix Y=AXB^T, where A and B are known m \times p matrices with m << p.
The main result shows that there exist constructions of the "sketching" matrices A and B so that even if X has O(p) non-zeros, it can be recovered exactly and efficiently using a convex program as long as these non-zeros are not concentrated in any single row/column of X. Furthermore, it suffices for the size of Y (the sketch dimension) to scale as m = O(\sqrt{# nonzeros in X} \times log p). The results also show that the recovery is robust and stable in the sense that if X is equal to a sparse matrix plus a perturbation, then the convex program we propose produces an approximation with accuracy proportional to the size of the perturbation. Unlike traditional results on sparse recovery, where the sensing matrix produces independent measurements, our sensing operator is highly constrained (it assumes a tensor product structure). Therefore, proving recovery guarantees require non-standard techniques. Indeed our approach relies on a novel result concerning tensor products of bipartite graphs, which may be of independent interest.
This problem is motivated by the following application, among others. Consider a p\times n data matrix D, consisting of n observations of p variables. Assume that the correlation matrix X:=DD^{T} is (approximately) sparse in the sense that each of the p variables is significantly correlated with only a few others. Our results show that these significant correlations can be detected even if we have access to only a sketch of the data S=AD with A \in R^{m\times p}.


Towards an information-theoretic system theory
Bernhard C. Geiger, Gernot Kubin
(Submitted on 26 Mar 2013)
In this work the information loss in deterministic, memoryless systems is investigated by evaluating the conditional entropy of the input random variable given the output random variable. It is shown that for a large class of systems the information loss is finite, even if the input is continuously distributed. Based on this finiteness, the problem of perfectly reconstructing the input is addressed and Fano-type bounds between the information loss and the reconstruction error probability are derived.
For systems with infinite information loss a relative measure is defined and shown to be tightly related to Renyi's information dimension. Employing another Fano-type argument, the reconstruction error probability is bounded by the relative information loss from below.
In view of developing a system theory from an information theoretic point-of-view, the theoretical results are illustrated at the hand of a few example systems, among them a multi-channel autocorrelation receiver.







Phase Transition Analysis of Sparse Support Detection from Noisy Measurements
Jaewook Kang, Heung-No Lee, Kiseon Kim
(Submitted on 26 Mar 2013 (v1), last revised 8 Apr 2013 (this version, v2))
This paper investigates the problem of sparse support detection (SSD) via a detection-oriented algorithm named Bayesian hypothesis test via belief propagation (BHT-BP). Our main focus is to compare BHT-BP to an estimation-based algorithm, called CS-BP, and show its superiority in the SSD problem. For this investigation, we perform a phase transition (PT) analysis over the plain of the noise level and signal magnitude on the signal support. This PT analysis sharply specifies the required signal magnitude for the detection under a certain noise level. In addition, we provide an experimental validation to assure the PT analysis. Our analytical and experimental results show the fact that BHT-BP detects the signal support against additive noise more robustly than CS-BP does.




Message Passing Algorithm for Distributed Downlink Regularized Zero-forcing Beamforming with Cooperative Base Stations
Chao-Kai Wen, Jung-Chieh Chen, Kai-Kit Wong, Pangan Ting
(Submitted on 26 Mar 2013)
Base station (BS) cooperation can turn unwanted interference to useful signal energy for enhancing system performance. In the cooperative downlink, zero-forcing beamforming (ZFBF) with a simple scheduler is well known to obtain nearly the performance of the capacity-achieving dirty-paper coding. However, the centralized ZFBF approach is prohibitively complex as the network size grows. In this paper, we devise message passing algorithms for realizing the regularized ZFBF (RZFBF) in a distributed manner using belief propagation. In the proposed methods, the overall computational cost is decomposed into many smaller computation tasks carried out by groups of neighboring BSs and communications is only required between neighboring BSs. More importantly, some exchanged messages can be computed based on channel statistics rather than instantaneous channel state information, leading to significant reduction in computational complexity. Simulation results demonstrate that the proposed algorithms converge quickly to the exact RZFBF and much faster compared to conventional methods.




Multi-Group Testing
Hong-Bin Chen, Fei-Huang Chang, Jun-Yi Guo, Yu-Pei Huang
(Submitted on 25 Mar 2013)
This paper proposes a novel generalization of group testing, called multi-group testing, which relaxes the notion of "testing subset" in group testing to "testing multi-set". The generalization aims to learn more information of each item to be tested rather than identify only defectives as was done in conventional group testing. This paper provides efficient nonadaptive strategies for the multi-group testing problem. The major tool is a new structure, $q$-ary additive $(w,d)$-disjunct matrix, which is a generalization of the well-known binary disjunct matrix introduced by Kautz and Singleton in 1964.



Circular law for random matrices with unconditional log-concave distribution
Radosław Adamczak, Djalil Chafai (LAMA)
(Submitted on 23 Mar 2013)
We explore the validity of the circular law for random matrices with non i.i.d. entries. Let A be a random n \times n real matrix having as a random vector in R^{n^2} a log-concave isotropic unconditional law. In particular, the entries are uncorellated and have a symmetric law of zero mean and unit variance. This allows for some dependence and non equidistribution among the entries, while keeping the special case of i.i.d. standard Gaussian entries. Our main result states that as n goes to infinity, the empirical spectral distribution of n^{-1/2}A tends to the uniform law on the unit disc of the complex plane.


Sparse Factor Analysis for Learning and Content Analytics
Andrew S. Lan, Andrew E. Waters, Christoph Studer, Richard G. Baraniuk
(Submitted on 22 Mar 2013)
We develop a new model and algorithms for machine learning-based learning analytics, which estimate a learner's knowledge of the concepts underlying a domain, and content analytics, which estimate the relationships among a collection of questions and those concepts. Our model represents the probability that a learner provides the correct response to a question in terms of three factors: their understanding of a set of underlying concepts, the concepts involved in each question, and each question's intrinsic difficulty. We estimate these factors given the graded responses to a collection of questions. The underlying estimation problem is ill-posed in general, especially when only a subset of the questions are answered. The key observation that enables a well-posed solution is the fact that typical educational domains of interest involve only a small number of key concepts. Leveraging this observation, we develop both a bi-convex maximum-likelihood and a Bayesian solution to the resulting SPARse Factor Analysis (SPARFA) problem. We also incorporate user-defined tags on questions to facilitate the interpretability of the estimated factors. Experiments with synthetic and real-world data demonstrate the efficacy of our approach. Finally, we make a connection between SPARFA and noisy, binary-valued (1-bit) dictionary learning that is of independent interest.






Can we allow linear dependencies in the dictionary in the sparse synthesis framework?
Raja Giryes, Michael Elad
(Submitted on 22 Mar 2013)
Signal recovery from a given set of linear measurements using a sparsity prior has been a major subject of research in recent years. In this model, the signal is assumed to have a sparse representation under a given dictionary. Most of the work dealing with this subject has focused on the reconstruction of the signal's representation as the means for recovering the signal itself. This approach forced the dictionary to be of low coherence and with no linear dependencies between its columns. Recently, a series of contributions that focus on signal recovery using the analysis model find that linear dependencies in the analysis dictionary are in fact permitted and beneficial. In this paper we show theoretically that the same holds also for signal recovery in the synthesis case for the l0- synthesis minimization problem. In addition, we demonstrate empirically the relevance of our conclusions for recovering the signal using an l1-relaxation.



Nonlocal imaging by conditional averaging of random reference measurements
Kai-Hong Luo, Boqiang Huang, Wei-Mou Zheng, Ling-An Wu
(Submitted on 22 Mar 2013)
We report the nonlocal imaging of an object by conditional averaging of the random exposure frames of a reference detector, which only sees the freely propagating field from a thermal light source. A bucket detector, synchronized with the reference detector, records the intensity fluctuations of an identical beam passing through the object mask. These fluctuations are sorted according to their values relative to the mean, then the reference data in the corresponding time-bins for a given fluctuation range are averaged, to produce either positive or negative images. Since no correlation calculations are involved, this correspondence imaging technique challenges our former interpretations of "ghost" imaging. Compared with conventional correlation imaging or compressed sensing schemes, both the number of exposures and computation time are greatly reduced, while the visibility is much improved. A simple statistical model is presented to explain the phenomenon.


Sample Distortion for Compressed Imaging
Chunli Guo, Mike E. Davies
(Submitted on 22 Mar 2013)
We propose the notion of sample distortion function for i.i.d compressive distributions with the aim to fundamentally quantify the achievable reconstruction performance of compressed sensing for certain encoder-decoder pairs at a given undersampling ratio. The theoretical SD function is derived for the Gaussian encoder and Bayesian optimal approximate message passing decoder thanks to the rigorous analysis of the AMP algorithm. We also show the convexity of the general SD function and derive two lower bounds. We then apply the SD framework to analyse compressed sensing for natural images using a multi-resolution statistical image model with both generalized Gaussian distribution and the two-state Gaussian mixture distribution. For this scenario we are able to achieve an optimal bandwise sample allocation and the corresponding SD function for natural images to accurately predict the possible compressed sensing performance gains. We further adopt Som and Schniter's turbo message passing approach to integrate the bandwise sampling with the exploitation of the hidden Markov tree structure of wavelet coefficients. Natural image simulation confirms the theoretical improvements and the effectiveness of bandwise sampling.









Projection Onto The k-Cosparse Set is NP-Hard







Andreas M. Tillmann, Rémi Gribonval, Marc E. Pfetsch










The computational complexity of a problem arising in the context of sparse optimization is considered, namely, the projection onto the set of k-cosparse vectors w.r.t. some given matrix {\Omega}. It is shown that this projection problem is (strongly) NP-hard, even in the special cases where the matrix {\Omega} contains only ternary or bipolar coefficients. Interestingly, this is in contrast to the projection onto the set of k-sparse vectors, which is trivially solved by keeping only the k largest coefficients.










Multi-dimensional sparse structured signal approximation using split Bregman iterations


Yoann Isaac, Quentin Barthélemy, Jamal Atif, Cédric Gouy-Pailler, Michèle Sebag


(Submitted on 21 Mar 2013 (v1), last revised 25 Mar 2013 (this version, v2))


The paper focuses on the sparse approximation of signals using overcomplete representations, such that it preserves the (prior) structure of multi-dimensional signals. The underlying optimization problem is tackled using a multi-dimensional split Bregman optimization approach. An extensive empirical evaluation shows how the proposed approach compares to the state of the art depending on the signal features.






On the optimality of a L1/L1 solver for sparse signal recovery from sparsely corrupted compressive measurements


Laurent Jacques


(Submitted on 20 Mar 2013)


This short note proves the $\ell_2-\ell_1$ instance optimality of a $\ell_1/\ell_1$ solver, i.e a variant of \emph{basis pursuit denoising} with a $\ell_1$ fidelity constraint, when applied to the estimation of sparse (or compressible) signals observed by sparsely corrupted compressive measurements. The approach simply combines two known results due to Y. Plan, R. Vershynin and E. Cand\`es.






Compressive Shift Retrieval


Henrik Ohlsson, Yonina C. Eldar, Allen Y. Yang, Shankar S. Sastry


(Submitted on 20 Mar 2013)


The classical shift retrieval problem considers two signals in vector form that are related by a cyclic shift. In this paper, we develop a compressive variant where the measurement of the signals is undersampled. While the standard procedure to shift retrieval is to maximize the real part of their dot product, we show that the shift can be exactly recovered from the corresponding compressed measurements if the sensing matrix satisfies certain conditions. A special case is the partial Fourier matrix. In this setting we show that the true shift can be found by as low as two measurements. We further show that the shift can often be recovered when the measurements are perturbed by noise.



Universal Numerical Encoder and Profiler Reduces Computing's Memory Wall with Software, FPGA, and SoC Implementations


Albert Wegener


(Submitted on 20 Mar 2013)


In the multicore era, the time to computational results is increasingly determined by how quickly operands are accessed by cores, rather than by the speed of computation per operand. From high-performance computing (HPC) to mobile application processors, low multicore utilization rates result from the slowness of accessing off-chip operands, i.e. the memory wall. The APplication AXcelerator (APAX) universal numerical encoder reduces computing's memory wall by compressing numerical operands (integers and floats), thereby decreasing CPU access time by 3:1 to 10:1 as operands stream between memory and cores. APAX encodes numbers using a low-complexity algorithm designed both for time series sensor data and for multi-dimensional data, including images. APAX encoding parameters are determined by a profiler that quantifies the uncertainty inherent in numerical datasets and recommends encoding parameters reflecting this uncertainty. Compatible software, FPGA, and systemon-chip (SoC) implementations efficiently support encoding rates between 150 MByte/sec and 1.5 GByte/sec at low power. On 25 integer and floating-point datasets, we achieved encoding rates between 3:1 and 10:1, with average correlation of 0.999959, while accelerating computational "time to results."






Greedy Feature Selection for Subspace Clustering


Eva L. Dyer, Aswin C. Sankaranarayanan, Richard G. Baraniuk


(Submitted on 19 Mar 2013)


Unions of subspaces are powerful nonlinear signal models for collections of high-dimensional data. However, existing methods that exploit this structure require that the subspaces the signals of interest occupy be known a priori or be learned from the data directly. In this work, we analyze the performance of greedy feature selection strategies for learning unions of subspaces from an ensemble of data. We develop sufficient conditions that are required for orthogonal matching pursuit (OMP) to recover subsets of points from the ensemble that live in the same subspace, a property we refer to as exact feature selection (EFS). Following this analysis, we provide an empirical study of greedy feature selection strategies for subspace clustering and characterize the gap between sparse recovery methods and nearest neighbor (NN)-based approaches. We demonstrate that the gap between sparse recovery and NN methods is particularly pronounced when the tiling of subspaces in the ensemble is sparse, suggesting that sparse recovery methods can be used in a number of regimes where nearest neighbor approaches fail to reveal the subspace membership of points in the ensemble.



Gradient methods for convex minimization: better rates under weaker conditions


Hui Zhang, Wotao Yin


(Submitted on 19 Mar 2013)


The convergence behavior of gradient methods for minimizing convex differentiable functions is one of the core questions in convex optimization. This paper shows that their well-known complexities can be achieved under conditions weaker than the commonly accepted ones. We relax the common gradient Lipschitz-continuity condition and strong convexity condition to ones that hold only over certain line segments. Specifically, we establish complexities $O(\frac{R}{\epsilon})$ and $O(\sqrt{\frac{R}{\epsilon}})$ for the ordinary and accelerate gradient methods, respectively, assuming that $\nabla f$ is Lipschitz continuous with constant $R$ over the line segment joining $x$ and $x-\frac{1}{R}\nabla f$ for each $x\in\dom f$. Then we improve them to $O(\frac{R}{\nu}\log(\frac{1}{\epsilon}))$ and $O(\sqrt{\frac{R}{\nu}}\log(\frac{1}{\epsilon}))$ for function $f$ that also satisfies the secant inequality $\ < \nabla f(x), x- x^*\ > \ge \nu\|x-x^*\|^2$ for each $x\in \dom f$ and its projection $x^*$ to the minimizer set of $f$. The secant condition is also shown to be necessary for the geometric decay of solution error. Not only are the relaxed conditions met by more functions, the restrictions give smaller $R$ and larger $\nu$ than they are without the restrictions and thus lead to better complexity bounds. We apply these results to sparse optimization and demonstrate a faster algorithm.






A General Iterative Shrinkage and Thresholding Algorithm for Non-convex Regularized Optimization Problems


Pinghua Gong, Changshui Zhang, Zhaosong Lu, Jianhua Huang, Jieping Ye


(Submitted on 18 Mar 2013)


Non-convex sparsity-inducing penalties have recently received considerable attentions in sparse learning. Recent theoretical investigations have demonstrated their superiority over the convex counterparts in several sparse learning settings. However, solving the non-convex optimization problems associated with non-convex penalties remains a big challenge. A commonly used approach is the Multi-Stage (MS) convex relaxation (or DC programming), which relaxes the original non-convex problem to a sequence of convex problems. This approach is usually not very practical for large-scale problems because its computational cost is a multiple of solving a single convex problem. In this paper, we propose a General Iterative Shrinkage and Thresholding (GIST) algorithm to solve the nonconvex optimization problem for a large class of non-convex penalties. The GIST algorithm iteratively solves a proximal operator problem, which in turn has a closed-form solution for many commonly used penalties. At each outer iteration of the algorithm, we use a line search initialized by the Barzilai-Borwein (BB) rule that allows finding an appropriate step size quickly. The paper also presents a detailed convergence analysis of the GIST algorithm. The efficiency of the proposed algorithm is demonstrated by extensive experiments on large-scale data sets.






Toward real-time quantum imaging with a single pixel camera


B.J. Lawrie, R.C. Pooser


(Submitted on 15 Mar 2013)


We present a workbench for the study of real-time quantum imaging by measuring the frame-by-frame quantum noise reduction of multi-spatial-mode twin beams generated by four wave mixing in Rb vapor. Exploiting the multiple spatial modes of this squeezed light source, we utilize spatial light modulators to selectively pass macropixels of quantum correlated modes from each of the twin beams to a high quantum efficiency balanced detector. In low-light-level imaging applications, the ability to measure the quantum correlations between individual spatial modes and macropixels of spatial modes with a single pixel camera will facilitate compressive quantum imaging with sensitivity below the photon shot noise limit.






Sparse approximation and recovery by greedy algorithms


Eugene Livshitz, Vladimir Temlyakov


(Submitted on 14 Mar 2013)


We study sparse approximation by greedy algorithms. Our contribution is two-fold. First, we prove exact recovery with high probability of random $K$-sparse signals within $\lceil K(1+\e)\rceil$ iterations of the Orthogonal Matching Pursuit (OMP). This result shows that in a probabilistic sense the OMP is almost optimal for exact recovery. Second, we prove the Lebesgue-type inequalities for the Weak Chebyshev Greedy Algorithm, a generalization of the Weak Orthogonal Matching Pursuit to the case of a Banach space. The main novelty of these results is a Banach space setting instead of a Hilbert space setting. However, even in the case of a Hilbert space our results add some new elements to known results on the Lebesque-type inequalities for the RIP dictionaries. Our technique is a development of the recent technique created by Zhang.






Tractability of Interpretability via Selection of Group-Sparse Models


Luca Baldassarre, Nirav Bhan, Volkan Cevher, Anastasios Kyrillidis


(Submitted on 13 Mar 2013)


Group-based sparsity models are proven instrumental in linear regression problems for recovering signals from much fewer measurements than standard compressive sensing. The main promise of these models is the recovery of "interpretable" signals along with the identification of their constituent groups. To this end, we establish a combinatorial framework for group-model selection problems and highlight the underlying tractability issues revolving around such notions of interpretability when the regression matrix is simply the identity operator. We show that, in general, claims of correctly identifying the groups with convex relaxations would lead to polynomial time solution algorithms for a well-known NP-hard problem, called the weighted maximum cover problem. Instead, leveraging a graph-based understanding of group models, we describe group structures which enable correct model identification in polynomial time via dynamic programming. We also show that group structures that lead to totally unimodular constraints have tractable discrete as well as convex relaxations. Finally, we study the Pareto frontier of budgeted group-sparse approximations for the tree-based sparsity model of \cite{baraniuk2010model} and illustrate identification and computation trade-offs between our framework and the existing convex relaxations.






Kernel Sparse Models for Automated Tumor Segmentation


Jayaraman J. Thiagarajan, Karthikeyan Natesan Ramamurthy, Deepta Rajan, Anup Puri, David Frakes, Andreas Spanias


(Submitted on 11 Mar 2013)


In this paper, we propose sparse coding-based approaches for segmentation of tumor regions from MR images. Sparse coding with data-adapted dictionaries has been successfully employed in several image recovery and vision problems. The proposed approaches obtain sparse codes for each pixel in brain magnetic resonance images considering their intensity values and location information. Since it is trivial to obtain pixel-wise sparse codes, and combining multiple features in the sparse coding setup is not straightforward, we propose to perform sparse coding in a high-dimensional feature space where non-linear similarities can be effectively modeled. We use the training data from expert-segmented images to obtain kernel dictionaries with the kernel K-lines clustering procedure. For a test image, sparse codes are computed with these kernel dictionaries, and they are used to identify the tumor regions. This approach is completely automated, and does not require user intervention to initialize the tumor regions in a test image. Furthermore, a low complexity segmentation approach based on kernel sparse codes, which allows the user to initialize the tumor region, is also presented. Results obtained with both the proposed approaches are validated against manual segmentation by an expert radiologist, and the proposed methods lead to accurate tumor identification.






Predictive Correlation Screening: Application to Two-stage Predictor Design in High Dimension


Hamed Firouzi, Bala Rajaratnam, Alfred Hero


(Submitted on 10 Mar 2013)


We introduce a new approach to variable selection, called Predictive Correlation Screening, for predictor design. Predictive Correlation Screening (PCS) implements false positive control on the selected variables, is well suited to small sample sizes, and is scalable to high dimensions. We establish asymptotic bounds for Familywise Error Rate (FWER), and resultant mean square error of a linear predictor on the selected variables. We apply Predictive Correlation Screening to the following two-stage predictor design problem. An experimenter wants to learn a multivariate predictor of gene expression based on successive biological samples assayed on mRNA arrays. She assays the whole genome on a few samples and from these assays she selects a small number of variables using Predictive Correlation Screening. To reduce assay cost, she subsequently assays only the selected variables on the remaining samples, to learn the predictor coefficients. We show superiority of Predictive Correlation Screening relative to LASSO and correlation learning in terms of performance and computational complexity.






l_0 Norm Constraint LMS Algorithm for Sparse System Identification


Yuantao Gu, Jian Jin, Shunliang Mei


(Submitted on 9 Mar 2013)


In order to improve the performance of Least Mean Square (LMS) based system identification of sparse systems, a new adaptive algorithm is proposed which utilizes the sparsity property of such systems. A general approximating approach on $l_0$ norm -- a typical metric of system sparsity, is proposed and integrated into the cost function of the LMS algorithm. This integration is equivalent to add a zero attractor in the iterations, by which the convergence rate of small coefficients, that dominate the sparse system, can be effectively improved. Moreover, using partial updating method, the computational complexity is reduced. The simulations demonstrate that the proposed algorithm can effectively improve the performance of LMS-based identification algorithms on sparse system.






New Understanding of the Bethe Approximation and the Replica Method


Ryuhei Mori


(Submitted on 9 Mar 2013)


In this thesis, new generalizations of the Bethe approximation and new understanding of the replica method are proposed. The Bethe approximation is an efficient approximation for graphical models, which gives an asymptotically accurate estimate of the partition function for many graphical models. The Bethe approximation explains the well-known message passing algorithm, belief propagation, which is exact for tree graphical models. It is also known that the cluster variational method gives the generalized Bethe approximation, called the Kikuchi approximation, yielding the generalized belief propagation. In the thesis, a new series of generalization of the Bethe approximation is proposed, which is named the asymptotic Bethe approximation. The asymptotic Bethe approximation is derived from the characterization of the Bethe free energy using graph covers, which was recently obtained by Vontobel. The asymptotic Bethe approximation can be expressed in terms of the edge zeta function by using Watanabe and Fukumizu's result about the Hessian of the Bethe entropy. The asymptotic Bethe approximation is confirmed to be better than the conventional Bethe approximation on some conditions. For this purpose, Chertkov and Chernyak's loop calculus formula is employed, which shows that the error of the Bethe approximation can be expressed as a sum of weights corresponding to generalized loops, and generalized for non-binary finite alphabets by using concepts of information geometry.






Bayesian Compressed Regression


Rajarshi Guhaniyogi, David B. Dunson


(Submitted on 4 Mar 2013 (v1), last revised 22 Mar 2013 (this version, v2))


As an alternative to variable selection or shrinkage in high dimensional regression, we propose to randomly compress the predictors prior to analysis. This dramatically reduces storage and computational bottlenecks, performing well when the predictors can be projected to a low dimensional linear subspace with minimal loss of information about the response. As opposed to existing Bayesian dimensionality reduction approaches, the exact posterior distribution conditional on the compressed data is available analytically, speeding up computation by many orders of magnitude while also bypassing robustness issues due to convergence and mixing problems with MCMC. Model averaging is used to reduce sensitivity to the random projection matrix, while accommodating uncertainty in the subspace dimension. Strong theoretical support is provided for the approach by showing near parametric convergence rates for the predictive density in the large p small n asymptotic paradigm. Practical performance relative to competitors is illustrated in simulations and real data applications.









Random Subdictionaries and Coherence Conditions for Sparse Signal Recovery


Alexander Barg, Arya Mazumdar, Rongrong Wang


(Submitted on 7 Mar 2013)


The most frequently used condition for sampling matrices employed in compressive sampling is the restricted isometry (RIP) property of the matrix when restricted to sparse signals. At the same time, imposing this condition makes it difficult to find explicit matrices that support recovery of signals from sketches of the optimal (smallest possible)dimension. A number of attempts have been made to relax or replace the RIP property in sparse recovery algorithms. We focus on the relaxation under which the near-isometry property holds for most rather than for all submatrices of the sampling matrix, known as statistical RIP or StRIP condition. We show that sampling matrices of dimensions $m\times N$ with maximum coherence $\mu=O((k\log^3 N)^{-1/4})$ and mean square coherence $\bar \mu^2=O(1/(k\log N))$ support stable recovery of $k$-sparse signals using Basis Pursuit. These assumptions are satisfied in many examples. As a result, we are able to construct sampling matrices that support recovery with low error for sparsity $k$ higher than $\sqrt m,$ which exceeds the range of parameters of the known classes of RIP matrices.



On Robust Face Recognition via Sparse Encoding: the Good, the Bad, and the Ugly


Yongkang Wong, Mehrtash T. Harandi, Conrad Sanderson


(Submitted on 7 Mar 2013)


In the field of face recognition, Sparse Representation (SR) has received considerable attention during the past few years. Most of the relevant literature focuses on holistic descriptors in closed-set identification applications. The underlying assumption in SR-based methods is that each class in the gallery has sufficient samples and the query lies on the subspace spanned by the gallery of the same class. Unfortunately, such assumption is easily violated in the more challenging face verification scenario, where an algorithm is required to determine if two faces (where one or both have not been seen before) belong to the same person. In this paper, we first discuss why previous attempts with SR might not be applicable to verification problems. We then propose an alternative approach to face verification via SR. Specifically, we propose to use explicit SR encoding on local image patches rather than the entire face. The obtained sparse signals are pooled via averaging to form multiple region descriptors, which are then concatenated to form an overall face descriptor. Due to the deliberate loss spatial relations within each region (caused by averaging), the resulting descriptor is robust to misalignment & various image deformations. Within the proposed framework, we evaluate several SR encoding techniques: l1-minimisation, Sparse Autoencoder Neural Network (SANN), and an implicit probabilistic technique based on Gaussian Mixture Models. Thorough experiments on AR, FERET, exYaleB, BANCA and ChokePoint datasets show that the proposed local SR approach obtains considerably better and more robust performance than several previous state-of-the-art holistic SR methods, in both verification and closed-set identification problems. The experiments also show that l1-minimisation based encoding has a considerably higher computational than the other techniques, but leads to higher recognition rates.









A Fast Iterative Bayesian Inference Algorithm for Sparse Channel Estimation


Niels Lovmand Pedersen, Carles Navarro Manchón Bernard Henri Fleury


(Submitted on 6 Mar 2013)


In this paper, we present a Bayesian channel estimation algorithm for multicarrier receivers based on pilot symbol observations. The inherent sparse nature of wireless multipath channels is exploited by modeling the prior distribution of multipath components' gains with a hierarchical representation of the Bessel K probability density function; a highly efficient, fast iterative Bayesian inference method is then applied to the proposed model. The resulting estimator outperforms other state-of-the-art Bayesian and non-Bayesian estimators, either by yielding lower mean squared estimation error or by attaining the same accuracy with improved convergence rate, as shown in our numerical evaluation.



Impulsive Noise Mitigation in Powerline Communications Using Sparse Bayesian Learning


Jing Lin, Marcel Nassar, Brian L. Evans


(Submitted on 5 Mar 2013)


Additive asynchronous and cyclostationary impulsive noise limits communication performance in OFDM powerline communication (PLC) systems. Conventional OFDM receivers assume additive white Gaussian noise and hence experience degradation in communication performance in impulsive noise. Alternate designs assume a parametric statistical model of impulsive noise and use the model parameters in mitigating impulsive noise. These receivers require overhead in training and parameter estimation, and degrade due to model and parameter mismatch, especially in highly dynamic environments. In this paper, we model impulsive noise as a sparse vector in the time domain without any other assumptions, and apply sparse Bayesian learning methods for estimation and mitigation without training. We propose three iterative algorithms with different complexity vs. performance trade-offs: (1) we utilize the noise projection onto null and pilot tones to estimate and subtract the noise impulses; (2) we add the information in the data tones to perform joint noise estimation and OFDM detection; (3) we embed our algorithm into a decision feedback structure to further enhance the performance of coded systems. When compared to conventional OFDM PLC receivers, the proposed receivers achieve SNR gains of up to 9 dB in coded and 10 dB in uncoded systems in the presence of impulsive noise.






Recursive Sparse Recovery in Large but Structured Noise - Part 2


Chenlu Qiu, Namrata Vaswani


(Submitted on 5 Mar 2013)


We study the problem of recursively recovering a time sequence of sparse vectors, St, from measurements Mt := St + Lt that are corrupted by structured noise Lt which is dense and can have large magnitude. The structure that we require is that Lt should lie in a low dimensional subspace that is either fixed or changes "slowly enough"; and the eigenvalues of its covariance matrix are "clustered". We do not assume any model on the sequence of sparse vectors. Their support sets and their nonzero element values may be either independent or correlated over time (usually in many applications they are correlated). The only thing required is that there be some support change every so often. We introduce a novel solution approach called Recursive Projected Compressive Sensing with cluster-PCA (ReProCS-cPCA) that addresses some of the limitations of earlier work. Under mild assumptions, we show that, with high probability, ReProCS-cPCA can exactly recover the support set of St at all times; and the reconstruction errors of both St and Lt are upper bounded by a time-invariant and small value.







On sparse sensing and sparse sampling of coded signals at sub-Landau rates


Michael Peleg, Shlomo Shamai


(Submitted on 1 Mar 2013)


Advances of information-theoretic understanding of sparse sampling of continuous uncoded signals at sampling rates exceeding the Landau rate were reported in recent works. This work examines sparse sampling of coded signals at sub-Landau sampling rates. It is shown that with coded signals the Landau condition may be relaxed and the sampling rate required for signal reconstruction and for support detection can be lower than the effective bandwidth. Equivalently, the number of measurements in the corresponding sparse sensing problem can be smaller than the support size. Tight bounds on information rates and on signal and support detection performance are derived for the Gaussian sparsely sampled channel and for the frequency-sparse channel using the context of state dependent channels. Support detection results are verified by a simulation. When the system is high-dimensional the required SNR is shown to be finite but high and rising with decreasing sampling rate, in some practical applications it can be lowered by reducing the a-priory uncertainty about the support e.g. by concentrating the frequency support into a finite number of subbands.






Randomized Low-Memory Singular Value Projection


Stephen Becker, Volkan Cevher, Anastasios Kyrillidis


(Submitted on 1 Mar 2013 (v1), last revised 19 Mar 2013 (this version, v2))


Affine rank minimization algorithms typically rely on calculating the gradient of a data error followed by singular value decompositions at every iteration. Because these two steps are expensive, heuristics are often used. In this paper, we propose one recovery scheme that merges the two steps and show that it actually admits provable recovery guarantees while operating on space proportional to the degrees of freedom in the problem.














An Augmented Lagrangian Method for Conic Convex Programming


Necdet Serhat Aybat, Garud Iyengar


(Submitted on 26 Feb 2013)


We propose a new first-order augmented Lagrangian algorithm ALCC for solving convex conic programs of the form min{rho(x)+gamma(x): Ax-b in K, x in chi}, where rho and gamma are closed convex functions, and gamma has a Lipschitz continuous gradient, A is mxn real matrix, K is a closed convex cone, and chi is a "simple" convex compact set such that optimization problems of the form min{rho(x)+|x-x0|_2^2: x in chi} can be efficiently solved for any given x0. We show that any limit point of the primal ALCC iterates is an optimal solution of the conic convex problem, and the dual ALCC iterates have a unique limit point that is a Karush-Kuhn-Tucker (KKT) point of the conic program. We also show that for any epsilon>0, the primal ALCC iterates are epsilon-feasible and epsilon optimal after O(log(1/epsilon)) iterations which require solving O(1/epsilon log(1/epsilon)) problems of the form min{rho(x)+|x-x0|_2^2: x in chi}.




Super-resolution via superset selection and pruning


Laurent Demanet, Deanna Needell, Nam Nguyen


(Submitted on 26 Feb 2013)


We present a pursuit-like algorithm that we call the "superset method" for recovery of sparse vectors from consecutive Fourier measurements in the super-resolution regime. The algorithm has a subspace identification step that hinges on the translation invariance of the Fourier transform, followed by a removal step to estimate the solution's support. The superset method is always successful in the noiseless regime (unlike L1-minimization) and generalizes to higher dimensions (unlike the matrix pencil method). Relative robustness to noise is demonstrated numerically.




Sound localization using compressive sensing


Hong Jiang, Boyd Mathews, Paul Wilford


(Submitted on 28 Feb 2013)


In a sensor network with remote sensor devices, it is important to have a method that can accurately localize a sound event with a small amount of data transmitted from the sensors. In this paper, we propose a novel method for localization of a sound source using compressive sensing. Instead of sampling a large amount of data at the Nyquist sampling rate in time domain, the acoustic sensors take compressive measurements integrated in time. The compressive measurements can be used to accurately compute the location of a sound source.
Ensemble Sparse Models for Image Analysis


Karthikeyan Natesan Ramamurthy, Jayaraman J. Thiagarajan, Prasanna Sattigeri, Andreas Spanias


(Submitted on 27 Feb 2013)


Sparse representations with learned dictionaries have been successful in several image analysis applications. In this paper, we propose and analyze the framework of ensemble sparse models, and demonstrate their utility in image restoration and unsupervised clustering. The proposed ensemble model approximates the data as a linear combination of approximations from multiple \textit{weak} sparse models. Theoretical analysis of the ensemble model reveals that even in the worst-case, the ensemble can perform better than any of its constituent individual models. The dictionaries corresponding to the individual sparse models are obtained using either random example selection or boosted approaches. Boosted approaches learn one dictionary per round such that the dictionary learned in a particular round is optimized for the training examples having high reconstruction error in the previous round. Results with compressed recovery show that the ensemble representations lead to a better performance compared to using a single dictionary obtained with the conventional alternating minimization approach. The proposed ensemble models are also used for single image superresolution, and we show that they perform comparably to the recent approaches. In unsupervised clustering, experiments show that the proposed model performs better than baseline approaches in several standard datasets.
Compressed Sensing with Sparse Binary Matrices: Instance Optimal Error Guarantees in Near-Optimal Time


M. A. Iwen


(Submitted on 24 Feb 2013)


A compressed sensing method consists of a rectangular measurement matrix, $M \in \mathbbm{R}^{m \times N}$ with $m \ll N$, together with an associated recovery algorithm, $\mathcal{A}: \mathbbm{R}^m \rightarrow \mathbbm{R}^N$. Compressed sensing methods aim to construct a high quality approximation to any given input vector ${\bf x} \in \mathbbm{R}^N$ using only $M {\bf x} \in \mathbbm{R}^m$ as input. In particular, we focus herein on instance optimal nonlinear approximation error bounds for $M$ and $\mathcal{A}$ of the form $ \| {\bf x} - \mathcal{A} (M {\bf x}) \|_p \leq \| {\bf x} - {\bf x}^{\rm opt}_k \|_p + C k^{1/p - 1/q} \| {\bf x} - {\bf x}^{\rm opt}_k \|_q$ for ${\bf x} \in \mathbbm{R}^N$, where ${\bf x}^{\rm opt}_k$ is the best possible $k$-term approximation to ${\bf x}$.


In this paper we develop a compressed sensing method whose associated recovery algorithm, $\mathcal{A}$, runs in $O((k \log k) \log N)$-time, matching a lower bound up to a $O(\log k)$ factor. This runtime is obtained by using a new class of sparse binary compressed sensing matrices of near optimal size in combination with sublinear-time recovery techniques motivated by sketching algorithms for high-volume data streams. The new class of matrices is constructed by randomly subsampling rows from well-chosen incoherent matrix constructions which already have a sub-linear number of rows. As a consequence, fewer random bits than previously required are needed in order to select the rows utilized by the fast reconstruction algorithms considered herein.
A Multi-Scale Spatial Model for RSS-based Device-Free Localization


Ossi Kaltiokallio, Maurizio Bocca, Neal Patwari


(Submitted on 24 Feb 2013)


RSS-based device-free localization (DFL) monitors changes in the received signal strength (RSS) measured by a network of static wireless nodes to locate people without requiring them to carry or wear any electronic device. Current models assume that the spatial impact area, i.e., the area in which a person affects a link's RSS, has constant size. This paper shows that the spatial impact area varies considerably for each link. Data from extensive experiments are used to derive a multi-scale spatial weight model that is a function of the fade level, i.e., the difference between the predicted and measured RSS, and of the direction of RSS change. In addition, a measurement model is proposed which gives a probability of a person locating inside the derived spatial model for each given RSS measurement. A real-time radio tomographic imaging system is described which uses channel diversity and the presented models. Experiments in an open indoor environment, in a typical one-bedroom apartment and in a through-wall scenario are conducted to determine the accuracy of the system. We demonstrate that the new system is capable of localizing and tracking a person with high accuracy (
Sparse Signal Estimation by Maximally Sparse Convex Optimization


Ivan W. Selesnick, Ilker Bayram


(Submitted on 22 Feb 2013)


This paper addresses the problem of sparsity penalized least squares for applications in sparse signal processing, e.g. sparse deconvolution. This paper aims to induce sparsity more strongly than L1 norm regularization, while avoiding non-convex optimization. For this purpose, this paper describes the design and use of non-convex penalty functions (regularizers) constrained so as to ensure the convexity of the total cost function, F, to be minimized. The method is based on parametric penalty functions, the parameters of which are constrained to ensure convexity of F. It is shown that optimal parameters can be obtained by semidefinite programming (SDP). This maximally sparse convex (MSC) approach yields maximally non-convex sparsity-inducing penalty functions constrained such that the total cost function, F, is convex. It is demonstrated that iterative MSC (IMSC) can yield solutions substantially more sparse than the standard convex sparsity-inducing approach, i.e., L1 norm minimization.
Stable phase retrieval with low-redundancy frames


Bernhard G. Bodmann, Nathaniel Hammen


(Submitted on 22 Feb 2013)


We investigate the recovery of vectors from magnitudes of frame coefficients when the frames have a low redundancy, meaning a small number of frame vectors compared to the dimension of the Hilbert space. We first show that for vectors in d dimensions, 4d-4 suitably chosen frame vectors are sufficient to uniquely determine each signal, up to an overall unimodular constant, from the magnitudes of its frame coefficients. Then we discuss the effect of noise and show that 8d-4 frame vectors provide a stable recovery if part of the frame coefficients is bounded away from zero. In this regime, perturbing the magnitudes of the frame coefficients by noise that is sufficiently small results in a recovery error that is at most proportional to the noise level.
q-ary Compressive Sensing


Youssef Mroueh, Lorenzo Rosasco


(Submitted on 21 Feb 2013)


We introduce q-ary compressive sensing, an extension of 1-bit compressive sensing. We propose a novel sensing mechanism and a corresponding recovery procedure. The recovery properties of the proposed approach are analyzed both theoretically and empirically. Results in 1-bit compressive sensing are recovered as a special case. Our theoretical results suggest a tradeoff between the quantization parameter q, and the number of measurements m in the control of the error of the resulting recovery algorithm, as well its robustness to noise.
Is Matching Pursuit Solving Convex Problems?


Mingkui Tan, Ivor W. Tsang, Li Wang


(Submitted on 20 Feb 2013)


Sparse recovery ({\tt SR}) has emerged as a very powerful tool for signal processing, data mining and pattern recognition. To solve {\tt SR}, many efficient matching pursuit (\texttt{MP}) algorithms have been proposed. However, it is still not clear whether {\tt SR} can be formulated as a convex problem that is solvable using \texttt{MP} algorithms. To answer this, in this paper, a novel convex relaxation model is presented, which is solved by a general matching pursuit (\texttt{GMP}) algorithm under the convex programming framework. {\tt GMP} has several advantages over existing methods. At first, it solves a convex problem and guarantees to converge to a optimum. In addition, with $\ell_1$-regularization, it can recover any $k$-sparse signals if the restricted isometry constant $\sigma_k\leq 0.307-\nu$, where $\nu$ can be arbitrarily close to 0. Finally, when dealing with a batch of signals, the computation burden can be much reduced using a batch-mode \texttt{GMP}. Comprehensive numerical experiments show that \texttt{GMP} achieves better performance than other methods in terms of sparse recovery ability and efficiency. We also apply \texttt{GMP} to face recognition tasks on two well-known face databases, namely, \emph{{Extended using}} and \emph{AR}. Experimental results demonstrate that {\tt GMP} can achieve better recognition performance than the considered state-of-the-art methods within acceptable time. {Particularly, the batch-mode {\tt GMP} can be up to 500 times faster than the considered $\ell_1$ methods.}
An SVD-free Pareto curve approach to rank minimization


Aleksandr Y. Aravkin, Rajiv Mittal, Hassan Mansour, Ben Recht, Felix J. Herrmann


(Submitted on 20 Feb 2013)


Recent SVD-free matrix factorization formulations have enabled rank optimization for extremely large-scale systems (millions of rows and columns). In this paper, we consider rank-regularized formulations that only require a target data-fitting error level, and propose an algorithm for the corresponding problem. We illustrate the advantages of the new approach using the Netflix prob- lem, and use it to obtain high quality results for seismic trace interpolation, a key application in exploration geophysics. We show that factor rank can be easily adjusted as the inversion proceeds, and propose a weighted extension that allows known subspace information to improve the results of matrix completion formulations. Using these methods, we obtain high-quality reconstructions for large scale seismic interpolation problems with real data.
Moving target inference with hierarchical Bayesian models in synthetic aperture radar imagery


Gregory E. Newstadt, Edmund G. Zelnio, Alfred O. Hero III


(Submitted on 19 Feb 2013)


In synthetic aperture radar (SAR), images are formed by focusing the response of stationary objects to a single spatial location. On the other hand, moving targets cause phase errors in the standard formation of SAR images that cause displacement and defocusing effects. SAR imagery also contains significant sources of non-stationary spatially-varying noises, including antenna gain discrepancies, angular scintillation (glints) and complex speckle. In order to account for this intricate phenomenology, this work combines the knowledge of the physical, kinematic, and statistical properties of SAR imaging into a single unified Bayesian structure that simultaneously (a) estimates the nuisance parameters such as clutter distributions and antenna miscalibrations and (b) estimates the target signature required for detection/inference of the target state. Moreover, we provide a Monte Carlo estimate of the posterior distribution for the target state and nuisance parameters that infers the parameters of the model directly from the data, largely eliminating tuning of algorithm parameters. We demonstrate that our algorithm competes at least as well on a synthetic dataset as state-of-the-art algorithms for estimating sparse signals. Finally, performance analysis on a measured dataset demonstrates that the proposed algorithm is robust at detecting/estimating targets over a wide area and performs at least as well as popular algorithms for SAR moving target detection.




Compressive Classification


Hugo Reboredo (1), Francesco Renna (1), Robert Calderbank (2), Miguel R. D. Rodrigues (3) ((1) Instituto de Telecomunicações, Universidade do Porto, Portugal, (2) Department of ECE, Duke University, NC, USA, (3) Department of E&EE, University College London, UK,)


(Submitted on 19 Feb 2013)


This paper derives fundamental limits associated with compressive classification of Gaussian mixture source models. In particular, we offer an asymptotic characterization of the behavior of the (upper bound to the) misclassification probability associated with the optimal Maximum-A-Posteriori (MAP) classifier that depends on quantities that are dual to the concepts of diversity gain and coding gain in multi-antenna communications. The diversity, which is shown to determine the rate at which the probability of misclassification decays in the low noise regime, is shown to depend on the geometry of the source, the geometry of the measurement system and their interplay. The measurement gain, which represents the counterpart of the coding gain, is also shown to depend on geometrical quantities. It is argued that the diversity order and the measurement gain also offer an optimization criterion to perform dictionary learning for compressive classification applications.
Saving phase: Injectivity and stability for phase retrieval


Afonso S. Bandeira, Jameson Cahill, Dustin G. Mixon, Aaron A. Nelson


(Submitted on 19 Feb 2013 (v1), last revised 14 Mar 2013 (this version, v2))


Recent advances in convex optimization have led to new strides in the phase retrieval problem over finite-dimensional vector spaces. However, certain fundamental questions remain: What sorts of measurement vectors uniquely determine every signal up to a global phase factor, and how many are needed to do so? Furthermore, which measurement ensembles lend stability? This paper presents several results that address each of these questions. We begin by characterizing injectivity, and we identify that the complement property is indeed a necessary condition in the complex case. We then pose a conjecture that 4M-4 generic measurement vectors are both necessary and sufficient for injectivity in M dimensions, and we prove this conjecture in the special cases where M=2,3. Next, we shift our attention to stability, both in the worst and average cases. Here, we characterize worst-case stability in the real case by introducing a numerical version of the complement property. This new property bears some resemblance to the restricted isometry property of compressed sensing and can be used to derive a sharp lower Lipschitz bound on the intensity measurement mapping. Localized frames are shown to lack this property (suggesting instability), whereas Gaussian random measurements are shown to satisfy this property with high probability. We conclude by presenting results that use a stochastic noise model in both the real and complex cases, and we leverage Cramer-Rao lower bounds to identify stability with stronger versions of the injectivity characterizations.
Adaptive low rank and sparse decomposition of video using compressive sensing


Fei Yang, Hong Jiang, Zuowei Shen, Wei Deng, Dimitris Metaxas


(Submitted on 6 Feb 2013)


We address the problem of reconstructing and analyzing surveillance videos using compressive sensing. We develop a new method that performs video reconstruction by low rank and sparse decomposition adaptively. Background subtraction becomes part of the reconstruction. In our method, a background model is used in which the background is learned adaptively as the compressive measurements are processed. The adaptive method has low latency, and is more robust than previous methods. We will present experimental results to demonstrate the advantages of the proposed method.




Wideband Spectrum Sensing for Cognitive Radio Networks: A Survey


Hongjian Sun, Arumugam Nallanathan, Cheng-Xiang Wang, Yunfei Chen


(Submitted on 7 Feb 2013 (v1), last revised 4 Mar 2013 (this version, v2))


Cognitive radio has emerged as one of the most promising candidate solutions to improve spectrum utilization in next generation cellular networks. A crucial requirement for future cognitive radio networks is wideband spectrum sensing: secondary users reliably detect spectral opportunities across a wide frequency range. In this article, various wideband spectrum sensing algorithms are presented, together with a discussion of the pros and cons of each algorithm and the challenging issues. Special attention is paid to the use of sub-Nyquist techniques, including compressive sensing and multi-channel sub-Nyquist sampling techniques.








Lensless Compressive Sensing Imaging


Gang Huang, Hong Jiang, Kim Matthews, Paul Wilford


(Submitted on 7 Feb 2013)


In this paper, we propose a lensless compressive sensing imaging architecture. The architecture consists of two components, an aperture assembly and a sensor. No lens is used. The aperture assembly consists of a two dimensional array of aperture elements. The transmittance of each aperture element is independently controllable. The sensor is a single detection element, such as a single photo-conductive cell. Each aperture element together with the sensor defines a cone of a bundle of rays, and the cones of the aperture assembly define the pixels of an image. Each pixel value of an image is the integration of the bundle of rays in a cone. The sensor is used for taking compressive measurements. Each measurement is the integration of rays in the cones modulated by the transmittance of the aperture elements. A compressive sensing matrix is implemented by adjusting the transmittance of the individual aperture elements according to the values of the sensing matrix. The proposed architecture is simple and reliable because no lens is used. Furthermore, the sharpness of an image from our device is only limited by the resolution of the aperture assembly, but not affected by blurring due to defocus. The architecture can be used for capturing images of visible lights, and other spectra such as infrared, or millimeter waves. Such devices may be used in surveillance applications for detecting anomalies or extracting features such as speed of moving objects. Multiple sensors may be used with a single aperture assembly to capture multi-view images simultaneously. A prototype was built by using a LCD panel and a photoelectric sensor for capturing images of visible spectrum.




Adaptive Compressive Spectrum Sensing for Wideband Cognitive Radios


Hongjian Sun, Wei-Yu Chiu, A. Nallanathan


(Submitted on 7 Feb 2013)


This letter presents an adaptive spectrum sensing algorithm that detects wideband spectrum using sub-Nyquist sampling rates. By taking advantage of compressed sensing (CS), the proposed algorithm reconstructs the wideband spectrum from compressed samples. Furthermore, an l2 norm validation approach is proposed that enables cognitive radios (CRs) to automatically terminate the signal acquisition once the current spectral recovery is satisfactory, leading to enhanced CR throughput. Numerical results show that the proposed algorithm can not only shorten the spectrum sensing interval, but also improve the throughput of wideband CRs.




Wideband Spectrum Sensing with Sub-Nyquist Sampling in Cognitive Radios


Hongjian Sun, Wei-Yu Chiu, Jing Jiang, Arumugam Nallanathan, H. Vincent Poor


(Submitted on 7 Feb 2013)


Multi-rate asynchronous sub-Nyquist sampling (MASS) is proposed for wideband spectrum sensing. Corresponding spectral recovery conditions are derived and the probability of successful recovery is given. Compared to previous approaches, MASS offers lower sampling rate, and is an attractive approach for cognitive radio networks.
Surveillance Video Processing Using Compressive Sensing


Hong Jiang, Wei Deng, Zuowei Shen


(Submitted on 8 Feb 2013)


A compressive sensing method combined with decomposition of a matrix formed with image frames of a surveillance video into low rank and sparse matrices is proposed to segment the background and extract moving objects in a surveillance video. The video is acquired by compressive measurements, and the measurements are used to reconstruct the video by a low rank and sparse decomposition of matrix. The low rank component represents the background, and the sparse component is used to identify moving objects in the surveillance video. The decomposition is performed by an augmented Lagrangian alternating direction method. Experiments are carried out to demonstrate that moving objects can be reliably extracted with a small amount of measurements.
A new compressive video sensing framework for mobile broadcast


Chengbo Li, Hong Jiang, Paul Wilford, Yin Zhang, Mike Scheutzow


(Submitted on 8 Feb 2013)


A new video coding method based on compressive sampling is proposed. In this method, a video is coded using compressive measurements on video cubes. Video reconstruction is performed by minimization of total variation (TV) of the pixelwise DCT coefficients along the temporal direction. A new reconstruction algorithm is developed from TVAL3, an efficient TV minimization algorithm based on the alternating minimization and augmented Lagrangian methods. Video coding with this method is inherently scalable, and has applications in mobile broadcast.
Efficient Data Gathering in Wireless Sensor Networks Based on Matrix Completion and Compressive Sensing


Jiping Xiong, Jian Zhao, Lei Chen


(Submitted on 9 Feb 2013)


Gathering data in an energy efficient manner in wireless sensor networks is an important design challenge. In wireless sensor networks, the readings of sensors always exhibit intra-temporal and inter-spatial correlations. Therefore, in this letter, we use low rank matrix completion theory to explore the inter-spatial correlation and use compressive sensing theory to take advantage of intra-temporal correlation. Our method, dubbed MCCS, can significantly reduce the amount of data that each sensor must send through network and to the sink, thus prolong the lifetime of the whole networks. Experiments using real datasets demonstrate the feasibility and efficacy of our MCCS method.




On the list decodability of random linear codes with large error rate


Mary Wootters


(Submitted on 9 Feb 2013)


It is well-known that a random q-ary code of rate \Omega(\epsilon^2) is list decodable up to radius (1 - 1/q - \epsilon) with list sizes on the order of 1/\epsilon^2, with probability 1 - o(1). However, until recently, a similar statement about random linear codes has until remained elusive. In a recent paper, Cheraghchi, Guruswami, and Velingker show a connection between list decodability of random linear codes and the Restricted Isometry Property from compressed sensing, and use this connection to prove that a random linear code of rate \Omega(\epsilon^2 / log^3(1/\epsilon)) achieves the list decoding properties above, with constant probability. We improve on their result to show that in fact we may take the rate to be \Omega(\epsilon^2), which is optimal, and further that the success probability is 1 - o(1), rather than constant. As an added benefit, our proof is relatively simple. Finally, we extend our methods to more general ensembles of linear codes. As an example, we show that randomly punctured Reed-Muller codes have the same list decoding properties as the original codes, even when the rate is improved to a constant.




Conditional Gradient Algorithms for Norm-Regularized Smooth Convex Optimization


Zaid Harchaoui, Anatoli Juditsky, Arkadi Nemirovski


(Submitted on 10 Feb 2013 (v1), last revised 7 Mar 2013 (this version, v3))


Motivated by some applications in signal processing and machine learning, we consider two convex optimization problems where, given a cone $K$, a norm $\|\cdot\|$ and a smooth convex function $f$, we want either 1) to minimize the norm over the intersection of the cone and a level set of $f$, or 2) to minimize over the cone the sum of $f$ and a multiple of the norm. We focus on the case where (a) the dimension of the problem is too large to allow for interior point algorithms, (b) $\|\cdot\|$ is "too complicated" to allow for computationally cheap Bregman projections required in the first order proximal algorithms. On the other hand, we assume that the intersection of the unit ball of $\|\cdot\|$ with $K$ allows for cheap Minimization Oracle capable to minimize linear forms over the intersection. Motivating examples are given by the nuclear norm and $K$ being the entire space of matrices, or the positive semidefinite cone in the space of symmetric matrices, and the Total Variation norm on the space of 2D images. We discuss versions of the Conditional Gradient algorithm (in its original form aimed at minimizing smooth convex functions over bounded domains given by minimization oracles) capable to handle our problems of interest, provide the related theoretical efficiency estimates and outline some applications.
Adaptive Space-Time Beamforming in Radar Systems


Rodrigo C. de Lamare


(Submitted on 10 Feb 2013)


The goal of this chapter is to review the recent work and advances in the area of space-time beamforming algorithms and their application to radar systems. These systems include phased-array \cite{melvin} and multi-input multi-output (MIMO) radar systems \cite{haimo_08}, mono-static and bi-static radar systems and other configurations \cite{melvin}. Furthermore, this chapter also describes in detail some of the most successful space-time beamforming algorithms that exploit low-rank and sparsity properties as well as the use of prior-knowledge to improve the performance of STAP algorithms in radar systems.
Compressed Sensing with Incremental Sparse Measurements


Xiaofu Wu, Zhen Yang, Lu Gan


(Submitted on 11 Feb 2013)


This paper proposes a verification-based decoding approach for reconstruction of a sparse signal with incremental sparse measurements. In its first step, the verification-bas...
09 Apr 04:07

Information theory + language

by Konstantinos
Check this out + prepare to have your mind blown. From the series “Through the wormhole”.   Filed under: Uncategorized
09 Apr 00:17

Insider ML Jobs

by Danny Bickson
In this blog post I will publish some open ML positions relating to big data analytics I got from my contacts. Those positions are not public yet and are published first in this blog.

PhD  in CS, Bio-informatics, EE, physics, Statistics post doctoral fellow in  biomedical informatics for 1 year. Emphasis of applications of Big Data to medicine. The fellow will be involved in projects which mine the electronic health record in of the Veterans Affairs system.

For details contact Alon Ben-Ari, MD, department of anesthesiology VA Puget Sound Seattle, Washington.





Stay tuned - more jobs to be posted soon..
08 Apr 23:54

Linear SVM, Ball SVM, and minimum enclosing ball

Linear SVM and the kernel trick are certainly one of the corner stones of supervised machine learning. In fact, it can be generalized to containment machines (geometric machines), and instead of using halfspace containment in SVM, one can choose ball containment and build Ball SVM, and so on.
In Approximating smallest enclosing balls with applications to machine learning, we investigate several algorithms for computing enclosing balls. There is a special emphasis on the nice covering/piercing duality, illustrated by this figure:
PrimalDualPiercingCoveringBall.png
See also Approximating smallest enclosing balls

08 Apr 04:34

Dirichlet Process, Infinite Mixture Models, and Clustering

by Wesley

(This article was first published on Statistical Research » R, and kindly contributed to R-bloggers)

The Dirichlet process provides a very interesting approach to understand group assignments and models for clustering effects.   Often time we encounter the k-means approach.  However, it is necessary to have a fixed number of clusters.  Often we encounter situations where we don’t know how many fixed clusters we need.  Suppose we’re trying to identify groups of voters.  We could use political partisanship (e.g. low/medium/high Democratic vote) but that may not necessary describe the data appropriately.  If this is the case then we can turn to Bayesian nonparametrics and the Dirichlet Process and use some approaches there to solve this problem.  Three in particular are commonly used as examples: the Chinese Restaurant ModelPólya’s Urn, and Stick Breaking.

Chinese Restaurant Model

Chinese Restaurant Model

 The Chinese Restaurant Model is based on idea that there is a restaurant with an infinite number of tables.  At each table there are an infinite number of seats.  The first customer arrives and sits down at a table.  The second customer then arrives and selects a table.  However, the customer selects the table that the first customer is currently sitting with probability \alpha/(1+\alpha) or selects a new table with 1/(1+\alpha).  This continues on to the (n+1)^{st} customer where they select a table that a current customer is sitting with probability n_{k}/(n+\alpha).

crp = function(num.customers, alpha) {
table < - c(1)
next.table <- 2
for (i in 1:num.customers) {
if (runif(1,0,1) < alpha / (alpha + i)) {
# Add a new ball color.
table <- c(table, next.table)
next.table <- next.table+1
} else {
# Pick out a ball from the urn, and add back a
# ball of the same color.
select.table <- table[sample(1:length(table), 1)]
table <- c(table, select.table)
}
}
table
}
crp(100, 4)
plot(
table( crp(10000, 2) )
,xlab="Table Index", ylab="Frequency"
)

 

Pólya’s Urn Model

In the Pólya’s Urn model we take the approach where there exists a urn of colored balls.  We take a ball out of the urn and note its color.  We replace the ball back into the urn and then we add an additional ball of the same color to the urn. This process can continue on infinitely.

rgb2hex < - function(x){
hex.part = ""
hex <- ""
for(i in 1:3){
b16 <- x[,i]/16
int.one <- trunc(b16)
if(int.one>=10){
val.one < - letters[int.one-10+1]
} else {
val.one <- int.one
}
fract <- abs( b16 - int.one )
int.two <- fract*16
if(int.two>=10){
val.two < - letters[int.two-10+1]
} else {
val.two <- int.two
}
hex.part[i] <- paste(val.one,val.two, sep="")
hex <- paste(hex,hex.part[i], sep="")
}
hex
}
polyaUrnModel = function(baseColorDistribution, num.balls, alpha) {
balls < - c()
for (i in 1:num.balls) {
if (runif(1,0,1) < alpha / (alpha + length(balls))) {
# Add a new ball color.
library(colorspace)
color.comb < - expand.grid(x=seq(0,255),y=seq(0,255),z=seq(0,255))
location.picker <- rnorm(1,nrow(color.comb)/2, (nrow(color.comb)-1)/4 )
the.color <- c( color.comb[location.picker,1], color.comb[location.picker,2], color.comb[location.picker,3])
the.hex <- paste("#",rgb2hex(the.color), sep="")
new.color <- the.hex
balls = c(balls, new.color)
} else {
# Pick out a ball from the urn, and add back a
# ball of the same color.
ball = balls[sample(1:length(balls), 1)]
balls = c(balls, ball)
}
}
balls
}
pum < - polyaUrnModel(function() rnorm(1,0,1), 100, 1)
barplot( table(pum), col=names(table(pum)), pch=10 )

 

Polya Urn Model

Stick Breaking Model

With this third model we simply start breaking a stick and continue to break that stick into smaller pieces.  This process works by taking a stick of length 1.0.  We then generate one random number from the Beta distribution (\beta_{1} ~ Beta(1,\alpha). We then break the stick at \beta_1. The left side of the stick we’ll call \nu_1.  We then take the remaining stick to the right and break it again at location (\beta_{2}  ~ Beta(1, \alpha). Once again the piece to the left we’ll call \nu_2 = \left(1-\beta_1\right) \cdot \beta_2. The sum of all of the probabilities generated will add up to 1.0.


stickBreakingModel = function(num.vals, alpha) {
betas = rbeta(num.vals, 1, alpha)
stick.to.right = c(1, cumprod(1 - betas))[1:num.vals]
weights = stick.to.right * betas
weights
}

plot( stickBreakingModel(100,5), pch=16, cex=.5 )

 

Stick Breaking Probabilities

Multivariate Clustering


##
# Generate some fake data with some uniform random means
##
generateFakeData < - function( num.vars=3, n=100, num.clusters=5, seed=NULL ) {
if(is.null(seed)){
set.seed(runif(1,0,100))
} else {
set.seed(seed)
}
data <- data.frame(matrix(NA, nrow=n, ncol=num.vars+1))

mu <- NULL
for(m in 1:num.vars){
mu <- cbind(mu,rnorm(num.clusters, runif(1,-10,15), 5))
}

for (i in 1:n) {
cluster <- sample(1:num.clusters, 1)
data[i, 1] <- cluster
for(j in 1:num.vars){
data[i, j+1] <- rnorm(1, mu[cluster,j], 1)
}
}

data$X1 <- factor(data$X1)
var.names <- paste("VAR",seq(1,ncol(data)-1), sep="")
names(data) <- c("cluster",var.names)

return(data)
}

##
# Set up a procedure to calculate the cluster means using squared distance
##
dirichletClusters <- function(orig.data, disp.param = NULL, max.iter = 100, tolerance = .001)
{
n <- nrow( orig.data )

data <- as.matrix( orig.data )
pick.clusters <- rep(1, n)
k <- 1

mu <- matrix( apply(data,2,mean), nrow=1, ncol=ncol(data) )

is.converged <- FALSE
iteration <- 0

ss.old <- Inf
ss.curr <- Inf

while ( !is.converged & iteration < max.iter ) { # Iterate until convergence
iteration <- iteration + 1

for( i in 1:n ) { # Iterate over each observation and measure the distance each observation' from its mean center's squared distance from its mean
distances <- rep(NA, k)

for( j in 1:k ){
distances[j] <- sum( (data[i, ] - mu[j, ])^2 ) # Distance formula.
}

if( min(distances) > disp.param ) { # If the dispersion parameter is still less than the minimum distance then create a new cluster
k < - k + 1
pick.clusters[i] <- k
mu <- rbind(mu, data[i, ])
} else {
pick.clusters[i] <- which(distances == min(distances))
}

}

##
# Calculate new cluster means
##
for( j in 1:k ) {
if( length(pick.clusters == j) > 0 ) {
mu[j, ] < - apply(subset(data,pick.clusters == j), 2, mean)
}
}

##
# Test for convergence
##
ss.curr <- 0
for( i in 1:n ) {
ss.curr <- ss.curr +
sum( (data[i, ] - mu[pick.clusters[i], ])^2 )
}
ss.diff <- ss.old - ss.curr
ss.old <- ss.curr
if( !is.nan( ss.diff ) & ss.diff < tolerance ) {
is.converged <- TRUE
}

}

centers <- data.frame(mu)
ret.val <- list("centers" = centers, "cluster" = factor(pick.clusters),
"k" = k, "iterations" = iteration)

return(ret.val)
}

create.num.vars <- 3
orig.data <- generateFakeData(create.num.vars, num.clusters=3, n=1000, seed=123)
dp.update <- dirichletClusters(orig.data[,2:create.num.vars+1], disp.param=25)
ggplot(orig.data, aes(x = VAR1, y = VAR3, color = cluster)) + geom_point()

 Clustering 1
In this example I have provided some R code that clusters variables based an any given number of variables.  The measure of distance from the group centroid is the multivariate sum of squared distance, though there are many other distance measure that could be implemented (e.g. manhattan, euclidean, etc.
Clustering 2

To leave a comment for the author, please follow the link and comment on his blog: Statistical Research » R.
R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series,ecdf, trading) and more...
08 Apr 02:17

Structural Information in Nanopore Sequencing ?

by Igor

I recently mentioned a comparison between nanopore sequencing and well logging ( see Of Well Logging and Nanopore Sequencing ). It recently occurred to me that one possible missing element of this analysis are knots, or more precisely how to untie knots.



A recent study showed that gene activation seemed to be related to how entangled different part of the DNA are [1] . How is this related to nanopore sequencing and well logging ? quite simply, in the case of DNA sequencing, the 2009 paper that started it all [2], shows a figure where one can witness a non constant traveling of the DNA strand through the nanopore with G,T,C,A being separated by uneven distances:
In other words, the interdistance between the G,T,C,A as indicated in this figure is an indication of how much resistance there was from the DNA strand when pushed through the nanopore. In other words, that interdistance is a feature reflecting how knots are being forced through the nanopore. Said in another way, those interdistances are related to the DNA structure being sequenced. An information you cannot obtain with current sequencing technology. I am sure the folks at Oxford Nanopore Technologies or the folks who attended CASP10, and the Future of Structure in Biology are aware of this.  for sure one wonders if this interdistance thing is a reproducible feature.
In well logging, that type of information could produce additional information on the soil being investigated. 

[1] Colocalization of Coregulated Genes: A Steered Molecular Dynamics Study of Human Chromosome 19 by Marco Di Stefano, Angelo Rosa, Vincenzo Belcastro, Diego di Bernardo, Cristian Micheletti
The connection between chromatin nuclear organization and gene activity is vividly illustrated by the observation that transcriptional coregulation of certain genes appears to be directly influenced by their spatial proximity. This fact poses the more general question of whether it is at all feasible that the numerous genes that are coregulated on a given chromosome, especially those at large genomic distances, might become proximate inside the nucleus. This problem is studied here using steered molecular dynamics simulations in order to enforce the colocalization of thousands of knowledge-based gene sequences on a model for the gene-rich human chromosome 19. Remarkably, it is found that most () gene pairs can be brought simultaneously into contact. This is made possible by the low degree of intra-chromosome entanglement and the large number of cliques in the gene coregulatory network. A clique is a set of genes coregulated all together as a group. The constrained conformations for the model chromosome 19 are further shown to be organized in spatial macrodomains that are similar to those inferred from recent HiC measurements. The findings indicate that gene coregulation and colocalization are largely compatible and that this relationship can be exploited to draft the overall spatial organization of the chromosome in vivo. The more general validity and implications of these findings could be investigated by applying to other eukaryotic chromosomes the general and transferable computational strategy introduced here. Author Summary Recent high-throughput experiments have shown that chromosome regions (loci) which accommodate specific sets of coregulated genes can be in close spatial proximity despite their possibly large sequence separation. The findings pose the question of whether gene coregulation and gene colocalization are related in general. Here, we tackle this problem using a knowledge-based coarse-grained model of human chromosome 19. Specifically, we carry out steered molecular dynamics simulations to promote the colocalization of hundreds of gene pairs that are known to be significantly coregulated. We show that most () of such pairs can be simultaneously colocalized. This result is, in turn, shown to depend on at least two distinctive chromosomal features: the remarkably low degree of intra-chain entanglement found in chromosomes inside the nucleus and the large number of cliques present in the gene coregulatory network. The results are therefore largely consistent with the coregulation-colocalization hypothesis. Furthermore, the model chromosome conformations obtained by applying the coregulation constraints are found to display spatial macrodomains that have significant similarities with those inferred from HiC measurements of human chromosome 19. This finding suggests that suitable extensions of the present approach might be used to propose viable ensembles of eukaryotic chromosome conformations in vivo.

[2] James Clarke, Hai-Chen Wu, Lakmal Jayasinghe, Alpesh Patel, Stuart Reid, Hagan Bayley (2009). Continuous base identification for single-molecule nanopore DNA sequencing Nature Nanotechnology DOI:10.1038/nnano.2009.12

Join the CompressiveSensing subreddit or the Google+ Community and post there ! Liked this entry ? subscribe to Nuit Blanche's feed, there's more where that came from. You can also subscribe to Nuit Blanche by Email, explore the Big Picture in Compressive Sensing or the Matrix Factorization Jungle and join the conversations on compressive sensing, advanced matrix factorization and calibration issues on Linkedin.
08 Apr 02:09

The Pragmatic Haskeller – Episode 1

by Patrick Durusau

The Pragmatic Haskeller – Episode 1 by Alfredo Di Napoli.

The first episode of “The Pragmatic Haskeller” starts with:

In the beginning was XML, and then JSON.

When I read that sort of thing, it is hard to know whether to weep or pitch a fit.

Neither one is terribly productive but if you are interested in the rich heritage that XML relies upon drop me a line.

The first lesson is a flying start on Haskell data and moving it between JSON and XML fomats.

07 Apr 20:52

Faster Cortex-A8 16-bit Multiplies

by Nils

I did a small and fun assembler SIMD optimization job the last week. The target architecture was ARMv6, but since the code will run on the iPhone I tried to keep the code fast on the Cortex-A8 as well.

When I did some profiling on my BeagleBoard, and I got some surprising results: The code run a faster as it should. This was odd. Never happened to me.

Fast forward 5 hours and lots of micro-benchmarking:

The 16 bit multiplies SMULxy on the Cortex-A8 are a cycle faster than documented!

They take one cycle for issue and have a result-latency of three cycles (rule of thumb, it’s a bit more complicated than that). And this applies to all variants of this instruction: SMULBB, SMULBT, SMULTT and SMULTB.

The multiply-accumulate variants of the 16 bit multiplies execute are as documented: Two cycles issue and three cycles result-latency.

This is nice. I have used the 16 bit multiplies a lot in the past but stopped to use them because I thought they offered no benefit over the more general MUL instruction on the Cortex-A8. The SMULxy multiplies mix very well with the ARMv6 SIMD multiplies. Both of them work on 16 bit integers but the SIMD instructions take a packed 16 bit argument while the SMULxy take only a single argument, and you can specify if you want the top or bottom 16 bits of each argument. Very flexible.

All this leads to nice code sequences. For example a three element dot-product of signed 16 bit quantities. Something that is used quite a lot for color-space conversion.

Assume this register-values on entry:

              MSB16      LSB16

          +----------+----------+
      r1: | ignored  |    a1    |
          +----------+----------+
          +----------+----------+
      r2: | ignored  |    a2    |
          +----------+----------+
          +----------+----------+
      r3: |    b1    |    c1    |
          +----------+----------+
          +----------+----------+
      r4: |    b2    |    c2    |
          +----------+----------+

And this code sequence:

    smulbb      r0, r1, r2
    smlad       r0, r3, r4, r0
 

Gives a result: r0 = (a1*a2) + (b1*b2) + (c1*c2)

On the Cortex-A8 this will schedule like this:

  Cycle0:

    smulbb      r0, r1, r2          Pipe0
    nop                             Pipe1   (free, can be used for non-multiplies)

  Cycle1:        

    smlad       r0, r3, r4, r0      Pipe0
    nop                             Pipe1   (free, can be used for non-multiplies)

  Cycle2:        

    blocked, because smlad is a multi-cycle instruction.

The result (r0) will be available three cycles later (cycle 6) for most instructions. You can execute whatever you want in-between as long as you don’t touch r0.

Note that this is a special case: The SMULBB instruction in cycle0 generates the result in R0. If the next instruction is one of the multiply-accumulate family, and the register is used as the accumulate argument a special forward path of the Cortex-A8 kicks in and the result latency is lowered to just one cycle. Cool, ain’t it?

Btw: Thanks to Måns/MRU. He was so kind and verified the timing on his beagleboard.

07 Apr 17:39

simple portabella pasta $4.94 recipe / $1.24 serving

by Beth M
My favorite recipes are those that feature just a few ingredients so that the subtle and intricate flavors are not lost. ...not to mention, they're just easier to make! ;)

Nearly every time I go to the grocery store I look at the portabella mushrooms longingly. They're so good, but so expensive. This time I decided to take the plunge. I figured I could just employ the basic budget bytes principle of combining expensive ingredients with bulkier inexpensive ingredients to keep the cost down. Well, it worked marvelously!

If you have the option to buy the portabella caps loose (by the pound) rather than packaged, you're likely to get a better deal. My grocery store sells them both ways, $5.99 for a package of three small-ish caps, or $5.99/lb. Portabella mushrooms are very light and spongy, so they don't weigh a lot. I got two large caps for just $2.70. Of course, weigh them out first and compare.

This pasta is super simple, but packed with flavor. Rich portabella mushrooms give a deep earthy flavor, while garlic, thyme, parmesan, and a splash of red wine vinegar add zing. All that marvelous flavor grabs a hold of the pasta and rides right into your happy mouth. Your very, very happy mouth.

Want to take it a step further? Try adding either spinach, sun dried tomatoes, feta cheese, or goat cheese. They all go marvelously with portabella mushrooms.

Simple Portabella Pasta

Simple Portabella Pasta

Print Friendly and PDF

Total Recipe cost: $4.94
Servings Per Recipe: 4-6
Cost per serving: $1.24 (for four large servings)
Prep time: 15 min. Cook time: 30 min. Total: 45 min.

INGREDIENTS COST
2 large portabella mushroom caps $2.70
3 Tbsp olive oil (divided) $0.48
1 Tbsp red wine vinegar $0.04
12 oz. rigatoni $0.82
2 cloves garlic $0.16
1/2 tsp dried thyme $0.03
1/4 cup grated parmesan $0.42
to taste salt & pepper $0.10
1/4 bunch flat leaf parsley $0.19
TOTAL $2.16

STEP 1: Preheat the oven to 400 degrees. Line a baking sheet with foil and then spritz lightly with non-stick spray. Lightly brush off any dirt or debris from the portabella mushrooms and then drizzle each cap with 1/2 tablespoon of olive oil, 1/2 tablespoon of red wine vinegar, and a slight sprinkle of salt and pepper (both sides). Place the seasoned mushrooms on the prepared baking sheet (gill side down) and roast in the preheated oven for 30 minutes.

STEP 2: While the mushrooms are roasting, begin to boil a large pot of water to cook the pasta. Once the water reaches a full boil, add the pasta and continue to boil for 7-10 minutes or until tender. Reserve about a cup of the starchy cooking water before draining the pasta in a colander.

STEP 3: Mince the garlic and add it to a large skillet along with the remaining 2 tablespoons of olive oil and the dried thyme. Cook the garlic and thyme over medium-low heat for 2-3 minutes, or until the garlic is tender. Remove the skillet from the heat. Roughly chop the parsley.

STEP 4: When the mushrooms have finished roasting, carefully cut each cap in half and then slice crosswise into thin strips. Add the drained pasta, sliced mushrooms, and chopped parsley to the skillet with the garlic (heat turned off). Toss to coat. Use a small amount of the reserved pasta water to loosen the pasta if it becomes dry. Add the parmesan cheese and toss to coat once again. If desired, add an extra splash of red wine vinegar before serving.

Simple Portabella Pasta


Step By Step Photos



season mushrooms
Preheat the oven to 400 degrees. Cover a baking sheet with foil and spritz lightly with non-stick spray. Drizzle the mushrooms with about a 1/2 Tbsp of olive oil and red wine vinegar (each) and sprinkle lightly with salt and pepper. Turn the caps over so that they're gill side down (opposite of this picture) and roast for 30 minutes in the oven.

boil pasta
Meanwhile, bring a large pot of water to a rolling boil and then add the pasta. Boil the pasta for 7-10 minutes, or until it is tender. Save some of the starchy pasta water before draining the pasta in a colander.

garlic & thyme
Mince the garlic and add it to a large skillet along with the remaining 2 Tbsp of olive oil and the dried thyme. Cook over medium-low heat for a couple of minutes, or until the garlic is tender. Once it's tender, turn the heat off.

slice mushrooms
When the mushrooms are finished roasting, cut each in half and then slice crosswise into thin strips. Be careful, as they're very hot. Cut as thin as possible so that you can get a little bit of mushroom in every bite!

mushrooms skillet
Add the sliced mushrooms to the skillet...

pasta parsley
...along with the drained pasta and roughly chopped parsley. Toss everything to coat in the olive oil mixture. If your pasta gets dry or sticky, add a touch of the reserved pasta water.

parmesan
Lastly, add the parmesan and toss to coat again. I like to add the parmesan last, after everything has had a chance to cool slightly so that the parmesan coats the pasta in its grated form, rather than melting and sticking together.

finished pasta
And then it's ready to eat! For extra zing, you can add another splash of red wine vinegar just before serving.

Simple Portabella Pasta

YUM.

07 Apr 17:35

System Architectures for Personalization and Recommendation

by noreply@blogger.com (Xavier Amatriain)
by Xavier Amatriain and Justin Basilico


In our previous posts about Netflix personalization, we highlighted the importance of using both data and algorithms to create the best possible experience for Netflix members. We also talked about the importance of enriching the interaction and engaging the user with the recommendation system. Today we're exploring another important piece of the puzzle: how to create a software architecture that can deliver this experience and support rapid innovation. Coming up with a software architecture that handles large volumes of existing data, is responsive to user interactions, and makes it easy to experiment with new recommendation approaches is not a trivial task. In this post we will describe how we address some of these challenges at Netflix.

To start with, we present an overall system diagram for recommendation systems in the following figure. The main components of the architecture contain one or more machine learning algorithms. 


The simplest thing we can do with data is to store it for later offline processing, which leads to part of the architecture for managing Offline jobs. However, computation can be done offline, nearline, or online. Online computation can respond better to recent events and user interaction, but has to respond to requests in real-time. This can limit the computational complexity of the algorithms employed as well as the amount of data that can be processed. Offline computation has less limitations on the amount of data and the computational complexity of the algorithms since it runs in a batch manner with relaxed timing requirements. However, it can easily grow stale between updates because the most recent data is not incorporated. One of the key issues in a personalization architecture is how to combine and manage online and offline computation in a seamless manner. Nearline computation is an intermediate compromise between these two modes in which we can perform online-like computations, but do not require them to be served in real-time. Model training is another form of computation that uses existing data to generate a model that will later be used during the actual computation of results. Another part of the architecture describes how the different kinds of events and data need to be handled by the Event and Data Distribution system. A related issue is how to combine the different Signals and Models that are needed across the offline, nearline, and online regimes. Finally, we also need to figure out how to combine intermediate Recommendation Results in a way that makes sense for the user. The rest of this post will detail these components of this architecture as well as their interactions. In order to do so, we will break the general diagram into different sub-systems and we will go into the details of each of them. As you read on, it is worth keeping in mind that our whole infrastructure runs across the public Amazon Web Services cloud.

Offline, Nearline, and Online Computation



As mentioned above, our algorithmic results can be computed either online in real-time, offline in batch, or nearline in between. Each approach has its advantages and disadvantages, which need to be taken into account for each use case.

Online computation can respond quickly to events and use the most recent data. An example is to assemble a gallery of action movies sorted for the member using the current context. Online components are subject to an availability and response time Service Level Agreements (SLA) that specifies the maximum latency of the process in responding to requests from client applications while our member is waiting for recommendations to appear. This can make it harder to fit complex and computationally costly algorithms in this approach. Also, a purely online computation may fail to meet its SLA in some circumstances, so it is always important to think of a fast fallback mechanism such as reverting to a precomputed result. Computing online also means that the various data sources involved also need to be available online, which can require additional infrastructure.

On the other end of the spectrum, offline computation allows for more choices in algorithmic approach such as complex algorithms and less limitations on the amount of data that is used. A trivial example might be to periodically aggregate statistics from millions of movie play events to compile baseline popularity metrics for recommendations. Offline systems also have simpler engineering requirements. For example, relaxed response time SLAs imposed by clients can be easily met. New algorithms can be deployed in production without the need to put too much effort into performance tuning. This flexibility supports agile innovation. At Netflix we take advantage of this to support rapid experimentation: if a new experimental algorithm is slower to execute, we can choose to simply deploy more Amazon EC2 instances to achieve the throughput required to run the experiment, instead of spending valuable engineering time optimizing performance for an algorithm that may prove to be of little business value. However, because offline processing does not have strong latency requirements, it will not react quickly to changes in context or new data. Ultimately, this can lead to staleness that may degrade the member experience. Offline computation also requires having infrastructure for storing, computing, and accessing large sets of precomputed results.

Nearline computation can be seen as a compromise between the two previous modes. In this case, computation is performed exactly like in the online case. However, we remove the requirement to serve results as soon as they are computed and can instead store them, allowing it to be asynchronous. The nearline computation is done in response to user events so that the system can be more responsive between requests. This opens the door for potentially more complex processing to be done per event. An example is to update recommendations to reflect that a movie has been watched immediately after a member begins to watch it. Results can be stored in an intermediate caching or storage back-end. Nearline computation is also a natural setting for applying incremental learning algorithms.

In any case, the choice of online/nearline/offline processing is not an either/or question. All approaches can and should be combined. There are many ways to combine them. We already mentioned the idea of using offline computation as a fallback. Another option is to precompute part of a result with an offline process and leave the less costly or more context-sensitive parts of the algorithms for online computation.

Even the modeling part can be done in a hybrid offline/online manner. This is not a natural fit for traditional supervised classification applications where the classifier has to be trained in batch from labeled data and will only be applied online to classify new inputs. However, approaches such as Matrix Factorization are a more natural fit for hybrid online/offline modeling: some factors can be precomputed offline while others can be updated in real-time to create a more fresh result. Other unsupervised approaches such as clustering also allow for offline computation of the cluster centers and online assignment of clusters. These examples point to the possibility of separating our model training into a large-scale and potentially complex global model training on the one hand and a lighter user-specific model training or updating phase that can be performed online.

Offline Jobs



Much of the computation we need to do when running personalization machine learning algorithms can be done offline. This means that the jobs can be scheduled to be executed periodically and their execution does not need to be synchronous with the request or presentation of the results. There are two main kinds of tasks that fall in this category: model training and batch computation of intermediate or final results. In the model training jobs, we collect relevant existing data and apply a machine learning algorithm produces a set of model parameters (which we will henceforth refer to as the model). This model will usually be encoded and stored in a file for later consumption. Although most of the models are trained offline in batch mode, we also have some online learning techniques where incremental training is indeed performed online. Batch computation of results is the offline computation process defined above in which we use existing models and corresponding input data to compute results that will be used at a later time either for subsequent online processing or direct presentation to the user.

Both of these tasks need refined data to process, which usually is generated by running a database query. Since these queries run over large amounts of data, it can be beneficial to run them in a distributed fashion, which makes them very good candidates for running on Hadoop via either Hive or Pig jobs. Once the queries have completed, we need a mechanism for publishing the resulting data. We have several requirements for that mechanism: First, it should notify subscribers when the result of a query is ready. Second, it should support different repositories (not only HDFS, but also S3 or Cassandra, for instance). Finally, it should transparently handle errors, allow for monitoring, and alerting. At Netflix we use an internal tool named Hermes that provides all of these capabilities and integrates them into a coherent publish-subscribe framework. It allows data to be delivered to subscribers in near real-time. In some sense, it covers some of the same use cases as Apache Kafka, but it is not a message/event queue system.

Signals & Models



Regardless of whether we are doing an online or offline computation, we need to think about how an algorithm will handle three kinds of inputs: models, data, and signals. Models are usually small files of parameters that have been previously trained offline. Data is previously processed information that has been stored in some sort of database, such as movie metadata or popularity. We use the term "signals" to refer to fresh information we input to algorithms. This data is obtained from live services and can be made of user-related information, such as what the member has watched recently, or context data such as session, device, date, or time.

Event & Data Distribution


Our goal is to turn member interaction data into insights that can be used to improve the member's experience. For that reason, we would like the various Netflix user interface applications (Smart TVs, tablets, game consoles, etc.) to not only deliver a delightful user experience but also collect as many user events as possible. These actions can be related to clicks, browsing, viewing, or even the content of the viewport at any time. Events can then be aggregated to provide base data for our algorithms. Here we try to make a distinction between data and events, although the boundary is certainly blurry. We think of events as small units of time-sensitive information that need to be processed with the least amount of latency possible. These events are routed to trigger a subsequent action or process, such as updating a nearline result set. On the other hand, we think of data as more dense information units that might need to be processed and stored for later use. Here the latency is not as important as the information quality and quantity. Of course, there are user events that can be treated as both events and data and therefore sent to both flows.

At Netflix, our near-real-time event flow is managed through an internal framework called Manhattan. Manhattan is a distributed computation system that is central to our algorithmic architecture for recommendation. It is somewhat similar to Twitter's Storm, but it addresses different concerns and responds to a different set of internal requirements. The data flow is managed mostly through logging through Chukwa to Hadoop for the initial steps of the process. Later we use Hermes as our publish-subscribe mechanism.

Recommendation Results


The goal of our machine learning approach is to come up with personalized recommendations. These recommendation results can be serviced directly from lists that we have previously computed or they can be generated on the fly by online algorithms. Of course, we can think of using a combination of both where the bulk of the recommendations are computed offline and we add some freshness by post-processing the lists with online algorithms that use real-time signals.

At Netflix, we store offline and intermediate results in various repositories to be later consumed at request time: the primary data stores we use are Cassandra, EVCache, and MySQL. Each solution has advantages and disadvantages over the others. MySQL allows for storage of structured relational data that might be required for some future process through general-purpose querying. However, the generality comes at the cost of scalability issues in distributed environments. Cassandra and EVCache both offer the advantages of key-value stores. Cassandra is a well-known and standard solution when in need of a distributed and scalable no-SQL store. Cassandra works well in some situations, however in cases where we need intensive and constant write operations we find EVCache to be a better fit. The key issue, however, is not so much where to store them as to how to handle the requirements in a way that conflicting goals such as query complexity, read/write latency, and transactional consistency meet at an optimal point for each use case.

Conclusions

In previous posts, we have highlighted the importance of data, models, and user interfaces for creating a world-class recommendation system. When building such a system it is critical to also think of the software architecture in which it will be deployed. We want the ability to use sophisticated machine learning algorithms that can grow to arbitrary complexity and can deal with huge amounts of data. We also want an architecture that allows for flexible and agile innovation where new approaches can be developed and plugged-in easily. Plus, we want our recommendation results to be fresh and respond quickly to new data and user actions. Finding the sweet spot between these desires is not trivial: it requires a thoughtful analysis of requirements, careful selection of technologies, and a strategic decomposition of recommendation algorithms to achieve the best outcomes for our members. We are always looking for great engineers to join our team. If you think you can help us, be sure to look at our jobs page.



07 Apr 17:33

Life of an instruction in LLVM

by eliben

LLVM is a complex piece of software. There are several paths one may take on the quest of understanding how it works, none of which is simple. I recently had to dig in some areas of LLVM I was not previously familiar with, and this article is one of the outcomes of this quest.

What I aim to do here is follow the various incarnations an "instruction" takes when it goes through LLVM’s multiple compilation stages, starting from a syntactic construct in the source language and until being encoded as binary machine code in an output object file.

This article in itself will not teach one how LLVM works. It assumes some existing familiarity with LLVM’s design and code base, and leaves a lot of "obvious" details out. Note that unless otherwise stated, the information here is relevant to LLVM 3.2. LLVM and Clang are fast-moving projects, and future changes may render parts of this article incorrect. If you notice any discrepancies, please let me know and I’ll do my best to fix them.

Input code

I want to start this exploration process at the beginning – C source. Here’s the simple function we’re going to work with:

int foo(int aa, int bb, int cc) {
  int sum = aa + bb;
  return sum / cc;
}

The focus of this article is going to be on the division operation.

Clang

Clang serves as the front-end for LLVM, responsible for converting C, C++ and ObjC source into LLVM IR. Clang’s main complexity comes from the ability to correctly parse and semantically analyze C++; the flow for a simple C-level operation is actually quite straightforward.

Clang’s parser builds an Abstract Syntax Tree (AST) out of the input. The AST is the main "currency" in which various parts of Clang deal. For our division operation, a BinaryOperator node is created in the AST, carrying the BO_div "operator kind" [1]. Clang’s code generator then goes on to emit a sdiv LLVM IR instruction from the node, since this is a division of signed integral types.

LLVM IR

Here is the LLVM IR created for the function [2]:

define i32 @foo(i32 %aa, i32 %bb, i32 %cc) nounwind {
entry:
  %add = add nsw i32 %aa, %bb
  %div = sdiv i32 %add, %cc
  ret i32 %div
}

In LLVM IR, sdiv is a BinaryOperator, which is a subclass of Instruction with the opcode SDiv [3]. Like any other instruction, it can be processed by the LLVM analysis and transformation passes. For a specific example targeted at SDiv, take a look at SimplifySDivInst. Since all through the LLVM "middle-end" layer the instruction remains in its IR form, I won’t spend much time talking about it. To witness its next incarnation, we’ll have to look at the LLVM code generator.

The code generator is one of the most complex parts of LLVM. Its task is to "lower" the relatively high-level, target-independent LLVM IR into low-level, target-dependent "machine instructions" (MachineInstr). On its way to a MachineInstr, an LLVM IR instruction passes through a "selection DAG node" incarnation, which is what I’m going to discuss next.

SelectionDAG node

Selection DAG [4] nodes are created by the SelectionDAGBuilder class acting "at the service of" SelectionDAGISel, which is the main base class for instruction selection. SelectionDAGIsel goes over all the IR instructions and calls the SelectionDAGBuilder::visit dispatcher on them. The method handling a SDiv instruction is SelectionDAGBuilder::visitSDiv. It requests a new SDNode from the DAG with the opcode ISD::SDIV, which becomes a node in the DAG.

The initial DAG constructed this way is still only partially target dependent. In LLVM nomenclature it’s called "illegal" – the types it contains may not be directly supported by the target; the same is true for the operations it contains.

There are a couple of ways to visualize the DAG. One is to pass the -debug flag to llc, which will cause it to create a textual dump of the DAG during all the selection phases. Another is to pass one of the -view options which causes it to dump and display an actual image of the graph (more details in the code generator docs). Here’s the relevant portion of the DAG showing our SDiv node, right after DAG creation (the sdiv node is in the bottom):

http://eli.thegreenplace.net/wp-content/uploads/2012/11/sdiv_initial_dag.png

Before the SelectionDAG machinery actually emits machine instructions from DAG nodes, these undergo a few other transformations. The most important are the type and operation legalization steps, which use target-specific hooks to convert all operations and types into ones that the target actually supports.

"Legalizing" sdiv into sdivrem on x86

The division instruction (idiv for signed operands) of x86 computes both the quotient and the remainder of the operation, and stores them in two separate registers. Since LLVM’s instruction selection distinguishes between such operations (called ISD::SDIVREM) and division that only computes the quotient (ISD::SDIV), our DAG node will be "legalized" during the DAG legalization phase when the target is x86. Here’s how it happens.

An important interface used by the code generator to convey target-specific information to the generally target-independent algorithms is TargetLowering. Targets implement this interface to describe how LLVM IR instructions should be lowered to legal SelectionDAG operations. The x86 implementation of this interface is X86TargetLowering [5]. In its constructor it marks which operations need to be "expanded" by operation legalization, and ISD::SDIV is one of them. Here’s an interesting comment from the code:

// Scalar integer divide and remainder are lowered to use operations that
// produce two results, to match the available instructions. This exposes
// the two-result form to trivial CSE, which is able to combine x/y and x%y
// into a single instruction.

When SelectionDAGLegalize::LegalizeOp sees the Expand flag on a SDIV node [6] it replaces it by ISD::SDIVREM. This is an interesting example to demonstrate the transformation an operation can undergo while in the selection DAG form.

Instruction selection – from SDNode to MachineSDNode

The next step in the code generation process [7] is instruction selection. LLVM provides a generic table-based instruction selection mechanism that is auto-generated with the help of TableGen. Many target backends, however, choose to write custom code in their SelectionDAGISel::Select implementations to handle some instructions manually. Other instructions are then sent to the auto-generated selector by calling SelectCode.

The X86 backend handles ISD::SDIVREM manually in order to take care of some special cases and optimizations. The DAG node created at this step is a MachineSDNode, a subclass of SDNode which holds the information required to construct an actual machine instruction, but still in DAG node form. At this point the actual X86 instruction opcode is selected – X86::IDIV32r in our case.

Scheduling and emitting a MachineInstr

The code we have at this point is still represented as a DAG. But CPUs don’t execute DAGs, they execute a linear sequence of instructions. The goal of the scheduling step is to linearize the DAG by assigning an order to its operations (nodes). The simplest approach would be to just sort the DAG topologically, but LLVM’s code generator employs clever heuristics (such as register pressure reduction) to try and produce a schedule that would result in faster code.

Each target has some hooks it can implement to affect the way scheduling is done. I won’t dwell on this topic here, however.

Finally, the scheduler emits a list of instructions into a MachineBasicBlock, using InstrEmitter::EmitMachineNode to translate from SDNode. The instructions here take the MachineInstr form ("MI form" from now on), and the DAG can be destroyed.

We can examine the machine instructions emitted in this step by calling llc with the -print-machineinstrs flag and looking at the first output that says "After instruction selection":

# After Instruction Selection:
# Machine code for function foo: SSA
Function Live Ins: %EDI in %vreg0, %ESI in %vreg1, %EDX in %vreg2
Function Live Outs: %EAX

BB#0: derived from LLVM BB %entry
    Live Ins: %EDI %ESI %EDX
        %vreg2<def> = COPY %EDX; GR32:%vreg2
        %vreg1<def> = COPY %ESI; GR32:%vreg1
        %vreg0<def> = COPY %EDI; GR32:%vreg0
        %vreg3<def,tied1> = ADD32rr %vreg0<tied0>, %vreg1, %EFLAGS<imp-def,dead>; GR32:%vreg3,%vreg0,%vreg1
        %EAX<def> = COPY %vreg3; GR32:%vreg3
        CDQ %EAX<imp-def>, %EDX<imp-def>, %EAX<imp-use>
        IDIV32r %vreg2, %EAX<imp-def>, %EDX<imp-def,dead>, %EFLAGS<imp-def,dead>, %EAX<imp-use>, %EDX<imp-use>; GR32:%vreg2
        %vreg4<def> = COPY %EAX; GR32:%vreg4
        %EAX<def> = COPY %vreg4; GR32:%vreg4
        RET

# End machine code for function foo.

Note that the output mentions that the code is in SSA form, and we can see that some registers being used are "virtual" registers (e.g. %vreg1).

Register allocation – from SSA to non-SSA machine instructions

Apart from some well-defined exceptions, the code generated from the instruction selector is in SSA form. In particular, it assumes it has an infinite set of "virtual" registers to act on. This, of course, isn’t true. Therefore, the next step of the code generator is to invoke a "register allocator", whose task is to replace virtual by physical registers, from the target’s register bank.

The exceptions mentioned above are also important and interesting, so let’s talk about them a bit more.

Some instructions in some architectures require fixed registers. A good example is our division instruction in x86, which requires its inputs to be in the EDX and EAX registers. The instruction selector knows about these restrictions, so as we can see in the code above, the inputs to IDIV32r are physical, not virtual registers. This assignment is done by X86DAGToDAGISel::Select.

The register allocator takes care of all the non-fixed registers. There are a few more optimization (and pseudo-instruction expansion) steps that happen on machine instructions in SSA form, but I’m going to skip these. Similarly, I’m not going to discuss the steps performed after register allocation, since these don’t change the basic form operations appear in (MachineInstr, at this point). If you’re interested, take a look at TargetPassConfig::addMachinePasses.

Emitting code

So we now have our original C function translated to MI form – a MachineFunction filled with instruction objects (MachineInstr). This is the point at which the code generator has finished its job and we can emit the code. In current LLVM, there are two ways to do that. One is the (legacy) JIT which emits executable, ready-to-run code directly into memory. The other is MC, which is an ambitious object-file-and-assembly framework that’s been part of LLVM for a couple of years, replacing the previous assembly generator. MC is currently being used for assembly and object file emission for all (or at least the important) LLVM targets. MC also enables "MCJIT", which is a JIT-ting framework based on the MC layer. This is why I’m referring to LLVM’s JIT module as legacy.

I will first say a few words about the legacy JIT and then turn to MC, which is more universally interesting.

The sequence of passes to JIT-emit code is defined by LLVMTargetMachine::addPassesToEmitMachineCode. It calls addPassesToGenerateCode, which defines all the passes required to do what most of this article has been talking about until now – turning IR into MI form. Next, it calls addCodeEmitter, which is a target-specific pass for converting MIs into actual machine code. Since MIs are already very low-level, it’s fairly straightforward to translate them to runnable machine code [8]. The x86 code for that lives in lib/Target/X86/X86CodeEmitter.cpp. For our division instruction there’s no special handling here, because the MachineInstr it’s packaged in already contains its opcode and operands. It is handled generically with other instructions in emitInstruction.

MCInst

When LLVM is used as a static compiler (as part of clang, for instance), MIs are passed down to the MC layer which handles the object-file emission (it can also emit textual assembly files). Much can be said about MC, but that would require an article of its own. A good reference is this post from the LLVM blog. I will keep focusing on the path a single instruction takes.

LLVMTargetMachine::addPassesToEmitFile is responsible for defining the sequence of actions required to emit an object file. The actual MI-to-MCInst translation is done in the EmitInstruction of the AsmPrinter interface. For x86, this method is implemented by X86AsmPrinter::EmitInstruction, which delegates the work to the X86MCInstLower class. Similarly to the JIT path, there is no special handling for our division instruction at this point, and it’s treated generically with other instructions.

By passing -show-mc-inst to llc, we can see the MC-level instructions it creates, alongside the actual assembly code:

foo:                                    # @foo
# BB#0:                                 # %entry
        movl    %edx, %ecx              # <MCInst #1483 MOV32rr
                                        #  <MCOperand Reg:46>
                                        #  <MCOperand Reg:48>>
        leal    (%rdi,%rsi), %eax       # <MCInst #1096 LEA64_32r
                                        #  <MCOperand Reg:43>
                                        #  <MCOperand Reg:110>
                                        #  <MCOperand Imm:1>
                                        #  <MCOperand Reg:114>
                                        #  <MCOperand Imm:0>
                                        #  <MCOperand Reg:0>>
        cltd                            # <MCInst #352 CDQ>
        idivl   %ecx                    # <MCInst #841 IDIV32r
                                        #  <MCOperand Reg:46>>
        ret                             # <MCInst #2227 RET>
.Ltmp0:
        .size   foo, .Ltmp0-foo

The object file (or assembly code) emission is done by implementing the MCStreamer interface. Object files are emitted by MCObjectStreamer, which is further subclassed according to the actual object file format. For example, ELF emission is implemented in MCELFStreamer. The rough path a MCInst travels through the streamers is MCObjectStreamer::EmitInstruction followed by a format-specific EmitInstToData. The final emission of the instruction in binary form is, of course, target-specific. It’s handled by the MCCodeEmitter interface (for example X86MCCodeEmitter). While in the rest of LLVM code is often tricky because it has to make a separation between target-independent and target-specific capabilities, MC is even more challenging because it adds another dimension – different object file formats. So some code is completely generic, some code is format-dependent, and some code is target-dependent.

Assemblers and disassemblers

A MCInst is deliberately a very simple representation. It tries to shed as much semantic information as possible, keeping only the instruction opcode and list of operands (and a source location for assembler diagnostics). Like LLVM IR, it’s an internal representation with multiple possible encodings. The two most obvious are assembly (as shown above) and binary object files.

llvm-mc is a tool that uses the MC framework to implement assemblers and disassemblers. Internally, MCInst is the representation used to translate between the binary and textual forms. At this point the tool doesn’t care which compiler produced the assembly / object file.

http://eli.thegreenplace.net/wp-content/uploads/hline.jpg

[1] To examine the AST created by Clang, compile a source file with the -cc1 -ast-dump options.
[2] I ran this IR via opt -mem2reg | llvm-dis in order to clean-up the spills.
[3] These things are a bit hard to grep for because of some C preprocessor hackery employed by LLVM to minimize code duplication. Take a look at the include/llvm/Instruction.def file and its usage in various places in LLVM’s source for more insight.
[4] A DAG here means Directed Acyclic Graph, which is a data structure LLVM code generator uses to represent the various operations with the values they produce and consume.
[5] Which is arguably the single scariest piece of code in LLVM.
[6] This is an example of how target-specific information is abstracted to guide the target-independent code generation algorithm.
[7] The code generator performs DAG optimizations between its major steps, such as between legalization and selection. These optimizations are important and interesting to know about, but since they act on and return selection DAG nodes, they’re out of the focus of this article.
[8] When I’m saying "machine code" at this point, I mean actual bytes in a buffer, representing encoded instructions the CPU can run. The JIT directs the CPU to execute code from this buffer once emission is over.
07 Apr 17:33

Dumping a C++ object’s memory layout with Clang

by eliben

When one wants to understand the memory layout of structures and classes, the C/C++ operators sizeof and offsetof are very useful. However, when large C++ class hierarchies are involved, using these operators becomes tedious. Luckily, Clang has a very handly command-line flag to dump object layouts in a useful manner. This flag is somewhat hidden since it’s only accepted by the Clang front-end (the one you get when you pass -cc1 to clang) and not the gcc-compatible compiler driver (the one you get when simply executing clang).

Consider this code, for example:

class Base {
protected:
  int foo;
public:
  int method(int p) {
    return foo + p;
  }
};

struct Point {
  double cx, cy;
};

class Derived : public Base {
public:
  int method(int p) {
    return foo + bar + p;
  }
protected:
  int bar, baz;
  Point a_point;
  char c;
};

int main(int argc, char** argv) {
  return sizeof(Derived);
}

To see the layout, run clang -cc1 -fdump-record-layouts myfile.cpp. It will produce a separate report for each class and struct defined, but the most interesting one is for class Derived:

*** Dumping AST Record Layout
   0 | class Derived
   0 |   class Base (base)
   0 |     int foo
   4 |   int bar
   8 |   int baz
  16 |   struct Point a_point
  16 |     double cx
  24 |     double cy
     |   [sizeof=16, dsize=16, align=8
     |    nvsize=16, nvalign=8]

  32 |   char c
     | [sizeof=40, dsize=33, align=8
     |  nvsize=33, nvalign=8]

(the above is the output of Clang 3.2 running on 64-bit Linux)

We can see the layout of Derived objects, with the offset of every field (including the fields coming from base classes) in the left-most column. Some additional information is printed in the bottom – for example, sizeof – the total size, and dsize – data size without tail padding.

If we make method virtual in the Base and Derived classes, the size of the virtual-table pointer is also accounted for:

*** Dumping AST Record Layout
   0 | class Derived
   0 |   class Base (primary base)
   0 |     (Base vtable pointer)
   0 |     (Base vftable pointer)
   8 |     int foo
  12 |   int bar
  16 |   int baz
  24 |   struct Point a_point
  24 |     double cx
  32 |     double cy
     |   [sizeof=16, dsize=16, align=8
     |    nvsize=16, nvalign=8]

  40 |   char c
     | [sizeof=48, dsize=41, align=8
     |  nvsize=41, nvalign=8]

I’ll wrap up with a tip about using clang -cc1. Since this isn’t the compiler driver, it won’t go look for standard headers in the expected places, so using it on realistic source files can be a pain. The easiest way to do it, IMHO, is to run it on preprocessed source. How your source gets preprocessed depends on your build process, but it’s usually something like:

clang -E [your -I flags] myfile.cpp > myfile_pp.cpp
07 Apr 17:33

Assembler relaxation

by eliben

In this article I want to present a cool and little-known feature of assemblers called "relaxation". Relaxation is cool because it’s one of those things that are apparent in hindsight ("of course this should be done"), but is non-trivial to implement and has some interesting algorithms behind it. While relaxation is applicable to several CPU architectures and more than one kind of instructions, for this article I will focus on jumps for the Intel x86-64 architecture.

And just so the nomenclature is clear, an assembler is a tool that translates assembly language into machine code, and this process is also usually referred to as assembly. That’s it, we’re good to go.

An example

Consider this x86 assembly function (in GNU assembler syntax):

  .text
  .globl  foo
  .align  16, 0x90
  .type  foo, @function
foo:
  # Save used registers
  pushq   %rbp
  pushq   %r14
  pushq   %rbx

  movl    %edi, %ebx
  callq   bar                 # eax <- bar(num)
  movl    %eax, %r14d         # r14 <- bar(num)
  imull   $17, %ebx, %ebp     # ebp <- num * 17
  movl    %ebx, %edi
  callq   bar                 # eax <- bar(num)
  cmpl    %r14d, %ebp         # if !(t1 > bar(num))
  jle     .L_ELSE             # (*) jump to return num * bar(num)
  addl    %ebp, %eax          # eax <- compute num * bar(num)
  jmp     .L_RET              # (*) and jump to return it
.L_ELSE:
  imull   %ebx, %eax
.L_RET:

  # Restore used registers and return
  popq    %rbx
  popq    %r14
  popq    %rbp
  ret

It was created by compiling the following C program with gcc -S -O2, cleaning up the output and adding some comments:

extern int bar(int);

int foo(int num) {
  int t1 = num * 17;
  if (t1 > bar(num))
    return t1 + bar(num);
  return num * bar(num);
}

This is a completely arbitrary piece of code crafted for purposes of demonstration, so don’t look too much into it. With the comments added, the relation between this code and the assembly above should be obvious.

What we’re interested in here is the translation of the jumps in the assembly code above (marked with (*)) into machine code. This can be easily done by first assembling the file:

$ gcc -c test.s

And then looking at the machine code (the jumps are once again marked):

$ objdump -d test.o

test.o:     file format elf64-x86-64

Disassembly of section .text:

0000000000000000 <foo>:
   0: 55                      push   %rbp
   1: 41 56                   push   %r14
   3: 53                      push   %rbx
   4: 89 fb                   mov    %edi,%ebx
   6: e8 00 00 00 00          callq  b <foo+0xb>
   b: 41 89 c6                mov    %eax,%r14d
   e: 6b eb 11                imul   $0x11,%ebx,%ebp
  11: 89 df                   mov    %ebx,%edi
  13: e8 00 00 00 00          callq  18 <foo+0x18>
  18: 44 39 f5                cmp    %r14d,%ebp
  1b: 7e 04                   jle    21 <foo+0x21>        (*)
  1d: 01 e8                   add    %ebp,%eax
  1f: eb 03                   jmp    24 <foo+0x24>        (*)
  21: 0f af c3                imul   %ebx,%eax
  24: 5b                      pop    %rbx
  25: 41 5e                   pop    %r14
  27: 5d                      pop    %rbp
  28: c3                      retq

Note the instructions used for the jumping. For the JLE, the opcode is 0x7e, which means "jump if less-or-equal with a 8-bit PC-relative offset". The offset is 0x04 which jumps to the expected place. Similarly for the JMP, the opcode 0xeb means "jump with a 8-bit PC-relative offset".

Here comes the crux. 8-bit PC-relative offsets are enough to reach the destinations of the jumps in this example, but what if they weren’t? This is where relaxation comes into play.

Relaxation

Relaxation is the process in which the assembler replaces certain instructions with other instructions, or picks certain encodings for instructions that would allow it to successfully assemble the the machine code.

To see this in action, let’s continue with our example, adding a twist that will make the assembler’s life harder. Let’s make sure that the targets of the jumps are too far to reach with a 8-bit PC-relative offset:

  [... same as before]
  jle     .L_ELSE             # jump to return num * bar(num)
  addl    %ebp, %eax          # eax <- compute num * bar(num)
  jmp     .L_RET              # and jump to return it
  .fill   130, 1, 0x90        # ++ added
.L_ELSE:
  imull   %ebx, %eax
.L_RET:
  [... same as before]

This is an excerpt of the assembly code with a directive added to insert a long stretch of NOPs between the jumps and their targets. The stretch is long enough so that the targets are more than 128 bytes away from the jumps referring to them [1].

When this code is assembled, here’s we get from objdump when looking at the resulting machine code:

[... same as before]
1b:   0f 8e 89 00 00 00       jle    aa <foo+0xaa>
21:   01 e8                   add    %ebp,%eax
23:   e9 85 00 00 00          jmpq   ad <foo+0xad>
28:   90                      nop
29:   90                      nop
[... many more NOPs]
a8:   90                      nop
a9:   90                      nop
aa:   0f af c3                imul   %ebx,%eax
ad:   5b                      pop    %rbx
ae:   41 5e                   pop    %r14
b0:   5d                      pop    %rbp
b1:   c3                      retq

The jumps were now translated to different instruction opcodes. JLE uses 0x0f 0x8e, which has a 32-bit PC-relative offset. JMP uses 0xe9, which has a similar operand. These instructions have a much larger range that can now reach their targets, but they are less efficient. Since they are longer, the CPU has to read more data from memory in order to execute them. In addition, they make the code larger, which can also have a negative impact because instruction caching is very important for performance [2].

Iterating relaxation

From this point on I’m going to discuss some aspects of implementing relaxation in an assembler. Specifically, the LLVM assembler. Clang/LLVM has been usable as an industrial-strength compiler for some time now, and its assembler (based on the MC module) is an integral part of the compilation process. The assembler can be invoked directly either by calling the llvm-mc tool, or through the clang driver (similarly to the gcc driver). My description here applies to LLVM version 3.2 or thereabouts.

To better understand the challenges involved in performing relaxation, here is a more interesting example. Consider this assembly code [3]:

  .text
  jmp AAA
  jmp BBB
  .fill 124, 1, 0x90   # FILL_TO_AAA
AAA:
  .fill 1, 1, 0x90     # FILL_TO_BBB
BBB:
  ret

Since by now we know that the short form of JMP (the one with a 8-bit immediate) is 2 bytes long, it’s clear that it suffices for both JMP instructions, and no
relaxation will be performed.

   0:   eb 7e                   jmp    80 <AAA>
   2:   eb 7d                   jmp    81 <BBB>
   [... many NOPs]
0000000000000080 <AAA>:
  80:   90                      nop

0000000000000081 <BBB>:
  81:   c3                      retq

If we increase FILL_TO_BBB to 4, however, an interesting happens. Although AAA is still in the range of the fist jump, BBB will no longer be in the range of the second. This means that the second jump will be relaxed. But this will make it 5, instead of 2 bytes long. This event, in turn, will cause AAA to become too far from the first jump, which will have to be relaxed as well.

To solve this problem, the relaxation implemented in LLVM uses an iterative algorithm. The layout is performed multiple times as long as changes still happen. If a relaxation caused some instruction encoding to change, it means that other instructions may have become invalid (just as the example shows). So relaxation will be performed again, until its run doesn’t change anything. At that point we can confidently say that all offsets are valid and no more relaxation is needed.

The output is then as expected:

0000000000000000 <AAA-0x86>:
   0:   e9 81 00 00 00          jmpq   86 <AAA>
   5:   e9 80 00 00 00          jmpq   8a <BBB>
   [... many NOPs]
0000000000000086 <AAA>:
  86:   90                      nop
  87:   90                      nop
  88:   90                      nop
  89:   90                      nop

000000000000008a <BBB>:
  8a:   c3                      retq

Contrary to the first example in this article, here relaxation needed two iterations over the text section to finish, due to the reason presented above.

Laying-out fragments

Another interesting feature of LLVM’s relaxation implementation is the way object file layout is done to support relaxation efficiently.

In its final form, the object file consists of sections – chunks of data. Much of this data is encoded instructions, which is the kind we’re most interested here because relaxation only applies to instructions. The most common way to represent chunks of data in programming is usually with some kind of byte arrays [4]. This representation, however, would not work very well for representing machine code sections with relaxable instructions. Let’s see why:

http://eli.thegreenplace.net/wp-content/uploads/2012/12/frag_layout_naive.png

Suppose this is a text section with several instructions (marked by line boundaries). The instructions were encoded into a byte array and now relaxation should happen. The instruction painted purple requires relaxation, growing by a few bytes. What happens next?

Essentially, the byte array holding the instruction has to be re-allocated because it has to grow larger. Since the amount of instructions needing relaxation may be non-trivial, a lot of time may be spent on such re-allocations, which tend to be very expensive. In addition, it’s not easy to avoid multiple re-allocations due to the iterative nature of the relaxation algorithm.

A solution that immediately springs to mind in light of this problem is to keep the instructions in some kind of linked list, instead of a contiguous array. This way, an instruction being relaxed only means the re-allocation of the small array it was encoded into, but not of the whole section. LLVM MC takes a somewhat more clever approach, by recognizing that a lot of data in the array won’t change once initially encoded. Therefore, it can be lumped together, leaving only the relaxable instructions separate. In MC nomenclature, these lumps are called "fragments".

http://eli.thegreenplace.net/wp-content/uploads/2012/12/frag_layout_linked.png

So, the assembly emission process in LLVM MC has three distinct steps:

  1. Assembly directives and instructions are parsed, encoded and collected into fragments. Data and instructions that don’t need relaxation are placed into contiguous "data" fragments, while instructions that may need relaxation are placed into "instruction" fragments [5]. Fragments are linked together in a list.
  2. Layout is performed. Layout is the process wherein the offsets of all fragments in a section are computed and relaxation is performed (iteratively). If some instruction gets relaxed, all that’s required is to update the offsets of the subsequent fragments – no re-allocations.
  3. Finally, fragments are written into a single linear buffer for object-file emission (either into memory or into a file). At this step, all instructions have final sizes so it’s safe to put them consecutively into a byte array.

Interaction with the compiler

So far I’ve focused on the assembly part of the compilation process. But what about the compiler that emits these instructions in the first place? Once again, this interaction is highly dependent on the implementation, and I will focus on LLVM.

The LLVM code generator doesn’t yet know the addresses instructions and labels will land on (this is the task of the assembler), so it emits only the short versions for x86-64 jumps, relying on the assembler to do relaxation for those instructions that don’t have a sufficient range. This ensures that the amount of relaxed instructions is as small as absolutely necessary.

While the relaxation process is not free, it’s a worthwhile optimization since it makes the code smaller and faster. Without this step, the compiler would have to assume no jump is close enough to its target and emit the long versions, which would make the generated code less than optimal.

Compiler writers usually prefer to sacrifice compilation time for the efficiency of the resulting code. However, as different tradeoffs sometimes matter for programmers, this can be configured with compiler flags. For example, when compiling with -O0, the LLVM assembler simply relaxes all jumps it encounters on first sight. This allows it to put all instructions immediately into data fragments, which ensures there’s much fewer fragments overall, so the assembly process is faster and consumes less memory.

Conclusion

The main goal of this article was to document relaxation – an important feature of assemblers which doesn’t have too much written about it online. As a bonus, some high-level documentation of the way relaxation is implemented in the LLVM assembler (MC module) was provided. I hope it provides enough background to dive into the relevant sections of code inside MC and understand the smaller details.

http://eli.thegreenplace.net/wp-content/uploads/hline.jpg

[1] The PC-relative offset is signed, making its range +/- 7 bits.
[2] Incidentally, these instructions also have variations that accept 16-bit PC-relative immediates, but these are only available in 32-bit mode, while I’m building and running the programs in 64-bit mode.
[3] In which I give up all attempts to resemble something generated from a real program, leaving just the bare essentials required to present the issue.
[4] LLVM, like any self-respecting C++ project has its own abstraction for this called SmallVector, that heaps a few layers of full-of-template-goodness classes on top; yet it’s still an array of bytes underneath.
[5] Reality is somewhat more complex, and MC has special fragments for alignment and data fill assembly directives, but for the sake of this discussion I’ll just focus on data and instruction fragments. In addition, I have to admit that "instruction" fragments have a misleading name (since data fragments also contain encoded instructions). Perhaps "relaxable fragment" would be more self-describing. Update: I’ve renamed this fragment to MCRelaxableFragment in LLVM trunk.
07 Apr 17:32

Python – paralellizing CPU-bound tasks with concurrent.futures

by eliben

A year ago, I wrote a series of posts about using the Python multiprocessing module. One of the posts contrasted compute-intensive task parallelization using threads vs. processes. Today I want to revisit that topic, this time employing the concurrent.futures module which is part of the standard library since Python 3.2

First of all, what are "futures"? The Wikipedia page says:

In computer science, future, promise, and delay refer to constructs used for synchronizing in some concurrent programming languages. They describe an object that acts as a proxy for a result that is initially unknown, usually because the computation of its value is yet incomplete.

I wouldn’t say "futures" is the best name choice, but this is what we’re stuck with, so let’s move on. Futures are actually a very nice tool that helps bridge an annoying gap that always exists in concurrent execution – the gap between launching some computation concurrently and obtaining the result of that computation. As my previous post showed, one of the common ways to deal with this gap is to pass a synchronized Queue object into every worker process (or thread) and then collect the results once the workers are done. Futures make this much easier and more elegant, as we’ll see.

For completeness, here is the computation we’re going to apply in parallel over a large amount of inputs:

def factorize_naive(n):
    """ A naive factorization method. Take integer 'n', return list of
        factors.
    """
    if n < 2:
        return []
    factors = []
    p = 2

    while True:
        if n == 1:
            return factors

        r = n % p
        if r == 0:
            factors.append(p)
            n = n // p
        elif p * p >= n:
            factors.append(n)
            return factors
        elif p > 2:
            # Advance in steps of 2 over odd numbers
            p += 2
        else:
            # If p == 2, get to 3
            p += 1
    assert False, "unreachable"

And here’s the first attempt at doing that with concurrent.futures:

from concurrent.futures import ProcessPoolExecutor, as_completed

def chunked_worker(nums):
    """ Factorize a list of numbers, returning a num:factors mapping.
    """
    return {n: factorize_naive(n) for n in nums}


def pool_factorizer_chunked(nums, nprocs):
    # Manually divide the task to chunks of equal length, submitting each
    # chunk to the pool.
    chunksize = int(math.ceil(len(nums) / float(nprocs)))
    futures = []

    with ProcessPoolExecutor() as executor:
        for i in range(nprocs):
            chunk = nums[(chunksize * i) : (chunksize * (i + 1))]
            futures.append(executor.submit(chunked_worker, chunk))

    resultdict = {}
    for f in as_completed(futures):
        resultdict.update(f.result())
    return resultdict

The end result of pool_factorizer_chunked is a dictionary mapping numbers to lists of their factors. The most interesting thing to note here is this: the function run in a worker process (chunked_worker in this case) can simply return a value. For each such "call" (submission to the executor), a future is returned. This future encapsulates the result of the execution, which is probably not ready immediately but will be at some point. The concurrent.futures.as_completed helper allows to simply wait on all futures and yield the results of those that are done, whenever they’re done.

It’s easy to see that this code is conceptually simpler than manually launching the processes, passing some sort of synchronization queues to workers and collecting results. This, IMHO, is the main goal of futures. Futures aren’t there to make your code faster, they’re there to make it simpler. And any simplification is a blessing when parallel programming is concerned.

Note also that ProcessPoolExecutor is used as a context manager – this makes process cleanup automatic and reliable. For more fine grained control, it has a shutdown method that can be called manually.

There’s more. Since the concurrent.futures module allows to simply return a value from a concurrent call, it has another tool to make computations like the above even simpler – Executor.map. Here’s the same task rewritten with map:

def pool_factorizer_map(nums, nprocs):
    # Let the executor divide the work among processes by using 'map'.
    with ProcessPoolExecutor(max_workers=nprocs) as executor:
        return {num:factors for num, factors in
                                zip(nums,
                                    executor.map(factorize_naive, nums))}

Amazingly, this is it. This small function (a potential 2-liner, if not the wrapping for readability) creates a process pool, submits a bunch of tasks to it, collects all the results when they’re ready, puts them into a single dictionary and returns it.

As for performance, the second method is also a bit faster in my benchmarks. I think this makes sense because the manual division to chunks doesn’t take into account which chunks will take longer and the division of work between workers may be unbalanced. The map method keeps a pool of workers to which it submits new computations when they’re ready, which means that the workers are all kept busy until everything is done.

To conclude, I strongly recommend using concurrent.futures whenever the possibility presents itself. They’re much simpler conceptually, and hence less error prone, than the manual method of creating processes and keeping track of their results. In practice, a future is a nice way to convey the result of a computation from a process producing it to a process consuming it. It’s like creating a result queue manually, but with a lot of useful semantics implemented. Futures can be polled, cancelled, provide useful access to exceptions and have callbacks attached to them. My examples here are simplistic and don’t show how to use the cooler features, but if you need them – the documentation is pretty good.

07 Apr 17:21

Kay Hayen: Support for portable (standalone) programs

This post is about a feature often requested, but so far not available feature of Nuitka. Please see the page "What is Nuitka?" for clarification of what it is now and what it wants to be.

In forums, and in Google, people are looking at a Python compiler, also as a way of deployment. It should offer what py2exe does, allow installation independent of Python.

Well, for a long time it didn't. But thanks to recent contributions, it's upcoming for the next release, Nuitka 0.4.3, and it's in the current pre-releases.

It works by adding --portable to the command line. So this should work for you:

nuitka-python --recurse-all --portable your-program.py

Right now, it will create a folder "_python" with DLLs, and "_python.zip" with standard library modules used along to the "your-program.exe". Copy these to another machine, without a Python installation, and it will (should) work. Making that statement fully true may need refinements, as some DLL dependencies might not be defined yet.

Note

We may improve it in the future to meld everything into one executable for even easier deployment.

You are more than welcome to experiment with it. To do so, download Nuitka from the download page and give it a roll.

Note

Of course, Nuitka is not about replacing "py2exe" primarily, it's only a side effect of what we do. Our major goal is of course to accelerate Python, but surely nobody minds achieving the two things at the same time.

And while the post is labeled "Windows", this feature also works for Linux at least too. It's just that the lack of Python installations on client systems is far more widespread on this platform.

To me, as this is from a contributor, it's another sign of Nuitka gaining adoption for real usage. My personal "py2exe" experience is practically not existing, I have never used it. And I will only merge the improvements into the Nuitka project as provided by others. My focus for the time to come is of course the compile time and run time optimization.

06 Apr 17:29

Co-EM algorithm in GraphChi

by Danny Bickson
Following the previous post about label propagation, as well as some request from US based startup to implement this method in GraphChi, I have decided to write a quick tutorial for Co-EM algorithm.

Co-EM is a very simple algorithm, extensively utilized by Rosie Jones in her PhD thesis. Originally by Nigam and Ghani (2000). The algorithm is used for clustering test entities into categories. Here is an example dataset (NPIC500) which explains the input format. The algorithm constructs a bipartite graph:
Where the left nodes are noun phrases, the right node are the sentence context, and edge weight is the number of times a certain noun phrase was within the context. The algorithm is very simple (described in page 43 of John's PhD thesis:

As seen above, the noun labels simply compute a weighted some of the edge values. The context nodes compute the same weighted sum (if they are not seed nodes). Seed nodes are the initial graph nodes we have ground truth labels about them.

The output of the probability for each noun phrase to be in a different categories.

Here are some more concrete example of the input file:


































Additionally, ground truth is given about the negative and positive seeds. For example, assume we have two categories (city / not city). The seed lists classify certain nouns to their matching categories.


$ head city-seeds.txt 
^New York$
^Boston$
^Pittsburgh$
^Los Angeles$
^Houston$
^Atlanta$
^London$

$ head city-neg-seeds.txt 
^people$
^the world$
^time$
^life$
^God$
^children$
^students$
^work$
^a number$
^women$

And here is how to try it out in GraphChi
0) Install graphchi as explained here, and compile using "make ta"
1) Download the file http://graphlab.org/downloads/datasets/coem.zip and unzip it in your root graphchi folder
2) In the root graphchi folder run:

$ ./toolkits/text_analysis/coem --training=matrix.txt --nouns=nps.txt --contexts=contexts.txt --pos_seeds=city-seeds.txt --neg_seeds=city-neg-seeds.txt --D=1 

The output is generation in the file: matrix.txt_U.mm:
$ cat matrix.txt_U.mm
%%MatrixMarket matrix array real general
%This file contains COEM output matrix U. In each row D probabilities for the Y labels
88322 1
4.081683456898e-01
4.162331819534e-01
4.119633436203e-01

The first three noun phrases all have a prob of around 0.4 of being a city.
06 Apr 17:29

Literature survey of graph databases

by Danny Bickson
I stumbled upon this tech report: Literature survey of graph database, by Bryan Thompson, Systap. It is a good survey of different graph database platforms, but especially I liked its extensive coverage of the GraphChi framework:

GraphChi (IO Efficient Graph Mining) GraphChi (Kyrola, 2012) is an IO efficient graph mining system that is also designed to accept topology updates based on a Parallel Sliding Window (PSW) algorithm. Each iteration over the graph requires P^2 sequential reads and P^2 sequential writes. Because all IO is sequential, GraphChi may be used with traditional disk or SSD. The system is not designed to answer ad-hoc queries and is not a database in any traditional sense – the isolation semantics of GraphChi are entirely related to the Asynchronous Parallel (ASP) versus Bulk Synchronous Parallel (BSP) processing paradigms. GraphChi does not support either vertex attributes or link attributes. The basic data layout for GraphChi is a storage model that is key-range partitioned by the link target (O) and then stores the links in sorted order (SO). This design was chosen to permit IO efficient vertex programs where the graph was larger than main memory.

 While GraphChi does not support cluster-based process, the approach could be extended to a compute cluster. Because of the IO efficient design, the approach is of interest for out-of-core processing in hybrid CPU/GPU architectures. GraphChi operates by applying a utility program to split a data set into P partitions, where the user chooses the value of P with the intention that a single partition will fit entirely into main memory. The edges are assigned to partitions in order to create partitions with an equal #of edges – this provides load balancing and compensates for data skew in the graph (high cardinality vertices). GraphChi reads one partition of P (called the memory partition) into core. This provides the in-edges for all vertices in that partition. Because the partitions are divided into target vertex key-ranges, and because partitions are internally ordered by the source vertex, out-edges for those vertices are guaranteed to lie in a contiguous range of the remaining P-1 partitions. Those key-ranges may vary in their size since the #of out-edges for a vertex is not a constant. Thus, for the current memory partition, GraphChi performs 1 full partition scan plus P-1 partial partition scans.
 In addition to the edges (network structure), GraphChi maintains the transient graph computation state for each edge and vertex. The edge and vertex each have a user assignable label consisting of some (user-defined) fixed-length data type. The vertices also have a flag indicating whether or not they are scheduled in a given iteration. The edge state is presumably read with the out-edges, though perhaps from a separate file (the paper does not specify this). The vertex state is presumably read from a separate file (again, this is not specified). After the update() function has been applied to each vertex in the current memory partition, the transient graph state is written back out to the disk. This provides one more dimension of graph computation state that is persisted on disk, presumably in a column paired with the vertex state. 

If you like to read the rest of the overview, and also some proposed extensions, you should read the full paper. And of course, you can read about the collaborative filtering toolkit I am writing on top of GraphChi here.

An update: I just got from Bryan Thompson a note about additional useful resources Systap released:
Large Scale Graph Algorithms on the GPU (Yangzihao Wang and John Owens, UC Davis)
Graph Pattern Matching, Search, and OLAP (Dr. Xifeng Yan, UCSB)