Shared posts

06 Apr 17:29

Intel Labs report on GraphLab vs. Mahout

by Danny Bickson
I have some very interesting news to report. I got from Nezih Yigitbasi, Intel Labs, the following graph:


It compares Mahout vs. Distributed GraphLab on the popular task of matrix factorization using ALS algorithm (alternating least squares) on Netflix data. The bottom line is that GraphLab is about x20 faster than Mahout.

And here is the exact experiment setup, I got from Nezih:

  • N is the number of ALS iterations, D is the number of latent factors. The experiments have been conducted on a 16 node cluster. 
  • We start GL as mpirun -hostfile ~/hostfile -x CLASSPATH ./als –ncpus=16 --matrix hdfs://host001:19000/user/netflix --D=$LATENT_FACTOR_COUNT --max_iter=$ITER_COUNT --lambda=0.065 --minval=0 --maxval=5 
  • To run mahout ALS, we use the factorize-netflix.sh script under the examples directory. It should be run as ./factorize-netflix.sh /path/to/training_set/ /path/to/qualifying.txt /path/to/judging.txt 
  • In our test cluster we have 16 machines each with 64GB of memory, 2 CPUs (Intel(R) Xeon(R) CPU E5-2670 @ 2.60GHz [8 cores each]) and 4 x 1 TB HDDs. The machines communicate over a 10Gb Ethernet interconnect. 
  • The Netflix dataset has been splitted into 32 equally sized chunks and then put into HDFS.
06 Apr 17:29

A nice collaborative filtering tutorial "for dummies"

by Danny Bickson
I got from M. Burhan, one of our GraphChi users from Germany, the following link to an online book called: A Programmer's Guide to Data Mining.

There are two relevant chapters that may help beginners understand the basic concepts.
The first one of them is Chapter 2: Collaborative Filtering  and Chapter 3: Implicit Ratings and Item Based Filtering.
06 Apr 17:27

Will Kahn-Greene: pyvideo status: April 3rd, 2013

What is pyvideo.org

pyvideo.org is an index of Python-related conference and user-group videos on the Internet. Saw a session you liked and want to share it? It's likely you can find it, watch it, and share it with pyvideo.org.

Status

  • Videos for PyCon US 2013 are still going up. There are 115 posted and live now. There are around 30 that are waiting for presenters to look at the metadata and tell Carl whether the metadata is good or not. More on that later.

  • Several new people submitted patches to richard! Several of the patches were fixes to broken things they saw on pyvideo.org. I've applied the fixes to the site directly, but have been waiting on making any non-critical updates to the site until after things have cooled off. I think I'll do a site update in the next week or so.

  • PyData 2013 was recorded. When videos are posted, they'll be in the PyData category. I don't know what the posting schedule is.

  • I was contacted a couple of times by the inimitable Montréal Python to post their videos. They're going to test out steve which is the tool I've been writing for the last 6 months to make it possible for other folks to generate the video metadata needed by pyvideo.org.

    I eagerly look forward to their progress and to their videos getting on the site.

    If it works out well, I'll blog more about steve and look for volunteers to use steve to generate the video metadata for the ever increasing backlog.

  • Several people are gittip'ing me. It's not a lot of money, but that and the many emails I've gotten over the last few weeks about the site have been really great. I work on pyvideo.org in my free time of which I don't have a lot. It's nice to know that prioritizing pyvideo.org work over other things helps you.

That's the gist of things!

Most of the PyCon US 2013 videos that aren't live are waiting for presenters to tell Carl at NextDayVideo (carl at nextdayvideo dot com) whether the metadata is good.

  • If you see your name on this list and you've told Carl the metadata is fine already, please send him a friendly reminder.
  • If you see your name on this list and you haven't told Carl anything, please send him a "yes, this is great!" or the list of things you need corrected.
  • If you see a friend on this list, tell your friend to do one of the above.

I'll update this list as I'm aware of changes. However, I don't work for NextDayVideo, so it's entirely possible my list is not current and/or there are errors. If so, please let me know.

Here's the list (last updated 2013-04-03 9:11am -0400):

  • Digital signal processing through speech, hearing, and Python -- Mel Chua
  • IPython in-depth: high-productivity interactive and parallel python -- Fernando Perez, Brian Granger, Min RK
  • Faster Python Programs through Optimization -- Mike Müller
  • Pyramid for Humans -- Paul Everitt
  • Python beyond the CPU -- Andy Terrel, Travis Oliphant, Mark Florisson
  • Code to Cloud in under 45 minutes -- John Wetherill
  • Learn Python Through Public Data Hacking -- David Beazley
  • A Gentle Introduction to Computer Vision -- Katherine Scott, Anthony Oliver
  • Documenting Your Project in Sphinx -- Brandon Rhodes
  • Contribute with me! Getting started with open source development -- Jessica McKellar
  • Intermediate Twisted: Test-Driven Networking Software -- Itamar Turner-Trauring
  • Gittip: Inspiring Generosity -- Chad Whitacre
  • Rethinking Errors: Learning from Scala and Go -- Bruce Eckel
  • The Magic of Metaprogramming -- Jeff Rush
  • You can be a speaker at PyCon! -- Anna Ravenscroft
  • sys._current_frames(): Take real-time x-rays of your software for fun and performance -- Leonardo Rochael
  • Planning and Tending the Garden: The Future of Early Childhood Python Education -- Kurt Grandis
  • powerful pyramid features -- Carlos de la Guardia
  • Python for Robotics and Hardware Control -- Jonathan Foote
  • Copyright and You -- Frank Siler
  • Chef: Automating web application infrastructure -- Kate Heddleston
  • Numba: A Dynamic Python compiler for Science -- Travis Oliphant, Siu Kwan Lam, Mark Florisson
  • Integrating Jython with Java -- Jim Baker, Shashank Bharadwaj
  • Iteration & Generators: the Python Way -- Luciano Ramalho
  • ApplePy: An Apple ][ emulator in Python -- James Tauber
  • Distributed Coordination with Python -- Ben Bangert
  • Become a logging expert in 30 minutes -- Gavin M. Roy
  • PyNES: Python programming for Nintendo 8 bits -- Guto Maia
  • Purely Python Imaging with Pymaging -- Jonas Obrist
  • Namespaces in Python -- Eric Snow

[Comments]

06 Apr 17:25

A Programmer’s Guide to Data Mining

by Patrick Durusau

A Programmer’s Guide to Data Mining – The Ancient Art of the Numerati by Ron Zacharski.

From the webpage:

Before you is a tool for learning basic data mining techniques. Most data mining textbooks focus on providing a theoretical foundation for data mining, and as result, may seem notoriously difficult to understand. Don’t get me wrong, the information in those books is extremely important. However, if you are a programmer interested in learning a bit about data mining you might be interested in a beginner’s hands-on guide as a first step. That’s what this book provides.

This guide follows a learn-by-doing approach. Instead of passively reading the book, I encourage you to work through the exercises and experiment with the Python code I provide. I hope you will be actively involved in trying out and programming data mining techniques. The textbook is laid out as a series of small steps that build on each other until, by the time you complete the book, you have laid the foundation for understanding data mining techniques. This book is available for download for free under a Creative Commons license (see link in footer). You are free to share the book, and remix it. Someday I may offer a paper copy, but the online version will always be free.

If you are looking for explanations of data mining that fall between the “dummies” variety and arXiv.org papers, you are at the right place!

Not new information but well presented information, always a rare thing.

Take the time to read this book.

If not for the content, to get some ideas on how to improve your next book.

05 Apr 20:24

Concurrent and Parallel Programming

by Patrick Durusau

Concurrent and Parallel Programming by Joe Armstrong.

Joe explains the difference between concurrency and parallelism to a five year old.

This is the type of stark clarity that I am seeking for topic map explanations.

At least the first ones someone sees. Time enough later for the gory details.

Suggestions welcome!

05 Apr 20:24

Plotting Geo Data using Basemap

by noreply@blogger.com (Jan Erik Solem)
Matplotlib has an interesting toolkit called Basemap for plotting geo data and maps. Basemap handles all the possible map projections and uses Matplotlib to do the plotting. Basemap also includes data like countries, coastlines and background satellite images from NASA Blue Marble.

Plotting geo data using Basemap is actually very simple. Here is an example with geo data from Twitter using the script in my previous post.

from mpl_toolkits.basemap import Basemap
from pylab import *
import twitter_geo

# setup Lambert Conformal basemap.
m = Basemap(width=400000,height=400000,projection='lcc',resolution='f',lat_0=55.5,lon_0=13)
m.bluemarble() #background
m.drawcoastlines() #coastlines

# get some geo data from Twitter
geo = '55.583333,13.033333,100km' #Malmo, Sweden
tweets = twitter_geo.parse(twitter_geo.search(geo=geo))

for t in tweets:
coords = t[2]['coordinates']
x,y = m(coords[1],coords[0]) #project to map coords
plot(x,y,'ro')

show() #you might not need this

This example takes tweets in a 100km radius around Malmö, Sweden and plots the location with red dots on top of a NASA Blue Marble image with coastlines drawn. The result looks like this:

Pretty simple. If you had some dense data like weather or topographic data, that could be placed on the map (e.g. using Matplotlib's contour()) and opens up interesting possibilities for data visualization.
05 Apr 20:24

Converting video to NumPy arrays

by noreply@blogger.com (Jan Erik Solem)
Using OpenCV it is possible to read video frames from a file and convert them to NumPy arrays. Frames returned by QueryFrame() are in OpenCV's IplImage format where the pixels are stored as arrays of color in BGR order. Here's an example of converting the 100 first frames of a video to standard RGB and storing them in a NumPy array.

from numpy import *
import cv

capture = cv.CaptureFromFile('skyline.mov')
frames = []
for i in range(100):
img = cv.QueryFrame(capture)
tmp = cv.CreateImage(cv.GetSize(img),8,3)
cv.CvtColor(img,tmp,cv.CV_BGR2RGB)
frames.append(asarray(cv.GetMat(tmp)))
frames = array(frames)

Here the CvtColor() function converts from BGR to RGB and NumPy's asarray() does the final conversion. The array will have dimensions (100,width,height,3) in this case.

There is no good way to determine the end of the file, QueryFrame() will just loop around to the beginning. If you want to you can add a check to break when the first frame appears a second time.

Here's the video I used:
05 Apr 20:23

RQ Factorization of Camera Matrices

by noreply@blogger.com (Jan Erik Solem)
A common operation on camera matrices is to use RQ factorization to obtain the camera calibration matrix given a 3*4 projection matrix. The simple example below shows how to do this using the scipy.linalg module. Assuming the camera is modeled as P = K [R | t], the goal is to recover K and R by factoring the first 3*3 part of P.

The scipy.linalg module actually contains RQ factorization although this is not always clear from the documentation (here is a page that shows it though). To use this version, import rq like this:

from scipy.linalg import rq

Alternatively, you can use the more common QR factorization and with some modifications write your own RQ function.

from scipy.linalg import qr

def rq(A):
Q,R = qr(flipud(A).T)
R = flipud(R.T)
Q = Q.T
return R[:,::-1],Q[::-1,:]

RQ factorization is not unique. The sign of the diagonal elements can vary. In computer vision we need them to be positive to correspond to focal length and other positive parameters. To get a consistent result with positive diagonal you can apply a transform that changes the sign. Try this on a camera matrix like this:

# factor first 3*3 part of P
K,R = rq(P[:,:3])

# make diagonal of K positive
T = diag(sign(diag(K)))

K = dot(K,T)
R = dot(T,R) #T is its own inverse
05 Apr 20:23

Adjacency matrix for image pixel graphs

by noreply@blogger.com (Jan Erik Solem)
One of my favorite NumPy functions is roll(). Here's an example of using this function to get neighborhood indices to create adjacency matrices for images.

An adjacency matrix is a way of representing a graph and shows which nodes of the graph are adjacent to which other nodes. For n nodes is is an n*n matrix A where a_ij is the number of edges from vertex i to vertex j. The number of edges can also be replaced with edge weights depending on the application.

For images, a pixel neighborhood defines a graph which is sparse since each pixel is only connected to its neighbors (usually 4 or 8). A sparse representation, for example using dictionaries, is preferable if only the edges are needed. For clustering using spectral graph theory, a full matrix is needed. The following function creates an adjacency matrix in sparse or full matrix form given an image:

from numpy import *

def adjacency_matrix(im,connectivity=4,sparse=False):
""" Create a pixel connectivity matrix with
4 or 8 neighborhood. If sparse then a dict is
returned, otherwise a full array. """

m,n = im.shape[:2]

# center pixel index
c = arange(m*n)
ndx = c.reshape(m,n)

if sparse:
A = {}
else:
A = zeros((m*n,m*n))

if connectivity==4: # 4 neighborhood
hor = roll(ndx,1,axis=1).flatten() # horizontal
ver = roll(ndx,1,axis=0).flatten() # vertical
for i in range(m*n):
A[c[i],hor[i]] = A[hor[i],c[i]] = 1
A[c[i],ver[i]] = A[ver[i],c[i]] = 1

elif connectivity==8: # 8 neighborhood
hor = roll(ndx,1,axis=1).flatten() # horizontal
ver = roll(ndx,1,axis=0).flatten() # vertical
diag1 = roll(roll(ndx,1,axis=0),1,axis=1).flatten() # diagonal
diag2 = roll(roll(ndx,1,axis=0),1,axis=0).flatten() # diagonal
for i in range(m*n):
A[c[i],hor[i]] = A[hor[i],c[i]] = 1
A[c[i],ver[i]] = A[ver[i],c[i]] = 1
A[c[i],diag1[i]] = A[diag1[i],c[i]] = 1
A[c[i],diag2[i]] = A[diag2[i],c[i]] = 1

return A

Use it like this:

from pylab import *

im = random.random((10,10))
A = adjacency_matrix(im,8,False)

figure()
imshow(1-A)
gray()
show()

Which should give you an image like the one below.
05 Apr 20:23

Book draft

by noreply@blogger.com (Jan Erik Solem)
The other night I put up the first early draft of my upcoming book "Programming Computer Vision with Python".

The book is meant as an entry point to hands-on computer vision for students, researchers and enthusiasts using python. I'm thinking introductory courses in image analysis and computer vision. Also OpenCV python uses who need to do something outside what's in OpenCV will find the book useful. There is a growing python community around computer vision with projects like pythonvision and SimpleCV. The book can serve as a computer vision introduction for these users as well. And let's not forget all the hackers and enthusiasts out there.

It is not meant as a textbook, it is a manual for getting started with python and/or computer vision. I'm putting everything online in order to get early feedback and help hammering out errors and improve the text and code.

Draft versions together with code and data are available from the book webpage.

I'll take all the feedback and comments I can get! Preferably via email.

UPDATE (Sept 25): Thanks so much for all the feedback and comments! I'm now on the third iteration of the book pdf since putting it online a month ago. Check the book page for updates and keep sending me your thoughts. /JES
05 Apr 20:22

Simple Panoramas with OpenCV

by noreply@blogger.com (Jan Erik Solem)
The latest 2.4 release of OpenCV contains some good stuff. The stitching module is really useful. In the OpenCV sample folder there is a great script called stitching_detailed.cpp. This does the whole pipeline for creating panoramas including feature extraction, matching, warping and blending. If you have OpenCV properly installed you can compile the file for your platform and try it out from your command line like this:

$ ./stitching_detailed Univ*.jpg

This will use the default parameters and settings and create a file result.jpg with your panorama created from all jpeg images starting with "Univ". To see the defaults and what options you can set:

$ ./stitching_detailed --help

For example, you can change the projection used (default is spherical). This will use a Mercator projection:

$ ./stitching_detailed Univ*.jpg --warp mercator

Here are some sample results using these images of the University building in Lund. This is the same example used in Chapter 3.3 of my book.


Cylindrical projection.

Planar projection.

Mercator projection.
05 Apr 15:26

Paper of the Day (Po'D): Multi-tasking with joint semantic spaces

by Bob L. Sturm
Hello, and welcome to the Paper of the Day (Po'D): Multi-tasking with joint semantic spaces edition. Today's paper is: J. Weston, S. Bengio and P. Hamel, "Multi-tasking with joint semantic spaces for large-scale music annotation and retrieval," J. New Music Research, vol. 40, no. 4, pp. 337-348, 2011.

This article proposes and tests a novel approach (pronounced MUSCLES but written MUSLSE) for describing a music signal along multiple directions, including semantically meaningful ones. This work is especially relevant since it applies to problems that remain unsolved, such as artist identification and music recommendation (in fact the first two authors are employees of Google). The method proposed in this article models a song (or a short excerpt of a song) as a triple in three vector spaces learned from a training dataset: one vector space is created from artists, one created from tags, and the last created from features of the audio. The benefit of using vector spaces is that they bring quantitative and well-defined machinery, e.g., projections and distances.

MUSCLES attempts to learn each vector space together so as to preserve (dis)similarity. For instance, vectors mapped from artists that are similar (e.g., Brittney Spears and Christina Aguilera) should point in nearly the same direction; while those that are not similar (e.g., Engelbert Humperdink and The Rubberbandits), should be nearly orthogonal. Similarly, so should vectors mapped from tags that are semantically close (e.g., "dark" and "moody"), and semantically disjoint (e.g., "teenage death song" and "NYC"). For features extracted from the audio, one hopes the features themselves are comparable, and are able to reflect some notion of similarity at least at the surface level of the audio. MUSCLES takes this a step further to learn the vector spaces so that one can take inner products between vectors from different spaces --- which is definitely a novel concept in music information retrieval.

As an example, consider we have some unidentified and untagged song for which we would like to identify the artist, propose some tags, and/or find similar songs. By designing vector spaces using MUSCLES, we can retrieve a list of possible artists by taking the inner product between the feature vector of the song and our learned artist vectors. As Weston et al. define, we can judge as similar those giving the highest magnitudes. Similarly, we can retrieve a list of possible tags by taking the inner product between the feature vector of the song and our learned tag vectors. And we can retrieve a list of similar songs by taking the inner product between the feature vector of the song and those of our training set (which is a typical approach, but by no means a good approach, to judging music similarity).

To learn these vector spaces, MUSCLES uses what appears to be constrained optimization (the norms of the vectors in each space are limited) using a simple gradient descent of a cost function within a randomly selected vector space (stochastic gradient descent). Since they are interested in increasing precision, the authors optimize with respect to either margin ranking loss, and weighted approximate ranked pairwise loss --- both relevant for the figure of merit precision @ k). Furthermore, when one wishes to design a system that can do all three tasks (artist prediction, tag prediction, and music similarity), MUSCLES considers all three vector spaces, and optimizes each one in turn by holding the others constant.

Weston et al. test MUSCLES using the TagATune dataset (16,289 song clips in training, 6,499 clips for testing, 160 unique tags applied to all data), as well as a private one they assemble to test artist identification and similarity (275,930 training clips, 66,072 test clips, from 26,972 artists). Their audio features are essentially histograms of codebook vectors (learned using K-means from the training set). They compare the performance of MUSCLES against a one-vs-rest SVM, and show MUSCLES has better mean p@k for several k (and with statistical significance at alpha = 0.05) when predicting tags for the TagATune dataset. For artist identification in the bigger dataset, we see MUSCLES performs better than the SVM in artist prediction, song prediction, and song similarity (judged by whether the artist is the same). Finally, Weston et al. show MUSCLES not only requires much less overhead than an SVM, but also that a nice by-product of the MUSCLES learning approach is sanity checking of the models. In other words, we can check the appropriateness of vectors learned for tags by simply taking inner products and inspecting whether those close together contain similar concepts.

Overall, this is a very nice paper with many interesting ideas. I am curious about a few things. First, how does one deal with near duplicates of tags? For instance, "hip hop" and "hip-hop" are treated separately by many tagging services, but they are essentially the same. So, does one need to preprocess the tag data before MUSCLES, or can MUSCLES automatically weed out such spurious duplicates? Second, in order to use MUSCLES, the feature dimensions of each space must be the same. What happens when we have only a few artists, but many years of their recordings --- for instance, many musicians change their style over the years. Will MUSCLES automatically find a vector for young Beethoven and old Beethoven? This restriction is necessary, but how does one set that dimensionality? I like the machinery that comes with vector spaces; however, I don't think it makes sense to think of artists or timbre spaces in the same way as tag spaces. For instance, I mention above Brittney Spears and Christina Aguilera should point in the same direction, but Engelbert Humperdink and The Rubberbandits should be nearly orthogonal. However, should the tags "sad" and "happy" be orthogonal, or point in exact opposite directions? What does it mean for two artists to point in opposite directions? Or two song textures? A further problem is that MUSCLES judges similarity by magnitude inner product. In such a case, if "sad" and "happy" point in exact opposite directions, then MUSCLES will say they are highly similar.
05 Apr 15:26

Paper of the Day (Po'D): Automatic classification of musical genres using inter-genre similarity edition

by Bob L. Sturm
Hello, and welcome to the Paper of the Day (Po'D): Automatic classification of musical genres using inter-genre similarity edition. Today's paper is: U. Bağci and E. Erzin, "Automatic classification of musical genres using inter-genre similarity," IEEE Signal Proc. Letters, vol. 14, pp. 521-524, Aug. 2007. Related to this work are the following four:
  1. U. Bag ̆cı and E. Erzin, "Boosting classifiers for music genre classifi- cation," in Proc. 20th Int. Symp. Comput. Inform. Sci. (ISCIS'05), Istanbul, Turkey, Oct. 2005, pp. 575-584.
  2. U. Bagci and E. Erzin, "Inter genre similarity modeling for automatic music genre classification," in Proc. IEEE Signal Process. Comm. Apps., pp. 1-4, Apr. 2006.
  3. U. Bagci and E. Erzin, "Inter genre similarity modeling for automatic music genre classification," in Proc. DAFx 2006.
  4. U. Bagci and E. Erzin, "INTER GENRE SIMILARITY MODELLING FOR AUTOMATIC MUSIC GENRE CLASSIFICATION" arXiv:0907.3220v1, 2009.
(Is the DAFx paper an English translation of the IEEE conference paper written in Turkish?)

This work is next on my docket for reproduction, for not the least of reasons that it reports a classification accuracy of over 92% in the GTZAN dataset. For most of the time in our day-to-day machine learning experiences, we want more data in order to build better models. In this paper, however, the approach is to find the music excerpts of each genre that are the most difficult to classify, and then use only those to build the classifier. The idea is that by learning the difficult examples, we will also lean the easy ones. The features used in this work are 13 MFCCs (with or without the zeroth coefficient?), spectral centroid, roll-off, flux, and zero-crossing rate, all computed from each 25 ms window shifted by 10 ms (I think, they talk about "25 ms overlapping audio windows ... for each 10 ms frame." And later it is written: "The resulting ... feature vectors are extracted for each 10 ms audio frame with an overlapping window of size 25 ms." Huh?). For a set of time-ordered features then, the system computes weighted first and second derivatives. Finally, the system creates for each analysis window (except the first few because of the derivatives) a feature vector by concatenating all of these features into one that has 51 dimensions.

The system next models each genre by mixtures of Gaussians in this 51-dimensional space using features from the training data. Then, the system classifies each "frame" (is that a 25 ms. window?) of the same training data. Taking only the misclassified frames in each genre, the system constructs a new model for these features, which represents the "inter-genre similarity" (IGS) distribution. (This kind of reminds me of a universal background model in speaker identification.) Finally, the system updates its models of each genre by using only the correctly classified frames. This leaves us with 10 genre models and 1 model of the IGS. This can be done in an iterative fashion as well, so as to tune all the models.

Now, to classify an excerpt producing a set of feature vectors (in a decision window of, say, 3 or 30 s duration), the system computes for each genre a weighted sum of log posteriors, with weights acting as a 1-0 loss function with respect to the IGS model (i.e., if the cond. prob. of observing a feature given the model of genre \(n\) is smaller than the cond. prob. of observing a feature given the IGS model, then the weight of that posterior is zero; otherwise it is one). This acts to minimize the insecurity imparted to a classifier by frames that are difficult-to-model by any genre. For comparison, the authors set all these weights to one so the IGS model is never used in classification (called the "flat classifier").

To test this approach, the article use the GTZAN dataset, 2-fold cross-validation, and compute mean classification accuracy (no std.dev.). Testing various number of Gaussians, and different feature sets, the article reports for the flat classifier mean accuracies of about 70% using only MFCCs, and their first and second weighted derivatives modeled by mixtures of 16-48 Gaussians (with diagonal covariances) over 30 s decision windows. For these same features and parameters but with the 1-0 loss decision function and the IGS model (not tuned), the article reports mean accuracies around 85%. (With a mixture of 48 Gaussians we see mean accuracy above 88%.) The confusion matrix in this article shows some interesting items. First, when using the IGS model vs. the flat classifier, most confusions decrease. We go from 19 Rock confused by the flat classifier as Country to only 4 using the IGS model. But in two cases the confusions increase by a lot: we go from 4 confusions of Country as Metal by the flat classifier, to 11 by the IGS model. What are these Country Metal tunes? (I think one must be "White Lightening" by George Jones --- a real boozer of a rock star, by the way.) And we go from 9 confusions of Rock as Metal by the flat classifier to 17 by IGS, which might not be so troublesome considering excerpts by Queen appear in both categories (including the exact same excerpt). Finally, the paper reports that the approach created by tuning the genre and IGS models five times obtains a mean classification accuracy of 92.4% over a 30 s decision window.
05 Apr 15:26

Test results for inter-genre similarity, part 1

by Bob L. Sturm
Having reviewed yesterday the work on music genre classification by boosting with inter-genre similarity, I now have some results. Adapting my MFCC code to fit the description in the text, and using the excellent PRTools toolbox, it only took me a matter of minutes to create the code. The time to run it, however, takes hours. Essentially, the core of the algorithm is like:

  % create initial models (mixture of Gaussians classifier)
  W = mogc(traindataset,numGaussians);
  % classify
  results = traindataset*W;
  predictedLabels = results*labeld;
  % find those frames that are wrongly classified
  idx_wrong = ~(predictedLabels == traindataset.nlab);
  
  % create new dataset from those wrongly classified
  newlabels = traindataset.nlab;
  newlabels(idx_wrong) = 0; % this is the IGS class
  traindataset = dataset(traindataset.data,newlabels);
  traindataset = setprior(traindataset,0);
  % create final models
  W = mogc(traindataset,numGaussians);
  
  % test
  results = testdataset*W;
I am assuming the priors for any of the genre categories or the IGS model are the same, 1/11. (Nothing about this is mentioned in the paper.) Then to classify, I take the posteriors of each class for a set of 2770 frames. (Bagci and Erzin say they use 3000 frames for all excerpts, but I don't believe it since there are some excerpts in GTZAN that are shorter than 30 seconds.)

    sumlogpost = -Inf*ones(numclasses,1);
    % compare posteriors of each class to those of IGS model
    for jj=1:numclasses
      posteriorsforthisclass = posteriors(:,jj+1);
      % find those that have posteriors above that of IGS
      idx = (posteriorsforthisclass > posteriors(:,1));
      numframestoconsider(jj) = sum(idx);
      if numframestoconsider(jj)
        sumlogpost(jj) = sum(log(posteriorsforthisclass(idx)))/numframestoconsider(jj);
      end
    end
    [~,pred] = max(sumlogpost);
And that is all.

When we do not consider whether the posterior of a frame is more likely to come from the IGS model (the flat classifier), we get the following confusion table for 2-fold CV in GTZAN using GMMs of order 8 (full covariance matrices however, instead of the diagonal ones specified in the text; and 27.7 s decision windows instead of 30 s). (Recall is along diagonal, precision is the right-hand column, F-measure is last row, and element at lower right is classification accuracy.) confusion_GMMC8_flat.png That is a classification accuracy of about 63%. The mean classification accuracy reported by Bagci and Erzin is about 67%. I think we are certainly in the neighborhood.

Now, when we take the IGS model into account, we observe the following confusion table using the exact same folds. confusion_GMMC8_IGS.png That is a classification accuracy in the high-tens --- like the weather across much of the US. It apparently thinks most things are Rock and Metal. The mean classification accuracy reported by Bagci and Erzin for this IGS is about 87%. No longer are we in the neighborhood, but on the Tom Waits train to the other side of hell.

When we instead make the classifier decide based upon only those frames that have a higher posterior in the IGS model, we observe the following confusions. confusion_GMMC8_opp.png This is a statistically significant improvement indeed, but confusing why that should work better than before.

So, on the basis of what number of frames in each class is this classifier making its decision? Let's look at the distribution of the percentage of frames the classifier uses to make classifications that are incorrect. confusion_GMMC8_IGS_incorrect.png Here we see for Blues excerpts that the classifier is only looking at about 17% of the frames to mislabel it. For classical, it is only looking at around less than 10% (277 25 ms frames). For the most part, we would expect most wrong classifications to occur on the basis of too few reliable frames. So this makes sense.

Now let's look at the distribution for when classifications are correct --- which, as we see above, this classifier rarely is. confusion_GMMC8_IGS_correct.png This is quite a different picture. Except for Country, Disco, Hip hop, Pop, and Rock, most of the correct decisions are made using a majority of frames. Particularly worrisome is that most of the frames of excerpts in Country, Hip hop, and Rock, and all of the frames of Pop, are too well-described by the IGS model.

I am now going to verify things are working correctly in PRtools. One problem might be the fact that I am estimating full covariance matrices, whereas Bagci and Erzin estimate diagonal covariance matrices. The multiplicity of estimates in my model may be contributing to the bad results. Another problem might be the features. I will reduce to MFCCs without the delta and delta delta coefficients to see if that makes a difference. Yet another problem is that too many frames appear to end up in the IGS model. Maybe those need to be capped somehow.
05 Apr 15:26

Test results for inter-genre similarity, part 3

by Bob L. Sturm
The genre recognition approach proposed by Bagci and Erzin is motivated by several key assumptions, which I boil down to the following:

  1. there are common properties among music genres, for instance, "similar types of instruments with similar rhythmic patterns";
  2. the nature of some of these common properties is such that bags of frames of features (BFFs) encapsulate them;
  3. this means we should expect poor automatic genre recognition using BFFs if these common properties are quite prevalent among the classes; (or, that music genres have common properties is to some extent responsible for their ambiguousness;)
  4. we can model the common properties of a set of music genres by combining the features of frames that are misclassified in each genre;
  5. we can model the non-common properties of each of a set of music genres by combining the features of frames that are correctly classified in each genre;
  6. with these models, we can thereby ascribe a confidence that any given frame encapsulates common properties, and thus whether to treat it as indicative of genre or not.
This got me thinking: let's forget about music signals for a while and consider a perfect toy problem for which the approach proposed by Bagci and Erzin would work. The assumptions above then become:

  1. there are common properties among the classes in dataset X;
  2. the nature of some of these common properties is such that our features encapsulate them;
  3. this means we should expect poor automatic recognition using these features if the common properties are quite prevalent among the classes;
  4. we can model the common properties of our classes by combining the features that are misclassified in each genre;
  5. we can model the non-common properties of each of our classes by combining the features that are correctly classified in each class;
  6. with these models, we can thereby ascribe a confidence that any given set of features (an observation) encapsulates common properties, and thus whether to treat it as indicative of a class or not.
Now, let's generate data of four classes, where each class is modeled by a mixture of 2 Gaussians with different parameters.

toy1.png In this pretty little scene, we see the features of our classes in the 2-dimensional feature space. Clearly, this dataset satisfies assumptions 1 and 2 above. With 2-fold CV, we model each class as a mixture of two Gaussians, and then classify each individual feature by maximum likelihood (equal priors). We get an error rate of 0.375, which is much better than random. Now, consider that each of our observations produces 2 features instead of a single one. When we classify each observation by summing the log conditional densities of the 2 observations, and selecting the class with the maximum (aggregated classification), our error rate decreases to 0.158.

Instead of 2 features, let the observations consist of 5.

toy2.png Our error rate for individual features stays the same, but now the error of the aggregated classifier plummets to 0.037.

Now, as a sanity check, let's change each class model such that with probability 0.9 a feature is taken from that part where all classes overlap. Now, there should be much less difference between each of the two approaches to classification.

toy3.png Sanity confirmed.

Similarly, if we reduce the separation of the unique components of each class, there should be less difference as well.

toy4.png Sanity confirmed. And this confirms assumption 3. The more probable our observation of some class contains features of the common properties, the worse we are going to do.

Time to move to the next two assumptions. In our training set, we take those features we misclassify and use only those to train a model of the common properties. Then we take the features that are correctly classified in a class, and use those to train a model of that class. Here are the resulting models for the first fold.

toy5.png The contour in magenta is of the model for the common properties. First, we can see that limiting the covariance matrix to be diagonal may not be the best way to model the non-common properties of each class. Second, we see that some of green and black class in the common properties has been correctly labeled, and so they contribute a Gaussian there.

Let's iterate this procedure now to tune the models. We take the training data, build models for each class, classify each feature, take the misclassified frames, relabel them to be the common class, and build a model for the common properties. Then we take the correctly classified frames and build new models for each class, classify the training data again, take all misclassified frames and build models again, and do it again and again. Below is an animation of 10 tuning iterations for one fold, and shows how many features go from each class to the common properties model. Eventually things settle with about one fifth of the features from every class flitting away to the common class.

pdf-obj5-x-y-w00t.gif And making the data more overlapping, we see the same kind of behavior.

dsfdsf.gif

Now let's see how well it works on classification. What we do is we classify each feature of an observation, then classify the observation based on the maximum of the sum log conditional probabilities of those features having a higher condition probability in a class model that in the common properties models. First, an easy dataset.

toy6.png Here we see a halving of the error after ten tuning iterations! And now for a more difficult dataset.

toy7.png So, apparently, it is working!

Now, what is the critical separation between these classes such that this tuning procedure doesn't hurt or help? I am going to run the above procedure for datasets of various separations, and compute the errors for each one. Still, we have five features in each observation. First, for a probability of a feature being drawn from the non-common set of 0.5, we see the following errors.

toy7b.png Clearly, IGS look good, and its error does not exceed the aggregated classifier.

Now for a probability of a feature being drawn from the non-common set of 0.3, we see the following errors. toy8.png All the errors have risen, but the cross-over still appears around the same separation as before.

Now for a probability of a feature being drawn from the non-common set of 0.1, we see the following errors. toy9.png

Let's see what happens when we increase the number of features in each observation from 5 to 50. toy10b.png Very nice!

Now that I have, through this process, ironed out some critical bugs in my code, it is time to reapply to music and test the first assumptions I enumerate above.
05 Apr 15:26

From MPTK books to Wivigrams

by Bob L. Sturm
Here is a sonogram of the introduction to "Pictor Alpha" by Curtis Roads.

sonogram.png Now, we make a dictionary of Gabor atoms of various sizes, and decompose this signal into 5000 atoms using MPTK. How can we visualize the output? One of the easiest ways is by a superposition of the Wigner-Ville distributions of the atoms, scaled by their energy in the decomposition. This produces the following picture.

wivigram.png The code I used to generate this is here. Much more can be done to the code by way of software engineering to streamline it; but it is only to serve as a simple example of how to do such visualizations.

Now, time to look at doing the same but in python.
05 Apr 15:24

19 months of Reproducible Research

by Igor
When Rich reminds us that "We generated more data than there are stars in the Universe", he also points to the obvious need to make sense of them. One strategy is to come up with new ideas for better sensors (These Technologies Do Not Exist ) or better estimation and detection schemes. 
Another surer strategy is to go through the Reproducible Research route as a means of making sure that your ideas do no die. I have added an implementation tag to blog entries that featured an algorithm implementation for the past 19 months and it looks like we have about 146 of them. That figure amounts to about 7.6 implementations released per month or about close to two releases per week
Here is a list of all these implementations that I will update every month after every Nuit Blanche Monthly Reviews:
Related entries and pages:


Image Credit: NASA/JPL/Space Science Institute N00205148.jpg was taken on April 02, 2013 and received on Earth April 03, 2013. The camera was pointing toward SATURN, and the image was taken using the CL1 and CL2 filters. T



Join the CompressiveSensing subreddit or the Google+ Community and post there !
Liked this entry ? subscribe to Nuit Blanche's feed, there's more where that came from. You can also subscribe to Nuit Blanche by Email, explore the Big Picture in Compressive Sensing or the Matrix Factorization Jungle and join the conversations on compressive sensing, advanced matrix factorization and calibration issues on Linkedin.
05 Apr 15:11

Iraq president improving, now able to speak

by Patrick Osgood
Iraqi President Jalal Talabani, who is being treated in Germany after suffering a stroke, is now able to speak and is able to recognise those around him, his doctor said on Saturday. "The president's health is improving, and he is responding well to treatment," Najm al-Din Karim, who is also the governor of Iraq's northern [...]
05 Apr 15:11

Control Theory: Example 3 – Ziegler-Nichols and Tyreus-Luyben Tuning Rules

by rip

Edit 21 Jan 2013: added “Example 3″ to the title.

Introduction

I’m going to jump right in and imagine that we have a known plant which we wish to control using P, PI, or PID. I’m going to assume that you have some idea what those are. We’ve seen proportional (P) control, where the control signal is proportional to the error. Integral control has a signal which is proportional to the integral of the error, and derivative control a signal which is proportional to the rate of change (the derivative) of the error. Proportional control is always used in conjunction with integral, hence PI. Although derivative control can be used with only proportional, in process control we generally use both integral and proportional with it, hence PID. We’ll talk more about them in subsequent posts, but for now I want to show you two ways to get the required parameters. They may not always be very good values, but they’ll be better than in the ballpark. At the very least, they are good starting values for further investigation.

Oh, I am aware that Mathematica® version 9 has new capabilities for tuning controllers – but let me write about what I know, using version 8.

My plant is a 3rd-order transfer function… as usual, rules are a convenient way to specify things:

con 1 14 1

Bequette – this is Example 5.3 (p. 174) and Example 6.1 (p. 199) from Bequette’s “Process Control: Modeling, Design, and Simulation”, Prentice Hall 2003. He used the Routh criterion to decide that the ultimate gain was 10: that’s on the verge of instability. Let’s take a look – that is, I’m about to show you what the ultimate gain means.

I set a new parameter value for K, namely K = 10, the alleged ultimate gain.

con 1 14 2

The closed loop transfer function (y) is…. and the time-domain response to a unit step function is…

con 1 14 3

Let’s get the control effort transfer function u, too… and the time-domain behavior:

con 1 14 4

Here’s my usual first plot: step function in silver (or whatever it is), system response in black, scaled control effort (c1/K) in gold:

con 1 14 5

OK, we see steady-state oscillations in response to a unit step input. Make K slightly greater, and the oscillations will grow; make K slightly smaller and the oscillations will get smaller. We are on the edge of instability, just as advertised.

I should also show you the actual control effort:

con 1 14 6

We would like to know the period of these output oscillations. It looks like it is slightly above 6… there’s a minimum just before 7, and the next one is just after 13 – look at the previous graph… that’s why I scale the control effort, so the output is not compressed.

We could use calculus to find them…. If we were looking at a paper recording, we might be able to measure them, but in this case we have an explicit equation for the solution, so let’s go ahead and do calculus. Here’s the first derivative of the output… set it to zero.

con 1 14 7

Let’s find solutions near 7 and 13…

con 1 14 8

We could also back up, and instead of finding the ultimate gain by the Routh criterion, we could have found it by simulation. I’ve shown you simulation before, so I won’t do it this time. If, instead of having a math model, you were in the control room playing with parameters, you’d be using the system itself to do a simulation… and the operators would be planning to kill you for bringing the system to the edge of instability.

The bottom line, however, is that we now have two parameters, called the ultimate gain and the ultimate period. There are some simple tuning rules – choices of P, PI, and PID settings – which are based on these two parameters.

But first let me show you the right way to get the ultimate gain and ultimate period, if you have a quantitative model. Everything else in this post is straight-forward, I hope, but this trick is not, at least to me.

We return to our initial system (K = 1):

con 1 14 9

And here’s the open loop Bode plot:

con 1 14 10

The gain margin is 10 – gee, that’s the ultimate gain Bequette found! It occurs at a frequency of 1 radian/second.

Given a frequency f in radians/second, we find the period from P f = 2 pi. This crossover frequency is 1, so the corresponding period is

con 1 14 11

… and that is the ultimate period we computed by finding two consecutive minima.

Don’t ask me why it works, I don’t understand it yet. But I have more than one book telling me it’s not an accident. This, as far as I’m concerned at this point, is the main use of the gain margin. There’s absolutely no reason to simulate the math model and estimate these numbers: they can read off the gain margin. If you have a math model.

Now save these:

con 1 14 12

Now that we have Pu and ku, there are some rules for setting the parameters of P, PI, and PID controllers. But while I have the default K = 1, let’s see how the uncontrolled plant responds to a unit step. I get the closed loop transfer function and then the inverse Laplace transform:

con 1 14 13

Here’s the output and the input (black and silver resp.)

con 1 14 14

Well, except that it’s nowhere near 1 – i.e. we have offset – it had a pretty gentle response, without wild swings.

With a PID controller, we have plant… and controller…

con 1 14 15

We see three parameters: an overall gain kc, and two time constants, \tau_D\ and \tau_I\ . The \tau_D\ is multiplied by s – which says that’s the Laplace transform of a derivative… and the \tau_I\ is coupled with a 1/s – which says that’s the Laplace transform of an integral. What I’m going to show you is two sets of rules for getting from the ultimate gain and the ultimate period to values of kc, \tau_D\ , and \tau_I\ .

The first set of rules is called Ziegler-Nichols; the second is called Tyreus-Luyben.

Let’s go.

Ziegler-Nichols with P, PI, PID

First, let’s try the Ziegler-Nichols rules.

P-only control

To get P-only control, we set \tau_D\ ->0 and \tau_I\ -> \infty\ .

con 1 14 16

Here are the closed loop transfer functions for output and for control… and, since I will want it later, the open loop transfer function, too:

con 1 14 17

The Z-N rule for P-only control is to set kc = .5 ku. (Parts of the following rule are redundant, because I’ve already set \tau_D\ and \tau_I\ , but I like having a complete rule anyway.)

con 1 14 18

and the closed loop for control becomes

con 1 14 19

We get the time-domain responses by taking inverse Laplace transforms…

con 1 14 20

And we plot the input (step function in “silver”), output (black), and scaled control effort (gold):

con 1 14 21

Not only does it take a long time to settle, but there is offset. As for the long settling time… K = 1 got there quickly, K = 10 was on the edge of instability, so K = 5 should have decaying oscillations. OK?

(And if you’re not used to seeing offset with P-only control… remember that the use of transfer functions requires that we use deviation variables: all our initial conditions are zero, in these variables. That is, for example, we’re not using the actual liquid level in a tank – we’re using the deviation from the set point. To hold the system at a value other than zero requires a constant nonzero control effort… so the response is nonzero, but not necessarily 1.)

For the record, here’s the unscaled control effort (gold):

con 1 14 22

Lastly, I want to see the open loop Bode plot, and compute the phase margin:

con 1 14 23

The phase margin is 25° – a little too jittery, as we saw.

PI control

To get PI control, we set \tau_D\ ->0.

con 1 14 24

Here are the closed loop transfer functions for output and for control… and the open loop transfer function for later:

con 1 14 25

Note that kc has changed a little: instead of .5 we have .45 . What this says is that the control modes interact: they should not be set independently of each other…once we add integral control to proportional control, we change the value of the proportional control… and we’ll change both of them again when we add derivative control.

The closed loop transfer function for output becomes.. and we know the ultimate gain ku and ultimate period Pu:

con 1 14 26

The closed loop for control becomes

con 1 14 27

The open loop transfer function becomes

con 1 14 28

The time-domain responses become:

con 1 14 29

And here’s the usual plot, including scaled control effort:

con 1 14 30

It still takes a long time to settle, but the offset is gone.

After all, that’s one of things integral control does for us: it eliminates offset. I should point out that we don’t always care about offset: level control for surge tanks is the best example where we do not care to hold the level at exactly 50%. In fact, to do so would defeat the purpose of a surge tank, and we might as well replace it by a pipe – what goes out is exactly what comes in.

Here’s the unscaled control effort:

con 1 14 31

And here’s the open loop Bode plot:

con 1 14 32

This quantifies what we saw: that the output is more jittery than P-only control.

PID control

To get PID control, we keep all three PID parameters:

con 1 14 33

Here are the closed loop transfer functions for output and for control… and, as usual, the open loop transfer function for later:

con 1 14 34

This time kc is greater than .5, \tau_I\ is smaller than for PI.

The closed loop transfer function for output becomes.. and we know the ultimate gain ku and the ultimate period Pu:

con 1 14 35

and the closed loop transfer function for control becomes

con 1 14 36

The open loop transfer function becomes

con 1 14 37

The time domain responses…

con 1 14 38

… and the usual plot:

con 1 14 39

As usual, here is the unscaled control effort:

con 1 14 40

The open loop Bode plot is

con 1 14 41

That phase margin of 29 is… marginal, at best – go ahead, grimace and hate me; still too jittery.

Let’s take a look at the three output responses:

con 1 14 42

I used panels 3-5 from the following, so orange is P, brown is PI, and yellow is PID. Alternatively, the orange must be P because it has offset, so PID should be yellow.

con 1 14 43

Anyway, we see that P has offset… PI does not, but it takes longer to settle down than P did… PID settles down much faster than the other two, and has no offset, but it still swings wildly at first (that overshoot is just over 40%.)

Tyreus-Luyben with PI, PID

Now I try the Tyreus-Luyben rules. Like the Ziegler–Nichols, they use ultimate gain ku and ultimate period Pu. For P-only control, they are the same.

PI control

To get PI control, we set \tau_D\ ->0.

con 1 14 44

Here are the closed loops for output and for control… and, of course, the open loop transfer function:

con 1 14 45

The T-L rule for PI control is

con 1 14 46

The closed loop for control becomes

con 1 14 47

The open loop becomes

con 1 14 48

The time-domain responses are…

con 1 14 49

Here is the usual plot showing input, output (black), and scaled controller effort:

con 1 14 50

Settles faster, and the offset is gone.

Here’s the unscaled control effort:

con 1 14 51

Here’s the open loop Bode plot:

con 1 14 52

OK, we finally have a phase margin (39°) which is greater than 30°: this choice of parameters is OK by that criterion.

PID control

To get PID control, we need all three PID parameters:

con 1 14 53

The T-L rule for PID control is

con 1 14 54

and the closed loop transfer function for control becomes

con 1 14 55

The open loop transfer function becomes

con 1 14 56

Here are the time-domain responses:

con 1 14 57

Here is the usual plot…

con 1 14 58

Here’s the unscaled controlled effort:

con 1 14 59

Here, as usual, is the open loop Bode plot:

con 1 14 60

Even better than the PI control: a not-too-large phase margin.

Now let’s compare things. Here are the two PI controls:

con 1 14 61

The T-L has significantly less wildness than the Z-N: the overshoot is less and the settling time is less.

Here are the two PID controls:

con 1 14 62

The T-L has significantly less overshoot, but it does take a little longer to get close to its final value. That probably corresponds to a phase margin close to 60°.

I shouldn’t even think of putting all 5 controllers on one graph….

con 1 14 63

…though they look better if I shorten the time, except that it is no longer clear that P-only has offset:

con 1 14 64

Recall the colors in order, orange (3) thru dark gray (7): so, orange, brown, and yellow are Z-N P, PI, and PID; light and dark gray are T-L PI and PID.

con 1 14 65

Summary

Here are the Ziegler-Nichols rules:

con 1 14 66

Here are the alternative Tyreus-Luyben rules:

con 1 14 67

These rules require that we determine the ultimate gain and the ultimate period… and we saw how to get them from the open loop Bode plot, specifically from the gain margin and the frequency at which it occurs.

We saw the effect of these rules on a particular plant, and we also checked the phase margin for each choice of control method and parameters.

As you might expect, I would tend to go with the Tyreus-Luyben rules.


05 Apr 15:10

Control Theory – Example 3: PI, PD, and PID

by rip

There’s something I forgot to do when I looked at PID tuning: I meant to look at the Bode plots for the controllers, literally to see what they do individually. In addition, I’m going to mention real derivative control; I’ve long known that the simple PID includes “ideal” derivative control, but I only just realized just how unrealistic it is.

It’s probably just as well that this end up in a separate post.

Let’s just dive in to Bode plots.

PID with Ziegler-Nichols rules

Recall our plant… first the definition, then the resulting transfer function.

con 1 21 1

We found, by looking at the gain margin for the open loop Bode plot of the plant that the ultimate gain and ultimate period were 10 and 2 \pi\ :

con 1 21 2

Ah, I need to say – and to add a comment to the previous post – that my lack of understanding is not complete. If we accept the dictum that the gain margin tells us how much the gain can be multiplied before the system goes unstable, then to that extent, it makes sense that the gain margin tells us the ultimate gain. My lack of understanding then focuses on the question: why is the gain margin computed at a phase of -180° ? I’m not setting the phase to that when I hit the system with a unit step. Perhaps one argues that the unit step includes a component at a phase of 180°…. I’m sure I’ll understand it someday.

Anyway, we know in practice that the ultimate gain and ultimate period are found from the gain margin of the open loop Bode plot of the plant.

With PID control, we redefine K… and get the general open loop transfer function for our given plant:

con 1 21 3

The Z-N settings for PID control are:

con 1 21 4

The open loop for plant and controller becomes

con 1 21 5

Here’s the plant alone:

con 1 21 6

I also want the complete PID controller by itself. Incidentally, the last thing I did was factor the controller.

con 1 21 7

I should emphasize something that I never noticed before: the PID controller is improper, because the numerator is of higher degree (2) than the denominator (1). The problem is the “ideal” derivative term… sometime soon I will have to show you “real” derivative transfer functions (there’s more than one).

Here’s the open loop Bode plot for the uncontrolled plant…

con 1 21 8

We’ve seen it before, of course: the {1,10} told us the gain margin was 10 – the ultimate gain – at a frequency of 1 – hence a period of 2 \pi\ / 1 = 2 \pi\ – the ultimate period.

The open loop Bode plot for the fully controlled plant is – again…

con 1 21 9

We’ve seen this before, too: the Z-N rules lead to a phase margin of 29°, which is right on the edge of acceptability (30°).

The plant G and the controller K are in series, so we may look at all three (G, K, GK)on one open loop Bode plot. Blue is the plant G, red is the PID controller K, and gold is the controlled plant G K. We see that the controller has a minimum, and increases the amplitude plot everywhere else. Looking at the gold curve versus the blue curve in the amplitude plot, we see that at low frequencies, the controller causes the amplitude to rise; at high frequencies it causes the amplitude to fall off less rapidly.

con 1 21 10

Now we could get the P-only part of the controller K (but I won’t actually do anything with it):

con 1 21 11

We saw in an earlier post how to break down a transfer function as a product of terms. Unfortunately, a PID controller is not a product of terms in that sense, but a sum; we cannot write it as a product of P, PI, PD. (That was why I factored it earlier; just to show that it wasn’t product of its components.)

Maybe, just for completeness, I should extract the ID part of the controller – at least, as far as possible; I can’t actually do it because ID without P would cancel the s in the numerator with the s in the denominator. That is, I can factor out the “6″, but I can’t completely eliminate the P term; all I can do is set P = 1 here (by setting ku = 1/.6). Here’s what I get if I do that:

con 1 21 12

And that’s what I want to do for PI and PD: set their P terms to 1…

Here’s the PI part of the controller, using the ZNPID settings but P = 1:

con 1 21 13

And here’s the PD part:

con 1 21 14

Incidentally, looking at the PI and PD controllers, we see that the PD controller itself is improper, while the PI controller is semi-proper (same degree in numerator and denominator). This is why I said earlier that the problem was in the “ideal” derivative control.

Let’s look at the PI and PD controllers that we extracted:

con 1 21 15

The blue curve is the PD controller: it increases the amplitude at high frequencies; we might say it reacts to noise. The red curve is the PI controller; it increases the amplitude at low frequencies.

Where are the break point frequencies? Look for the zeroes in these cases.

con 1 21 16

So, the break points are at .32 and 1.27 . The region between them is relatively unaffected. This tells us that the break point frequency for PD should be higher than the break point frequency for PI. (We’ll come back to this.)

Now let us add the PID controller to them:

con 1 21 17

I want to spread out the break points of the two controllers. Let’s put the PI breakpoint at .1 and the PD break point at 10. Note that I have to invert both numbers: \tau_D\ -> .1 and \tau_I\ -> 10. (Of course I have to invert: \omega = \frac{1}{|\tau|}\ .) Here’s the resulting PID controller:

con 1 21 18

We see that there was some interaction… the PI breakpoint isn’t exactly at 10, and the PD breakpoint isn’t exactly at .1 . Here’s the Bode plot for the resulting PID controller:

con 1 21 19

Now that we have been reminded that the breakpoints are approximate inverses of the time constants \tau_D\ and \tau_I\ , we can recall the Z-N PID rules:

con 1 21 20

Yes, they show \tau_D\ smaller than \tau_I\ , by a factor of 4… but that means the breakpoint from the PD portion is larger than the breakpoint for the PI portion.

I should also look at Tyreus-Luyben rules.

Recall our plant with PID control:

con 1 21 21

The T-L settings for PID control are:

con 1 21 22

We see that \tau_I\ is bigger than \tau_D\ . This is in line with what we learned looking at the Z-N rules: we end up with a smaller breakpoint frequency for PI than for PD.

The open loop for plant and controller becomes, with T-L PID settings:

con 1 21 23

The PI part of it… but I use the T-L PID settings – and divide by .454545 rather than by .6 to get P = 1:

con 1 21 24

And here’s the PD part of it:

con 1 21 25

I also want the PID controller by itself (i.e. with P = 1):

con 1 21 26

Here’s the open loop Bode plot for the controlled plant:

con 1 21 27

We’ve seen this before, too: the T-L rules lead to a phase margin of 54°.

Let’s look at the PI and PD controllers alone:

con 1 21 28

The blue curve is the PD controller: it increases the amplitude at high frequencies; the red curve is the PI controller, it increases the amplitude at low frequencies.

We see immediately that the breakpoint frequencies have been separated more than by the Z-N rules.

Exactly where are the break point frequencies?

con 1 21 29

So, the break points are at .07 and 1. The region between them is not affected.

Now lets add the PID controller to the PD and PI curves:

con 1 21 30

As before, blue is PD and red is PI, and gold is PID.

Just for the fun of it, let’s see what happens if we interchange the values of \tau_I\ and \tau_D\ , using our values of 10 and .1 . Here’s the PD transfer function, which has a zero at -.1 .

con 1 21 31

Now set only \tau_I\ -> .1 … and we get a zero at 10 …

con 1 21 32

Here’s the two of them together:

con 1 21 33

And here’s the corresponding PID controller…

con 1 21 34

And here’s the open loop Bode plot of that PID:

con 1 21 35

I’d say that’s a rather striking difference.

Let’s look at all three.

con 1 21 36

Do you begin to believe that the PID is not the product of the PI and the PD? I suppose I could simply compute the product… well, with transfer functions, I connect them in series…

con 1 21 37

And here’s the product PI PD overlaid on the actual PID:

con 1 21 38

OK? PID ≠ PI PD in series.

Summary

We’ve seen what a PI and a PD controller look like…

We’ve discovered that we want \tau_I\ > \tau_D\ in a PID controller.

We’ve discovered that although we can factor out the P ≠ 1 term from a PID controller, we cannot then factor the remainder into PI PD or into I D.

We’ve discovered that ideal derivative control leads to an improper – physically unrealizable – controller, in both PD and PID. We should look at real derivative control – even though the overwhelming majority of chemical process analyses use ideal derivative control!

But we’ll not look at that today.


05 Apr 15:10

The Economic Order Quantity – a simple calculus application

by rip

The EOQ (Economic Order Quantity) formula is a deceptively simple model. It comes from Zipkin’s “Foundations of Inventory Management” (Irwin/McGraw-Hill, 2000, 0-256-11379-3) and it is the very first model in the book. It was first published 100 years ago, in 1913 – the model, not the book!.

When all is said and done, it’s a simple application of freshman calculus.

Imagine that we sell or use up one product, at a known constant rate \lambda\ . Periodically, we order more of this product, to replenish our inventory I(t). Further, there is a known constant lead time L – between when we place an order and when we receive it (actually, when we can sell or use it, so this includes unloading and storing). If our inventory will go to zero at t = T, then, at the very latest, we must place an order at T – L:

EOQ 1

Since we assume that the demand \lambda\ and the lead time L are constants, we can safely place our order at exactly T – L. A more realistic model would allow for uncertainty in \lambda\ and L. Oh, of course, I am assuming that running out of inventory is not acceptable: we want to always meet demand.

This means that if we order q, our inventory jumps from 0 to q upon delivery of our order.

Now, there are a lot of caveats for this model. For example… what if we’re only open for 8 hours a day, 5 days a week? Then this model would require us to count only the time we’re open, to assume that lead time does not include the weekend days, and that deliveries occur while we’re open…. But let me stay with this first model.

In addition to inventory on hand at time t, I(t), we also define inventory on order IO(t) and inventory position IP(t) as their sum. That is, we have

\lambda\ a constant rate of demand (quantity/time)
L a constant lead time between placing an order and receiving it
q order quantity
I(t) inventory at time t
IO(t) inventory on order at time t
IP(t) inventory position at time t, where

IP(t) = I(t) + IO(t).

What we want to know is how much to order and often to order it. Yes, those are related questions. The key is that if we place an order for q once during the time required to completely deplete our inventory, then we are placing an order every

OF = \frac{1}{\frac{q}{\lambda}} = \frac{\lambda}{q}\ .

(The time between orders, one cycle, is \frac{q}{\lambda}\ , so the frequency of orders (the order frequency OF) is the inverse.)

During that cycle, the average inventory is \bar{I} = q/2\ .

Now, what’s it cost to do this? We can break the costs into three different kinds:

k fixed cost to place an order
c variable cost to place an order
h cost to hold one unit of inventory for one unit of time.

(So the cost of inventory I(t) is h I(t), and the average cost of inventory in one cycle is h q/2.)

The total cost of an order q placed every cycle, then, is

C(q) = order cost + holding cost

C(q) = (k + c\ q) OF + h\  \bar{I}\

C(q) = (k + c\ q) \frac{\lambda}{q} + h \ \frac{q}{2}\

hence the order cost, the holding cost, and the total cost are:

EOQ 2

It’s crucial that the holding cost increases with q, while the order cost decreases with q.

I need some numbers.

Zipkin’s numerical example is of paper for the copy machine(s):

usage (demand) is 8 boxes/week at $25 per box.
handling (on our end) and shipping is $80 + $50.
the shipment takes a week to arrive.
keeping inventory organized is $1.10 per week.
Financing is 15% per year.
We order 2 weeks worth of supplies (16 boxes) with each order.

(That finance charge is applied to the cost of each box. At least, that’s how I think of it.)

That is,

\lambda\ = 8
k = 80 + 50
L = 1
c = 25
q = 16
h = 1.10 + .15/52

and that in turn is (omitting q):

EOQ 3

With these numbers our average order cost, holding cost, and total cost become – as functions of q

EOQ 4

We can plot them over an indicative range, q goes from 10 to 100:

EOQ 5

The red line is the linear holding cost; the black line is the always-decreasing order cost; the green-gold line is their sum.

The sum has a minimum value. Before we compute it, however, let’s see what our cost is when we order q = 16, as is our current practice:

EOQ 6

Now let’s do some freshman calculus. What value of q provides the minimum cost? We take the derivative of C(q) with respect to q, set it to zero, and solve for q:

EOQ 7

We want the positive (second) solution… and in our case the optimal q is

EOQ 8

At the optimal q, our total cost is

EOQ 9

and we place an order every \frac{q}{\lambda}\

EOQ 10

That is, every 5 1/4 weeks, we place an order for 42 boxes, and it looks like we save $25 per week – i.e. the cost of one box per week. (I’ll return to the units shortly.)

While we’re thinking of freshman calculus… even though we have a drawing which shows that we have found a minimum… let’s check the second derivative: for a minimum, the second derivative should be positive at the minimum. We compute the second derivative, and set q to the (positive) value where the first derivative was zero:

EOQ 10a

Yup, the second derivative is positive, at least for these values. But in general? I recall the second derivative… and simplify it:

EOQ 10b

So long as k, \lambda\ , and q are positive, the second derivative is positive: we have a minimum.

Now, we need to think about this whole thing, because we’ve changed the definition of a cycle. Our initial cost was $274 and now our cost is $249. But 16 boxes and 42 boxes cost

EOQ 11

so our computed costs must be per week, not per cycle. (Still more later.)

It’s worth noting that the cost curve is pretty flat in the neighborhood of the optimal q. It doesn’t matter much whether we order 41, 42, or 43 boxes at a time:

EOQ 12

We save $25 per week.

One last look at the units. Here’s the cost:

EOQ 13

Here is the original set of parameters… followed by a set with units attached:

EOQ 14

Now when we look at the cost, we see that units are indeed $/week:

EOQ 15

Note that qq itself is dimensionless: q = qq Box, so q has dimensions of box(es) but qq is just a number.

But I still wonder why “order” does not show up in some of those. In particular, why doesn’t q have units boxes/order? And yet, it works.


05 Apr 15:10

Control Theory: Example 3 – P-only control and offset

by rip

introduction

I want to see exactly how we get offset under P-only control, and how PI control eliminates the offset. I’m going to use Example 3 again… I’m going to look at P and PI control… and I’m going to use the Tyreus-Luyben (“T-L”) tuning rules, which we’ve seen before.

When I started this, I was wondering about two things:

  • Is the control effort nonzero when we have offset?
  • We don’t always have offset under P-only control, do we?

I can’t say I’ve become an expert in offset, but I’m a little more comfortable with it. Now what I’d like to know is why the control effort goes to zero when we do not have offset – but that’s a question still looking for an answer.

Let me tell you up front what we will find:

  • With P-only control, we may have offset: the output will not tend to the set point.
  • In that case, the use of PI-control will eliminate offset.
  • But if our plant (G) has a pole of any order at the origin, then P-only control does not lead to offset.
  • In that case, you could imagine that the plant itself includes integral control.

Let us recall Example 3. This plant is a 3rd-order transfer function… as usual, rules are a convenient way to specify things:

con 2 11 1

The plant G, then, has the following transfer function:

con 2 11 2

I’ve shown you the right way to get the ultimate gain and ultimate period, if you have a quantitative model. Let’s do it again.

We get the open loop Bode plot… start with the open loop transfer function GK:

con 2 11 3

And here’s the plot:

con 2 11 4

The first pair of numbers tells us that the gain margin is 10 and that it occurs at a frequency of 1 radian/second. Given a frequency f in radians/second, we find the period from P f = 2 pi. This crossover frequency is 1, so the corresponding period is

con 2 11 5

That is, the ultimate gain is ku = 10 and the ultimate period is Pu = 2\pi\ . Now save these:

con 2 11 6

Now that we have Pu and ku, there are some rules for setting the parameters of P, PI, and PID controllers. But while I have the default K = 1, let’s see how the uncontrolled plant responds to a unit step.

con 2 11 7

con 2 11 8

If it helps, think of the uncontrolled plant as nevertheless having K = kc = 1: a P-only controller with gain = 1. It is crucial that the plant response is nonzero – but it seems to have a steady-state value of .5 instead of 1. This is offset.

We can check that directly, by taking the limit as t goes to infinity…

con 2 11 9

Or we can use the final-value theorem – but remember that the output is the inverse Laplace transform of y/s, so we use s (y/s), which is just y:

con 2 11 10

Anyway, the plant itself responds… it just doesn’t respond enough. Adding a P-only controller will increase the gain, and will move the long-term response closer to 1.

P-only control

So let’s start by inserting a PID controller:

con 2 11 11

To get P-only control, we set \tau_D \rightarrow 0\ and \tau_I \rightarrow \infty\ . I have a rule for that:

con 2 11 12

The open loop transfer function, then, is…

con 2 11 13

Here are the closed loop transfer functions for output and for control…

con 2 11 14

The T-L rule for P-only control is to set kc = .5 ku – the same as the Z-N rule. (Parts of the following rule are redundant, because I’ve already set \tau_D \text{ and } \tau_I\ , but I like having a complete rule anyway.)

con 2 11 15

So the closed loop output becomes.. and we know the ultimate gain ku:

con 2 11 16

and the closed loop for control becomes

con 2 11 17

We get the time-domain responses to a unit step input by taking inverse Laplace transforms of y/s and of u/s…

con 2 11 18

And we plot the input (step function in “silver” or whatever it is), output (black), and scaled control effort (gold):

con 2 11 19

Not only does it take a long time to settle, but there is offset. As for the long settling time… K = 1 got there quickly, K = 10 was on the edge of instability, so K = 5 should have decaying oscillations. OK?

For the record, here’s the unscaled control effort (gold):

con 2 11 20

Let me look at the long-time response – pick it up at t = 30 (and note the numbers on the vertical axis):

con 2 11 21

Isn’t it interesting that the steady-state control effort is not zero? The controller must be active in order to hold the output away from zero. I infer that if we were to turn of the control (that is. set kp = 1), the system would return to 0.5, its long-term uncontrolled value (i.e. with kp = 1).

Let’s see exactly the long-term values of output…

con 2 11 22

They’re the same. We’re driving a linear constant-coefficient differential equation with a constant forcing function; because the forcing function = 1, the output is the same constant. We’ve also just seen one very good reason for not scaling the control effort: I missed that it’s equal to the output.

This is the first clue as to how offset can occur: the control effort is not strong enough to push the system to an output of 1.

Let’s look even closer. We’ve seen that the output has offset: we are not at “1”. We’ve seen that the controller does not tend to zero (“off”) in the long term. So just what does the error look like?

Well, the error is input (“reference” or set point) minus output: r – y.

con 2 11 23

Hmm. The error is very small, but nonzero.

Well, if the control effort is to be nonzero, then the error must be nonzero. Right? Under P-only control, we have that the controller has kc = 5:

con 2 11 24

and the time-domain formulation is that the control effort is proportional to the error, with a proportionality constant of 5… so let’s compute 5 e:

con 2 11 25

con 2 11 26

We need to see that over the entire range:

con 2 11 27

Compare that to a plot of the unscaled control effort c1:

con 2 11 20

The same.

OK…. both the error and the control effort tend to nonzero values over time. And the control effort is enough to hold the output significantly away from zero, and actually closer to 1 than without control – but not enough to hold it at 1.

But there is a second issue involved. The ultimate gain for this system is 10 – which means that if we set the controller gain above 10, then the closed system will be unstable.

What happens if we raise the controller gain to 9.9?

P with large gain

So let’s just set K = 9.9

con 2 11 28

The open loop transfer function is…

con 2 11 29

Here are the closed loop transfer functions for output and for control…

con 2 11 30

We get the time-domain responses to a unit step input by taking inverse Laplace transforms of y/s and of u/s…

con 2 11 31

And we plot the input (step function in “silver”) and output (black):

con 2 11 32

Well that’s not very informative. Eventually we settle down at a value just over 0.9:

con 2 11 33

So let’s go way out – start time at 2000:

con 2 11 34

So we did get closer, but, boy, does it take a long time.

Oh, what’s the control effort? As before, it’s the output value, in this case 0.9 +.

con 2 11 35

Anyway, the real point – the second issue – is that we can’t raise the gain enough to eliminate offset.

PI-control

Now let’s add integral control. Oh, I need to restore my PID controller:

con 2 11 36

To get PI control, we set \tau_D \rightarrow 0\ .

con 2 11 37

The open loop transfer function is…

con 2 11 38

Here are the closed loops for output and for control…

con 2 11 39

The T-L rules for PI control are

con 2 11 40

so the closed loop output becomes.. and we use the ultimate gain ku and ultimate period Pu…

con 2 11 41

The closed loop for control becomes

con 2 11 42

The time-domain responses are…

con 2 11 43

Here is the usual plot showing input, output (black), and scaled controller effort (gold):

con 2 11 44

It settles faster, and the offset is gone.

After all, that’s one of things integral control does for us: it eliminates offset. I should point out that we don’t always care about offset: level control for surge tanks is the best example where we do not care to hold the level at exactly 50%. In fact, to do so would defeat the purpose of a surge tank, and we might as well replace it by a pipe – where what goes out is exactly what comes in.

Here’s the unscaled control effort:

con 2 11 45

For the record, we find the long-term limits: both the output and the control effort tend to 1:

con 2 11 46

As before, let’s look a little closer. With a PI controller, there are two error terms. One is the error itself, and that gives us the P-control. The other, however, is the integrated error, and that’s what gives us the I-control.

OK… The error is e = r – y, i.e.

con 2 11 47

(I will need a variable of integration, and I don’t want it to be t, because t will be the upper limit on the definite integral. So I need to be able to set the variable for e.)

Since the output tends to 1, we know that the error tends to zero:

con 2 11 48

What about the integrated error? We integrate from 0 to t, using \tau\ as the variable of integration:

con 2 11 49

(The only difference in those two expressions, I think, is that +0.i disappeared as a result of the Chop command.)

As t goes to infinity, the integrated error tends to a constant positive value:

con 2 11 50

Now, what does the I-term do in the controller? In the time domain, the controller is

kc(e+ie/ \tau_I)\

i.e. the I-term is… and tends to 1 as time tends to infinity…

con 2 11 51

Despite all the terms, the key is the two constants at the beginning:

con 2 11 52
That’s why the control effort tends to 4. whatever.

Two examples without offset

First, forget that we have PI-control on that plant… imagine that we have our latest closed-loop transfer function, and that it represents an uncontrolled system. What is the G that leads to such a closed-loop transfer function?

That is, we find G such that G / (1+G) = y:

con 2 11 53

Note that G1 has a pole of order 1 at the origin (i.e. s in the denominator).

Given the warning message, let me check that answer. Use the given open-loop G1 and compute its closed-loop transfer function… and compare it to the one we started with:

con 2 11 54

And what we see is that if we had started with a plant G that had a pole at zero, then P-only control would have worked.

It’s an interesting game, and I’m sure I’ll be playing it again: usually, set the closed-loop response and the system G, and find the controller K to get the desired closed-loop.

Let me try something simpler. Let’s just take 1/s for the system, and compute the closed-loop transfer function:

con 2 11 55

Now get the output and control effort in response to a unit step input:

con 2 11 56

Interesting. Very interesting: the control effort goes to zero… as the error goes to zero… and we do not have offset.

con 2 11 57

So. As I said at the beginning:

  • With P-only control, we may have offset: the output will not tend to the set point.
  • In that case, the use of PI-control will eliminate offset.
  • But if our plant (G) has a pole of any order at the origin, then P-only control does not lead to offset.
  • In that case, you could imagine that the plant itself includes integral control.

Let me be blunt. It would be wrong to say that P-only control always leads to offset.

05 Apr 15:09

A C++ Template-Based Reinforcement Learning Library: Fitting the Code to the Mathematics; Hervé Frezza-Buet, Matthieu Geist; 14(Feb):625--628, 2013.

This paper introduces the rllib as an original C++ template-based library oriented toward value function estimation. Generic programming is promoted here as a way of having a good fit between the mathematics of reinforcement learning and their implementation in a library. The main concepts of rllib are presented, as well as a short example.
05 Apr 14:25

Efficient Aggregates in Julia

We recently introduced an exciting feature that has been in planning for some time: immutable aggregate types. In fact, we have been planning to do this for so long that this feature is the subject of our issue #13 on GitHub, out of more than 2400 total issues so far.

Essentially, this feature drastically reduces the overhead of user-defined types that represent small number-like values, or that wrap a small number of other objects. Consider an RGB pixel type:

immutable Pixel
    r::Uint8
    g::Uint8
    b::Uint8
end

Instances of this type can now be packed efficiently into arrays, using exactly 3 bytes per object. In all other respects, these objects continue to act like normal first-class objects. To see how we might use this, here is a function that converts an RGB image in standard 24-bit framebuffer format to grayscale:

function rgb2gray!(img::Array{Pixel})
    for i=1:length(img)
        p = img[i]
        v = uint8(0.30*p.r + 0.59*p.g + 0.11*p.b)
        img[i] = Pixel(v,v,v)
    end
end

This code will run blazing fast, performing no memory allocation. We have not done thorough benchmarking, but this is in fact likely to be the fastest way to write this function in Julia from now on.

The key to this behavior is the new immutable keyword, which means instances of the type cannot be modified. At first this sounds like a mere restriction — how come I’m not allowed to modify one? — but what it really means is that the object is identified with its contents, rather than its memory address. A mutable object has “behavior”; it changes over time, and there may be many references to the object, all of which can observe those changes. An immutable object, on the other hand, has only a value, and no time-varying behavior. Its location does not matter. It is “just some bits”.

Julia has always had some immutable values, in the form of bits types, which are used to represent fixed-bit-width numbers. It is highly intuitive that numbers are immutable. If x equals 2, you might later change the value of x, but it is understood that the value of 2 itself does not change. The immutable keyword generalizes this idea to structured data types with named fields. Julia variables and containers, including arrays, are all still mutable. While a Pixel object itself can’t change, a new Pixel can be written over an old one within an array, since the array is mutable.

Let’s take a look at the benefits of this feature.

  1. The compiler and GC have a lot of freedom to move and copy these objects around. This flexibility can be used to store data more efficiently, for example keeping the real and imaginary parts of a complex number in separate registers, or keeping only one part in a register.

  2. Immutable objects are easy to reason about. Some languages, such as C++ and C#, provide “value types”, which have many of the benefits of immutable objects. However, their behavior can be confusing. Consider code like the following:

    item = lookup(collection, index) modify!(item) The question here is whether we have modified the same item that is in the collection, or if we have modified a local copy. In Julia there are only two possibilities: either item is mutable, in which case we modified the one and only copy of it, or it is immutable, in which case modifying it is not allowed.

  3. No-overhead data abstractions become possible. It is often useful to define a new type that simply wraps a single value, and modifies its behavior in some way. Our favorite modular integer example type fits this description:

    immutable ModInt{n} <: Integer k::Int ModInt(k) = new(mod(k,n)) end Since a given ModInt doesn’t need to exist at a particular address, it can be passed to functions, stored in arrays, and so on, as efficiently as a single Int, with no wrapping overhead. But, in Julia, the overhead will not always be zero. The ModInt type information will “follow the data around” at compile time to the extent possible, but heap-allocated wrappers will be added as needed at run time. Typically these wrappers will be short-lived; if the final destination of a ModInt is in a ModInt array, for example, the wrapper can be discarded when the value is assigned. But if the value is only used locally inside a function, there will most likely be no wrappers at all.

  4. Abstractions are fully enforced. If a custom constructor is written for an immutable type, then all instances will be created by it. Since the constructed objects are never modified, the invariants provided by the constructor cannot be violated. At this time, uninitialized arrays are an exception to this rule. New arrays of “plain data” immutable types have unspecified contents, so it is possible to obtain an invalid value from one. This is usually harmless in practice, since arrays must be initialized anyway, and are often created through functions like zeros that do so.

  5. We can automatically type-specialize fields. Since field values at construction time are final, their types are too, so we learn everything about the type of an immutable object when it is constructed.

There are many potential optimizations here, and we have not implemented all of them yet. But having this feature in place provides another lever to help us improve performance over time.

For now though, we at least have a much simpler implementation of complex numbers, and will be able to take advantage of efficient rational matrices and other similar niceties.

Addendum: Under the hood

For purposes of calling C and writing reflective code, it helps to know a bit about how immutable types are implemented. Before this change, we had types AbstractKind, BitsKind, and CompositeKind, for separating which types are abstract, which are represented by immutable bit strings, and which are mutable aggregates. It was sometimes convenient that the type system reflected these differences, but also a bit unwarranted since all these types participate in the same hierarchy and follow the same subtyping rules.

Now, the type landscape is both simpler and more complex. The three Kinds have been merged into a single kind called DataType. The type of every value in Julia is now either a DataType, or else a tuple type (union types still exist, but of course are always abstract). To find out the details of a DataType’s physical representation, you must query its properties. DataTypes have three boolean properties abstract, mutable, and pointerfree, and an integer property size. The CompositeKind properties names and types are still there to describe fields.

The abstract property indicates that the type was declared with the abstract keyword and has no direct instances. mutable indicates, for concrete types, whether instances are mutable. pointerfree means that instances contain “just data” and no references to other Julia values. size gives the size of an instance in bytes.

What used to be BitsKinds are now DataTypes that are immutable, concrete, have no fields, and have non-zero size. The former CompositeKinds are mutable and concrete, and either have fields or are zero size if they have zero fields. Clearly, new combinations are now possible. We have already mentioned immutable types with fields. We could have the equivalent of mutable BitsKinds, but this combination is not exposed in the language, since it is easily emulated using mutable fields. Another new combination is abstract types with fields, which would allow you to declare that all subtypes of some abstract type should have certain fields. That one is definitely useful, and we plan to provide syntax for it.

Typically, the only time you need to worry about these things is when calling native code, when you want to know whether some array or struct has C-compatible data layout. This is handled by the type predicate isbits(T).

05 Apr 14:24

Videos from the Julia tutorial at MIT

We held a two day Julia tutorial at MIT in January 2013, which included 10 sessions. MIT Open Courseware and MIT-X graciously provided support for recording of these lectures, so that the wider Julia community can benefit from these sessions.

Julia Lightning Round (slides)

This session is a rapid introduction to julia, using a number of lightning rounds. It uses a number of short examples to demonstrate syntax and features, and gives a quick feel for the language.

Rationale behind Julia and the Vision (slides)

The rationale and vision behind julia, and its design principles are discussed in this session.

Data Analysis with DataFrames (slides)

DataFrames is one of the most widely used Julia packages. This session is an introduction to data analysis with Julia using DataFrames.

Statistical Models in Julia (slides)

This session demonstrates Julia’s statistics capabilities, which are provided by these packages: Distributions, GLM, and LM.

Fast Fourier Transforms

Julia provides a built-in interface to the FFTW library. This session demonstrates the Julia’s signal processing capabilities, such as FFTs and DCTs. Also see the Hadamard package.

Optimization (slides)

This session focuses largely on using Julia for solving linear programming problems. The algebraic modeling language discussed was later released as JuMP. Benchmarks are shown evaluating the performance of Julia for implementing low-level optimization code. Optimization software in Julia has been grouped under the JuliaOpt project.

Metaprogramming and Macros

Julia is homoiconic: it represents its own code as a data structure of the language itself. Since code is represented by objects that can be created and manipulated from within the language, it is possible for a program to transform and generate its own code. Metaprogramming is described in detail in the Julia manual.

Parallel and Distributed Computing (Lab, Solution)

Parallel and distributed computing have been an integral part of Julia’s capabilities from an early stage. This session describes existing basic capabilities, which can be used as building blocks for higher level parallel libraries.

Networking

Julia provides asynchronous networking I/O using the libuv library. Libuv is a portable networking library created as part of the Node.js project.

Grid of Resistors (Lab, Solution)

The Grid of Resistors is a classic numerical problem to compute the voltages and the effective resistance of a 2n+1 by 2n+2 grid of 1 ohm resistors if a battery is connected to the two center points. As part of this lab, the problem is solved in Julia in a number of different ways such as a vectorized implementation, a devectorized implementation, and using comprehensions, in order to study the performance characteristics of various methods.

05 Apr 14:24

Distributed Numerical Optimization

This post walks through the parallel computing functionality of Julia to implement an asynchronous parallel version of the classical cutting-plane algorithm for convex (nonsmooth) optimization, demonstrating the complete workflow including running on both Amazon EC2 and a large multicore server. I will quickly review the cutting-plane algorithm and will be focusing primarily on parallel computation patterns, so don’t worry if you’re not familiar with the optimization side of things.

Cutting-plane algorithm

The cutting-plane algorithm is a method for solving the optimization problem

where the functions \( f_i \) are convex but not necessarily differentiable. The absolute value function \( |x| \) and the 1-norm \( ||x|| _ 1 \) are typical examples. Important applications also arise from Lagrangian relaxation. The idea of the algorithm is to approximate the functions \( f_i \) with piecewise linear models \( m_i \) which are built up from information obtained by evaluating \( f_i \) at different points. We iteratively minimize over the models to generate candidate solution points.

We can state the algorithm as

  1. Choose starting point \( x \).
  2. For \(i = 1,\ldots,n\), evaluate \( f_i(x) \) and update corresponding model \( m_i \).
  3. Let the next candidate \( x \) be the minimizer of \( \sum_{i=1}^n m_i(x) \).
  4. If not converged, goto step 2.

If it is costly to evaluate \( f_i(x) \), then the algorithm is naturally parallelizable at step 2. The minimization in step 3 can be computed by solving a linear optimization problem, which is usually very fast. (Let me point out here that Julia has interfaces to linear programming and other optimization solvers under JuliaOpt.)

Abstracting the math, we can write the algorithm using the following Julia code.

# functions initialize, isconverged, solvesubproblem, and process implemented elsewhere
state, subproblems = initialize()
while !isconverged(state)
    results = map(solvesubproblem,subproblems)
    state, subproblems = process(state, results)
end

The function solvesubproblem corresponds to evaluating \( f_i(x) \) for a given \( i \) and \( x \) (the elements of subproblems could be tuples (i,x)). The function process corresponds to minimizing the model in step 3, and it produces a new state and a new set of subproblems to solve.

Note that the algorithm looks much like a map-reduce that would be easy to parallelize using many existing frameworks. Indeed, in Julia we can simply replace map with pmap (parallel map). Let’s consider a twist that makes the parallelism not so straightforward.

Asynchronous variant

Variability in the time taken by the solvesubproblem function can lead to load imbalance and limit parallel efficiency as workers sit idle waiting for new tasks. Such variability arises naturally if solvesubproblem itself requires solving a optimization problem, or if the workers and network are shared, as is often the case with cloud computing.

We can consider a new variant of the cutting-plane algorithm to address this issue. The key point is

  • When proportion \(0 < \alpha \le 1 \) of subproblems for a given candidate have been solved, generate a new candidate and corresponding set of subproblems by using whatever information is presently available.

In other words, we generate new tasks to feed to workers without needing to wait for all current tasks to complete, making the algorithm asynchronous. The algorithm remains convergent, although the total number of iterations may increase. For more details, see this paper by Jeff Linderoth and Stephen Wright.

By introducing asynchronicity we can no longer use a nice black-box pmap function and have to dig deeper into the parallel implementation. Fortunately, this is easy to do in Julia.

Parallel implementation in Julia

Julia implements distributed-memory parallelism based on one-sided message passing, where process push work onto others (via remotecall) and the results are retrieved (via fetch) by the process which requires them. Macros such as @spawn and @parallel provide pretty syntax around this low-level functionality. This model of parallelism is very different from the typical SIMD style of MPI. Both approaches are useful in different contexts, and I expect an MPI wrapper for Julia will appear in the future (see also here).

Reading the manual on parallel computing is highly recommended, and I won’t try to reproduce it in this post. Instead, we’ll dig into and extend one of the examples it presents.

The implementation of pmap in Julia is

function pmap(f, lst)
    np = nprocs()  # determine the number of processors available
    n = length(lst)
    results = cell(n)
    i = 1
    # function to produce the next work item from the queue.
    # in this case it's just an index.
    next_idx() = (idx=i; i+=1; idx)
    @sync begin
        for p=1:np
            if p != myid() || np == 1
                @spawnlocal begin
                    while true
                        idx = next_idx()
                        if idx > n
                            break
                        end
                        results[idx] = remotecall_fetch(p, f, lst[idx])
                    end
                end
            end
        end
    end
    results
end

On first sight, this code is not particularly intuitive. The @spawnlocal macro creates a task on the master process (e.g. process 1). Each task feeds work to a corresponding worker; the call remotecall_fetch(p, f, lst[idx]) function calls f on process p and returns the result when finished. Tasks are uninterruptable and only surrender control at specific points such as remotecall_fetch. Tasks cannot directly modify variables from the enclosing scope, but the same effect can be achieved by using the next_idx function to access and mutate i. The task idiom functions in place of using a loop to poll for results from each worker process.

Implementing our asynchronous algorithm is not much more than a modification of the above code:

# given constants n and 0 < alpha <= 1
# functions initialize and solvesubproblem defined elsewhere
np = nprocs()
state, subproblems = initialize()
converged = false
isconverged() = converged
function updatemodel(mysubproblem, result)
    # store result
    ...
    # decide whether to generate new subproblems
    state.numback[mysubproblem.parent] += 1
    if state.numback[mysubproblem.parent] >= alpha*n && !state.didtrigger[mysubproblem.parent]
        state.didtrigger[mysubproblem.parent] = true
        # generate newsubproblems by solving linear optimization problem
        ...
        if ... # convergence test
            converged = true
        else
            append!(subproblems, newsubproblems)
            push!(state.didtrigger, false)
            push!(state.numback, 0)
            # ensure that for s in newsubproblems, s.parent == length(state.numback)
        end
    end
end

@sync begin
    for p=1:np
        if p != myid() || np == 1
            @spawnlocal begin
                while !isconverged()
                    if length(subproblems) == 0
                        # no more subproblems but haven't converged yet
                        yield()
                        continue
                    end
                    mysubproblem = shift!(subproblems) # pop subproblem from queue
                    result = remotecall_fetch(p, solvesubproblem, mysubproblem)
                    updatemodel(mysubproblem, result)
                end
            end
        end
    end
end

where state is an instance of a type defined as

type State
    didtrigger::Vector{Bool}
    numback::Vector{Int}
    ...
end

There is little difference in the structure of the code inside the @sync blocks, and the asynchronous logic is encapsulated in the local updatemodel function which conditionally generates new subproblems. A strength of Julia is that functions like pmap are implemented in Julia itself, so that it is particularly straightforward to make modifications like this.

Running it

Now for the fun part. The complete cutting-plane algorithm (along with additional variants) is implemented in JuliaBenders. The code is specialized for stochastic programming where the cutting-plane algorithm is known as the L-shaped method or Benders decomposition and is used to decompose the solution of large linear optimization problems. Here, solvesubproblem entails solving a relatively small linear optimization problem. Test instances are taken from the previously mentioned paper.

We’ll first run on a large multicore server. The runals.jl (asynchronous L-shaped) file contains the algorithm we’ll use. Its usage is

julia runals.jl [data source] [num subproblems] [async param] [block size]

where [num subproblems] is the \(n\) as above and [async param] is the proportion \(\alpha\). By setting \(\alpha = 1\) we obtain the synchronous algorithm. For the asynchronous version we will take \(\alpha = 0.6\). The [block size] parameter controls how many subproblems are sent to a worker at once (in the previous code, this value was always 1). We will use 4000 subproblems in our experiments.

To run multiple Julia processes on a shared-memory machine, we pass the -p N option to the julia executable, which will start up N system processes. To execute the asynchronous version with 10 workers, we run

julia -p 12 runals.jl Data/storm 4000 0.6 30

Note that we start 12 processes. These are the 10 workers, the master (which distributes tasks), and another process to perform the master’s computations (an additional refinement which was not described above). Results from various runs are presented in the table below.

Synchronous Asynchronous
No. Workers Speed Efficiency Speed Efficiency
10 154 Baseline 166 Baseline
20 309 100.3% 348 105%
40 517 84% 654 98%
60 674 73% 918 92%

Table: Results on a shared-memory 8x Xeon E7-8850 server. Workers correspond to individual cores. Speed is the rate of subproblems solved per second. Efficiency is calculated as the percent of ideal parallel speedup obtained. The superlinear scaling observed with 20 workers is likely a system artifact.

There are a few more hoops to jump through in order to run on EC2. First we must build a system image (AMI) with Julia installed. Julia connects to workers over ssh, so I found it useful to put my EC2 ssh key on the AMI and also set StrictHostKeyChecking no in /etc/ssh/ssh_config to disable the authenticity prompt when connecting to new workers. Someone will likely correct me on if this is the right approach.

Assuming we have an AMI in place, we can fire up the instances. I used an m3.xlarge instance for the master and m1.medium instances for the workers. (Note: you can save a lot of money by using the spot market.)

To add remote workers on startup, Julia accepts a file with a list of host names through the --machinefile option. We can generate this easily enough by using the EC2 API Tools (Ubuntu package ec2-api-tools) with the command

ec2-describe-instances | grep running | awk '{ print $5; }' > mfile

On the master instance we can then run

julia --machinefile mfile runatr.jl Data/storm 4000 0.6 30

Results from various runs are presented in the table below.

Synchronous Asynchronous
No. Workers Speed Efficiency Speed Efficiency
10 149 Baseline 151 Baseline
20 289 97% 301 99.7%
40 532 89% 602 99.5%

Table: Results on Amazon EC2. Workers correspond to individual m1.medium instances. The master process is run on an m3.xlarge instance.

On both architectures the asynchronous version solves subproblems at a higher rate and has significantly better parallel efficiency. Scaling is better on EC2 than on the shared-memory server likely because the subproblem calculation is memory bound, and so performance is better on the distributed-memory architecture. Anyway, with Julia we can easily experiment on both.

Further reading

A more detailed tutorial was prepared for the Julia IAP session at MIT in January 2013.

Creative Commons License
Distributed Numerical Optimization by Miles Lubin is licensed under a Creative Commons Attribution 4.0 International License.

05 Apr 14:04

A deeper look into the LLVM code generator, Part 1

by eliben

In a previous article, I followed the various incarnations an instruction takes when it’s being compiled from the source language to machine code in LLVM. The article briefly mentioned a lot of layers within LLVM, each of which is interesting and non trivial.

Here I want to focus on one of the most important and complex layers – the code generator, and specifically the instruction selection mechanism. A short reminder: the task of the code generator is to transform the high-level, mostly target-independent LLVM IR into low-level, target dependent machine language. Instruction selection is the process wherein the abstract operations in IR are mapped to concrete instructions of the target architecture.

This article will follow a simple example to show the instruction selection mechanism in action (ISel in LLVM parlance).

Getting started: a DAG for simple multiplication

Here’s some sample IR:

define i64 @imul(i64 %a, i64 %b) nounwind readnone {
entry:
  %mul = mul nsw i64 %b, %a
  ret i64 %mul
}

It’s compiled with Clang (-emit-llvm option) on a x64 machine from this C code:

long imul(long a, long b) {
    return a * b;
}

The first thing done by the code generator is convert the IR into a selection DAG representation. This is the initial DAG, right after it’s built:

http://eli.thegreenplace.net/wp-content/uploads/2013/dag_imul5.png

There’s really not much interesting going on here, and all the types are legal for the target architecture; therefore, this is also the DAG that reaches the instruction selection stage.

Patterns for instruction selection

Instruction selection is arguably the most important part of the code generation phase. Its task is to convert a legal selection DAG into a new DAG of target machine code. In other words, the abstract, target-independent input has to be matched to concrete, target-dependent output. For this purpose LLVM uses an elaborate pattern-matching algorithm that consists of two major steps.

The first step happens "offline", when LLVM itself is being built, and involves the TableGen tool, which generates the pattern matching tables from instruction definitions. TableGen is an important part of the LLVM eco-system, and it plays an especially central role in instruction selection, so it’s worthwhile to spend a couple of minutes talking about it (there’s also official documentation, starting with TableGen fundamentals).

The problem with TableGen is that some of its uses are so complex (and instruction selection, as we’ll shortly see, is one of the worst offenders) that it’s easy to forget how simple the idea is in its core. The LLVM developers realized a long time ago that a lot of repetitive code has to be written for each new target. Take a machine instruction, for instance. An instruction is being used in code generation, in the assembler, in the disassembler, in optimizers, and in many other places. Each such use results in a "table" that maps instructions to some piece of information. Wouldn’t it be nice if we could just define all instructions in one central place which collects all the interesting information we need about them and then generate all the tables automatically? This is precisely what TableGen was born to do.

Let’s examine an instruction definition relevant to this article (taken from lib/Target/X86/X86InstrArithmetic.td and reformatted a bit):

def IMUL64rr : RI<0xAF, MRMSrcReg, (outs GR64:$dst),
                                   (ins GR64:$src1, GR64:$src2),
                  "imul{q}\t{$src2, $dst|$dst, $src2}",
                  [(set GR64:$dst, EFLAGS,
                        (X86smul_flag GR64:$src1, GR64:$src2))],
                  IIC_IMUL64_RR>,
                 TB;

If this looks like gibberish, don’t worry, that’s the right first impression to have. To factor out common code and fanatically preserve DRY, TableGen grew some advanced features like multiple inheritance, a form of templating and more; all of these make definitions somewhat difficult to understand at first. If you want to see the "naked" definition of IMUL64rr, you can run this from the root of the LLVM source tree:

$ llvm-tblgen lib/Target/X86/X86.td -I=include -I=lib/Target/X86

The 13.5 MB output only contains simple defs – table entries from which TableGen backends can take what they need. The def for IMUL64rr has something like 75 fields. But we’ll only focus on the ones we need for this article, and the condensed description pasted above will do.

The most important field for our discussion is the sixth template argument in the def above:

[(set GR64:$dst, EFLAGS,
      (X86smul_flag GR64:$src1, GR64:$src2))],

This is the pattern on which the IMUL64rr can be selected. It’s essentially an s-expression describing the DAG path that will be matched. In this case it roughly means: an X86ISD::SMUL node (this is concealed behind the X86smul_flag definition) with two 64-bit GPR (General Purpose Register) arguments is invoked and returns two results – one assigned to a destination GPR and the other to the status flag register [1]. When the automatic instruction selection sees such a sequence in the DAG, it will
match it to the said IMUL64rr instruction.

A careful reader will, at this point, notice that I’m cheating a little bit. If the node matched by this pattern is X86ISD::SMUL, then how did it match the DAG shown above which has an ISD::MUL node? Indeed, it didn’t. I will show the pattern that actually matches the DAG shortly, but I felt it’s important to demonstrate the instruction definitions with patterns, to enable me to discuss how all patterns are mashed together later.

So what is the difference between ISD::MUL and X86ISD::SMUL [2] ? In the former, we don’t care about the actual flags affected by the multiplication, while in the latter we do. In the case of multiplication in C, we usually don’t care about the flags affected, hence ISD::MUL is selected. But LLVM provides some special intrinsics such as llvm.smul.with.overflow in which an overflow flag can be returned from the operation. For these (and possibly other uses), the X86ISD::SMUL node exists [3].

What, then, actually matches the ISD::MUL node here? This pattern from lib/Target/X86/X86InstrCompiler.td:

def : Pat<(mul GR64:$src1, GR64:$src2),
          (IMUL64rr GR64:$src1, GR64:$src2)>;

This is an anonymous TableGen record that defines a "pattern" which is detached from any specific instruction. The pattern is simply a mapping from a DAG input to DAG output, the latter containing a selected instruction. We don’t care how this mapping is called, so TableGen lets us define anonymous instances. In this case, the pattern should be fairly straightforward. Here’s an interesting snippet from include/llvm/Target/TargetSelectionDAG.td, where the Pattern class (and its Pat specialization) is defined:

// Selection DAG Pattern Support.
//
// Patterns are what are actually matched against by the target-flavored
// instruction selection DAG.  Instructions defined by the target implicitly
// define patterns in most cases, but patterns can also be explicitly added when
// an operation is defined by a sequence of instructions (e.g. loading a large
// immediate value on RISC targets that do not support immediates as large as
// their GPRs).
//

class Pattern<dag patternToMatch, list<dag> resultInstrs> {
  dag             PatternToMatch  = patternToMatch;
  list<dag>       ResultInstrs    = resultInstrs;
  list<Predicate> Predicates      = [];  // See class Instruction in Target.td.
  int             AddedComplexity = 0;   // See class Instruction in Target.td.
}

// Pat - A simple (but common) form of a pattern, which produces a simple result
// not needing a full list.
class Pat<dag pattern, dag result> : Pattern<pattern, [result]>;

The large comment at the top of this snippet is helpful, but it describes an exactly opposite situation of what we’re observing for IMUL64rr. In our case, the pattern defined within the instruction is actually the more complex one, while the basic pattern is defined outside with a Pattern.

The pattern matching mechanism

TableGen descriptions of target instructions support numerous pattern kinds. We’ve examined patterns implicitly defined within instruction definitions and patterns explicitly defined as stand-alones. In addition there are also "complex" patterns that specify a C++ function to be called, and "pattern fragments" that can contain arbitrary snippets of C++ code that do custom matching. If you’re interested, these pattern types are somewhat described in the comments within include/llvm/Target/TargetSelectionDAG.td.

Mixing up C++ code in TableGen works because the final result of the TableGen run (with the specific DAG ISel backend) is a C++ method that gets embedded into a target’s implementation of the SelectionDAGISel interface.

To be more specific, the sequence is:

  • The generic SelectionDAGISel::DoInstructionSelection method calls Select per DAG node.
  • Select is an abstract method, implemented by the targets. For example X86DAGToDAGISel::Select.
  • The latter intercepts some nodes for manual matching, but delegates the bulk of the work to X86DAGToDAGISel::SelectCode.
  • X86DAGToDAGISel::SelectCode is auto-generated by TableGen [4], and contains the matcher table, followed by a call to the generic SelectionDAGISel::SelectCodeCommon, passing it the table.

So what is the matcher table? Essentially, it’s a "program" written in a sort of a "bytecode" specific for instruction selection. To enable flexible pattern matching while staying efficient, TableGen munges all the patterns together and generates a program that, given a DAG mode, figures out which pattern it matches. SelectionDAGISel::SelectCodeCommon serves as the interpreter for this bytecode.

Unfortunately, the bytecode language for pattern matching is not documented anywhere. To understand how it works, there’s no substitute to looking at the interpreter code and at the generated bytecode for some backend [5].

Example: matching our sample DAG node

Let’s examine how the ISD::MUL node in our sample DAG is matched. For this purpose, it’s very useful to pass the -debug option to llc, which makes it dump detailed debugging information throughout the code generation process. In particular, the selection process for each DAG node can be traced. Here’s the relevant portion for our ISD::MUL node:

Selecting: 0x38c4ee0: i64 = mul 0x38c4de0, 0x38c4be0 [ORD=1] [ID=7]

ISEL: Starting pattern match on root node: 0x38c4ee0: i64 = mul 0x38c4de0, 0x38c4be0 [ORD=1] [ID=7]

  Initial Opcode index to 57917
  Match failed at index 57922
  Continuing at 58133
  Match failed at index 58137
  Continuing at 58246
  Match failed at index 58249
  Continuing at 58335
  TypeSwitch[i64] from 58337 to 58380
MatchAddress: X86ISelAddressMode 0x7fff447ca040
Base_Reg nul Base.FrameIndex 0
 Scale1
IndexReg nul Disp 0
GV nul CP nul
ES nul JT-1 Align0
  Match failed at index 58380
  Continuing at 58396
  Match failed at index 58407
  Continuing at 58516
  Match failed at index 58517
  Continuing at 58531
  Match failed at index 58532
  Continuing at 58544
  Match failed at index 58545
  Continuing at 58557
  Morphed node: 0x38c4ee0: i64,i32 = IMUL64rr 0x38c4de0, 0x38c4be0 [ORD=1]

ISEL: Match complete!
=> 0x38c4ee0: i64,i32 = IMUL64rr 0x38c4de0, 0x38c4be0 [ORD=1]

The indices mentioned here refer to the matcher table. You can see them in a comment at the beginning of each line in the generated X86GenDAGISel.inc file. Here’s the beginning of that table [6]:

// The main instruction selector code.
SDNode *SelectCode(SDNode *N) {
  // Some target values are emitted as 2 bytes, TARGET_VAL handles
  // this.
  #define TARGET_VAL(X) X & 255, unsigned(X) >> 8
  static const unsigned char MatcherTable[] = {
/*0*/     OPC_SwitchOpcode /*221 cases */, 73|128,103/*13257*/,  TARGET_VAL(ISD::STORE),// ->13262
/*5*/       OPC_RecordMemRef,
/*6*/       OPC_RecordNode,   // #0 = 'st' chained node
/*7*/       OPC_Scope, 5|128,2/*261*/, /*->271*/ // 7 children in Scope

At position 0 we have a OPC_SwitchOpcode operation, which is kind of a huge switch table on the node opcode. It’s followed by a list of cases. Each case begins with its size (so that the matcher knows where to go if matching the case fails), and then the opcode. For example, as you can see in the listing above, the first case in the table is for opcode ISD::STORE, and its size is 13257 (the size is encoded in a special variable-length-encoding since the table is byte-based).

Looking at the debug output, the matching of our MUL node starts at offset 57917. Here’s the relevant part of the table:

          /*SwitchOpcode*/ 53|128,8/*1077*/,  TARGET_VAL(ISD::MUL),// ->58994
/*57917*/   OPC_Scope, 85|128,1/*213*/, /*->58133*/ // 7 children in Scope

So, as expected, this is the switch case with ISD::MUL as the opcode. The matching for this case starts with OPC_Scope, which is an instruction to the interpreter to push its current state. If something fails within the scope, the state can be then restored to proceed with matching the next cases. In the snippet above, if matching fails in the scope, it will proceed in offset 58133.

You can see this happening in the debug output:

Initial Opcode index to 57917
Match failed at index 57922
Continuing at 58133

At 57922, the interpreter tries to match the child of the node to a ISD::LOAD (meaning – multiply with in-memory argument), fails, and jumps to 58133 as the scope dictates. Similarly, the rest of the matching process can be traced – following the debug output and the matching table as a reference. Something interesting happens at offset 58337 though. Here’s the relevant table part:

/*58337*/     OPC_SwitchType /*2 cases */, 38,  MVT::i32,// ->58378
/*58340*/       OPC_Scope, 17, /*->58359*/ // 2 children in Scope
/*58342*/         OPC_CheckPatternPredicate, 4, // (!Subtarget->is64Bit())
/*58344*/         OPC_CheckComplexPat, /*CP*/3, /*#*/0, // SelectLEAAddr:$src #1 #2 #3 #4 #5
/*58347*/         OPC_MorphNodeTo, TARGET_VAL(X86::LEA32r), 0,
                      1/*#VTs*/, MVT::i32, 5/*#Ops*/, 1, 2, 3, 4, 5,
                  // Src: lea32addr:i32:$src - Complexity = 18
                  // Dst: (LEA32r:i32 lea32addr:i32:$src)

This is the result of a complex pattern described above. SelectLEAAddr is a C++ method (defined by the X86 backen’s ISel implementation) and it gets invoked to try and match the node operand to a LEA [7]. The debug printout that follows comes from that method, and as we can see, eventually fails.

Finally, where the interpreter reaches offset 58557, the match succeeds. Here’s the relevant table part:

/*58557*/       /*Scope*/ 12, /*->58570*/
/*58558*/         OPC_CheckType, MVT::i64,
/*58560*/         OPC_MorphNodeTo, TARGET_VAL(X86::IMUL64rr), 0,
                      2/*#VTs*/, MVT::i64, MVT::i32, 2/*#Ops*/, 0, 1,
                  // Src: (mul:i64 GR64:i64:$src1, GR64:i64:$src2) - Complexity = 3
                  // Dst: (IMUL64rr:i64:i32 GR64:i64:$src1, GR64:i64:$src2)

Simply put, after it fails matching a bunch of optimizations and special cases, the matcher finally uses a generic integer-multiply between 64-bit registers, which is matched to the IMUL64rr machine instruction.

If it appears from the trace that the instruction selector works hard to find a suitable instruction, that is true. To generate good code, some work has to be done to try and match various optimized sequences before falling back to generic ones. In the next part of the article, I will show some more advanced cases of instruction selection with optimization.

The final code

This is how the DAG looks after instruction selection:

http://eli.thegreenplace.net/wp-content/uploads/2013/dag_imul_postisel.png

Since the entry DAG was pretty basic, this one is very similar; the main difference is that the multiplication and return nodes were selected to actual instructions.

If you remember from the life of an instruction in LLVM article, the instruction goes through a couple of additional incarnations after being matched by the instruction selector. The final code that gets emitted is:

imul:                                   # @imul
      imulq   %rsi, %rdi
      movq    %rdi, %rax
      ret

imulq is the assembly (GAS flavor) representation of X86::IMUL64rr. It multiplies the function’s arguments (according to the AMD64 ABI, the first two integers come in %rsi and %rdi); then the result is moved to the return register – %rax.

Conclusion

This article provided an in-depth peek into the instruction selection process – a key part of the LLVM code generator. While it uses a relatively simple example, it should contain sufficient information to gain some initial understanding of the mechanisms involved. In the next part of the article, I will examine a couple of additional examples through which other aspects of the code generation process should become clearer.

http://eli.thegreenplace.net/wp-content/uploads/hline.jpg

[1] Although the status flags are "implicit" on x86 (there’s no explicit register you can work with), LLVM treats it as explicit to aid the code generation algorithms.
[2] X86ISD::SMUL is the X86-specific lowering of the generic ISD::SMULO node.
[3] You may have a "oh my, why is this so complex?" reaction at this point. The TL;DR; answer is "compilers are hard, let’s go fishing". A longer rationale would be: the x86 instruction set is very large and complex. Moreover, LLVM is a compiler with many (quite different) targets and much of its machinery is thus engineered to be target-independent. The result is inherent complexity. To put it differently – the x86 TableGen definitions are about 20 KLOC in size. Add to that another 20 KLOC or so of custom C++ lowering code and compare to the Intel architecture manual which contains 3,000 pages or so. In terms of Kolmogorov complexity, this isn’t very bad :-)
[4] It’s generated into <BUILD_DIR>/lib/Target/X86/X86GenDAGISel.inc, a file that’s #included by lib/Target/X86/X86ISelDAGToDAG.cpp.
[5] If you want to understand how this bytecode is generated from the TableGen pattern definitions, you also need to look inside the TableGen DAG ISel backend.
[6] Note that the values in this table are relevant to the version of LLVM I have built for this example (r174056). Changes in X86 pattern definitions may result in different numbering, but the principle is the same.
[7] Some multiplications can be optimized to use the faster LEA instruction.
05 Apr 14:03

Python FFI with ctypes and cffi

by eliben

In a previous post, I demonstrated how to use libffi to perform fully dynamic calls to C code, where "fully dynamic" means that even the types of the arguments and return values are determined at runtime.

Here I want to discuss how the same task is done from Python, both with the existing stdlib ctypes package and the new cffi library, developed by the PyPy team and a candidate for inclusion into the Python stdlib in the future.

With ctypes

I’ll start with the shared object discussed before; the following code loads and runs it in Python using ctypes. I tested it on Python 3.2, but other versions should work too (including 2.7):

from ctypes import cdll, Structure, c_int, c_double, c_uint

lib = cdll.LoadLibrary('./libsomelib.so')
print('Loaded lib {0}'.format(lib))

# Describe the DataPoint structure to ctypes.
class DataPoint(Structure):
    _fields_ = [('num', c_int),
                ('dnum', c_double)]

# Initialize the DataPoint[4] argument. Since we just use the DataPoint[4]
# type once, an anonymous instance will do.
dps = (DataPoint * 4)((2, 2.2), (3, 3.3), (4, 4.4), (5, 5.5))

# Grab add_data from the library and specify its return type.
# Note: .argtypes can also be specified
add_data_fn = lib.add_data
add_data_fn.restype = DataPoint

print('Calling add_data via ctypes')
dout = add_data_fn(dps, 4)
print('dout = {0}, {1}'.format(dout.num, dout.dnum))

This is pretty straightforward. As far as dynamic language FFIs go, ctypes is pretty good. But we can do better. The main problem with ctypes is that we have to fully repeat the C declarations to ctypes, using its specific API. For example, see the description of the DataPoint structure. The return type should also be explicitly specified. Not only is this a lot of work for wrapping non-trivial C libraries, it’s also error prone. If you make a mistake translating a C header to a ctypes description, you will likely get a segfault at runtime which isn’t easy to debug without having a debug build of Python available. ctypes allows us to explicitly specify argtypes on a function for some measure of type checking, but this is only within the Python code – given that you got the declaration right, it will help with passing the correct types of objects. But if you didn’t get the declaration right, nothing will save you.

How does it work?

ctypes is a Python wrapper around libffi. The CPython project carries a version of libffi with it, and ctypes consists of a C extension module linking to libffi and Python code for the required glue. If you understand how to use libffi, it should be easy to see how ctypes works.

While libffi is quite powerful, it also has some limitations, which by extension apply to ctypes. For example, passing unions by value to dynamically-loaded functions is not supported. But overall, the benefits outweigh the limitations, which are not hard to work around when needed.

With cffi

cffi tries to improve on ctypes by using an interesting approach. It allows you to avoid re-writing your C declarations in ctypes notation, by being able to parse actual C declarations and inferring the required data types and function signatures automatically. Here’s the same example implemented with cffi (tested with cffi 0.5 on Python 3.2):

from cffi import FFI

ffi = FFI()

lib = ffi.dlopen('./libsomelib.so')
print('Loaded lib {0}'.format(lib))

# Describe the data type and function prototype to cffi.
ffi.cdef('''
typedef struct {
    int num;
    double dnum;
} DataPoint;

DataPoint add_data(const DataPoint* dps, unsigned n);
''')

# Create an array of DataPoint structs and initialize it.
dps = ffi.new('DataPoint[]', [(2, 2.2), (3, 3.3), (4, 4.4), (5, 5.5)])

print('Calling add_data via cffi')
# Interesting variation: passing invalid arguments to add_data will trigger
# a cffi type-checking exception.
dout = lib.add_data(dps, 4)
print('dout = {0}, {1}'.format(dout.num, dout.dnum))

Instead of tediously describing the C declarations to Python, cffi just consumes them directly and produces all the required glue automatically. It’s much harder to get things wrong and run into segfaults.

Note that this demonstrates what cffi calls the ABI level. There’s another, more ambitious, use of cffi which uses the system C compiler to auto-complete missing parts of declarations. I’m just focusing on the ABI level here, since it requires no C compiler. How does it work? Deep down, cffi also relies on libffi to generate the actual low-level calls. To parse the C declarations, it uses pycparser.

Another cool thing about cffi is that being part of the PyPy ecosystem, it can actually benefit from PyPy’s JIT capabilities. As I’ve mentioned in a previous post, using libffi is much slower than compiler-generated calls because a lot of the argument set-up work has to happen for each call. But once we actually start running, in practice the signatures of called functions never change. So a JIT compiler could be smarter about it and generate faster, more direct calls. I don’t know whether PyPy is already doing this with cffi, but I’m pretty sure it’s in their plans.

A more complex example

I want to show another example, which demonstrates a more involved function being called – the POSIX readdir_r (the reentrant version of readdir). This example is based on some demo/ code in the cffi source tree. Here’s code using ctypes to list the contents of a directory:

from ctypes import (CDLL, byref, Structure, POINTER, c_int,
                    c_void_p, c_long, c_ushort, c_ubyte,
                    c_char, c_char_p, c_void_p)

# CDLL(None) invokes dlopen(NULL), which loads the currently running
# process - in our case Python itself. Since Python is linked with
# libc, readdir_r will be found there.
# Alternatively, we can just explicitly load 'libc.so.6'.
lib = CDLL(None)
print('Loaded lib {0}'.format(lib))

# Describe the types needed for readdir_r.
class DIRENT(Structure):
    _fields_ = [('d_ino', c_long),
                ('d_off', c_long),
                ('d_reclen', c_ushort),
                ('d_type', c_ubyte),
                ('d_name', c_char * 256)]

DIR_p = c_void_p
DIRENT_p = POINTER(DIRENT)
DIRENT_pp = POINTER(DIRENT_p)

# Load the functions we need from the C library. Specify their
# argument and return types.
readdir_r = lib.readdir_r
readdir_r.argtypes = [DIR_p, DIRENT_p, DIRENT_pp]
readdir_r.restype = c_int

opendir = lib.opendir
opendir.argtypes = [c_char_p]
opendir.restype = DIR_p

closedir = lib.closedir
closedir.argtypes = [DIR_p]
closedir.restype = c_int

# opendir's path argument is char*, hence bytes.
path = b'/tmp'
dir_fd = opendir(path)
if not dir_fd:
    raise RuntimeError('opendir failed')

dirent = DIRENT()
result = DIRENT_p()

while True:
    # Note that byref() here is optional since ctypes can do it on its
    # own by observing the argtypes declared for readdir_r. I keep byref
    # for explicitness.
    if readdir_r(dir_fd, byref(dirent), byref(result)):
        raise RuntimeError('readdir_r failed')
    if not result:
        # If (*result == NULL), we're done.
        break
    # dirent.d_name is char[], hence we decode it to get a unicode
    # string.
    print('Found: ' + dirent.d_name.decode('utf-8'))

closedir(dir_fd)

Here I went one step farther and actually described the required argument types for imported functions. Once again, this only helps us avoid errors to some extent. You’ll have to agree that the code is tedious. Using cffi, we can just "copy paste" the C declarations and focus on actual calling:

from cffi import FFI

ffi = FFI()
ffi.cdef("""
    typedef void DIR;
    typedef long ino_t;
    typedef long off_t;

    struct dirent {
        ino_t          d_ino;       /* inode number */
        off_t          d_off;       /* offset to the next dirent */
        unsigned short d_reclen;    /* length of this record */
        unsigned char  d_type;      /* type of file; not supported
                                       by all file system types */
        char           d_name[256]; /* filename */
    };

    DIR *opendir(const char *name);
    int readdir_r(DIR *dirp, struct dirent *entry, struct dirent **result);
    int closedir(DIR *dirp);
""")

# Load symbols from the current process (Python).
lib = ffi.dlopen(None)
print('Loaded lib {0}'.format(lib))

path = b'/tmp'
dir_fd = lib.opendir(path)
if not dir_fd:
    raise RuntimeError('opendir failed')

# Allocate the pointers passed to readdir_r.
dirent = ffi.new('struct dirent*')
result = ffi.new('struct dirent**')

while True:
    if lib.readdir_r(dir_fd, dirent, result):
        raise RuntimeError('readdir_r failed')
    if result[0] == ffi.NULL:
        # If (*result == NULL), we're done.
        break
    print('Found: ' + ffi.string(dirent.d_name).decode('utf-8'))

lib.closedir(dir_fd)

I placed "copy paste" in quotes on purpose, because the man page for readdir_r doesn’t fully specify all the typedef declarations inside struct dirent. For example, you need to do some digging to discover that ino_t is long. cffi‘s other goal, the API level, is to enable Python programmers to skip such declarations and let the C compiler complete the details. But since this requires a C compiler, I see it as a very different solution from the ABI level. In fact, it’s not really a FFI at this point, but rather an alternative way for writing C extensions to Python code.

05 Apr 13:59

Which side of Kullback-Leibler divergence to optimize?

A common question is which side of Kullback-Leibler (KL) should I consider for optimization. Since KL is the relative entropy: it is the difference of cross-entropy minus the entropy. As such the ideal model should be on right hand side, and the estimated model on left-hand side. Now suppose you want to find the center of two distributions for mixture simplification. You can interpret the results of left-sided and right-sided KL centroids as follows:

  • The Kullback-Leibler right-sided centroid is zero-avoiding so that its corresponding density function tries to cover the support of all input normals,
  • The Kullback-Leibler left-sided centroid is zero-forcing so that it focuses on the highest mass mode normal.

Here, there is not one model but two ideal models so we need a trade-off (=centroid). The zero-forcing/avoiding property is depicted in the following figure:
KLsidedandsymmetrized.png
Here are the details (paper).