Shared posts

16 Oct 00:46

How Lisp Became God’s Own Programming Language

When programmers discuss the relative merits of different programming languages, they often talk about them in prosaic terms as if they were so many tools in a tool belt—one might be more appropriate for systems programming, another might be more appropriate for gluing together other programs to accomplish some ad hoc task. This is as it should be. Languages have different strengths and claiming that a language is better than other languages without reference to a specific use case only invites an unproductive and vitriolic debate.

But there is one language that seems to inspire a peculiar universal reverence: Lisp. Keyboard crusaders that would otherwise pounce on anyone daring to suggest that some language is better than any other will concede that Lisp is on another level. Lisp transcends the utilitarian criteria used to judge other languages, because the median programmer has never used Lisp to build anything practical and probably never will, yet the reverence for Lisp runs so deep that Lisp is often ascribed mystical properties. Everyone’s favorite webcomic, xkcd, has depicted Lisp this way at least twice: In one comic, a character reaches some sort of Lisp enlightenment, which appears to allow him to comprehend the fundamental structure of the universe. In another comic, a robed, senescent programmer hands a stack of parentheses to his padawan, saying that the parentheses are “elegant weapons for a more civilized age,” suggesting that Lisp has all the occult power of the Force.

Another great example is Bob Kanefsky’s parody of a song called “God Lives on Terra.” His parody, written in the mid-1990s and called “Eternal Flame”, describes how God must have created the world using Lisp. The following is an excerpt, but the full set of lyrics can be found in the GNU Humor Collection:

For God wrote in Lisp code
When he filled the leaves with green.
The fractal flowers and recursive roots:
The most lovely hack I’ve seen.
And when I ponder snowflakes,
never finding two the same,
I know God likes a language
with its own four-letter name.

I can only speak for myself, I suppose, but I think this “Lisp Is Arcane Magic” cultural meme is the most bizarre and fascinating thing ever. Lisp was concocted in the ivory tower as a tool for artificial intelligence research, so it was always going to be unfamiliar and maybe even a bit mysterious to the programming laity. But programmers now urge each other to “try Lisp before you die” as if it were some kind of mind-expanding psychedelic. They do this even though Lisp is now the second-oldest programming language in widespread use, younger only than Fortran, and even then by just one year.1 Imagine if your job were to promote some new programming language on behalf of the organization or team that created it. Wouldn’t it be great if you could convince everyone that your new language had divine powers? But how would you even do that? How does a programming language come to be known as a font of hidden knowledge?

How did Lisp get to be this way?

Byte Magazine Cover, August, 1979. The cover of Byte Magazine, August, 1979.

Theory A: The Axiomatic Language

John McCarthy, Lisp’s creator, did not originally intend for Lisp to be an elegant distillation of the principles of computation. But, after one or two fortunate insights and a series of refinements, that’s what Lisp became. Paul Graham—we will talk about him some more later—has written that, with Lisp, McCarthy “did for programming something like what Euclid did for geometry.”2 People might see a deeper meaning in Lisp because McCarthy built Lisp out of parts so fundamental that it is hard to say whether he invented it or discovered it.

McCarthy began thinking about creating a language during the 1956 Darthmouth Summer Research Project on Artificial Intelligence. The Summer Research Project was in effect an ongoing, multi-week academic conference, the very first in the field of artificial intelligence. McCarthy, then an assistant professor of Mathematics at Dartmouth, had actually coined the term “artificial intelligence” when he proposed the event.3 About ten or so people attended the conference for its entire duration.4 Among them were Allen Newell and Herbert Simon, two researchers affiliated with the RAND Corporation and Carnegie Mellon that had just designed a language called IPL.

Newell and Simon had been trying to build a system capable of generating proofs in propositional calculus. They realized that it would be hard to do this while working at the level of the computer’s native instruction set, so they decided to create a language—or, as they called it, a “pseudo-code”—that would help them more naturally express the workings of their “Logic Theory Machine.”5 Their language, called IPL for “Information Processing Language”, was more of a high-level assembly dialect then a programming language in the sense we mean today. Newell and Simon, perhaps referring to Fortran, noted that other “pseudo-codes” then in development were “preoccupied” with representing equations in standard mathematical notation.6 Their language focused instead on representing sentences in propositional calculus as lists of symbolic expressions. Programs in IPL would basically leverage a series of assembly-language macros to manipulate and evaluate expressions within one or more of these lists.

McCarthy thought that having algebraic expressions in a language, Fortran-style, would be useful. So he didn’t like IPL very much.7 But he thought that symbolic lists were a good way to model problems in artificial intelligence, particularly problems involving deduction. This was the germ of McCarthy’s desire to create an algebraic list processing language, a language that would resemble Fortran but also be able to process symbolic lists like IPL.

Of course, Lisp today does not resemble Fortran. Over the next few years, McCarthy’s ideas about what an ideal list processing language should look like evolved. His ideas began to change in 1957, when he started writing routines for a chess-playing program in Fortran. The prolonged exposure to Fortran convinced McCarthy that there were several infelicities in its design, chief among them the awkward IF statement.8 McCarthy invented an alternative, the “true” conditional expression, which returns sub-expression A if the supplied test succeeds and sub-expression B if the supplied test fails and which also only evaluates the sub-expression that actually gets returned. During the summer of 1958, when McCarthy worked to design a program that could perform differentiation, he realized that his “true” conditional expression made writing recursive functions easier and more natural.9 The differentiation problem also prompted McCarthy to devise the maplist function, which takes another function as an argument and applies it to all the elements in a list.10 This was useful for differentiating sums of arbitrarily many terms.

None of these things could be expressed in Fortran, so, in the fall of 1958, McCarthy set some students to work implementing Lisp. Since McCarthy was now an assistant professor at MIT, these were all MIT students. As McCarthy and his students translated his ideas into running code, they made changes that further simplified the language. The biggest change involved Lisp’s syntax. McCarthy had originally intended for the language to include something called “M-expressions,” which would be a layer of syntactic sugar that made Lisp’s syntax resemble Fortran’s. Though M-expressions could be translated to S-expressions—the basic lists enclosed by parentheses that Lisp is known for— S-expressions were really a low-level representation meant for the machine. The only problem was that McCarthy had been denoting M-expressions using square brackets, and the IBM 026 keypunch that McCarthy’s team used at MIT did not have any square bracket keys on its keyboard.11 So the Lisp team stuck with S-expressions, using them to represent not just lists of data but function applications too. McCarthy and his students also made a few other simplifications, including a switch to prefix notation and a memory model change that meant the language only had one real type.12

In 1960, McCarthy published his famous paper on Lisp called “Recursive Functions of Symbolic Expressions and Their Computation by Machine.” By that time, the language had been pared down to such a degree that McCarthy realized he had the makings of “an elegant mathematical system” and not just another programming language.13 He later wrote that the many simplifications that had been made to Lisp turned it “into a way of describing computable functions much neater than the Turing machines or the general recursive definitions used in recursive function theory.”14 In his paper, he therefore presented Lisp both as a working programming language and as a formalism for studying the behavior of recursive functions.

McCarthy explained Lisp to his readers by building it up out of only a very small collection of rules. Paul Graham later retraced McCarthy’s steps, using more readable language, in his essay “The Roots of Lisp”. Graham is able to explain Lisp using only seven primitive operators, two different notations for functions, and a half-dozen higher-level functions defined in terms of the primitive operators. That Lisp can be specified by such a small sequence of basic rules no doubt contributes to its mystique. Graham has called McCarthy’s paper an attempt to “axiomatize computation.”15 I think that is a great way to think about Lisp’s appeal. Whereas other languages have clearly artificial constructs denoted by reserved words like while or typedef or public static void, Lisp’s design almost seems entailed by the very logic of computing. This quality and Lisp’s original connection to a field as esoteric as “recursive function theory” should make it no surprise that Lisp has so much prestige today.

Theory B: Machine of the Future

Two decades after its creation, Lisp had become, according to the famous Hacker’s Dictionary, the “mother tongue” of artificial intelligence research. Early on, Lisp spread quickly, probably because its regular syntax made implementing it on new machines relatively straightforward. Later, researchers would keep using it because of how well it handled symbolic expressions, important in an era when so much of artificial intelligence was symbolic. Lisp was used in seminal artificial intelligence projects like the SHRDLU natural language program, the Macsyma algebra system, and the ACL2 logic system.

By the mid-1970s, though, artificial intelligence researchers were running out of computer power. The PDP-10, in particular—everyone’s favorite machine for artificial intelligence work—had an 18-bit address space that increasingly was insufficient for Lisp AI programs.16 Many AI programs were also supposed to be interactive, and making a demanding interactive program perform well on a time-sharing system was challenging. The solution, originally proposed by Peter Deutsch at MIT, was to engineer a computer specifically designed to run Lisp programs. These Lisp machines, as I described in my last post on Chaosnet, would give each user a dedicated processor optimized for Lisp. They would also eventually come with development environments written entirely in Lisp for hardcore Lisp programmers. Lisp machines, devised in an awkward moment at the tail of the minicomputer era but before the full flowering of the microcomputer revolution, were high-performance personal computers for the programming elite.

For a while, it seemed as if Lisp machines would be the wave of the future. Several companies sprang into existence and raced to commercialize the technology. The most successful of these companies was called Symbolics, founded by veterans of the MIT AI Lab. Throughout the 1980s, Symbolics produced a line of computers known as the 3600 series, which were popular in the AI field and in industries requiring high-powered computing. The 3600 series computers featured large screens, bit-mapped graphics, a mouse interface, and powerful graphics and animation software. These were impressive machines that enabled impressive programs. For example, Bob Culley, who worked in robotics research and contacted me via Twitter, was able to implement and visualize a path-finding algorithm on a Symbolics 3650 in 1985. He explained to me that bit-mapped graphics and object-oriented programming (available on Lisp machines via the Flavors extension) were very new in the 1980s. Symbolics was the cutting edge.

Bob Culley's path-finding program. Bob Culley’s path-finding program.

As a result, Symbolics machines were outrageously expensive. The Symbolics 3600 cost $110,000 in 1983.16 So most people could only marvel at the power of Lisp machines and the wizardry of their Lisp-writing operators from afar. But marvel they did. Byte Magazine featured Lisp and Lisp machines several times from 1979 through to the end of the 1980s. In the August, 1979 issue, a special on Lisp, the magazine’s editor raved about the new machines being developed at MIT with “gobs of memory” and “an advanced operating system.”17 He thought they sounded so promising that they would make the two prior years—which saw the launch of the Apple II, the Commodore PET, and the TRS-80—look boring by comparison. A half decade later, in 1985, a Byte Magazine contributor described writing Lisp programs for the “sophisticated, superpowerful Symbolics 3670” and urged his audience to learn Lisp, claiming it was both “the language of choice for most people working in AI” and soon to be a general-purpose programming language as well.18

I asked Paul McJones, who has done lots of Lisp preservation work for the Computer History Museum in Mountain View, about when people first began talking about Lisp as if it were a gift from higher-dimensional beings. He said that the inherent properties of the language no doubt had a lot to do with it, but he also said that the close association between Lisp and the powerful artificial intelligence applications of the 1960s and 1970s probably contributed too. When Lisp machines became available for purchase in the 1980s, a few more people outside of places like MIT and Stanford were exposed to Lisp’s power and the legend grew. Today, Lisp machines and Symbolics are little remembered, but they helped keep the mystique of Lisp alive through to the late 1980s.

Theory C: Learn to Program

In 1985, MIT professors Harold Abelson and Gerald Sussman, along with Sussman’s wife, Julie Sussman, published a textbook called Structure and Interpretation of Computer Programs. The textbook introduced readers to programming using the language Scheme, a dialect of Lisp. It was used to teach MIT’s introductory programming class for two decades. My hunch is that SICP (as the title is commonly abbreviated) about doubled Lisp’s “mystique factor.” SICP took Lisp and showed how it could be used to illustrate deep, almost philosophical concepts in the art of computer programming. Those concepts were general enough that any language could have been used, but SICP’s authors chose Lisp. As a result, Lisp’s reputation was augmented by the notoriety of this bizarre and brilliant book, which has intrigued generations of programmers (and also become a very strange meme). Lisp had always been “McCarthy’s elegant formalism”; now it was also “that language that teaches you the hidden secrets of programming.”

It’s worth dwelling for a while on how weird SICP really is, because I think the book’s weirdness and Lisp’s weirdness get conflated today. The weirdness starts with the book’s cover. It depicts a wizard or alchemist approaching a table, prepared to perform some sort of sorcery. In one hand he holds a set of calipers or a compass, in the other he holds a globe inscribed with the words “eval” and “apply.” A woman opposite him gestures at the table; in the background, the Greek letter lambda floats in mid-air, radiating light.

The cover art for SICP. The cover art for SICP.

Honestly, what is going on here? Why does the table have animal feet? Why is the woman gesturing at the table? What is the significance of the inkwell? Are we supposed to conclude that the wizard has unlocked the hidden mysteries of the universe, and that those mysteries consist of the “eval/apply” loop and the Lambda Calculus? It would seem so. This image alone must have done an enormous amount to shape how people talk about Lisp today.

But the text of the book itself is often just as weird. SICP is unlike most other computer science textbooks that you have ever read. Its authors explain in the foreword to the book that the book is not merely about how to program in Lisp—it is instead about “three foci of phenomena: the human mind, collections of computer programs, and the computer.”19 Later, they elaborate, describing their conviction that programming shouldn’t be considered a discipline of computer science but instead should be considered a new notation for “procedural epistemology.”20 Programs are a new way of structuring thought that only incidentally get fed into computers. The first chapter of the book gives a brief tour of Lisp, but most of the book after that point is about much more abstract concepts. There is a discussion of different programming paradigms, a discussion of the nature of “time” and “identity” in object-oriented systems, and at one point a discussion of how synchronization problems may arise because of fundamental constraints on communication that play a role akin to the fixed speed of light in the theory of relativity.21 It’s heady stuff.

All this isn’t to say that the book is bad. It’s a wonderful book. It discusses important programming concepts at a higher level than anything else I have read, concepts that I had long wondered about but didn’t quite have the language to describe. It’s impressive that an introductory programming textbook can move so quickly to describing the fundamental shortfalls of object-oriented programming and the benefits of functional languages that minimize mutable state. It’s mind-blowing that this then turns into a discussion of how a stream paradigm, perhaps something like today’s RxJS, can give you the best of both worlds. SICP distills the essence of high-level program design in a way reminiscent of McCarthy’s original Lisp paper. The first thing you want to do after reading it is get your programmer friends to read it; if they look it up, see the cover, but then don’t read it, all they take away is that some mysterious, fundamental “eval/apply” thing gives magicians special powers over tables with animal feet. I would be deeply impressed in their shoes too.

But maybe SICP’s most important contribution was to elevate Lisp from curious oddity to pedagogical must-have. Well before SICP, people told each other to learn Lisp as a way of getting better at programming. The 1979 Lisp issue of Byte Magazine is testament to that fact. The same editor that raved about MIT’s new Lisp machines also explained that the language was worth learning because it “represents a different point of view from which to analyze problems.”22 But SICP presented Lisp as more than just a foil for other languages; SICP used Lisp as an introductory language, implicitly making the argument that Lisp is the best language in which to grasp the fundamentals of computer programming. When programmers today tell each other to try Lisp before they die, they arguably do so in large part because of SICP. After all, the language Brainfuck presumably offers “a different point of view from which to analyze problems.” But people learn Lisp instead because they know that, for twenty years or so, the Lisp point of view was thought to be so useful that MIT taught Lisp to undergraduates before anything else.

Lisp Comes Back

The same year that SICP was released, Bjarne Stroustrup published the first edition of The C++ Programming Language, which brought object-oriented programming to the masses. A few years later, the market for Lisp machines collapsed and the AI winter began. For the next decade and change, C++ and then Java would be the languages of the future and Lisp would be left out in the cold.

It is of course impossible to pinpoint when people started getting excited about Lisp again. But that may have happened after Paul Graham, Y-Combinator co-founder and Hacker News creator, published a series of influential essays pushing Lisp as the best language for startups. In his essay “Beating the Averages,” for example, Graham argued that Lisp macros simply made Lisp more powerful than other languages. He claimed that using Lisp at his own startup, Viaweb, helped him develop features faster than his competitors were able to. Some programmers at least were persuaded. But the vast majority of programmers did not switch to Lisp.

What happened instead is that more and more Lisp-y features have been incorporated into everyone’s favorite programming languages. Python got list comprehensions. C# got Linq. Ruby got… well, Ruby is a Lisp. As Graham noted even back in 2001, “the default language, embodied in a succession of popular languages, has gradually evolved toward Lisp.”23 Though other languages are gradually becoming like Lisp, Lisp itself somehow manages to retain its special reputation as that mysterious language that few people understand but everybody should learn. In 1980, on the occasion of Lisp’s 20th anniversary, McCarthy wrote that Lisp had survived as long as it had because it occupied “some kind of approximate local optimum in the space of programming languages.”24 That understates Lisp’s real influence. Lisp hasn’t survived for over half a century because programmers have begrudgingly conceded that it is the best tool for the job decade after decade; in fact, it has survived even though most programmers do not use it at all. Thanks to its origins and use in artificial intelligence research and perhaps also the legacy of SICP, Lisp continues to fascinate people. Until we can imagine God creating the world with some newer language, Lisp isn’t going anywhere.

If you enjoyed this post, more like it come out every two weeks! Follow @TwoBitHistory on Twitter or subscribe to the RSS feed to make sure you know when a new post is out.

Previously on TwoBitHistory…

  1. John McCarthy, “History of Lisp”, 14, Stanford University, February 12, 1979, accessed October 14, 2018, http://jmc.stanford.edu/articles/lisp/lisp.pdf

  2. Paul Graham, “The Roots of Lisp”, 1, January 18, 2002, accessed October 14, 2018, http://languagelog.ldc.upenn.edu/myl/llog/jmc.pdf

  3. Martin Childs, “John McCarthy: Computer scientist known as the father of AI”, The Independent, November 1, 2011, accessed on October 14, 2018, https://www.independent.co.uk/news/obituaries/john-mccarthy-computer-scientist-known-as-the-father-of-ai-6255307.html

  4. Lisp Bulletin History. http://www.artinfo-musinfo.org/scans/lb/lb3f.pdf 

  5. Allen Newell and Herbert Simon, “Current Developments in Complex Information Processing,” 19, May 1, 1956, accessed on October 14, 2018, http://bitsavers.org/pdf/rand/ipl/P-850_Current_Developments_In_Complex_Information_Processing_May56.pdf

  6. ibid. 

  7. Herbert Stoyan, “Lisp History”, 43, Lisp Bulletin #3, December 1979, accessed on October 14, 2018, http://www.artinfo-musinfo.org/scans/lb/lb3f.pdf 

  8. McCarthy, “History of Lisp”, 5. 

  9. ibid. 

  10. McCarthy “History of Lisp”, 6. 

  11. Stoyan, “Lisp History”, 45 

  12. McCarthy, “History of Lisp”, 8. 

  13. McCarthy, “History of Lisp”, 2. 

  14. McCarthy, “History of Lisp”, 8. 

  15. Graham, “The Roots of Lisp”, 11. 

  16. Guy Steele and Richard Gabriel, “The Evolution of Lisp”, 22, History of Programming Languages 2, 1993, accessed on October 14, 2018, http://www.dreamsongs.com/Files/HOPL2-Uncut.pdf 2

  17. Carl Helmers, “Editorial”, Byte Magazine, 154, August 1979, accessed on October 14, 2018, https://archive.org/details/byte-magazine-1979-08/page/n153

  18. Patrick Winston, “The Lisp Revolution”, 209, April 1985, accessed on October 14, 2018, https://archive.org/details/byte-magazine-1985-04/page/n207

  19. Harold Abelson, Gerald Jay. Sussman, and Julie Sussman, Structure and Interpretation of Computer Programs (Cambridge, Mass: MIT Press, 2010), xiii. 

  20. Abelson, xxiii. 

  21. Abelson, 428. 

  22. Helmers, 7. 

  23. Paul Graham, “What Made Lisp Different”, December 2001, accessed on October 14, 2018, http://www.paulgraham.com/diff.html

  24. John McCarthy, “Lisp—Notes on its past and future”, 3, Stanford University, 1980, accessed on October 14, 2018, http://jmc.stanford.edu/articles/lisp20th/lisp20th.pdf

02 Jul 23:27

A Gift Whore’s Mouth #017

by totempole

 


The comic updates twice a week thanks to the support of its readers – on Patreon.

I’m active on HentaiFoundry, Twitter and Instagram.
I also draw a SFW webcomic- Stick in the Mud

23 Dec 18:19

Topic Suggestions for Millions of Repositories

by Kavita Ganesan

We recently launched Topics, a new feature that lets you tag your repositories with descriptive words or phrases, making it easy to discover projects and explore GitHub.com. Topic suggestions on public repositories, provides a quick way to add tags to repositories.

image

These suggestions are the result of recent data science work at GitHub. We applied concepts from text mining, natural language processing (NLP), and machine learning to build a topic extraction framework.

Solving the cold-start problem for topic suggestions

Because Topics is a brand new concept at GitHub, we started with no cues from users on what defined a topic and what type of topics they would typically add to their repositories. Given our focus on improving discoverability, internally we defined Topics as any “word or phrase that roughly describes the purpose of a repository and the type of content it encapsulates”. These can be words such as “data science”, “nlp”, “scikit-learn”, “clustering algorithm”, “jekyll plugin”, “css template”, or “python”.

While no tag or label-type feature existed prior to the start of this project, we did have a rich set of textual information to start from. At its core, GitHub is a platform for sharing software with other people, and some of the data typically found in a repository provides information to humans rather than instructions for computers. Repository names, descriptions, and READMEs are text that communicate functionality, use case, and features to human readers. That’s where we started.

Repo-Topix: our topics extraction framework

We developed a topic extraction framework, called repo-topix, to learn from the human-readable text that users provide in repo names, descriptions, and READMEs written by developers about their projects by incorporating methods from text mining, NLP, and supervised machine learning. At a high level, repo-topix does three things:

  1. Generates candidate topics from natural language text by incorporating data from millions of other repositories
  2. Selects the best topics from the set of candidates
  3. Finds similarities and relationships in topics to facilitate discoverability

Below, we describe each step of the repo-topix framework in greater technical detail.

flow

Generating candidate topics from natural language text

Preprocessing of READMEs

While README files within GitHub.com tend to be formatted using Markdown and reStructuredText with fairly lightweight formatting, there are certain sections such as code blocks, tables, and image links that are not useful for topic suggestions. For example, month names from within a table would not be useful to a user.

To extract text sections of interest, we developed a heuristics-based README tagger that marks sections in the README file as relevant or non-relevant. This simple tagger uses common formatting cues such as indentation, spacing, and use of backticks to determine “noise sections” and “valid text sections”. The use of a grammar-based parser was unnecessary as we only care about useful text sections and regard everything else in a README as noise.

Once we extract text sections of interest, we perform basic cleanup to remove file extensions, HTML tags, paths, and hosts from further processing, as these are more distracting than useful. Finally, the remaining text gets segmented into coarse-grained units using punctuation marks as well as README section markers such as contiguous hash symbols.

Finding candidate topics

We use the cleaned text from the previous step to generate candidate topics by eliminating low-information words and breaking the remaining text into strings of one or multiple consecutive words, called n-grams. Like any text, our sources contain many words that are so common that they do not contain distinguishing information. Called stop words, these typically include determiners like “is”, “the”, “are”, conjunctions like “and,”, “but”, and “yet”, and so on. Given our specialized domain, we created a custom stop word list that included words that are practically ubiquitous in our source text; for example, “push”, “pull”, “software”, “tool”, “var”, “val”, and “package.” The custom stop word list provides an efficient way to finding potential topics, as we simply take the resulting phrases left between eliminated words. For example, “this open source software is used for web scraping and search” produces three candidate topics: 1. “open source,” 2. “web scraping,” 3. “search”. This process eliminates the need for brute-force n-gram generation which could end up producing a large number of n-grams depending the length of the README files being processed. After testing internally among GitHub staff, we found that n-grams made up of many words tended to be too specific (e.g. “machine-learning-tutorial-part-1-intro”), so we limit candidate topics to n-grams of size 4 or less.

Selecting best topics

Eliminating low quality topics

While some of the generated word units in the previous step would be meaningful as topics, some could also be plain noise. We have a few strategies for pruning noise and unpromising candidate topics. The first is simply to eliminate phrases with low frequency counts. For example, if we had the following candidates with their corresponding counts, we could eliminate some of the low frequency topics:

machine learning tutorial part 1 (1)
machine learning (5)
beginners data science course (1)
topic modeling (3)

From the above, we could easily eliminate topics that don’t satisfy a minimum frequency count threshold. However, this method doesn’t prune out topics with unwanted grammatical structure or word composition. For example, words and phrases like “great”, “cool”, “running slowly”, “performing operations”, “install database” and “@kavgan” (a GitHub handle) are not great topics for a repository. To aggressively prune out these keywords, we developed a supervised logistic regression model, trained to classify a topic as “good” (positive) or “bad” (negative). We call this the keyword filtering model. We manually gathered about 300 training examples balanced across the positive (good topics) and negative (bad topics) categories. Because of this manual process with no input from users, our training data is actually fairly small. While it’s possible to learn from the actual words that make up a topic when you have a very large training set, with limited training data we used features that draw on the meta-information of the training examples so that our model does not just memorize specific words. For instance, one of the features we used was the part-of-speech usage within topics. If the model learns that single word verbs are often considered bad topics, the next time it sees such an occurrence, it would help eliminate such words from further consideration. Other features we used were occurrence of user names, n-gram size of a phrase, length of a phrase, and numeric content within a phrase. Our classifier is tuned for high recall in order to keep as many phrases as possible and prune obviously incorrect ones.

With time, we plan to include feedback from users to update the keyword filtering model. For example, highly accepted topics can serve as positive training examples and highly rejected topics can either be used as stop words or used as negative examples in our model. We believe that this incremental update would help weed out uninteresting topics from the suggestions list.

Scoring candidate topics

Instead of treating all remaining candidate topics with equal importance, we rank the candidates by score to return only the top-N promising topics instead of a large list of arbitrary topics. We experimented with several scoring schemes. The first scoring approach measures the average strength of association of words in a phrase using pointwise mutual information (PMI) weighted by the frequency count of the phrases. The second approach we tried uses the average tf-idf scores of individual words in a phrase weighted by the phrase frequency (if it’s more than one word long) and n-gram size.

We found that the first scoring strategy favored topics that were unique in nature because of the way PMI works when data is fairly sparse: unique phrases tend to get very high scores. While some highly unique phrases can be interesting, some unique phrases can just be typos or even code snippets that were not formatted as code. The second approach favored phrases that were less unique and relatively frequent. We ended up using the tf-idf based scoring as it gave us a good balance between uniqueness of a topic and relevance of a topic to a repository. While our tf (term frequency) scoring is based on local counts, our idf (inverse document frequency) weighting is based on a large dictionary of idf scores built using the unstructured content from millions of public READMEs. The idf weights essentially tell us how common or unique a term is globally. The intuition is that the more common a term, the less information it carries and should thus have a lower weight. For example, in the GitHub domain, the term “application” is much more common than terms such as “machine”, “learning”, or “assignment” and this is clearly reflected by their idf weights as shown below:

word idf
application 4.300
machine 7.169
learning 7.818
assignment 8.480

If a phrase has many words with low idf weighting, then its overall score should be lower compared to a phrase with more significant words - this is the intuition behind our tf-idf scoring strategy. As an example, assuming that the normalized tf of each word above is 0.5, the average tf-idf score for “machine-learning-application” would be 3.21 and the average tf-idf score for “machine-learning-assignment” would be 3.91. The former has a lower score because the term “application” is more ubiquitous and has a lower idf score than the term “assignment”.

In addition to the base tf-idf scoring, we are also experimenting with some additional ideas such as boosting scores of those phrases that tend to occur earlier in a document and aren’t unique to a few repositories. These minor tweaks are subject to change based on our internal evaluation.

Discovering similar topics

Canonicalizing topics to address character level topic differences and inflections

Because different users can express similar phrases in different ways, the generated topics can also vary from repository to repository. For example, we have commonly seen these variation of topics with different repositories:

neural-network
neural-networks
neuralnetwork
neuralnetworks
topic-models
topic-modelling
topic-modeling
topicmodel
topicmodeling
machinelearning-algorithms
machine-learning-algorithm

To keep topic suggestions fairly consistent, we use a dictionary to canonicalize suggested topics. Instead of suggesting the original topics discovered, we suggest a canonicalized version of the topic if present in our dictionary. This in-house dictionary was built using all non-canonicalized topics across public repositories. The non-canonicalized topics give us cues as to which topics are most commonly used and which ones can be grouped together as being equivalent. We currently use a combination of edit-distance, stemming, and word-level Jaccard similarity to group similar topics together. Jaccard similarity in our case estimates how similar two phrases are by comparing members of two sets to see which members are shared and which are distinct. With this, phrases that share many words can be grouped together.

Near similar topics

While it’s possible to suggest all top-scoring topics, some topics may be fairly repetitive, and the set of topics returned may not provide enough variety for labeling a repository. For example, the following top-scoring topics (from an actual repository), while valid and meaningful, are not interesting and lack variety as it captures different granularity of similar topics:

machine learning
deep learning
general-purpose machine learning
machine learning library
machine learning algorithms
distributed machine learning
machine learning framework
deep learning library
support vector machine
linear regression

We use a greedy topic selection strategy that starts with the highest-scoring topic. If the topic is similar to other lower-scoring topics, the lower-scoring topics are dropped from consideration. We repeat this process iteratively using the next highest-scoring topic until all candidate topics have been accounted for. For the example above, the final set of topics returned to the user would be as follows:

machine learning
deep learning
support vector machine
linear regression

We use word-level Jaccard similarity when computing similarity between phrases, because it’s known to work well for short phrases. It also produces a score between 0-1, making it easy to set thresholds.

Evaluation

As topic labels were not available during the development of repo-topix, we needed to get a rough approximation of how well the suggested topics describe a repository. For this rough approximation, we used the description text for repositories since descriptions often provide insights into the function of a repository. If indeed the auto-suggested topics are not completely arbitrary, there should be some amount of overlap between suggested topics and the description field. For this evaluation, we computed ROUGE-1 precision and recall. ROUGE is an n-gram overlap metric that counts the number of overlapping units between a system summary (suggested topics in our case) and a gold standard summary (description in our case). We performed this evaluation on roughly 127,000 public repositories with fairly long descriptions. These are our most recent results:

Repos with no topic suggestions: ~28%
Average ROUGE-1 Recall: 0.259
Average ROUGE-1 Precision: 0.372
F-Measure: 0.306

The ROUGE recall above tells us quantitatively how much of the description is being captured by topic suggestions and precision tells us what proportion of the suggestion words are words that are also in the description. Based on the results we see that there is some overlap as expected. We’re not looking for perfect overlap, but some level of overlap after disregarding all stop words.

Summary

Our topics extraction framework is capable of discovering promising topics for any public repository on GitHub.com. Instead of applying heavy NLP and complex parsing algorithms within our framework (e.g. grammar-based markdown parsing, dependency parsing, chunking, lemmatization), we focused on using lightweight methods that would easily scale as GitHub.com’s repository base grows over the years. For many of our tasks, we leverage the volume in available data to build out reusable dictionaries such as the IDF dictionary, which was built using all public README files, a custom stop-word list, and a canonicalization dictionary for topics. While we currently depend on the presence of README files to generate suggestions, in the future we hope to make suggestions by looking at any available content within a repository. Most of the core topics extraction code was developed using Java and Python within the Spark framework.

Future plans

Our plan for the near future is to evaluate the usage of suggested topics as well as manually created topics to continuously improve suggestions shown to users. Some of the rejected topics could feed into our topics extraction framework as stop words or as negative examples to our keyword filtering model. Highly accepted topics could add positive examples to our keyword filtering model and also provide lessons on the type of topics that users care about. This would provide cues as to what type of “meta-information” users add to their repositories in addition to the descriptive terms found within README files. We also plan to explore topic suggestions on private repositories and with GitHub Enterprise in a way that fully respects privacy concerns and eliminates certain data dependencies.

Beyond these near term goals, our vision for using topics is to build an ever-evolving GitHub knowledge graph containing concepts and mapping how they relate to each other and to the code, people, and projects on GitHub.

References

These are references to some of the libraries that we used:

Work with us

Want to work on interesting problems like code analysis, social network analysis, recommendations engine and improving search relevance? Apply here!

Topics Data Team

  • Kavita Ganesan, @kavgan, Senior Data Scientist - Machine Learning
  • Rafer Hazen, @rafer, Engineering Manager - Data Engineering
  • Frances Zlotnick, @franniez, Senior Data Scientist - Analytics
25 Apr 05:11

Ubuntu 12.04 Precise Build Image End Of Life Warning

At the end of April 2017, Ubuntu 12.04 will no longer be supported by Canonical. Additionally, CircleCI are not updating the 12.04 build image with new versions of packages.

We encourage all customers to upgrade to our 14.04 build image, or even better, sign up for our 2.0 platform for complete control and flexibility over your build images.

12 Mar 14:51

I need 50,000 comments on a government website.

by Matthew Inman
27 Feb 06:51

Sinfest for February 27, 2017: Pig Person 4

26 Jul 05:41

"Stop The Genocide" - Thu, 25 Jul 2013

Stop The Genocide