Shared posts

11 Oct 17:41

Flexible Neo4j Batch Import with Groovy

by Patrick Durusau

Flexible Neo4j Batch Import with Groovy by Michael Hunger.

From the post:

You might have data as CSV files to create nodes and relationships from in your Neo4j Graph Database.
It might be a lot of data, like many tens of million lines.
Too much for LOAD CSV to handle transactionally.

Usually you can just fire up my batch-importer and prepare node and relationship files that adhere to its input format requirements.

Your Requirements

There are some things you probably want to do differently than the batch-importer does by default:

  • not create legacy indexes
  • not index properties at all that you just need for connecting data
  • create schema indexes
  • skip certain columns
  • rename properties from the column names
  • create your own labels based on the data in the row
  • convert column values into Neo4j types (e.g. split strings or parse JSON)

Michael helps you avoid the defaults of batch importing into Neo4j.

11 Oct 17:40

No Query Language Needed: Using Python with an Ordered Key-Value Store

by Patrick Durusau

No Query Language Needed: Using Python with an Ordered Key-Value Store by Stephen Pimentel.

From the post:

FoundationDB is a complex and powerful database, designed to handle sharding, replication, network hiccups, and server failures gracefully and automatically. However, when we designed our Python API, we wanted most of that complexity to be hidden from the developer. By utilizing familiar features- such as generators, itertools, and comprehensions-we tried to make FoundationDB’s API as easy to us as a Python dictionary.

In the video below, I show how FoundationDB lets you query data directly using Python language features, rather than a separate query language.

Most applications have back-end data stores that developers need to query. This talk presents an approach to storing and querying data that directly employs Python language features. Using the Key-Value Store, we can make our data persistent with an interface similar to a Python dictionary. Python then gives us a number of tools “out of the box” that we can use to form queries:

  • generators for memory-efficient data retrieval;
  • itertools to filter and group data;
  • comprehensions to assemble the query results.

Taken together, these features give us a query capability using straight Python. The talk walks through a number of example queries using the Enron email dataset.

https://github.com/stephenpiment/object-store For code and the details.

More motivation to take a look at FoundationDB!

I do wonder about the “no query language needed.” Users, despite their poor results, appear to be committed to querying and query languages.

Whether it is the illusion of “empowerment” of users, the current inability to measure the cost of ineffectual searching, or acceptance of poor search results, search and search operators continue to be the preferred means of interaction. Plan accordingly.

I first saw this in a tweet by Hari Kishan.

09 Oct 00:06

A look at Cayley

by Patrick Durusau

A look at Cayley by Tony.

From the post:

Recently I took the time to check out Cayley, a graph database written in Go that’s been getting some good attention.

cayley

https://github.com/google/cayley

A great introduction to Cayley. Tony has some comparisons to Neo4j, but for beginners with graph databases, those comparisons may not be real useful. Come back for those comparisons once you have moved beyond example graphs.

08 Oct 21:24

Incremental Classification, concept drift and Novelty detection (IClaNov)

by Patrick Durusau

Incremental Classification, concept drift and Novelty detection (IClaNov)

From the post:

The development of dynamic information analysis methods, like incremental clustering, concept drift management and novelty detection techniques, is becoming a central concern in a bunch of applications whose main goal is to deal with information which is varying over time. These applications relate themselves to very various and highly strategic domains, including web mining, social network analysis, adaptive information retrieval, anomaly or intrusion detection, process control and management recommender systems, technological and scientific survey, and even genomic information analysis, in bioinformatics. The term “incremental” is often associated to the terms dynamics, adaptive, interactive, on-line, or batch. The majority of the learning methods were initially defined in a non-incremental way. However, in each of these families, were initiated incremental methods making it possible to take into account the temporal component of a data stream. In a more general way incremental clustering algorithms and novelty detection approaches are subjected to the following constraints:

  • Possibility to be applied without knowing as a preliminary all the data to be analyzed;
  • Taking into account of a new data must be carried out without making intensive use of the already considered data;
  • Result must but available after insertion of all new data;
  • Potential changes in the data description space must be taken into consideration.

This workshop aims to offer a meeting opportunity for academics and industry-related researchers, belonging to the various communities of Computational Intelligence, Machine Learning, Experimental Design and Data Mining to discuss new areas of incremental clustering, concept drift management and novelty detection and on their application to analysis of time varying information of various natures. Another important aim of the workshop is to bridge the gap between data acquisition or experimentation and model building.

ICDM 2014 Conference: December 14, 2014

The agenda for this workshop has been posted.

Does your ontology support incremental classification, concept drift and novelty detection? All of those exist in the ongoing data stream of experience if not within some more limited data stream from a source.

You can work from a dated snapshot of the world as it was, but over time will that best serve your needs?

Remember that for less than $250,000 (est.) the attacks on 9/11 provoked the United States into spending $trillions based on a Cold War snapshot of the world. Probably the highest return on investment for an attack in history.

The world is constantly changing and your data view of it should be changing as well.

05 Oct 22:40

Gödel for Goldilocks…

by Patrick Durusau

Gödel for Goldilocks: A Rigorous, Streamlined Proof of Gödel’s First Incompleteness Theorem, Requiring Minimal Background by Dan Gusfield.

Abstract:

Most discussions of Gödel’s theorems fall into one of two types: either they emphasize perceived philosophical “meanings” of the theorems, and maybe sketch some of the ideas of the proofs, usually relating Gödel’s proofs to riddles and paradoxes, but do not attempt to present rigorous, complete proofs; or they do present rigorous proofs, but in the traditional style of mathematical logic, with all of its heavy notation and difficult definitions, and technical issues which reflect Gödel’s original exposition and needed extensions by Gödel’s contemporaries. Many non-specialists are frustrated by these two extreme types of expositions and want a complete, rigorous proof that they can understand. Such an exposition is possible, because many people have realized that Gödel’s first incompleteness theorem can be rigorously proved by a simpler middle approach, avoiding philosophical discussions and hand-waiving at one extreme; and also avoiding the heavy machinery of traditional mathematical logic, and many of the harder detail’s of Gödel’s original proof, at the other extreme. This is the just-right Goldilocks approach. In this exposition we give a short, self-contained Goldilocks exposition of Gödel’s first theorem, aimed at a broad audience.

Proof that even difficult subjects can be explained without “hand=waiving” or “heavy machinery of traditional mathematical logic.”

I first saw this in a tweet by Lars Marius Garshol.

05 Oct 03:11

General Theory of Natural Equivalences [Category Theory - Back to the Source]

by Patrick Durusau

General Theory of Natural Equivalences by Samuel Eilenberg and Saunders MacLane. (1945)

While reading the Stanford Encyclopedia of Philosophy entry on category theory, I was reminded that despite seeing the citation Eilenberg and MacLane, General Theory of Natural Equivalences, 1945 uncounted times, I have never attempted to read the original paper.

Considering I had a graduate seminar on running biblical research back to original sources (as nearly as possible), a severe oversight on my part. An article comes to mind that proposed inserting several glyphs into a particular inscription. Plausible, until you look at the tablet in question and realize perhaps one glyph could be restored, but not two or three.

It has been my experience that was not a unique case nor is it limited to biblical studies.

05 Oct 03:11

Category Theory (Stanford Encyclopedia of Philosophy)

by Patrick Durusau

Category Theory (Stanford Encyclopedia of Philosophy)

From the entry:

Category theory has come to occupy a central position in contemporary mathematics and theoretical computer science, and is also applied to mathematical physics. Roughly, it is a general mathematical theory of structures and of systems of structures. As category theory is still evolving, its functions are correspondingly developing, expanding and multiplying. At minimum, it is a powerful language, or conceptual framework, allowing us to see the universal components of a family of structures of a given kind, and how structures of different kinds are interrelated. Category theory is both an interesting object of philosophical study, and a potentially powerful formal tool for philosophical investigations of concepts such as space, system, and even truth. It can be applied to the study of logical systems in which case category theory is called “categorical doctrines” at the syntactic, proof-theoretic, and semantic levels. Category theory is an alternative to set theory as a foundation for mathematics. As such, it raises many issues about mathematical ontology and epistemology. Category theory thus affords philosophers and logicians much to use and reflect upon.

Several tweets contained “category theory” and links to this entry in the Stanford Encyclopedia of Philosophy. The entry was substantially revised as of October 3, 2014, but I don’t see a mechanism that allows discovery of changes to the prior text.

For a PDF version of this entry (or other entries), join the Friends of the SEP Society. The cost is quite modest and the SEP is an effort that merits your support.

As a reading/analysis exercise, treat the entries in SEP as updates to Copleston‘s History of Philosophy:

A History of Philosophy 1: Greek and Rome

A History of Philosophy 2: Medieval

A History of Philosophy 3: Late Medieval and Renaissance

A History of Philosophy 4: Modern: Descartes to Leibniz

A History of Philosophy 5: Modern British, Hobbes to Hume

A History of Philosophy 6: Modern: French Enlightenment to Kant

A History of Philosophy 7: Modern Post-Kantian Idealiststo Marx, Kierkegaard and Nietzsche

A History of Philosophy 8: Modern: Empiricism, Idealism, Pragmatism in Britain and America

A History of Philosophy 9: Modern: French Revolution to Sartre, Camus, Lévi-Strauss

Enjoy!

03 Oct 23:43

Latency Numbers Every Programmer Should Know

by Patrick Durusau

Latency Numbers Every Programmer Should Know by Jonas Bonér.

Latency numbers from “L1 cache reference” up to “Send packet CA->Netherlands->CA” and many things in between!

Latency will be with you always. ;-)

I first saw this in a tweet by Julia Evans.

03 Oct 03:39

The Early Development of Programming Languages

by Patrick Durusau

The Early Development of Programming Languages by Donald E. Knuth and Luis Trabb Pardo.

A survey of the first ten (10) years of “high level” computer languages. Ends in 1947 and was written based on largely unpublished materials.

If you want to find a “new” idea, there are few better places to start than with this paper.

Enjoy!

I first saw this in a tweet by JD Maturen.

03 Oct 03:39

Readings in Databases

by Patrick Durusau

Readings in Databases by Reynold Xin.

From the webpage:

A list of papers essential to understanding databases and building new data systems. The list is curated and maintained by Reynold Xin (@rxin)

Not a comprehensive list but it is an annotated one, which should enable you to make better choices.

Concludes with reading lists from several major computer science programs.

02 Oct 05:21

Lingo of Lambda Land

by Patrick Durusau

Lingo of Lambda Land by Katie Miller.

From the post:

Comonads, currying, compose, and closures
This is the language of functional coders
Equational reasoning, tail recursion
Lambdas and lenses and effect aversion
Referential transparency and pure functions
Pattern matching for ADT deconstructions
Functors, folds, functions that are first class
Monoids and monads, it’s all in the type class
Infinite lists, so long as they’re lazy
Return an Option or just call it Maybe
Polymorphism and those higher kinds
Monad transformers, return and bind
Catamorphisms, like from Category Theory
You could use an Either type for your query
Arrows, applicatives, continuations
IO actions and partial applications
Higher-order functions and dependent types
Bijection and bottom, in a way that’s polite
Programming of a much higher order
Can be found just around the jargon corner

I posted about Kate Miller’s presentation, Coder Decoder: Functional Programmer Lingo Explained, with Pictures but wanted to draw your attention to the poem she wrote to start the presentation.

In part because it is an amusing poem but also for you to attempt an experiment that Stanley Fish reports on interpretation of poems.

Stanley’s experiment is recounted in “How to Recognize a Poem When You See One,” which appears as chapter 14 in Is There A Text In This Class? The Authority of Interpretative Communities by Stanley Fish.

As functional programmers or wannabe functional programmers, you are probably not the “right” audience for this experiment. (But, feel free to try it.)

Stanley’s experiment came about from a list of authors given to one class, centered on a blackboard (yes, many years ago) to which, for the second class, Stanley drew a box around the list of names and inserted “p. 43″ on the board. Those were the only changes between the classes.

The second class was one on interpretation of religious poetry and they were instructed this list was a religious poem and they should being applied the techniques learned in the class to its interpretation.

Stanley’s account of this experiment is masterful and I urge you to read his account in full.

At the same time, you will learn a lot about semantics if you ask a poetry professor to have one of their classes produce an interpretation of this poem. You will discover that “not knowing the meaning of the terms” is no barrier to the production of an interpretation. Sit in the back of the classroom and don’t betray the experiment by offering explanations of the terms.

The question to ask yourself at the end of the experiment is: Where did the semantics of the poem originate? Did Katie Miller imbue it with semantics that would be known to all readers? Or do the terms themselves carry semantics and Katie just selected them? If either answer is yes, how did the poetry class arrive at its rather divergent and colorful explanation of the poem?

Hmmm, if you were scanning this text with a parser, whose semantics would your parser attribute to the text? Katie’s? Any programmers? The class’?

Worthwhile to remember that data processing chooses “a” semantic, not “the” semantic in any given situation.

02 Oct 05:21

Coder Decoder: Functional Programmer Lingo Explained, with Pictures

by Patrick Durusau

by Katie Miller.

From the description:

For the uninitiated, a conversation with functional programmers can feel like ground zero of a jargon explosion. This talk will help you to defend against the blah-blah blast by demystifying several terms commonly used by FP fans with bite-sized Haskell examples and friendly pictures. The presentation will also offer a glimpse of how some of these concepts can be applied in a simple Haskell web application. Expect appearances by Curry, Lens, and the infamous M-word, among others.

Slides: http://decoder.codemiller.com/#/

Haskell demo source code: https://github.com/codemiller/wubble

Informative and entertaining presentation on functional programming lingo.

Not all functional programming lingo, but enough to make you wish your presentations were this clear.

02 Oct 00:45

FOAM (Functional Ontology Assignments for Metagenomes):…

by Patrick Durusau

FOAM (Functional Ontology Assignments for Metagenomes): a Hidden Markov Model (HMM) database with environmental focus by Emmanuel Prestat, et al. (Nucl. Acids Res. (2014) doi: 10.1093/nar/gku702 )

Abstract:

A new functional gene database, FOAM (Functional Ontology Assignments for Metagenomes), was developed to screen environmental metagenomic sequence datasets. FOAM provides a new functional ontology dedicated to classify gene functions relevant to environmental microorganisms based on Hidden Markov Models (HMMs). Sets of aligned protein sequences (i.e. ‘profiles’) were tailored to a large group of target KEGG Orthologs (KOs) from which HMMs were trained. The alignments were checked and curated to make them specific to the targeted KO. Within this process, sequence profiles were enriched with the most abundant sequences available to maximize the yield of accurate classifier models. An associated functional ontology was built to describe the functional groups and hierarchy. FOAM allows the user to select the target search space before HMM-based comparison steps and to easily organize the results into different functional categories and subcategories. FOAM is publicly available at http://portal.nersc.gov/project/m1317/FOAM/.

Aside from its obvious importance for genomics and bioinformatics, I mention this because the authors point out:

A caveat of this approach is that we did not consider the quality of the tree in the tree-splitting step (i.e. weakly supported branches were equally treated as strongly supported ones), producing models of different qualities. Nevertheless, we decided that the approach of rational classification is better than no classification at all. In the future, the groups could be recomputed, or split more optimally when more data become available (e.g. more KOs). From each cluster related to the KO in process, we extracted the alignment from which HMMs were eventually built.

I take that to mean that this “ontology” represents no unchanging ground truth but rather an attempt to enhance the “…screening of environmental metagenomic and metatranscriptomic sequence datasets for functional genes.”

As more information is gained, the present “ontology” can and will change. Those future changes create the necessity to map those changes and the facts that drove them.

I first saw this in a tweet by Jonathan Eisen

25 Sep 15:45

Twitter open sourced a recommendation algorithm for massive datasets

by Patrick Durusau

Twitter open sourced a recommendation algorithm for massive datasets by Derrick Harris.

From the post:

Late last month, Twitter open sourced an algorithm that’s designed to ease the computational burden on systems trying to recommend content — contacts, articles, products, whatever — across seemingly endless sets of possibilities. Called DIMSUM, short for Dimension Independent Matrix Square using MapReduce (rolls off the tongue, no?), the algorithm trims the list of potential combinations to a reasonable number, so other recommendation algorithms can run in a reasonable amount of time.

Reza Zadeh, the former Twitter data scientist and current Stanford consulting professor who helped create the algorithm, describes it in terms of the famous handshake problem. Two people in a room? One handshake; no problem. Ten people in a room? Forty-five handshakes; still doable. However, he explained, “The number of handshakes goes up quadratically … That makes the problem very difficult when x is a million.”

Twitter claims 271 million active users.

DIMSUM works primarily in two different areas: (1) matching promoted ads with the right users, and (2) suggesting similar people to follow after users follow someone. Running through all the possible combinations would take days even on a large cluster of machines, Zadeh said, but sampling the user base using DIMSUM takes significantly less time and significantly fewer machines.

The “similarity” of two or more people or bits of content is a variation on the merging rules of the TMDM.

In recommendation language, two or more topics are “similar” if:

  • at least one equal string in their [subject identifiers] properties,
  • at least one equal string in their [item identifiers] properties,
  • at least one equal string in their [subject locators] properties,
  • an equal string in the [subject identifiers] property of the one topic item and the [item identifiers] property of the other, or
  • the same information item in their [reified] properties.

TMDM 5.3.5 Properties

The TMDM says “equal” and not “similar” but the point being that you can arbitrarily decide on how “similar” two or more topics must be in order to trigger merging.

That realization opens up the entire realm of “similarity” and “recommendation” algorithms and techniques for application to topic maps.

Which brings us back to the algorithm just open sourced by Twitter.

With DIMSUM, you don’t have to do a brute force topic by topic comparison for merging purposes. Some topics will not meet a merging “threshold” and not be considered by merging routines.

Of course, with the TMDM, merging being either true or false, you may be stuck with brute force. Suggestions?

But if you have other similarity measures, you may be able to profit from DIMSUM.

BTW, I would not follow #dimsum on Twitter because it is apparently a type of dumpling. ;-)

25 Sep 00:12

One or More Taxonomies

by Heather Hedden

In the various definitions of taxonomy, one aspect of the definition that is often missing is what constitutes a single taxonomy (or thesaurus) versus multiple related taxonomies (or thesauri). If you hire a taxonomy consultant, they won’t tell you because they will defer to their client’s terminology. If you are designing a taxonomy/taxonomies for your own organization, however, this is often an issue of concern.

Hierarchies and other relationships

In simple hierarchical taxonomies, a single hierarchy could be a single taxonomy. Not all terms on the same subject, however, may fit neatly in one hierarchy while complying with ANSI/NISO hierarchical relationship guidelines. So, more often than not, a hierarchical taxonomy may have multiple top terms. For example, a taxonomy on health care might have top terms for hierarchies on conditions and diseases, diagnostic procedures, treatments, medical equipment and supplies. If for some reason you needed a single hierarchy, then you would bend the hierarchical-relationship rules to make such top terms narrower to the term that is the name of the taxonomy. Thus, whether there is one top term or multiple top terms, it is still considered one taxonomy.
Facets are a special case. Each facet consists of its own hierarchy of terms, or may even have multiple top-term hierarchies of similar-type terms on the same subject, and there are no relationships between terms in different facets. So, you might consider each facet to be a taxonomy. However, the facets are intended to be used only in combination, not in isolation. In fact, we often speak of a “faceted taxonomy,” implying a single taxonomy comprised of multiple facets. So, a single facet is not a taxonomy.

A more thesaurus-like structure, may have fewer large hierarchies and more smaller hierarchies with more numerous top terms, but it will also have associative relationships that link terms across hierarchies. So, a possible definition of a taxonomy or thesaurus is a set of terms where there is at least some kind of relationship between every term and at least one other in the same set. However, you could end up with a situation of just a couple of terms related to each other but none of them are related/linked to any other terms in the taxonomy. So, additional criteria are needed to define a single taxonomy as to include such terms.

Thus, at a minimum, a taxonomy comprises one or more hierarchies, but what about at a maximum? The question came up in my online course, in an assignment to create polyhierarchies, in which I suggest that the broader terms are from different hierarchies. A student asked: “Are the different hierarchies supposed to be within the same Taxonomy, or merely two different hierarchies from two different Taxonomies?” Generally, standard hierarchical and associative relationships do not transcend multiple taxonomies. An exception would be instance-type hierarchical relationships between topics in a taxonomy and named entities (proper nouns) maintained in a separate controlled vocabulary. Other types of relationships may link terms across multiple taxonomies, but they would likely be special-purpose relationships, such as equivalency mappings or translations.

Subject scope and purpose

In addition to considering the relationships between terms, another determining factor of what constitutes a single taxonomy is the subject area scope. One taxonomy is for one subject area, although that subject area could be very broad, especially if the taxonomy’s purpose is to support indexing of the topics in a daily national newspaper. More often, a taxonomy is more limited in scope, such as just technology topics or health topics.

Related to subject scope is how the taxonomy will be used in both indexing/tagging and retrieval. Generally, a single taxonomy is utilized in a single indexing/tagging method and with its own indexing policy. Policy, comprising both editorial style for terms and indexing rules, is often a defining factor for a single taxonomy. Different taxonomies will have different policies. For the end-user, a retrieval function is served by a single taxonomy, such as supporting a search function or providing a set of browse categories. If you want to enable multiple unrelated methods of retrieval (such as type-ahead for the search box, dynamic filtering facets, and a navigational browse), then you will need to create separate taxonomies for each. At a former employer I built taxonomies for SharePoint, and it turned out that I had to build three completely separate taxonomies: (1) the consistently labeled hierarchy of libraries and folders, (2) terms and their variants to support search with a third-party auto-classification tool, and (3) controlled vocabularies of terms for consistent tagging and metadata management of uploaded documents.

There is also the question of whether the content to be accessed by the taxonomy is together in one set or separated out for different purposes or different audiences. A taxonomy should be designed to suit its own content. This was the case in a current project I am working on. There are two distinct sets of content available on a web site. The content sets have many similarities, so could be browsed via the same one hierarchical taxonomy, but they are for potentially different audiences. If the content set were to remain separate, we would have created two separate taxonomies, each customized to best suit its own set of content. But the site owners decided that the two sets of content would be presented together, “blended,” to cross-sell content, in addition to standing on their own elsewhere on the site. Thus, a single taxonomy was the chosen option. The use of two content categories for terms within the taxonomy will enable the additional, separate content set option.

Conclusions


In sum, a single taxonomy:

  • Has standard relationships (BT/NT, RT, USE/UF) confined within it. Cross-taxonomy links, if any, are of non-standard types.
  • Has a defined, restricted subject scope.
  • Has its own indexing/tagging policy.
  • Could function in isolation, unlike a single facet (although may be supplemented by other controlled vocabularies/metadata).
  • Has its own implementation, function, and purpose (although taxonomies can be reused and repurposed).

It’s important for a taxonomist to determine what constitutes a single taxonomy versus multiple taxonomies, not so much for communicating with stakeholders, but rather to plan the initial design of the taxonomy within a taxonomy management tool. Taxonomy/thesaurus software allows for the designation of one or more taxonomies/thesauri that may be linked to each other or not. The use of multiple so-called files, thesauri, vocabularies, objects, classes, categories, etc. are different ways that the various software tools allow the taxonomist to control the divisions between and within taxonomies.
24 Sep 23:02

Growing a Language

by Patrick Durusau

Growing a Language by Guy L. Steele, Jr.

The first paper in a new series of posts from the Hacker School blog, “Paper of the Week.”

I haven’t found a good way to summarize Steele’s paper but can observe that a central theme is the growth of programming languages.

While enjoying the Steele paper, ask yourself how would you capture the changing nuances of a language, natural or artificial?

Enjoy!

24 Sep 23:01

You can be a kernel hacker!

by Patrick Durusau

You can be a kernel hacker! by Julia Evans.

From the post:

When I started Hacker School, I wanted to learn how the Linux kernel works. I’d been using Linux for ten years, but I still didn’t understand very well what my kernel did. While there, I found out that:

  • the Linux kernel source code isn’t all totally impossible to understand
  • kernel programming is not just for wizards, it can also be for me!
  • systems programming is REALLY INTERESTING
  • I could write toy kernel modules, for fun!
  • and, most surprisingly of all, all of this stuff was useful.

I hadn’t been doing low level programming at all – I’d written a little bit of C in university, and otherwise had been doing web development and machine learning. But it turned out that my newfound operating systems knowledge helped me solve regular programming tasks more easily.

Post by the same name as her presentation at Strange Loop 2014.

Another reason to study the Linux kernel: The closer to the metal your understanding, the more power you have over the results.

That’s true for the Linux kernel, machine learning algorithms, NLP, etc.

You can have a canned result prepared by someone else, which may be good enough, or you can bake something more to your liking.

I first saw this in a tweet by Felienne Hermans.

24 Sep 22:51

Learn Datalog Today

by Patrick Durusau

Learn Datalog Today by Jonas Enlund.

From the homepage:

Learn Datalog Today is an interactive tutorial designed to teach you the Datomic dialect of Datalog. Datalog is a declarative database query language with roots in logic programming. Datalog has similar expressive power as SQL.

Datomic is a new database with an interesting and novel architecture, giving its users a unique set of features. You can read more about Datomic at http://datomic.com and the architecture is described in some detail in this InfoQ article.

Table of Contents

You have been meaning to learn Datalog but it just hasn’t happened.

Now is the time to break that cycle and do the deed!

This interactive tutorial should ease you on your way to learning Datalog.

It can’t learn Datalog for you but it can make the journey a little easier.

Enjoy!

24 Sep 22:51

From Frequency to Meaning: Vector Space Models of Semantics

by Patrick Durusau

From Frequency to Meaning: Vector Space Models of Semantics by Peter D. Turney and Patrick Pantel.

Abstract:

Computers understand very little of the meaning of human language. This profoundly limits our ability to give instructions to computers, the ability of computers to explain their actions to us, and the ability of computers to analyse and process text. Vector space models (VSMs) of semantics are beginning to address these limits. This paper surveys the use of VSMs for semantic processing of text. We organize the literature on VSMs according to the structure of the matrix in a VSM. There are currently three broad classes of VSMs, based on term–document, word–context, and pair–pattern matrices, yielding three classes of applications. We survey a broad range of applications in these three categories and we take a detailed look at a specific open source project in each category. Our goal in this survey is to show the breadth of applications of VSMs for semantics, to provide a new perspective on VSMs for those who are already familiar with the area, and to provide pointers into the literature for those who are less familiar with the field.

At forty-eight (48) pages with a thirteen (13) page bibliography, this survey of vector space models (VSMs) of semantics should keep you busy for a while. You will have to fill in VSMs developments since 2010 but mastery of this paper will certain give you the foundation to do so. Impressive work.

I do disagree with the authors when they say:

Computers understand very little of the meaning of human language.

Truth be told, I would say:

Computers have no understanding of the meaning of human language.

What happens with a VSM of semantics is that we as human readers choose a model we think represents semantics we see in a text. Our computers blindly apply that model to text and report the results. We as human readers choose results that we think are closer to the semantics we see in the text, and adjust the model accordingly. Our computers then blindly apply the adjusted model to the text again and so on. At no time does the computer have any “understanding” of the text or of the model that it is applying to the text. Any “understanding” in such a model is from a human reader who adjusted the model based on their perception of the semantics of a text.

I don’t dispute that VSMs have been incredibly useful and like the authors, I think there is much mileage left in their development for text processing. That is not the same thing as imputing “understanding” of human language to devices that in fact have none at all. (full stop)

Enjoy!

I first saw this in a tweet by Christopher Phipps.

PS: You probably recall that VSMs are based on creating a metric space for semantics, which have no preordained metric space. Transitioning from a non-metric space to a metric space isn’t subject to validation, at least in my view.

24 Sep 22:51

Convince your boss to use Clojure

by Patrick Durusau

Convince your boss to use Clojure by Eric Normand.

From the post:

Do you want to get paid to write Clojure? Let’s face it. Clojure is fun, productive, and more concise than many languages. And probably more concise than the one you’re using at work, especially if you are working in a large company. You might code on Clojure at home. Or maybe you want to get started in Clojure but don’t have time if it’s not for work.

One way to get paid for doing Clojure is to introduce Clojure into your current job. I’ve compiled a bunch of resources for getting Clojure into your company.

Take these resources and do your homework. Bringing a new language into an existing company is not easy. I’ve summarized some of the points that stood out to me, but the resources are excellent so please have a look yourself.

Great strategy and list of resources for Clojure folks.

How would you adapt this strategy to topic maps and what resources are we missing?

I first saw this in a tweet by Christophe Lalanne.

24 Sep 22:49

Getting Started with S4, The Self-Service Semantic Suite

by Patrick Durusau

Getting Started with S4, The Self-Service Semantic Suite by Marin Dimitrov.

From the post:

Here’s how S4 developers can get started with The Self-Service Semantic Suite. This post provides you with practical information on the following topics:

  • Registering a developer account and generating API keys
  • RESTful services & free tier quotas
  • Practical examples of using S4 for text analytics and Linked Data querying

Ontotext is up front about the limitations on the “free” service:

  • 250 MB of text processed monthly (via the text analytics services)
  • 5,000 SPARQL queries monthly (via the LOD SPARQL service)

The number of pages in a megabyte of text varies depends on text content but assuming a working average of one (1) megabyte = five hundred (500) pages of text, you can analyze up to one hundred and twenty-five thousand (125,000) pages of text a month. Chump change for serious NLP but it is a free account.

The post goes on to detail two scenarios:

  • Annotate a news document via the News analytics service
  • Send a simple SPARQL query to the Linked Data service

Learn how effective entity recognition and SPARQL are with data of interest to you, at a minimum of investment.

I first saw this in a tweet by Tony Agresta.

24 Sep 22:48

User Onboarding

by Patrick Durusau

User Onboarding by Samuel Hulick.

From the webpage:

Want to see how popular web apps handle their signup experiences? Here’s every one I’ve ever reviewed, in one handy list.

I have substantially altered Samuel’s presentation to fit the list onto one screen and to open new tabs, enabling quick comparison of onboarding experiences.

Asana iOS Instagram OkCupid Slingshot
Basecamp InVision Optimizely Snapchat
Buffer LessAccounting Pinterest Trello
Evernote LiveChat Pocket Tumblr
Foursquare Mailbox for Mac Quora Twitter
GetResponse Meetup Shopify Vimeo
Gmail Netflix Slack WhatsApp

Writers become better by reading good writers.

Non-random good onboarding comes from studying previous good onboarding.

Enjoy!

I first saw this in a tweet by Jason Ziccardi.

24 Sep 22:48

New Directions in Vector Space Models of Meaning

by Patrick Durusau

New Directions in Vector Space Models of Meaning by Edward Grefenstette, Karl Moritz Hermann, Georgiana Dinu, and Phil Blunsom. (video)

From the description:

This is the video footage, aligned with slides, of the ACL 2014 Tutorial on New Directions in Vector Space Models of Meaning, by Edward Grefenstette (Oxford), Karl Moritz Hermann (Oxford), Georgiana Dinu (Trento) and Phil Blunsom (Oxford).

This tutorial was presented at ACL 2014 in Baltimore by Ed, Karl and Phil.

The slides can be found at http://www.clg.ox.ac.uk/resources.

Running time is 2:45:12 so you had better get a cup of coffee before you start.

Includes a review of distributional models of semantics.

The sound isn’t bad but the acoustics are so you will have to listen closely. Having the slides in front of you helps as well.

The semantics part starts to echo topic map theory with the realization that having a single token isn’t going to help you with semantics. Tokens don’t stand alone but in a context of other tokens. Each of which has some contribution to make to the meaning of a token in question.

Topic maps function in a similar way with the realization that identifying any subject of necessity involves other subjects, which have their own identifications. For some purposes, we may assume some subjects are sufficiently identified without specifying the subjects that in our view identify it, but that is merely a design choice that others may choose to make differently.

Working through this tutorial and the cited references (one advantage to the online version) will leave you with a background in vector space models and the contours of the latest research.

I first saw this in a tweet by Kevin Safford.

24 Sep 22:46

A schemaless computer database in 1965

by Patrick Durusau

A schemaless computer database in 1965 by Bob DuCharme.

From the post:

To enable flexible metadata aggregation, among other things.

I’ve been reading up on America’s post-war attempt to keep up the accelerated pace of R&D that began during World War II. This effort led to an infrastructure that made accomplishments such as the moon landing and the Internet possible; it also led to some very dry literature, and I’m mostly interested in what new metadata-related techniques were developed to track and share the products of the research as they led to development.
… (emphasis in original)

I won’t spoil the surprise. Go read Bob’s post to see the answer.

His post does prompt me to ask: What early computing “dry” literature have you read lately?

24 Sep 21:58

Where Does Scope Come From?

by Patrick Durusau

Where Does Scope Come From? by Michael Robert Bernstein.

From the post:

After several false starts, I finally sat down and watched the first of Frank Pfenning’s 2012 “Proof theory foundations” talks from the University of Oregon Programming Languages Summer School (OPLSS). I am very glad that I did.

Pfenning starts the talk out by pointing out that he will be covering the “philosophy” branch of the “holy trinity” of Philosophy, Computer Science and Mathematics. If you want to “construct a logic,” or understand how various logics work, I can’t recommend this video enough. Pfenning demonstrates the mechanics of many notions that programmers are familiar with, including “connectives” (conjunction, disjunction, negation, etc.) and scope.

Scope is demonstrated during this process as well. It turns out that in logic, as in programming, the difference between a sensible concept of scope and a tricky one can often mean the difference between a proof that makes no sense, and one that you can rest other proofs on. I am very interested in this kind of fundamental kernel – how the smallest and simplest ideas are absolutely necessary for a sound foundation in any kind of logical system. Scope is one of the first intuitions that new programmers build – can we exploit this fact to make the connections between logic, math, and programming clearer to beginners? (emphasis in the original)

Michael promises more detail on the treatment of scope in future posts.

The lectures run four (4) hours so it is going to take a while to do all of them. My curiosity is whether “scope” in this context refers to variables in programming or does “scope” here extend in some way to scope as used in topic maps?

More to follow.

24 Sep 21:58

ETL: The Dirty Little Secret of Data Science

by Patrick Durusau

ETL: The Dirty Little Secret of Data Science by Byron Ruth.

From the description:

“There is an adage that given enough data, a data scientist can answer the world’s questions. The untold truth is that the majority of work happens during the ETL and data preprocessing phase. In this talk I discuss Origins, an open source Python library for extracting and mapping structural metadata across heterogenous data stores.”

More than your usual ETL presentation, Byron makes several points of interest to the topic map community:

  • “domain knowledge” is necessary for effective ETL
  • “domain knowledge” changes and fades from dis-use
  • ETL isn’t transparent to consumers of data resulting from ETL, a “black box”
  • Data provenance is the answer to transparency, changing domain knowledge and persisting domain knowledge
  • “Provenance is a record that describes the people, institutions, entities, and activities, involved in producing, influencing, or delivering a piece of data or a thing.”
  • Project Origins, captures metadata and structures from backends and persists it to Neo4j

Great focus on provenance but given the lack of merging in Neo4j, the collation of information about a common subject, with different names, is going to be a manual process.

Follow @thedevel.

24 Sep 21:56

Demystifying The Google Knowledge Graph

by Patrick Durusau

Demystifying The Google Knowledge Graph by Barbara Starr.

knowledge graph

Barbara covers:

  • Explicit vs. Implicit Entities (and how to determine which is which on your webpages)
  • How to improve your chances of being in “the Knowledge Graph” using Schema.org and JSON-LD.
  • Thinking about “things, not strings.”

Is there something special about “events?” I remember the early Semantic Web motivations being setting up tennis matches between colleagues. The examples here are of sporting and music events.

If your users don’t know how to use TicketMaster, repeating delivery of that data on your site isn’t going to help them.

On the other hand, this is a good reminder to extract from Schema.org all the “types” that would be useful for my blog.

PS: A “string” doesn’t become a “thing” simply because it has a longer token. Having an agreed upon “longer token” from a vocabulary such as Schema.org does provide more precise identification than an unadorned “string.”

Having said that, the power of having several key/value pairs and a declaration of which ones must, may or must not match, should be readily obvious. Particularly when those keys and values may themselves be collections of key/value pairs.

07 Sep 21:32

Elastic Search: The Definitive Guide

by Patrick Durusau

Elastic Search: The Definitive Guide by Clinton Gormley and Zachary Tong.

From “why we wrote this book:”

We wrote this book because Elasticsearch needs a narrative. The existing reference documentation is excellent… as long as you know what you are looking for. It assumes that you are intimately familiar with information retrieval concepts, distributed systems, the query DSL and a host of other topics.

This book makes no such assumptions. It has been written so that a complete beginner — to both search and distributed systems — can pick it up and start building a prototype within a few chapters.

We have taken a problem based approach: this is the problem, how do I solve it, and what are the trade-offs of the alternative solutions? We start with the basics and each chapter builds on the preceding ones, providing practical examples and explaining the theory where necessary.

The existing reference documentation explains how to use features. We want this book to explain why and when to use various features.

An important guide/reference for Elastic Search but the “why” for this book is important as well.

Reference documentation is absolutely essential but so is documentation that eases the learning curve in order to promote adoption of software or a technology.

Read this both for Elastic Search as well as one model for writing a “why” and “when” book for other technologies.

26 Aug 23:38

Probabilistic Topic Maps?

by Patrick Durusau

Probabilistic Soft Logic

From the webpage:

Probabilistic soft logic (PSL) is a modeling language (with accompanying implementation) for learning and predicting in relational domains. Such tasks occur in many areas such as natural language processing, social-network analysis, computer vision, and machine learning in general.

PSL allows users to describe their problems in an intuitive, logic-like language and then apply their models to data.

Details:

  • PSL models are templates for hinge-loss Markov random fields (HL-MRFs), a powerful class of probabilistic graphical models.
  • HL-MRFs are extremely scalable models because they are log-concave densities over continuous variables that can be optimized using the alternating direction method of multipliers.
  • See the publications page for more technical information and applications.

This homepage lists three introductory videos and has a set of slides on PSL.

Under entity resolution, the slides illustrate rules that govern the “evidence” that two entities represent the same person. You will also find link prediction, mapping of different ontologies, discussion of mapreduce implementations and other materials in the slides.

Probabilistic rules could be included in a TMDM instance but I don’t know of any topic map software that supports probabilistic merging. Would be a nice authoring feature to have.

The source code is on GitHub if you want to take a closer look.

25 Aug 22:54

Exploring a SPARQL endpoint

by Patrick Durusau

Exploring a SPARQL endpoint by Bob DuCharme.

From the post:

In the second edition of my book Learning SPARQL, a new chapter titled “A SPARQL Cookbook” includes a section called “Exploring the Data,” which features useful queries for looking around a dataset that you know little or nothing about. I was recently wondering about the data available at the SPARQL endpoint http://data.semanticweb.org/sparql, so to explore it I put several of the queries from this section of the book to work.

An important lesson here is how easy SPARQL and RDF make it to explore a dataset that you know nothing about. If you don’t know about the properties used, or whether any schema or schemas were used and how much they was used, you can just query for this information. Most hypertext links below will execute the queries they describe using semanticweb.org’s SNORQL interface.

Bob’s ease at using SPARQL reminds me of a story of an ex-spy who was going through customs for the first time in years. As part of that process, he accused a customs officer of having memorized print that was too small to read easily. The which the officer replied, “I am familiar with it.” ;-)

Bob’s book on SPARQL and his blog will help you become a competent SPARQL user.

I don’t suppose SPARQL is any worse off semantically than SQL, which has been in use for decades. It is troubling that I can discover dc:title but have no way to investigate how it was used by a particular content author.

Oh, to be sure, the term dc:title makes sense to me, but that is a smoothing function as a reader and may or may not be the same “sense” as occurs to the person who completed such a term.

You can read data sets using your own understanding of tokens but I would do so with a great deal of caution.