Domain-Specific Language and SEDNA’s Elasticsearch

As the first in the series, this week’s blog comes from the incredible engineering team that makes SEDNA’s intelligent communication system possible. As SEDNA continues to grow and develop our product, we aim to pull back the curtain and share with you our knowledge along the way—from the building blocks that make up SEDNA, to the clever tools that allow businesses everywhere to break free of email overload. This week, we’re looking at DSL and how that relates to SEDNA’s ultra-fast search functionality. 

A Domain-Specific Language (DSL) is a computer language specialized in a specific domain. An example of DSL is HTML for web pages. DSLs are useful because they are one level higher in the hierarchy of complexity and let you express things for a specific domain in a simpler syntax. Domain-Specific Languages allow solutions to be expressed at the level of abstraction of the problem domain. They also allow input validation at the domain level and are easier to learn because of their limited scope.

Hierarchy of Programming Languages

First generation: machine language.

Second generation: low-level programming languages such as assembly language.

Third generation: structured high-level programming languages such as C, COBOL, and FORTRAN.

Fourth generation: domain-specific high-level programming languages 

As you go up the hierarchy, it becomes more abstract and easier to understand.

DSLs are not always advantageous, they come with a cost too. It is costly to design, implement, and maintain them. As they are implemented to specifically fit a particular business domain, there are not many code examples available online.

Therefore, it’s better to use language parsers to create a Domain-Specific Language. In this article, we will be looking at how at SEDNA we used a parser tool called PEG to create a grammar that could translate human-readable complex searches into ElasticSearch queries.

What Is Grammar ?

Grammar is a set of rules by which the words in a statement are interpreted into meaningful instructions.

What Is PEG ?

PEG or Parsing Expression Grammar is a type of grammar that describes a formal language in terms of a set of rules for recognizing strings in the language. Peg.js is a javascript library that takes a parsing expression grammar and generates a parser program that can be called directly from javascript.

Why and When to Use PEG ?

You want to process complex data. You think about using Regex, but the expressions that you want to process are so complex that Regex can’t perform well. In addition to that, complex Regex expressions become almost impossible to read and understand by other developers. Regex is not always bad; for example, when validating whether text entered by a user is a valid email address. Regex becomes hard or impossible to write only when you want to parse complex expressions involving complex rules. In those scenarios, context-free grammars (CFGs) are helpful. PEGs are very similar to CFGs, except for the fact that they are non-ambiguous.   

At SEDNA, we wanted to allow users to form human-readable complex searches which we would parse and convert into Elasticsearch queries to fetch results from Elasticsearch. Instead of using Regex, we thought of creating our own language parser. We used javascript’s PEG.js library to do so.

How Do We Use PEG in SEDNA ?

Users can form simple searches like:

hasAttachment:true

They could also combine these to form complex searches like: 

(hasAttachment:true OR hasComment:true) AND (inInbox:true AND comment:test)

Above you can see a complex search in SEDNA. These searches then have to be converted into an ElasticSearch query to be able to fetch results from ElasticSearch.

Let’s start forming the grammar for a simple search like:

hasAttachment:true

Every grammar begins with a “start” rule:

start = _ s:sentence { return s; }

In its simplest form, a PEG rule looks more or less like a Regex rule. However, we can form complex rules utilizing one of the features of PEG; using javascript adjacent to a rule to control its return value. 

We define a sentence as:

sentence

  = left:(primarySentence conj)* right:primarySentence

  { return leftAssociative(left, right); }

  / primarySentence

Where primarySentence can be any text field or an operand

We have defined many operands like “hasAttachment”, “hasComment”, “comment”, “subject” etc.

hasAttachment operand is followed by a “:” and then a boolean field:

/ field : hasAttachment’:’ _ value:bool_field {

Where bool_field can be “true”, “false”, “1”, “0”, “yes”, “no”

Which is defined as follows:

false_field = (“no”i / “false”i / “0”i / “n”i / “f”i) { return false; }

true_field = (“yes”i / “true”i / “1”i / “y”i / “t”i){ return true; }

bool_field = false_field / true_field

So when we have hasAttachment followed by a true field or a false field we convert it into ES format and return:

/ field : hasAttachment’:’ _ value:bool_field {

  return {

    “token” : field,

    “selector”: {

      match: {

        has_attachment: value

      }

    }

  }

}

Now we can see that our primarySentence either has one primarySentence or a primarySentence followed by a conjunction which is then followed by another primarySentence

Where conjunction is defined as: 

conj

  = __ “AND” __ { return “must”; }

  / __ “OR”  __ { return “should”; }

  / __          { return “must”; }

This will accept searches like: hasAttachment:true OR hasComment:true

Recursion Using PEG

In the above definition of sentence, we can also see another wonderful feature of PEG; recursion. A sentence can be a primary sentence or a primary sentence followed by a conj (‘OR’, ‘AND’, ‘NOT’) followed by another primary sentence. This allows any number of searches combined with a `OR’, ‘AND’ or ‘NOT’.

Defining hasComment is similar to hasAttachment, but for that we needed to return an Elasticsearch query based on the boolean value.

If value is true then return the ES query with a must, otherwise, return with a must_not

/ field : hasComment’:’ _ value:bool_field {

  if (value) {

    return {

      “token” : field,

      “selector”: {

        “bool”: {

          “must”: [{

            terms: {

              ‘tags.tag_id’: [ 15 ]

            }

          }]

        }

      }

    }

  } else {

    return {

      “token” : field,

      “selector”: {

        “bool”: {

          “must_not”: [{

            terms: {

              ‘tags.tag_id’: [ 15 ]

            }

          }]

        }

      }

    }

  }

}

All the possible searches in SEDNA have a corresponding rule in our grammar as above, which translates the search into an ElasticSearch query.

As we can see, using a grammar and a tool like PEG makes it very easy to create your own DSL while also saving a lot of implementation time. The grammar code is also easier to read than if this translation was made directly using code.For more details on how to use PEG grammar please refer to their documentation.

To learn more about SEDNA’s intelligent communication software, take a look at our Product Overview.