Recently, my team had to work on a clustered web-based project that had a No-SQL backend; MongoDB to be specific. Having come from a SQL era, this was the first big No-SQL project that I had to deal with, and there were many things that I learned in the process. One of the outcomes of this was Mango, a schema inference engine, which I will describe later below.
But first,
What is a No-SQL database?
There is no precise definition of the term “No-SQL” database. It is used to describe a number of recent database engines, that break away from the SQL mould, and have some common characteristics like:
- Scalability is the primary objective
- ACID compliance is not a necessity
- No pre-specified or enforced Schema for tables / collections
- Nested documents
- Typically, there’s no support for joins or they are very slow.
Because of the above reasons, full support for SQL queries is not possible, and hence the term “No SQL”
Challenges
While working with such a NoSQL database, I found that there the above charactereistics bring about new features and along with them new challenges, like:
- There are no transactions in the database, so if you need them (occasionally), you need to code that logic in the application.
- Since documents can be nested, queries are easier to write.
- It is easy to mix documents with different properties in a collection. For example, if only few users in the database need addresses, you can specify the address field for only those users’ documents.
However, because of the latter two points (nested documents and varying properties), over time, it becomes difficult to know what is the overall schema of the collection.
To address this problem, my team has been working on a new project, which tries to infer the schema of an existing No-SQL database. We have name the tool Mango.
Mango features at a glance
- Data exploration
- Schema inference
- Relation inference
Using Mango
When a user first runs Mango, they are presented with connection options (currently, it only supports MongoDB on localhost), and after approving them, Mango connects to the database backend and presents a list of databases that are available to explore.
After selecting the database, the user can choose from a list of collections in that database.
After choosing a collection, the inference engine starts reading and processing the entries in that collection. It does a recursive analysis of each field in each document in the collection, and then infers the schema from this.
If more than one collection is chosen, Mango will also try to infer the relationships between the collection. This information can be quite complex and hence it is presented in a graph.
Illustration 3: How a graph showing relations looks like
For those who would like to view the inference results as a static document, we have implemented an HTML report generator as well.
Developing Mango
Our natural choice for developing Mango was Scala, because it is fast and portable and expressing complex algorithms in it is easier, thanks to its exhaustive collection libraries, and functional programming features.
For example, the core inference engine is just about 25 lines of code! We first defined a function that merges any two given schemas and then to merge all schemas in a collection we folded over the sequence of rows like this:
val inferredSchema =
collectionSchema.rowFields
.foldLeft[Seq[Field]](Nil)(mergeSchemas)
.sortBy(-_.count)
Here we are folding over the schemas by merging them and then sorting them based on their repeat count. Four lines of code which would have taken reams of code in an imperative styled language!
Of course, writing those few lines of code requires some time and expertise, but it pays off in the long term, in time spent on testing, debugging and maintenance.
Scala will also enable us to easily perform background tasks, to free the GUI thread. We intend to use Actors to implement multi-threaded analysis algorithms.
We have been using good design patterns while developing Mango, such as immutable constructs and layered classes, and we have been writing unit tests as well. This has led to a robust application.
Future plans
Support for more databases and more f