< Research and Development

Architecture Research and Design

This page details the research we have done through the course of designing the architecture for our system. Click the buttons below to jump to the corresponding section.

1. Important Decisions 2. First Iteration 3. Alterations During Development

Important Decisions

Issue #1 Where will processing operations take place?
Options considered on the user's browser, locally
on a backend, remotely
Final Decision on a backend, remotely
Rationale Why not on the user's browser (locally)?

Performing the data processing locally will make for a simpler architecture, but it is a worse option in every other regard. Choosing this approach will make the performance of the end product be bottlenecked by the user's computer and browser capabilities. It also limits development options, as we would most likely only be able to code in JavaScript. Lastly, we would end up with an architecture whose computational capability cannot scale beyond the user's browser.

Why on a backend, remotely?

Although choosing to introduce a back-end component to the project on which data processing takes place will make the architecture of our product more complex, the benefits outweigh this minor disadvantage.

Moving data processing to a back-end instead of performing it will overcome all the shortcomings of performing it locally. The end product will have higher performance and an architecture that can scale much more easily. It gives us greater freedom in terms of development, as there are many more choices of platforms for a backend.

There is no use avoiding a backend as one of our "could have" requirements is the option for a notebook style interface which would most likely necessitate some form of backend anyway.

Issue #2 Which data processing tools/libraries/frameworks will be used in the backend?
Options considered Apache Spark
Pandas
R (language) + its built-in libraries
no library/a custom implementation
Final Decision Pandas
Rationale Why not no library/a custom implementation?

It would be a waste of time re-inventing the wheel when there are readily available open-source libraries that we can integrate into our project.

Why not Apache Spark?

Apache Spark is the option that is the most scalable and thus would afford the best performance, as Spark being a cluster computing framework is inherently scalable and designed for performance. However, we were advised by our client that it would be overkill for our project and that such a level of performance and scalability is beyond the scope of this project.

We are also quite aware, through our investigation on Zeppelin, that Apache Spark does not have much built-in functionality and is actually just a bare-bones distributed computing framework. In fact, choosing to use Spark is no different from choosing to implement everything ourselves, except an added bonus of running on top of a scalable platform.

Why not R + its built-in libraries?

Our research revealed that there is actually not much difference in the functionality provided by Pandas and R + its built-in libraries. In fact, R actually has an upper hand in terms of existing functionality over Pandas, because the whole language was created for statistics and data science.

However, what made us choose not to use R is because none of us are familiar with the language, so development would obviously be set back as we spend time familiarizing ourselves with the language and the libraries.

Why Pandas?

Pandas, being a data science library for Python, has a lot of built-in functionality (comparable to R) that we can use to very easily fulfill some of our basic functional requirements. However, unlike R, it has an added advantage of being a Python library, a language all of us know.

Our client also requested that we use Pandas because their machine learning platform uses Pandas and Python as feature extraction pipelines, and it would be ideal if our solution was compatible with Seldon's platform.

Finally, the fact that Pandas is a Python package makes it more suitable for use with a web backend. Python has many options for web application frameworks such as Django and Flask, and it would be more convenient for development if the web application and the data processing logic were both written in the same language.

Issue #3 Which web application framework will be used for the backend?
Options considered Django
Flask
non Python frameworks e.g. Express.js, Ruby on Rails
Final Decision Flask
Rationale Why not a non Python framework?

As we have chosen to use Pandas as a data processing library, which is a Python package, it does not make sense to use a non Python based web development framework.

Why not Django?

Django is actually better than Flask in some ways, as it is a complete framework that enforces and encourages proper code structure with the MVC paradigm and provides many features that Flask does not, such as an object-oriented wrapper over databases and built-in administration tools. However, none of this additional functionality will be necessary for our backend, making Django a bloated framework that gets in the way of performance in the context of our project.

Why Flask?

In contrast to the feature-completeness of Django, Flask advertises itself as a micro web application framework, only providing bare essentials such as routing. This makes Flask extremely lightweight and since our backend only needs the functionality Flask already provides, it is the perfect choice.

Issue #4 Which Web framework will be used for the frontend?
Options considered Angular.js
Backbone.js
Ember.js
React.js
Final Decision Angular.js
Rationale Why not Backbone.js?

Backbone.js does not compare to Ember.js or Angular.js in terms of functionality. For example, it lacks routing, dynamic data binding and a build int templating engine. Its popularity has dwindled since its release due to the growth in popularity of more feature-filled frameworks such as Angular.js and Ember.js.

Why not React.js?

React.js is a new library that is gaining popularity among developers, but it is not a complete MVC framework like Angular.js or Ember.js, only providing the V/'view' aspect of the paradigm. Therefore, using React.js would require us spend additional time defining the MVC structure ourselves, or end up with a poorly structured app.

Why Angular.js over Ember.js?

Ember.js, along with Angular.js, were the only suitable choices for a front-end framework for our project. They are nearly identicla with regard to feature-set, but what made us choose Angular.js over Ember.js is that some of us have had experience using the former, so choosing Angular.js should streamline the development process. Furthermore, Angular.js is the more popular framework between the two and has a much larger user base in the developer community, meaning that it would be easier for us to get support if we run into problems.

Issue #5 How will the notebook interface be created?
Options considered Jupyter + iPython kernel
a custom implementation
Final Decision Jupyter + iPython kernel
Rationale Why not a custom implementation?

It would be a waste of time re-inventing the wheel when there are readily available open-source libraries that we can integrate into our project. Too much development time will be wasted dealing with the complexity of implementing a secure and robust Python execution engine as well as communication protocol.

Why Jupyter + iPython?

Jupyter + iPython kernel is the de-facto standard notebook software for Python. It is fair to say that there are no competing solutions worth considering, and the fact that it is also included in Seldon's machine learning platform makes it a great choice.

Issue #6 How will the backend expose functionality to the frontend?
Options considered RESTful HTTP API
WebSocket API
Final Decision WebSocket API
Rationale Why not a RESTful HTTP API?

This was a tough decision to make as the REST architecture has become so prevalent among Web services that it is the de facto standard. Providing a RESTful service would make it easy for advanced users and developers to create their own custom frontends to communicate with our backend.

However, one of the architectural constraints of REST is that the web service does not store the state of the client, and the nature of our project forces us to break that constraint. For our backend to be efficient, it must store the i.e. data of the client between requests, instead of forcing the client to comply with the stateless constraint and send its data with every request. d

A RESTful API is also based around HTTP, and HTTP does not allow servers to push data to clients. Our architecture must handle the situation where a client requests a data processing job that takes a long period of time, and it does not make sense to use a RESTful API, as requests will hang and might even timeout before the backend completes the data processing task.

Why a WebSocket API

Our previous assessment concluded that we ultimately need a protocol that supports pushing of updates to the client and does not enforce the constraint of statelessness, which is an inappropriate constraint for our application.

Server push will most likely be a critical capability to enable synchronization of changes made using the notebook interface to the graphical/spreadsheet interface. It would also be required for the live collaboration functionality, although that is a "would have" requirement.

A WebSocket API would fulfill all these requirements.

Issue #7 Which WebSocket library will be used in the front end and backend?
Options considered none/a custom implementation
Socket.IO
Final Decision Socket.IO
Rationale Why not none/a custom implementation?

It would be a waste of time re-inventing the wheel when there are readily available open-source libraries that we can integrate into our project.

Why Socket.IO?

SocketIO seems to be the only WebSocket library that has an implementation for Flask, our backend. Team members have prior experience using SocketIO. Lastly, SocketIO is backwards compatible with browsers that don't support the WebSocket protocol introduced in HTML5, so using it as a library also adds a layer of robustness and usability to our application.

Issue #7 How will the backend perform data operations?
Options considered synchronously within Flask
asynchronously outside Flask
Final Decision asynchronously outside Flask
Rationale Why not synchronously within Flask?

Although this option would make for a simpler architecture, it is definitely not good practice for the web app to be directly calling intensive data cleaning functions. It is also an approach that does not scale and performs poorly when the backend is under heavy load. Even in a production setting when Flask is deployed with a WSGI server with multiple server processes, a single blocking data processing operation would render an entire worker process unresponsive, and if enough users cause this scenario, the entire backend would become unresponsive.

Why asynchronously outside Flask?

This approach will provide scalability + robustness in the end product, overcoming the shortcomings of running the tasks synchronously within Flask. It should also boost performance as multiple cores and even multiple CPUs can be utilized to perform tasks concurrently.

Running the tasks outside Flask asynchronously also enables the infrastructure to support features such as canceling operations before they complete.

Issue #8 How will backend operations be performed asynchronously?
Options considered threading/multiprocessing within Flask
service process outside Flask
Celery (background task queue manager)
Final Decision Celery (background task queue manager)
Rationale Why not threading/multiprocessing within Flask?

Although this looked promising initially, testing with Flask revealed that a web app which spawns new child processes and/or uses multiple threads does not work behind a server due to restrictions in WSGI. The application only behaves as expected when run in development mode as a single threaded single process server, but not when behind any WSGI server such as Gunicorn. The idea of using multiple threads or spawning child processes within a Flask app is strongly discouraged by the developer community. Furthermore the Socket.IO library we have chosen to use with Flask does not support multithreaded web apps.

Why not a service process outside Flask?

This approach does adhere to best practices unlike using multithreading/multiprocessing, but is quite complicated to implement. With this approach, Flask would spawn a new service that performs all the data processing with every user that connects to the backend, and then communicate with that service using some form of IPC in the server OS such as a socket. There are libraries that could help with messaging such as ZeroMQ, but this approach does not compare to the simplicity of using an existing task manager solution i.e. Celery.

Why Celery?

After looking through developer forums, we saw that Celery is consistently the recommended solution for asynchronous background tasks in Python. Celery handles all aspects of background tasks, such as creating, managing and communicating with worker processes.

Using Celery would allow us to queue all data processing tasks and not worry about the implementation of asynchronous background tasks. Its task management features will also simplify the development of features such as cancelling operations.

Issue #9 How will persistence of user data during sessions be achieved?
Options considered disk cache (HDF files)
in Flask web app memory
Final Decision disk cache (HDF files)
Rationale Why not in Flask web app memory?

In a production environment, the Flask web app will be served by a WSGI server that uses multiple different server worker processes. That means it is not guaranteed that the user’s frontend gets served by the same server process. Every server process has an independent memory stack, so storing the user data in the memory of the Flask web app as some sort of global variable, would not be a working solution.

Why disk cache (HDF files)?

The only other option is to cache the user's data on disk in a HDF file format (our research revealed that the HDF file format provides the fastest read and write speeds with Pandas data objects).


^ Back to Top


First Iteration

The user interacts with an Angular.js front-end whose HTML and other assets are served by the Flask web app. The front-end provides both spreadsheet/graphical and notebook style interfaces.

When the user launches the web app, Socket.IO (also running on Flask) establishes a two-way WebSocket connection between the Angular.js application and the Flask backend. Interactions with the spreadsheet/graphical style interface will be communicated as data retrievel, processing or cleaning requests that will be defined in our WebSocket API. When the backend receives these requests, it asynchronously launches background tasks using Celery. These background tasks will load the server’s model of the user data, stored as a cached HDF file on disk, and then run functions from the standalone data cleaning Python package that we will develop on top of Pandas. Once the Celery tasks complete and return results, the backend pushes the results of the operation to the frontend over the WebSocket connection. Angular.js then updates the view in the front-end accordingly.

Cleaning Tab
Outline of the first iteration of our application architecture.

Interactions with the notebook style interface, however, does not follow this process. The Angular.js frontend will simply display what is being served by the Jupyter notebook server running alongside Flask in the backend. Socket.IO is not used as Jupyter handles all communication of user interactions under the hood, sending the user’s Python code snippets to the iPython kernel and receiving the results of the code evaluation. The Jupyter notebook server in the backend does not connect to any iPython kernel, but rather to an iPython kernel instance that our Flask application initializes and configured to have our standalone data cleaning package imported and the server model of the user’s data retrieved and loaded from the HDF file store.

Testing

To ensure that the architecture was designed well, we created a skeleton with all the components in place for testing. We discovered that the design of the architecture was solid except for one issue, the lack of support for asynchronous callbacks in Flask. We assumed that Flask would be queue tasks in Celery and pass in functions to be called back when the tasks complete – a common software pattern in frameworks such as Node.js. However, we realized that Celery runs functions in a separate process altogether ant is unable to call functions inside Flask. Furthermore, our research showed that Flask and most Python applications do not have an event loop and thus cannot support asynchronous callbacks. In the end, we were able to simulate asynchronous behavior by adding Celery, but instead of calling a function in Flask, the task submits a web request to an endpoint with the results data to the Flask app.


^ Back to Top


Alterations During Development

architecture
Outline of the final version of our application architecture.

The architecture of the system has undergone some changes during the development of the system. The most notable change was the withdrawal of the requirement that the system should provide both a pure GUI interface and a notebook style interface for performing operations on datasets. This was because our client felt that the UI of the system is good enough that a notebook style interface would not add much to the user experience of the system. To reflect this change of the system, the parts around the notebook style interface including the iPython notebook and the iPython kernel and web server. A more detailed description of the final architecture design can be found in our technical documentation.


^ Back to Top