Bi-Weekly Report 7

February 12, 2016

2/01/2016

Time 13-15

This was a meeting we had during our lab session. During this meeting we started off by reviewing the work that we completed over the weekend. This included the redesigned sidebar consisting of tabs, some data cleaning features such as standardisation and normalisation, column analysis and finally data visualisation. We also went through some of the initial unit tests we made for the backend python modules, this helped us identify some minor bugs which we were able to fix quiet quickly. Once we were happy with the individual features, we then merged them together and begun testing the application using sample datasets. One problem we noticed was that the performance of our application was slowed down significantly if the dataset was sufficiently large (more than 200000 rows).

2/01/16

Time 16 – 18

Straight after the lab session we had a meeting with our client. During this meeting we demonstrated the newly implemented features of the application such as data visualisation and column analysis. Our clients were very impressed with our progress and were happy to note that we had completed all of our ‘Must Have’ requirements and had begun implementing the ‘Should Haves’. Some of the feedback they gave us was to add a feature that would allow them to visualise word frequencies of columns containing string types. They also thought it would be a good idea to allow the users to upload only a small portion of the dataset using a seed, this would be useful considering some of the datasets they receive contain more than 1000000 rows. We then went on to create a new two-week sprint which we populated with the following tasks:

Implement find and replace feature with regular expressions
Implement the Date variable type, and integrate it with existing features
Add visualisations – Scatter plots, Time series, Pie charts
Show unique values and their counts of every column
Refactor code for data cleaning
Implement sorting of rows by column
Find and view duplicate rows
Find and show outliers for numeric columns

9/01/16

Time 14 – 18

Before this meeting we were able to complete several tasks that were listed in our two-week sprint, these included implementing the find and replace feature, word frequency visualisation and finally the implementation of the Date variable type. However, we also spent a considerable amount of time optimising our code on the frontend in order to improve efficiency as well as speed up its operations – as a result it is now possible to upload larger datasets and perform cleaning operations with a significant reduction in lag. We therefore spent this meeting reviewing our progress as well as merging together the completed tasks. We then used the duration of the session to update our testing code for the backend as well as implement a new feature for one-hot encoding. We then decided to meet on Thursday to organise and distribute the remaining amount of tasks before we break for reading week.

Plan

Our plan is to maintain our pace and complete all the tasks listed in our sprint before our next client meeting which is on 16/01/16. Completing the listed tasks will place us ahead of schedule as it would mean we are very close to finishing our Should have requirements. We also plan on adding more tests to our test suite to ensure we have good code coverage.

Shivam Dhall

Over the course of the first week I spent most of my time creating a test suite for our backend python modules for data cleaning, loading and analysis. While developing the test suite I made sure to test every function in the module with a variety of different input parameters; I also tried to make every test atomic, as well as self-documenting as this would make it easier to identify particular regions of code where bugs may exist. During the second week I was tasked with implementing word frequency visualisation, this involved getting a list of the most common words of a particular column and plotting them on a bar chart along with their frequencies, I also implemented a feature that would allow the user to display a list of the most frequent values in the analysis tab. Finally, as we develop more features, I am constantly adding more tests to the suite as well as carrying out regression testing to see if any new bugs have been introduced.

Gordon Cheng

Over the course of the last two weeks, our team continued with the development of our system. I continued to work on implementing various data cleaning and transformation features. These features included data transformation by match and replace, and encoding categorical features using one-hot encoding. For both of these features, I implemented both the front end and the back end. For the match and replace feature, I also devised an extended functionality where the user can create a queue of replacements which they can execute at one go. I also added extra options to the upload screen which allowed the user to import a random sample of a dataset. This meant that while the system cannot practically handle very large dataset due to limitation of the hardware and architecture, the system can still be used on a random sample of the dataset. Our client suggested this feature in the meeting as a possible solution for dealing with the typical dataset they work with, which are often very large in size.

Bandi Enkh-Amgalan

During the last two weeks, our team continued developing the features set out in our functional requirements and also worked on improving the stability of the application and optimizing performance with large data sets. I worked more on the latter aspect, finding and overcoming bottlenecks and bugs in the system. I discovered and eliminated a major performance bottleneck in the front-end that was being caused by the over reliance of Angular scope. I also fixed several bugs in the front-end that were a result of mistimed initialization and digest function calls. Lastly, I started optimizing the system performance even further for handling large datasets by re-engineering the way in which the front-end receives data from the backend. As for the development of new features, I implemented support for the Date datatype and integrated with the existing datatype conversion and analysis features.