Bi-Weekly Report 2

October 30, 2015

Meeting 4

Date 15/10/2015 Time 11-13

This meeting was done during a lab session. At the start of the meeting, each team member brought forward and discussed their findings from their research on ML ETL features on Amazon Web Services and Microsoft Azure. We found that both AWS and Azure offer some data cleaning features such as highlighting missing values and data exploration features such as offering averages and in AWS case, data insights such as unique word count.

During this meeting we also worked on some of the tasks that was set during the previous meeting with our client. Specifically, we wanted to work together on tasks that will benefit all team members. One such task was researching data analysis methods. This includes research on both statistical analysis methods and data visualisation methods.

We looked for the most used data visualisation methods online and researched on when they should be used, which data should the methods be used with, and the limitations of these visualisation methods. The visualisation methods we researched on include: histogram, bar chart, pie chart, line chart, and box-plot. One of the thing we discovered during the research is that while pie chart is a popular data visualisation method, it may not always be an appropriate choice, even for the data it is meant to work with. For example, data with many categories, when represented as a pie chart, may appear as many thin slices, which is difficult for the viewer to see the area difference between the slices. We found that histograms are much more suitable for these kind of data.

We scheduled our next meeting on 20/10/2015 which will be a meeting with our client. Before this meeting, each team member will continue to complete the tasks that are part of the weekly sprint.

Meeting 5

Date 20/10/2015 Time 16-17

This is our third meeting with our client. During this meeting, we reported our findings from the research we had done on AWS, Azure, Apache Zeppelin, JASP and data analysis methods. Our client was pleased with our findings and we swiftly moved on to planning the next sprint. Collectively we decided on a number of tasks which we thought would equate to about the same amount of work as last week, these tasks included:

Requirements questionnaire for Seldon users
Document the options / choices to be made for the final solution
Find another cleaner dataset for testing purposes
Look at outcomes of the Springleaf competition
Create preliminary content and load onto project website for university
Github setup and sandbox

We decided to split up some of these tasks between the team members. As Gordon is the Chief Editor of the team, he has been assigned to create content for the project website. Bandi will work on a document with options to be made for the final solution. Shivam will look at outcomes of the Springleaf competition. We also decided that we should come up with questions for the requirements questionnaire together during our next meeting, scheduled on 22/10/2015.

Meeting 6

Date 22/10/2015 Time 11-13

This meeting was done during a lab session. In this meeting, we focused on coming up with questions for the requirements questionnaire. This questionnaire is very important to us as it will give us valuable information about what the potential users may want in a system we will be developing. We decided to create the questionnaire using Google Forms as it offers various question types and forms created using Google Forms are easy to distribute.

We wanted to know what features users desired the most, therefore we started the questionnaire by asking about the tools they currently use and why they prefer to use these tools at the moment. Not only will this give us an idea of what other similar systems offers, but this will also tell us what features to focus on when we develop our system. We also decided to ask about the specific data visualisation and data cleaning method they prefer and why. We hope that this information will narrow the scope of the project as we focus on the features users most cared about.

A challenge we encountered when coming up with questions for this questionnaire is deciding on the type of answers (eg. paragraph text, checkboxes, radio buttons etc.) we want to receive from the users. We wanted to minimize the use of paragraph text as we think that the answers we receive may be too vague to be helpful as different people answers the same question differently. However, as we try to come up with more questions, we found that it was very difficult to come up with limited choices questions for the information we want. This was partly due to the fact that we cannot possibly list all the possible choices for some of these questions. In the end, we decided we will offer some choices on such questions but include a text box for any other answers they may want to give.

We scheduled the next meeting on 27/10/2015. We hope by then we would have discussed these questions with our client and hopefully get the questionnaire distributed soon.

Meeting 7

Date 27/10/2015 Time 14-15

This is our fourth meeting with our client. We started off this meeting by reporting to our client about the potential feature set for the system, including the data cleaning, data visualisation and data analysis features. Our client was satisfied with this feature set and suggested that we can publish this and get feedback from the community.

We also finalised the questions for the requirements questionnaire with our client. Our client was satisfied with the question we came up with but suggested some changes to the wording of the questions to make them clearer to the people answering them. We then established that the questionnaire will be trialed within the building that Seldon is in before releasing them to a wider audience. Our client also told us that aside from the questionnaire, they will also set up with an interview for us with another company which have expressed interest in this project.

We also came up with tasks for next week’s sprint, which are:

Interview the company about potential features and data cleaning process
Send the questionnaire and collate results
Trying some of the techniques tools on the sample dataset
Setting up a dev server for collaboration and demos

We hope that in the next couple of weeks, through the questionnaire and further exploration of existing tools and techniques we would have a much better understanding of what the specific requirements of the system will be and also the techniques and methods for achieving those requirements.

Gordon Cheng

During these two weeks, I researched on some data cleaning methods, which involved looking at existing tools for data cleaning. I researched on JASP, a statistical analysis tool, and discovered that it offered features like missing values count and averages, but offered relatively little tools for cleaning messy data as it is mainly oriented for statistical analysis. I then researched on other more robust tools for data cleaning and found an open sourced program named OpenRefine. OpenRefine offered a lot more tools for cleaning data such as cluster and text transformation. This proved to be valuable as some of these features were eventually included in our list of potential features for the system. I also continued updating our project website, which is now live and accessible, with more content such as sections on the project background and the research we have done so far.

Shivam Dhall

At the beginning of week 3, my team mates and I created a questionnaire for our clients in order to gain specific information about the system requirements thus enabling us to prioritise them according to MoSCoW. Once that was completed we decided to research on some of the features that are available on existing software packages used for data analysis. I did this by familiarising myself with a stats analysis tool known as JASP, I was also able to download some sample datasets online which I then used for testing purposes within the tool. During our third meeting, our client pointed out a specific dataset available on Kaggle which contained various scripts for cleaning the data. I therefore spent a considerable amount of time looking and working through these scripts however, as most of these are written in python, I have simultaneously had to expand on my python knowledge. Finally in preparation for the forthcoming weeks we have been trying to identify a ‘messy’ dataset which we can all use for testing purposes in different tools.

Bandi Enkh-Amgalan

During the last two weeks, we continued research on the problem. Each of us did further research on data exploration and cleaning tools and techniques. I looked at scripts that people posted on the forums for the SpringLeaf Kaggle challenge that our clients referred us to at the start of the project. I noticed the prevalence of the language R for data analytic and took notes of interesting techniques people used with R. Besides research, I took on the task of consolidating all of the research regarding existing tools we had up to now and compiling a list of potential features that could be included in our final product. I also worked with the team on creating the requirements questionnaire that will be sent out by Seldon to survey developers involved with data science. The results of this questionnaire along with an interview that might be arranged by Seldon will allow us to finalize a list of requirements.