< Bi-Weekly Reports

Bi-Weekly Report 3

November 13, 2015

Summary

The last two weeks coincided with a week-long scenario week followed by reading week, so our group was unfortunately not able to get as much work done as we had hoped. Besides a Skype call with the CEO of a partner at Seldon, we did not hold any group meetings. All of us essentially focused on the individual research tasks that were decided on, as mentioned in the previous report.

Meetings

November 16th

In the morning we had a Skype call with Jon Clarke, CEO of Cyance and Joseph Cavalla, digital development manager at Cyance. Cyance is a business-to-business database marketing company that Seldon is currently partnering with, and has lots of experience analyzing large and messy datasets from clients.

After introducing themselves, they talked about what exactly their company does and what kind of data they usually receive. They stressed the point of “garbage in, garbage out” and how most clients are uneducated in data cleanliness and are not aware of the quality of the data they provide.

We then asked them to go over their current workflow for cleaning data, and in summary, it is a very manual process, undertaken by skilled experts, that cannot be generalized because it varies so much from dataset to dataset. They pointed us to the tools that they currently use, such as Tableau and DQ Match, and briefly stated the advantages and disadvantages of each. They also expressed their interest in a comprehensive data cleaning and exploration solution that can be used by technical and non-technical users alike.

Finally, we showed them our current feature list, and their feedback was that our feature list is very comprehensive and that they would be actually interested in a solution that had all those features. However, when we informed them about our deadlines, they raised concerns about whether or not we would be able to implement the entire list of features in time. We quickly explained that the list is not finalized and asked if they could help narrow our focus and prioritize the specifications, but they said they were unfortunately not in the position to help us, as that would be the role of Seldon, our client.

The meeting ended with them wishing us luck on our project and promising to provide us with support in the future if we need it.

Progress

Completed Tasks

  • Individual research and write-up on existing data cleaning and exploration solutions
  • Update website with all the research done so far

Outlook

Although we made reasonable progress in initial research and requirements analysis, the fact that we haven’t locked down the requirements yet and started designing prototypes with less than a month to go means we are behind schedule.

Plan

We need to narrow our feature list and categorize them into the MoSCoW requirements framework before the next meeting with Seldon, and then finalize it with them during the meeting. After that, we need to begin researching into implementation and designing prototype UIs (wireframes) and back-ends.

Individual Contributions

Bandi Enkh-Amgalan

During the last two weeks, all of us took on individual research assignments into existing data cleaning and exploration solutions. I looked into Apache Zeppelin again but conducted more in-depth research. I then wrote up a couple of paragraphs summarizing my notes, which included an overview of the product, advantages and disadvantages. Besides researching the product, I also attempted to conduct some data analysis on the data set we chose for testing, in order to gain insight into the workflows for data cleaning and analysis in Zeppelin. I found it quite challenging because I had to perform all tasks manually and programmatically with the Spark backend using a functional language called Scala to use the Spark backend, and I am not that experienced in functional programming nor do I know Scala. There was a Python backend but I had issues configuring Zeppelin to use Python for Spark. I struggled to even get basic functionality out of Zeppelin, beyond data loading and some basic analysis. However, I think I have an adequate understanding of the methods and workflows that one would take when using Zeppelin, for a research task, and would be able to resume my investigation in the event that we use Spark as a backend.

Gordon Cheng

At the end of the last meeting with our client, we presented a list of potential features of the product to our client. Our client was satisfied by the feature list and asked us to try out some of the techniques mentioned in existing software and programs such as JASP, Apache Zepplin, Azure, and OpenRefine, with the aim to understand how they work in these programs. I was assigned to experiment with JASP and OpenRefine. To make sure that our findings can be easily compared, we decided to use the same example dataset for each of the programs we will be experimenting with. I found one such dataset that I thought will be good for experimentation, this was a dataset by the Wellcome Trust which has a lot of text based data, for example names of publishers. Moreover, this dataset seemed quite messy, unlike many other I looked at online, with spelling errors and trailing whitespaces in a significant number of records. I experimented the various features of both JASP and OpenRefine, including JASP’s data descriptive, and OpenRefine’s cluster and text transformation features. I also continued to make progress on our project website, including continuing to add our research outcomes. We also gave the link to the website to our client and the initial response was positive.

Shivam Dhall

Over the course of the last two weeks, I have carried on researching on ETL features available on software’s such as JASP, and have also tried to identify new tools and programs that may be useful to our project. I focused mainly on JASP, which is an open source stats package that provides both standard and advanced techniques for data analysis. I used a sample data-set which consisted of messy data and used the appropriate tools within JASP to try and analyse and clean the data-set. At first I found it quiet tricky due to the large array of analyses offered, however it didn’t take very long to get used to it as it has a very intuitive user interface. I have also been reading up on Beaker which is a notebook style development environment for interactively working with large data-sets. It is particularly useful as it supports multiple languages within the same notebook, thus allowing one to use the language best suited for solving a problem. Finally on Monday we had a meeting with a company that have a vast amount of experience with the data cleaning process, as a team we are working together to come up with a few interview questions which will help finalise our user requirements.