CeBIT Australia recently interviewed Chief Data Scientist, Eugene Dubossarsky of our Gold DataCon Sponsor, Contexti, on his thoughts on Big Data.
1. Can you tell us a little about your background and your current role?
As CDS of Contexti, I have two main jobs, each of which is a conversation. The first is the most important one, this is the conversation with the client. This is a delicate and ongoing conversation, where I need to do many things: help the client figure out what they want, which may also require gently educating them in the process, introducing them to the power of data analytics, and most importantly their own role and power in the process.
As a Data Scientist, my role is to let the data tell its story. A good engineer gives data its voice, and my job is to listen to that voice. We thus each have our role to do. My job is to explore the undiscovered country that is data, grab the nuggets of value, identify the hidden risks to avoid, and try to see how it all fits into a big picture, that allows a view of the future, a crystal ball that can help chart a way forward. The cool stuff you hear about machine learning, data visualisation, “big data” and other buzzwords all feature too. But these are just tools.
People, data, and the stories they tell – this is the main part of the job.
2. As we know the term Big Data can be an ambiguous term and mean a lot of different things to people, what is your take on Big Data?
It certainly did the job! This is the one buzz phrase that put “it” on the map, whatever we choose to call “it”. My only point is that “it” needs to be big enough to include “small data” (if we think of “big data” as terabytes), and what I call “tacit data”, which is usually the most important data, but lives in people’s heads rather than electronic data bases. Of course, getting tacit data out is possible and desirable, but that is an entirely different story for another time…
So, with regard to Big Data : For me, the “Big” is not just “Size” or “Speed”. Far more importantly : it is “Big” in terms of “Value”, “Credibility”, “Transformational Potential”. It may also be “small” in size, but require “Big” tools in terms of sophistication, computational power and effort on the part of engineers and scientists to realise this value.
My own term for the latter category is “Big Crunch” – the data itself may be small or medium, but “Big Data” tools make extracting the value possible. These are the techniques I find myself using the most actually.
Volume wise, actual “Big Data” terabytes in size is not the most valuable kind of analysis for most Australian companies. But “Big Crunch” certainly is.
3. Which industries do you see this type of analytics benefiting the most and why?
Name one that doesn’t. I am most excited about the growth of analytics in SMEs. In practice, analytics is vital for organisations facing real competition, real ongoing, disruptive change in their industry, real risks and real uncertainty. This is true for most privately-owned SMEs. Ironically, many organisations that can afford analytics most in Australia probably need it least, but this is where most of the buzz is in the industry.
For me, the real question is: “can your company truly afford to survive without analytics”? The answer for most quantitative hedge funds is “of course not”. I leave it for the reader to identify areas where this might not be the case.
4. What do you think are organisations biggest problems when trying to start a big data project?
Most people don’t realise what they are getting into with data science: it is if anything even more powerful than they thought, but they underestimate the amount of personal investment and change required to realise value.
The biggest misconceptions are around the very nature of analytics, and specifically this thing called “data science”, which is a far more helpful term that “big data”.
In essence, analytics should be about exploration, with engineering/building playing a supportive role, albeit a vital one.
Analytics is about exploring, not building. For a Scientist, data is a rich land of mysteries, and the process is a conversation. A scientists welcomes the unknown. For an engineer, data is a commodity, and the focus is on the tools that move and process it. The unknown is to be shied from and controlled, things must work perfectly. And this is necessary too, so that the scientist may play his part.
Engineers work on “projects” by the way. This model is less appropriate for scientists, as are conventional project management methodologies. Those are also a great way to kill the value of an analytics project, and I have seen this tragedy unfold more than once.
Organisations that get this achieve enormous benefits. Organisations that don’t will fail, dissolve the analytics function and start again, only to fail again because the key misconception has not been addressed.
The other major problem is executives underestimating how much personal investment they must make in analytics. Investing in analytics is like investing in a gym membership or an education : you are not paying to make something go away: you are paying to get a whole lot busier at something that will transform you fundamentally. The executive suite can no more outsource the analytics function than I can outsource my gym workout. I wish I could…
Nevertheless, the view persists that analytics is an IT function, primarily concerned with engineering (building, maintaining), that data is a commodity and that the whole thing has nothing or little to do with the lives of important people in the organisation – these are the biggest challenges to organisations coming to terms with big data.
5. How should an organisation go about even starting a Big Data project?
- It isn’t a project, it is an exploration.
- Invest in experimentation, not fixed projects. Accept that there may be no value at all in the first six months.
- The executive sponsor is the number one fan, supporter, client and leader of the analytics team
- “Invest in smarts” – hire smart people, bring in smart advisers, consultants, trainers.
- Don’t waste a cent on software until you know exactly what you need and why, having tried a great many things with open source. It is good enough to begin with, especially when you are still trying to figure out what to do with your data.
- Don’t be embarrassed that you have no idea what to actually do with your data, or how it leads to value. Just about everyone else is in the same boat.
6. What are some of the tools and technologies that can be employed for a big data project?
There are three levels to this, only one of which is actual “tools” ie IT products. The three levels are:
- Business applications – which often require very significant customisation, although they may also be quite similar to applications in other industries/organisations. Or they may be relatively well known things like customer retention for Telco or Insurance claims analysis.
- Conceptual tools – these are the things missing in the toolkits of most people with an IT background who make the transition to “data science”. This includes the whole kit bag of machine learning, statistics, visualisation, network analysis and lots of other mathematical/conceptual/computational tools, tricks, methods and maps. While these are indeed embodied in specific software, the conceptual/mathematical understanding of these tools, their applicability to real-world problems, and the ability to use them broadly and in new scenarios sets the true data scientist apart from a hacker.
- Software – this is the least important layer. If people are not sure what they need, they should start with something like R. And of courses there are many other open source tools out there. If they can see for themselves where open source tools are not up to the job, then they are “educated buyers”, with a clear need and agenda and should consider commercial tools. I have never seen any new data analytics function that did not need to come to terms with its own needs and data first, and where R was not a sufficient starting point.
About Contexti | Big Data Analytics
Contexti is a premier Big Data Analytics company.
We help customers drive growth, accelerate innovation and create competitive advantage.
With expertise in data-driven strategy, Hadoop and NoSQL technologies and advanced Data Science methods, we provide specialist consulting, training and managed services.
In short, we Create Value from Data™.