Rosaria has been a researcher in applications of Data Mining and Machine Learning for over a decade. Application fields include biomedical systems and data analysis, financial time series (including risk analysis), and automatic speech processing. She is currently based in Zurich, Switzerland.
What project have you worked on do you wish you could go back to, and do better?
There is not such a thing like the perfect project! As close as you can be to perfection, at some point you need to stop either because the time is over, the money is over or because you just need to have a productive solution. I am sure I can go back to all my past projects and find something to improve in each of them!
This is actually one of the biggest issues in a data analytics projects: when do we stop? Of course, you need to identify some basic deliverables in the project initial phase, without which the project is not satisfactorily completed. But once you have passed these deliverable milestones, when do you stop? What is the right compromise between perfection and resource investment?
In addition, every few years some new technology becomes available which could help re-engineer your old projects, for speed or accuracy or both. So, even the most perfect project solution, after a few years, can surely be improved due to new technologies. This is, for example, the case of the new big data platforms. Most of my old projects would benefit now from a big data based speeding operation. This could help to speed up old models training and deployment, to create more complex data analytics models, and to optimize model parameters better.
What advice do you have to younger analytics professionals and in particular PhD students in the Sciences?
Use your time to learn! Data Science is a relatively new discipline that combines old knowledge, such as statstics and machine learning, with newer wisdom, like big data platforms and parallel computation. Not many people know everything here, really! So, take your time to learn what you do not know yet from the experts in that area. Combining a few different pieces of data science knowledge probably makes you unique already in the data science landscape. The more pieces of different knowledge, the bigger of an advantage for you in the data science ecosystem!
One way to get easy hands-on experience on a different range of application fields is to explore the Kaggle challenges. Kaggle has a number of interesting challenges up every months and who knows you might also win some money!
What do you wish you knew earlier about being a data scientist?
This answer is related to the previous one, since my advice to young data scientists sprouts from my earlier experience and failures. My early background is in machine learning. So, when I moved my first steps in the data science world many years ago, I thought that knowledge of machine learning algorithms was all I needed. I wish!
I had to learn that data science is the sum of many different skills, including data collection and data cleaning and transformation. The latter, for example, is highly underestimated! In all data science projects I have seen (not only mine), the data processing part takes way more than 50% of the used resources! Including also data visualization and data presentation. A genial solution is worth nothing if the executives and stakeholders do not understand the results by means of a clear and compact representation. And so on. I guess I wish I took more time early on to learn from colleagues with a different set of skills than mine.
How do you respond when you hear the phrase ‘big data’?
Do you really need big data? Sometimes customers ask for a big data platform just because. Then when you investigate deeper you realize that they really do not have and do not want to have such a big amount of data to take care of every day. A nice traditional DWH solution is definitely enough for them.
Sometimes though, a big data solution is really needed or at least it will be needed at some point to keep all company’s’ internal and external data up to date. In these cases, we can work together for a big data solution for their project. Even if this is the case, I often warn data analysts not to underestimate the power of small data! A nice clean general statistical sample might produce higher accuracy in terms of prediction, classification and clustering than a messy large and noisy data lake (a data swamp really)! In some projects, a few data dimensionality selection algorithms were run posteriori, just to see whether all those input dimensions contained useful unique pieces of information or just obnoxious noise. You would be surprised! In some cases we easily passed from more than 200 input variables to less than 10 keeping the same accuracy performance.
There is another phrase though that triggers my inner warning signals: “We absolutely need real time execution”. When I hear this phrase, IÂ usually wonder: “Do you need effective real time responses or would perceived real time responses be enough?”. Perceived real time for me is a few seconds,something that a user can wait without getting impatient. A few seconds however is NOT real time! Any data analytics tool, any deployment workflow can produce a response in a few seconds or even less. Real time is a much faster response usually to trigger some kind of consequent action. In most cases, “perceived real time” is good enough for the human reaction time.
What is the most exciting thing about your field?
Probably, the variety of applications. The whole knowledge of data collection, data warehousing, data analytics, data visualization, results inspection and presentation is transferable to a number of fields. You would be surprised at how many different applications can be designed using a variation of the same data science technique! Once you have the data science knowledge and a particular application request, all you need is imagination to make the two match and find the best solution.
How do you go about framing a data problem – in particular, how do you avoid spending too long, how do you manage expectations etc. How do you know what is good enough?
I always propose a first pilot/investigation mini-project at the very beginning. This is for me to get a better idea of the application specs, of the data set, and yes also of the customer. This is a crucial phase, though short. During this part, in fact, I can take the measures of the project in terms of needed time and resources, and I and the customer we can study each other and adjust our expectations about input data and final results. This initial phase, usually involves a sample of the data, an understanding of the data update strategy, some visual investigation, and a first tentative analysis to produce the requested results. Once this part is successful and expectations have been adjusted on both sides, the real project can start.
You spent sometime as a Consultant in Data Analytics. How did you manage cultural challenges, dealing with stakeholders and executives? What advice do you have for new starters about this?
Ah … I am really not a very good example for dealing with stakeholders and executives and successfully manage cultural challenges! Usually, I rely on external collaborators to handle this part for me, also because of time constraints. I see myself as a technical professional, with little time for talking and convincing. Unfortunately, this is a big part of each data analytics project. However, when I have to deal with it myself, I let the facts speak for me: final or intermediate results of current and past projects. This is the easiest way to convince stakeholders that the project is worth the time and the money. For any occurrence, though, I always have at hand a set of slides with previous accomplishments to present to executives if and when needed.
Tell us about something cool you’ve been doing in Data Science lately.
My latest project was about anomaly detection. I found it a very interesting problem to solve, where skills and expertise have to meet creativity. In anomaly detection you have no historical records of anomalies, either because they rarely happen or because they are too expensive to let them happen.
What you have is a data set of records of normal functioning of the machine, transactions, system, or whatever it is you are observing. The challenge then is to predict anomalies before they happen and without previous historical examples. That is where the creativity comes in. Traditional machine learning algorithms need a twist in application to provide an adequate solution for this problem.