Interview Peadar Coyle – Dataconomy

“Spend time learning the basics” – An Interview with Data Scientist Thomas Wiecki

Peadar Coyle — Mon, 16 Nov 2015 08:30:33 +0000

Thomas is Data Science Lead at Quantopian Inc which is a crowd-sourced hedge fund and algotrading platform. Thomas is a cool guy and came to give a great talk in Luxembourg last year – which I found so fascinating that I decided to learn some PyMC3

Follow Peadar’s series of interviews with data scientists here.

What project have you worked on do you wish you could go back to, and do better?

While I was doing my masters in CS I got a stipend to develop an object recognition framework. This was before deep learning dominated every benchmark data set and bag-of-features was the way to go. I am proud of the resulting software, called Pynopticon, even though it never gained any traction. I spent a lot of time developing a streamed data piping mechanism that was pretty general and flexible. This was in anticipation of the large size of data sets. In retrospect though it was overkill and I should have spent less time coming up with the best solution and instead spend time improving usability! Resources are limited and a great core is not worth a whole lot if the software is difficult to use. The lesson I learned is to make something useful first, place it into the hands of users, and then worry about performance.

What advice do you have to younger analytics professionals and in particular PhD students in the Sciences?

Spend time learning the basics. This will make more advanced concepts much easier to understand as it’s merely an extension of core principles and integrates much better into an existing mental framework. Moreover, things like math and stats, at least for me, require time and continuous focus to dive in deep. The benefit of taking that time, however, is a more intuitive understanding of the concepts. So if possible, I would advise people to study these things while still in school as that’s where you have the time and mental space. Things like the new data science tools or languages are much easier to learn and have a greater risk of being ‘out-of-date’ soon. More concretely, I’d start with Linear Algebra (the Strang lectures are a great resource) and Statistics (for something applied I recommend Kruschke’s Doing Bayesian Analysis, for fundamentals “The Elements of Statistical Learning” is a classic).

What do you wish you knew earlier about being a data scientist?

How important non-technical skills are. Communication is key, but so are understanding business requirements and constraints. Academia does a pretty good job of training you for the former (verbal and written), although mostly it is assumed that communicate to an expert audience. This certainly will not be the case in industry where you have to communicate your results (as well as how you obtained them) to people with much more diverse backgrounds. This I find very challenging.

[bctt tweet=”Communication is key, but so are understanding business requirements and constraints.”]

As to general business skills, the best way to learn is probably to just start doing it. That’s why my advice for grad-students who are looking to move to industry would be to not obsess over their technical skills (or their Kaggle score) but rather try to get some real-world experience.

How do you respond when you hear the phrase ‘big data’?

As has been said before, it’s quite an overloaded term. On one side, it’s a buzzword in business where I think the best interpretation is that ‘big data’ actually means that data is a ‘big deal’ — i.e. the fact that more and more people realize that by analyzing their data they can have an edge over the competition and make more money.

Then there’s the more technical interpretation where it means that data increases in size and some data sets do not fit into RAM anymore. I’m still undecided of whether this is actually more of a data engineering problem (i.e. the infrastructure to store the data, like hadoop) or an actual data science problem (i.e. how to actually perform analyses on large data). A lot of times, as a data scientist I think you can get by by sub-sampling the data (Andreas Müller has a great talk of how to do ML on large data sets, here).

Then again, more data also has the potential to allow us to build more complex models that capture reality more accurately, but I don’t think we are there yet. Currently, if you have little data, you can only do very simple things. If you have medium data, you are in the sweet spot where you can do more complex analyses like Probabilistic Programming. However, with “big data”, the advanced inference algorithms fail to scale so you’re back to doing very simple things. This “big data needs big models” narrative is expressed in a talk by Michael Betancourt, here.

What is the most exciting thing about your field?

The fast pace the field is moving. It seems like every week there is another cool tool announced. Personally I’m very excited about the blaze ecosystem including dask which has a very elegant approach to distributed analytics which relies on existing functionality in well established packages like pandas, instead of trying to reinvent the wheel. But also data visualization is coming along quite nicely where the current frontier seems to be interactive web-based plots and dashboards as worked on by bokeh, plotly and pyxley.

How do you go about framing a data problem – in particular, how do you avoid spending too long, how do you manage expectations etc. How do you know what is good enough?

I try to keep the loop between data analysis and communication to consumers very tight. This also extends to any software to perform certain analyses which I try to place into the hands of others even if it’s not perfect yet. That way there is little chance to ween off track too far and there is a clearer sense of how usable something is. I suppose it’s borrowing from the agile approach and applying it to data science.

(image credit: Lauren Manning, CC2.0)

“Justify the business value behind your work.” – Interview with Data Scientist Ian Ozsvald

Peadar Coyle — Fri, 13 Nov 2015 08:30:12 +0000

Ian is an Entrepreneurial Geek, 30-late-ish, living in London (after 10 years in Brighton and a year in Latin America).

Ian is the owner of an Artificial Intelligence consultancy and author of ‘The Artificial Intelligence Cookbook’which teaches you how to add clever algorithms to your software to make it smarter! One of his mobile products is SocialTies (built with RadicalRobot).

Follow Peadar’s series of interviews with data scientists here.

What project have you worked on do you wish you could go back to, and do better?

My most frustrating project was (thankfully) many years ago. A client gave me a classification task for a large number of ecommerce products involving NLP. We defined an early task to derisk the project and the client provided representative data, according to the specification that I’d laid out. I built a set of classifiers that performed as well as a human and we felt that the project was derisked sufficiently to push on. Upon receiving the next data set I threw up my arms in horror – as a human I couldn’t solve the task on this new, very messy data – I couldn’t imagine how the machine would solve it. The client explained that they wanted the first task to succeed so they gave me the best data they could find and since we’d solved that problem, now I could work on the harder stuff. I tried my best to explain the requirements of the derisking project but fear I didn’t give a deep enough explanation to why I needed fully-representative dirty data rather than cherry-picked good data. After this I got *really* tough when explaining the needs for a derisking phase.

What advice do you have to younger analytics professionals and in particular PhD students in the Sciences?

You probably want an equal understanding of statistics, linear algebra and engineering, with multiple platforms and languages plus visualisation skills. You probably want 5+ years experience in each industrial domain you’ll work in. None of this however is realistic. Instead focus on some areas that interest you and that pay well-enough and deepen your skills so that you’re valuable. Next go to open source conferences and speak, talk at meetups and generally try to share your knowledge – this is a great way of firming up all the dodgy corners of your knowledge. By speaking at open source events you’ll be contributing back to the ecosystem that’s provided you with lots of high quality free tools. For me I speak, teach and keynote at conferences like PyDatas, PyCons, EuroSciPys and EuroPythons around the world and co-run London’s most active data community at PyDataLondon. Also get involved in supporting the projects you use – by answering questions and submitting new code you’ll massively improve the quality of your knowledge.

What do you wish you knew earlier about being a data scientist?

I wish I knew how much I’d miss not paying attention to classes in statistics and linear algebra! I also wish I’d appreciated how much easier conversations with clients were if you have lots of diagrams from past projects and projects related to their data – people tend to think visually, they don’t work well from lists of numbers.

How do you respond when you hear the phrase ‘big data’?

Most clients don’t have a Big Data problem and even if they’re storing huge volumes of logs, once you subselect the relevant data you can generally store it on a single machine and probably you can represent it in RAM. For many small and medium sized companies this is definitely the case (and it is definitely-not-the-case for a company like Facebook!). With a bit of thought about the underlying data and its representation you can do things like use sparse arrays in place of dense arrays, use probabilistic counting and hashes in place of reversible data structures and strip out much of the unnecessary data. Cluster-sized data problems can be made to fit into the RAM of a laptop and if the original data already fits on just 1 hard-drive then it almost certainly only needs a single machine for analysis. I co-wrote O’Reilly’s High Performance Python and one of the goals of the book was to show that many number-crunching problems work well using just 1 machine and Python, without the complexity and support-cost of a cluster.

What is the most exciting thing about your field?

We’re stuck in a world of messy, human-created data. Cleaning it and joining it is currently a human-level activity, I strongly suspect that we can make this task machine-powered using some supervised approaches so less human time is spent crafting regular expressions and data transformations. Once we start to automate data cleaning and joining I suspect we’ll see a new explosion in the breadth of data science projects people can tackle.

How do you go about framing a data problem – in particular, how do you avoid spending too long, how do you manage expectations etc. How do you know what is good enough?

To my mind the trick is figuring out a) how good the client’s data is and b) how valuable it could be to their business if put to work. You can justify any project if the value is high enough but first you have to derisk it and you want to do that as quickly and cheaply as possible. With 10 years of gut-feel experience I have some idea about how to do this but it feels more like art than science for the time being. Always design milestones that let you deliver lumps of value, this helps everyone stay confident when you hit the inevitable problems.

You spent sometime as a Consultant in Data Analytics. How did you manage cultural challenges, dealing with stakeholders and executives? What advice do you have for new starters about this?

Justify the business value behind your work and make lots of diagrams (stick them on the wall!) so that others can appreciate what you’re doing. Make bits of it easy to understand and explain why it is valuable and people will buy into it. Don’t hide behind your models, instead speak to domain experts and learn about their expertise and use your models to backup and automate their judgement, you’ll want them on your side.

[bctt tweet=”Justify the business value behind your work. #datascience”]

You have a cool startup can you comment on how important it is as a CEO to make a company such as that data-driven or data-informed?

My consultancy (ModelInsight.io) helps companies to exploit their data so we’re entirely data-driven! If a company has figured out that it has a lot of data and it could steal a march on its competitors by exploiting this data, that’s where we step in. A part of the reason I speak internationally is to help companies think about the value in their data based on the projects we’ve worked on previously.

(image credit: Carmen Escobar Carrio, CC2.0)

“Write a ton of code. Don’t watch TV. ” – An Interview with Data Scientist Erik Bernhardsson

Peadar Coyle — Mon, 09 Nov 2015 08:30:24 +0000

Erik likes to work with smart people and deliver great software. After 5+ years at Spotify, he recently left for new and exciting startup in NYC where he is leading the engineering team.

At Spotify, Erik built up and lead the team responsible for music recommendations and machine learning. They designed and built many large scale machine learning algorithms we use to power the recommendation features: the radio feature, the “Discover” page, “Related Artists”, and much more. He also authored Luigi, which is a workflow manager in Python with 3,000+ stars on Github – used by Foursquare, Quora, Stripe, Asana, and more.

Follow Peadar’s series of interviews with data scientists here.

What project have you worked on do you wish you could go back to, and do better?

Like… everything I ever built. But I think that’s part of the learning experience. Especially working with real users, you never know what’s going to happen. There’s no clear problem formulation, no clear loss function, lots of various data sets to use. Of course you’re going to waste too much time doing something that turns out to nothing. But research is that way. Learning stuff is what matters and kind of by definition you have to do stupid shit before you learned it. Sorry for a super unclear answer 🙂

The main thing I did wrong for many years was I built all this cool stuff but never really made it into prototypes that other people could play around with. So I learned something very useful about communication and promoting your ideas.

What advice do you have to younger analytics professionals and in particular PhD students in the Sciences?

Write a ton of code. Don’t watch TV. 🙂

[bctt tweet=”Write a ton of code. Don’t watch TV. :)”]

I really think showcasing cool stuff on Github and helping out other projects is a great way to learn and also to demonstrate market validation of your code.

Seriously, I think everyone can kick ass at almost anything as long as you spend a ridiculous amount of time on it. As long as you’re motivated by something, use that by focusing on something 80% of your time being awake.

I think people generally get motivated by coming up with various proxies for success. So be very careful about choosing the right proxies. I think people in academia often validate themselves in terms of things people in the industry don’t care about and things that doesn’t necessarily correlate with a successful career. It’s easy to fall down into a rabbit hole and become extremely good at say deep learning (or anything), but at a company that means you’re just some expert that will have a hard time getting impact beyond your field. Looking back on my own situation I should have spent a lot more time figuring out how to get other people excited about my ideas instead of perfecting ML algorithms (maybe similar to last question).

What do you wish you knew earlier about being a data scientist?

I don’t consider myself a data scientist so not sure 🙂

There’s a lot of definitions floating around about what a data scientist does. I have had this theory for a long time but just ran into this blog post the other day. I think it summarizes my own impression pretty well. There’s two camps, one is the “business insights” side, one is the “production ML engineer” side. I managed teams at Spotify on both sides. It’s very different.

If you want to understand the business and generate actionable insights, then in my experience you need pretty much no knowledge of statistics and machine learning. It seems like people think with ML you can generate these super interesting insights about a business but in my experience it’s very rare. Sometimes we had people coming in writing a master’s thesis about churn prediction and you can get a really high AUC but it’s almost impossible to use that model for anything. So it really just boils down to doing lots of highly informed A/B tests. And above all, having deep empathy for user behavior. What I mean is you really need to understand how your users think in order to generate hypotheses to test.

For the other camp, in my experience understanding backend development is super important. I’ve seen companies where there’s a “ML research team” and a “implementation team” and there’s a “throw it over the fence” attitude, but it doesn’t work. Iteration cycles get 100x larger and incentives just get misaligned. So I think for anyone who wants to build cool ML algos, they should also learn backend and data engineering.

How do you respond when you hear the phrase ‘big data’?

Love it. Seriously, there’s this weird anti-trend of people bashing big data. I throw up every time I see another tweet like “You can get a machine with 1TB of ram for $xyz. You don’t have big data”. I almost definitely had big data at Spotify. We trained models with 10B parameters on 10TB data sets all the time. There is a lot of those problems in the industry for sure. Unfortunately sampling doesn’t always work.

The other thing I think those people get wrong is the production aspect of it. Things like Hadoop forces your computation into fungible units that means you don’t have to worry about computers breaking down. It might be 10x slower than if you had specialized hardware, but that’s fine because you can have 100 teams running 10000 daily jobs and things rarely crash – especially if you use Luigi. 🙂

But I’m sure there’s a fair amount of snake oil Hadoop consultants who convince innocent teams they need it.

The other part of “big data” is that it’s at the far right of the hype cycle. Have you been to a Hadoop conference? It’s full of people in oversized suits talking about compliance now. At some point we’ll see deep learning or flux architecture or whatever going down the same route.

How do you go about framing a data problem – in particular, how do you avoid spending too long, how do you manage expectations etc. How do you know what is good enough?

Ideally you can iterate on it with real users and see what the impact is. If not, you need to introduce some proxy metrics. That’s a whole art form in itself.

It’s good enough when the opportunity cost outweighs the benefit 🙂 I.e. the marginal return of time invested is lower than for something else. I think it’s good to keep a backlog full of 100s of ideas so that you can prioritize based on expected ROI at any time. I don’t know if that’s a helpful answer but prioritization is probably the hardest problem to solve and it really just boils down to having some rules of thumb.

(image credit: Jer Thorp, CC2.0)

“Be a profit center” – An Interview with Data Artisan J.D. Long

Peadar Coyle — Mon, 02 Nov 2015 09:52:49 +0000

J.D.Long is the current AVP Risk Management at RenaissanceRe and has a 15 year history of working as an analytics professional.

Follow Peadar’s series of interviews with data scientists here.

What project have you worked on do you wish you could go back to, and do better?

I’ve been asked this question before.

Longer answer: Interestingly, what I find myself thinking about when asked this question is not analytics projects where I wish I could redo the analysis, but rather instances where I felt I did good analysis but did a bad job explaining the implications to those who needed the info. Which brings me to #2…

What advice do you have to younger analytics professionals?

Learn technical skills and enjoy learning new things, naturally. But, 1) always plot your data to visualize relationships and 2) remember at the end of the analysis you have to tell a story. Humans are hard wired to remember stories and not numbers. Throw away your slide deck pages with a table of p values and instead put a picture of someone’s face and tell their story. Or possible show a graph that illustrates the story. But don’t forget to tell the story.

What do you wish you knew earlier about being a data artisan?

Inside of a firm, cost savings of $1mm seems like it should be the same as generating income of $1mm. It’s not. As an analyst you can kick and whine and gripe about that reality, or you can live with it. One rational reason for the inequality is that income is often more reproducible than cost savings. However, the real reason is psychological. Once a cost savings happens it’s the new expectation. So there’s no ‘credit’ for future years. Income is a little different in that people who can produce $1mm in income every year are valued every year. That’s one of the reasons I listed “be a profit center” in the post John referenced. There are many more reasons, but that alone is a good one.

[bctt tweet=”‘Be a profit center’. #datascience”]

How do you respond when you hear the phrase ‘big data’?

I immediately think, “buzz word alert”. The phrase is almost meaningless. I try to listen to what comes next to see if I’m interested.

What is the most exciting thing about your field?

Everybody loves a good “ah-ha!” moment. Analytics is full of those. I think most of us get a little endorphin drop when we learn or discover something. I’ve always been very open about what I like about my job. I like being surrounded by interesting people, working on interesting problems, and being well compensated. What’s not to love!

(image credit: T Young, CC2.0)

“Playing in Everyone’s Back Yard” – An Interview with Data Scientist David Hand

Peadar Coyle — Tue, 13 Oct 2015 07:32:22 +0000

David Hand is Senior Research Investigator and Emeritus Professor of Mathematics at Imperial College, London, and Chief Scientific Advisor to Winton Capital Management. He is a Fellow of the British Academy, and a recipient of the Guy Medal of the Royal Statistical Society. He has served (twice) as President of the Royal Statistical Society, and is on the Board of the UK Statistics Authority. He has published 300 scientific papers and 26 books. He has broad research interests in areas including classification, data mining, anomaly detection, and the foundations of statistics. His applications interests include psychology, physics, and the retail credit industry – he and his research group won the 2012 Credit Collections and Risk Award for Contributions to the Credit Industry. He was made OBE for services to research and innovation in 2013.

Follow Peadar’s series of interviews with data scientists here.

What projects have you worked on that you wish you could go back to and do better?

I think I always have this feeling about most of the things I have worked on – that, had I been able to spend more time on it, I could have done better. Unfortunately, there are so many things crying out for one’s attention that one has to do the best one can in the time available. Quality of projects probably also has a diminishing returns aspect – spend another day/week/year on a project and you reduce the gap between its current quality and perfection by a half. Which means you never achieve perfection.

What advice do you have to younger analytics professionals and in particular PhD students in the Sciences?

I generally advise PhD students to find a project which interests them, which is solvable or on which significant headway can be made in the time they have available, and which other people (but not too many) care about. That last point means that others will be interested in the results you get, while the qualification means that there are not also thousands of others working on the problem (because that would mean you would probably be pipped to the post).

What do you wish you knew earlier about being a statistician? What do you think industrial data scientists have to learn from this?

I think it is important that people recognise that statistics is not a branch of mathematics. Certainly statistics is a mathematical discipline, but so are engineering, physics, and surveying, and we don’t regard them as parts of mathematics. To be a competent professional statistician one needs to understand the mathematics underlying the tools, but one also needs to understand something about the area in which one is applying those tools. And then there are other aspects: it may be necessary, for example, to use a suboptimal method if this means that others can understand and buy in to what you have done. Industrial data scientists need to recognise the fundamental aim of a data scientist is to solve a problem, and to do this one should adopt the best approach for the job, be it a significance test, a likelihood function, or a Bayesian analysis. Data scientists must be pragmatic, not dogmatic. But I’m sure that most practicing data scientists do recognise this.

[bctt tweet=”The fundamental aim of a data scientist is to solve a problem. #datascience”]

How do you respond when you hear the phrase ‘big data’?

Probably a resigned sigh. ‘Big data’ is proclaimed as the answer to humanity’s problems. However, while it’s true that large data sets, a consequence of modern data capture technologies, do hold great promise for interesting and valuable advances, we should not fail to recognise that they also come with considerable technical challenges. The easiest of these lie in the data manipulation aspects of data science (the searching, sorting, and matching of large sets) while the toughest lie in the essentially statistical inferential aspects. The notion that one nowadays has ‘all’ of the data for any particular context is seldom true or relevant. And big data come with the data quality challenges of small data along with new challenges of its own.

What is the most exciting thing about your field?

Where to begin! The eminent statistician John Tukey once said ‘the great thing about statistics is that you get to play in everyone’s back yard’, meaning that statisticians can work in medicine, physics, government, economics, finance, education, and so on. The point is that data are evidence, and to extract meaning, information, and knowledge from data you need statistics. The world truly is the statistician’s oyster.

Do you feel universities will have to adapt to ‘data science’? What do you think will have to be done in say mathematical education to keep up with these trends?

Yes, and you can see that this is happening, with many universities establishing data science courses. Data science is mostly statistics, but with a leavening of relevant parts of computer science – some knowledge of databases, search algorithms, matching methods, parallel processing, and so on.

(image credit: John Morgan, CC2.0)

“Tactical Empathy” – An Interview with Data Scientist Peadar Coyle

Peadar Coyle — Wed, 07 Oct 2015 16:27:16 +0000

Peadar Coyle is a Data Analytics professional based in Luxembourg. His intellectual background is in Mathematics and Physics, and he currently works for Vodafone in one of their Supply Chain teams.He is passionate about data science and the lead author of this project. He also contributes to Open Source projects and speaks at EuroSciPy, PyData and PyCon. His expertise is largely in the statistical side of Data Science.

Peadar was asked by various of his interviewees to share his own interview, so he does humbly.

Follow Peadar’s series of interviews with data scientists here.

What projects have you worked on that you wish you could go back to and do better?

I agree that it is better to look forward rather than look backward. And my skills have frankly improved since I first started doing what we could call professional data analysis (which was probably just before starting my Masters a few years ago).

One project I did which springs to mind (and not naming names) is where there was a huge breakdown in communication and misaligned incentives. There needed to be more communication on that project and it overran the initial allotted time. I also spent not enough time communicating up front the risks and opportunities with the stakeholders.

The data was a lot messier than expected, and management had committed to delivered results in 2 weeks. This was impossible, the data cleaning and exploration phase took too long. Now I would focus on quicker wins. I also rushed to the ‘modelling’ phase without really understanding the data. I think such terms ‘understanding the data’ sound a bit academic to some stakeholders, but you need to clearly and articulately explain how important the data generation process is, and the uncertainty in that data.

Some of this comes from experience – now I focus on adding value as quickly as possible and keeping things simple. There I fell to the siren call of ‘do more analysis’ rather than thinking about how the analysis is conveyed.

What advice do you have to younger analytics professionals and in particular PhD students in the Sciences?

I don’t have a PhD but I have recently been giving advice to people in that situation.

My advice is that having a portfolio of work if possible is great, or at least move towards doing an online course on Machine Learning or something cool like that.

The PyData videos are a good start too to watch. I’d recommend if you can to do any outreach or communication skills courses. There are many such courses at a lot of universities around the world, it’ll just help you understand the needs of others.

I think frankly that the most important skill for a Data Science is the ‘tactical application of empathy’ and that is something that working in a team really helps you develop. One thing I feel my Masters let me down on – as is common in Pure Mathematics – was a shortage of experience working in a team.

What do you wish you knew earlier about being a data scientist?

The focus on communication skills, the need to add value every day. The fact that budget or a project can be terminated at any moment.

Adding value every day means showing results and sharing them, talking to people about stuff. Share visualizations, and share results – a lot of data science is about relationships and empathy. In fact I think that the tactical application of empathy is the greatest skill of our times.

[bctt tweet=”‘The tactical application of empathy is the greatest skill of our times’ – @springcoil”]

You need to get out there and speak to the domain specialist, and understand what they understand. I believe that the best algorithms incorporate human as well as machine intelligence.

How do you respond when you hear the phrase ‘big data’?

I do like the distinction of small, medium and big data. I don’t worry so much about the terminology, and I focus on understanding exactly what my stakeholder wants from it.

I think, though, that it is often a distraction. I did one proof of concept as a consultant, that was an operational disaster. We didn’t have the resources to support a dev ops culture, nor did we have the capabilities to support a Hadoop cluster. Even worse the problem really could be solved more intelligently by being in RAM. But I got excited by the new tools, without understanding what they were really for.

I think this is a challenge, part of myself maturing as an engineer/data scientist is appreciating the limits of tools and avoiding the hype. Most companies don’t need a cluster, and the mean size of a cluster will remain one for a long time. Don’t believe the salesmen, and ask the experts in your community about what is needed.

In short: I feel it is strongly misleading but it is certainly here to stay.

How did you end up being a data analyst? What is the most exciting thing about your field?

My academic and professional career have a bit of weird path. I started at Bristol in a Physics and Philosophy program. It was a really exciting time, and I learned a lot (some of it non-academic). I went into that program because I wanted to learn everything. At various points – especially in 2009-2010 the terminology of ‘data science’ began to pick up, and when I went into grad school in 2010, I was ‘aware’ of the discipline. I took a lot of financial maths classes at Luxembourg, just to keep that option open, yet I still in my heart wanted to be an academic.

I eventually realized (after some soul-searching) that academic opportunities were going to be too difficult to get, and that I could earn more in industry. So I did a few industrial internships including one at import.io, and towards the end of my Masters – I did a 6 month internship at a ‘small’ e-commerce company called Amazon.

I learned a lot at Amazon, and it was there that I realized i needed to work a lot harder on my software engineering skills. I’ve been working on them in my working life and through contributing to open source software and my various speaking engagements. I strongly recommend to any wanna data geeks to come to these and share your own knowledge 🙂

The most exciting thing about my field relates to the first statement about physics and philosophy – we truly are drowning in data, and we really with the computational resources we have now have the ability to answer or simulate certain questions in a business context. The web is a microscope, and your ERP system tells you more about your business than you can actually imagine – I’m very excited to help companies exploit their data.

How do you go about framing a data problem – in particular, how do you avoid spending too long, how do you manage expectations etc. How do you know what is good enough?

I like the OSEMIC framework (which I developed myself) and the CoNVO framework (which comes from Thinking with Data by Max Schron – I recommend the following video for an intro and the book itself.)

Let me explain – at the beginning of an ‘engagement’ I look for the Context, Need, Vision and Outcome of the project. Outcome means the delivery and asking these questions by having a conversation with stakeholders is a really good way to get to solving the ‘business problem’.

[bctt tweet=”‘I look for the Context, Need, Vision and Outcome of the project” – @springcoil #datascience”]

A lot of this after a few years in the business still feels like an art rather than a science.

I like explaining to people the Data Science process – obtain data, scrub data, explore, model, interpret and communicate.

I think a lot of people get these kinds of notions and a lot of my conversations recently at work have been about data quality – and data quality really needs domain knowledge. It is amazing how easy it is to misinterpret a number – especially around things like unit conversion etc.

You spent sometime as a Consultant in Data Analytics. How did you manage cultural challenges, dealing with stakeholders and executives? What advice do you have for new starters about this?

I would see a lot of the stuff above. One challenge is that some places aren’t ready for a data scientist nor do they know how to use one. I would avoid such places, and look for work elsewhere.

Some of this is a lack of vision, and one reason I do a lot of talks is to do ‘educated selling’ about the gospel of data-informed decision making and how the new tools such as the PyData stack and R are helping us extract more and more value out of data.

I’ve also found that visualizations help a lot, humans react to stories and pictures more than to numbers.

My advice to new-starters is over communicate, and learn some soft skills. The frameworks I mentioned help a bit in structuring and explaining a project to stakeholders. I recommend also reading this interview series, I learned a lot from it too. 🙂

(image credit: USASOC News Service, CC2.0)

“Constantly Ask Why” – Interview with Data Scientist Vanessa Sabino

Peadar Coyle — Thu, 17 Sep 2015 08:06:06 +0000

Vanessa Sabino started her career as a system analyst in 2000, and in 2010 she jumped at the opportunity to start working with Digital Analytics, which brought together her educational background in Business, Applied Mathematics, and Computer Science. She gained experience from Internet companies in Brazil before moving to Canada, where she is now a data analysis lead for Shopify, transforming data into Marketing insights.

Follow Peadar’s series of interviews with data scientists here.

What projects have you worked on that you wish you could go back to and do better?

Working as practitioner in a company, as opposed to consulting, means I always have the option of going back and improving past projects, as long as the time spent on this task can be justified. There are always new ideas to try and new libraries being published, so as a team lead I try to balance the time spent on higher priority tasks, which for my team currently is ETL work to improve our data warehouse, with exploratory analysis of our data sets and creating and improving models that add value to our business users.

What advice do you have to younger analytics professionals and in particular PhD students in the Sciences?

My advice is to not underestimate the importance of communication skills, which goes from listening, in order to understand exactly what the data means and the context in which it is used, to presenting your results in a way that demonstrates impact and resonates with your audience.

What do you wish you knew earlier about being a data scientist?

I wish I knew 20 years ago how to be a data scientist! When I was finishing high school and I had to decide what to do in university, I had some interest in Computer Science, but I had no idea what a career in that area would be like. The World Wide Web was just starting, and living in Brazil, I had the impression that all software developing companies were north of the Equator. So I decided to study Business, imagining I’d be able to spend my days using spreadsheets to optimize things. During the course I learned about data warehouses, business intelligence, statistics, data mining and decision science, but when it was over it was not clear how to get a job where I could apply this knowledge. I went to work on a IT consulting company, where I had the opportunity to improve my software developing skills, but I missed working with numbers, so after two years I left to start a new undergrad in Applied Mathematics, followed by a Masters in Computer Science. Then I continued working as a software developer, now in web companies, and that’s when I started learning about the vast amount of online behavior data they were collecting and the techniques being used to leverage its potential. “Data scientist” is a new name for something that covers many different traditional roles, and a better understanding of the related terms would have allowed me to make this career move sooner.

How do you respond when you hear the phrase ‘big data’?

I prefer to work closer to data analysis than to data engineering, so in an ideal world I’d have a small data set with a level of detail just right to summarize everything that I can extract from that data. Whatever size the data is, if someone is calling it big data it probably means that the tool they are using to manipulate it is no longer meeting certain expectations, and they are struggling with the technology in order to get their job done. I find it a little frustrating when you write correct code that should be able to transform a certain input to the desired output, but things don’t work as expected due to a lack of computing resources, which means you have to do extra work to get what you want. And the new solution only lasts until your data outgrows it again. But that’s just the way it is, and being in the boundary of what you can handle means you’ll be learning and growing in order to overcome the next challenges.

What is the most exciting thing about your field?

I’m excited about the opportunities to collaborate in a wide range of projects. Nowadays everyone wants to improve things with data informed decisions, so you get to apply your skills to many areas and you learn a lot in the process.

How do you go about framing a data problem – in particular, how do you avoid spending too long, how do you manage expectations etc. How do you know what is good enough?

I always like to start with simple proof of concepts and iterate from there, using feedback from stakeholders to identify where are the biggest gains so that I can pivot the project in the right direction. But the most important thing in this process is to constantly ask “why”, in particular when dealing with requests. This helps you validate the understanding of the problem and enables you to offer better alternatives that the business user might not be aware of when they make a request.

[bctt tweet=”The most important thing in this process is to constantly ask “why”’ – @bani #datascience”]

(image credit: Britany G, CC2.0)

“Get the basics right first.” – Interview with Data Scientist Shane Lynn

Peadar Coyle — Wed, 09 Sep 2015 10:02:12 +0000

Shane is the Co-Founder of KillBiller, a company that helps mobile operators to gain new customers. They provide a mobile phone plan comparison service in Ireland that allows people to use their own call, text, and data usage information to find the best value mobile tariff for their individual needs. In this position, he’s finding his way as a tech-startup founder, learning the actual ropes of creating a profitable business, and stretching his tech muscles on a complex and scaleable python backend on the Amazon cloud.

Follow Peadar’s series of interviews with data scientists here.

What projects have you worked on that you wish you could go back to and do better?

Maybe every one?! I think that data science projects always have a bit of unfinished business. Its a key part of the trade to be able to identify when enough is enough, and when extra time would actually lead to tangible results. Is 4 hours tuning a model worth an extra 0.01 % in accuracy? Maybe in some cases, but not most. Unfortunately, I think that a huge amount of real data science business cases leave you with a little “ooh i could have tried…” or “oh we might have optimised…”.

What advice do you have to younger analytics professionals and in particular PhD students in the Sciences?

“The more I learn, the more I realise how much I don’t know.” There seems to be a never ending list of new technologies and new techniques to get your head around. I would say to budding professionals that if you can get a solid understanding of basic key techniques in your repertoire to start with, you’ll do better than learning buzz words about the latest trends. While the headline-grabbing bleeding edge research will always seem to sparkle, the reality of data science in business is that people are still using proven techniques that work reliably and simply – think regression and k-means over deep-learning and natural language processing. Get the basics right first.

What do you wish you knew earlier about being a data scientist?

Data preparation. I know you see it written down, but there is no exaggeration at all in the phrase – you’ll spend 80% of your time preparing the data. I’m sure everyone says it, and should know it, but its a key part of the work, and a very important step in the information discovery process.

[bctt tweet=”‘you’ll spend 80% of your time preparing the data’ – @shane_a_lynn”]

How do you respond when you hear the phrase ‘big data’?

That depends on where it comes from. At a business conference from a sales man – sometimes with rolling eye. At a tech meetup in Dublin – maybe with some interest. I think that Big Data has been hyped to death, and the reality is that, for now, there’s very few companies that actually require a large scale Hadoop deployment. I’ve worked with some of the largest companies on data science projects, and to date, have been able to process the data required on a single machine. However, I’m aware that that is an Irish specific viewpoint, where naturally our population and market size reduces the volume of data in many fields. However, I do think that Big Data is ultimately a function of the IT department, data scientists will simple lever the tools to extract meaningful excerpts or subsets for analysis.

What is the most exciting thing about your field?

Its ever changing, ever growing, and moving quickly. While its daunting sometimes to think of the speed of progress, its also extremely exciting to be involved in a world where new ideas, tools, and techniques are being spread on a weekly basis. There’s a huge amount of enthusiasm out there in the community and a plethora of new opportunities to be explored.

How do you go about framing a data problem – in particular, how do you avoid spending too long, how do you manage expectations etc. How do you know what is good enough?

I tend to start to tackle each problem after I’ve had a good look at the data behind it. Perhaps an extract, perhaps a MVP type model, but just enough to grasp the state of the data, the amount of cleansing required, and to identify potential problems and benefits. Its extremely difficult to accurately estimate the outcome of a data science problem before you start working – so a few hours of exploration are very worthwhile. Time spent is usually limited naturally by time and budget, and you can relatively quickly get to a point where negligible gains are being made for additional time investment.

You spent sometime as a Consultant in Data Analytics. How did you manage cultural challenges, dealing with stakeholders and executives? What advice do you have for new starters about this?

There’s a political landscape in every company that you’ll join. Take the time to learn the ropes and learn how your company deals with these items. I find that frequent and realistic updates on progress and expectations are key to managing the various parties. Don’t hide the dirty bits or the issues. And probably budget three times the time that you initially think for each task – there’s always hidden issues!

You have a cool startup can you comment on how important it is as a CEO to make a company such as that data-driven or data-informed?

I’m working on KillBiller, an Irish startup that makes difficult decisions easy. KillBiller automatically audits your mobile phone usage and works out exactly what you would spend on every mobile network and tariff. We’ve saved almost 20,000 people money on their phone bills!

In our case, we’re all about data – processing peoples mobile usage, doing it securely, accurately, quickly, and presenting the results in a meaningful way. In addition, a data-driven approach to the startup world has its advantages – having a solid understanding of your marketing effectiveness, website traffic, user retention, and route to revenue allows us to make decisions backed on science over intuition.

More information about Shane can be found on his blog.

(image credit: Ricymar Photography)

“Those who rely on gut instinct or trivial analyses will be out-competed” Interview with Data Scientist Jon Sedar

Peadar Coyle — Wed, 19 Aug 2015 07:30:35 +0000

Jon is a consulting data scientist, trained in physics and machine learning, with 10 years professional background in data analysis and management consulting. He co-manages a niche data science consultancy called Applied AI, operating primarily in the insurance sector throughout UK, Ireland and Europe. He’s also an organiser and volunteer within data-for-good social movements, and occasional speaker at tech and industry events.

Follow Peadar’s series of interviews with data scientists here.

1. What project have you worked on do you wish you could go back to, and do better?

I won’t name names, but throughout my career I’ve encountered projects – and indeed full-time jobs – where major issues have popped up not due to technologies or analysis, but due to ineffective communication, either institutional or interpersonal. Just to pick an example, one particular job was an analyst’s nightmare due to overbearing senior management and too-rapid engineering – the task was to produce KPIs of the company’s health, but the entire software and hardware stack changed so frequently that getting even the most basic information out was extremely hard work. That could have been fixed by stronger communication and pushback on my part – but my opinions weren’t accepted and it wasn’t to be. Another large project (of which I was only a very minor part) was scuppered to due mishandled client expectations and caused no end of overwork for the consulting team. Every project needs better communication, always.

2. What advice do you have to younger analytics professionals and in particular PhD students in the Sciences?

I’ll deal with these separately, since there are (or should be) different reasons why people are in each group.

To PhD candidates here I simply hope that they truly love their subject and are careful to gain commercially-useful skills along the way. I’ve friends who have completed PhDs, some who’ve quit midway, and some like me who considered it but instead returned to industry after an MSc. You might not plan to go into industry, but gaining the following skills is vital for academia too:

Reproducible research (version control, data management, robust / testable / actually maintainable code).
Lightweight programming (learn Python, it’s easy, able to do most things, always available, the packages are very well maintained and the community is very strong).
Statistics (Bayesian, frequentist, whatever – make sure you have a really solid grasp of the fundamentals).
Finally ensure you have proven capability in high-quality communication – and a dead-tree LaTeX publication doesn’t count. Get yourself blogging, answering questions on Stack Overflow, presenting at meetups and conferences, working with others, consulting in industry etc. As you improve upon this you’ll really distinguish yourself from the herd.

Also some flamebait: Whilst I love the idea of improving humanity’s body of knowledge in the hard sciences, I’m not convinced that a PhD in the soft sciences is worthwhile nowadays, at least not straight out of school. If you want to research the humanities just take your degree and go work for a giant search engine / social network / online retailer; you’ll get real-world issues and massive study sizes from day one.

To the younger analytics professionals, regardless the company or industry in which you find yourself, build up your skills as per the PhD advice above, polish your external profile (blogs, talks, research papers etc) and don’t ever be afraid to jump ship and try a few things out. Try to have 3 month’s pay in your savings account, maintain your friendships local and international, and set up a basic vehicle for you to do independent contracting / consulting work.

Over the years I’ve tried a lot of different jobs in a few different locations. I felt happiest once I’d set up my own company and knew that I would always have a method to market my skills independent of anyone else. Data science skills are likely to be important for a good few years yet, so if you’re well-connected, well-respected and mobile, you can try a lot of things, find what you love, and will never be out of work for long.

3. What do you wish you knew earlier about being a data scientist?

Lots to unpack in that question! If I can call myself a scientist at all, then it’s an empiricist rather than theoretician. As such I consider data the be the record of things that happen(ed) and science as the formalisation & generalisation of our understanding of those things. ‘Data scientist’ is thus a useful shorthand term for someone who specialises in learning from data, communicating insights and taking/recommending reasoned actions accordingly.

With that in mind, I’d advise my younger self to never forget that it’s that final step that matters most – allowing decision makers to take reasoned actions according to your well-communicated insights. That decision maker may be your client, your boss or even simply yourself, but without an effective application ‘data science’ is actually research & development – and chances are you’re not being paid to do R&D.

4. How do you respond when you hear the phrase ‘big data’?

I think we’re far enough along the hype cycle now that nearly all data science practitioners recognise both the possibilities and the constraints of performing large-scale analyses. Proper problem-definition and product-market fit are the most important to get right, and hopefully even your typical non-technical business leader is no longer bedazzled by the term and instead wants to see actionable insights that don’t require a major engineering project.

That said, I’m still happy to see experts in the field continue to preach that whilst gathering reams of ‘big’ data (which I take here to be primarily commercially-related data including interface interactions, system log files, audio, images, video feeds, positional info, live market movements etc.) can lead to something immensely powerful, it can easily become a giant waste of everyone’s time and resources.

Truly understanding the behaviour of a system/process, and properly cleaning, reducing and sub-sampling datasets are practices long-understood by the statistics community. A reasoned hypothesis tested with ‘small-medium’ data on a modest desktop machine beats blind number crunching any day.

5. What is the most exciting thing about your field?

Well, the tools for applying the analysis techniques, and the techniques themselves are certainly moving at a hell of a pace, but science & technology always does. I really enjoy having the opportunity to research and apply novel techniques to client problems.

More widely I’m excited to see the principles of gathering, maintaining and learning from data permeate all aspects of businesses and organizations. There’s well-developed data science platforms popping up every day, new software packages to use, heavily over-subscribed meetup groups and conferences everywhere, and it’s great to see the formalisation and commoditization of certain technical aspects. Just as it’s unlikely that anyone would try today to run an enterprise without a website, a telephone or even an accountant, I expect that a data science capability will be at the core of most businesses in future.

6. How do you go about framing a data problem – in particular, how do you avoid spending too long, how do you manage expectations etc. How do you know what is good enough?

I assume you mean an analytical problem rather than a data management problem or something else.

I think it’s quite simple really, and just common sense to ensure that you define well the analytical problem, and the inputs and outputs of your work. What question are we trying to answer? How should the answer be presented and how will it be used? What analysis and what data will let us provide insights based on that question? What data do we have and what analysis is possible / acceptable within our organisational and technical constraints? Then prototype, develop, communicate and iterate until baked.

7. Do you feel ‘Data Science’ is a thing – or do you feel it is just some Engineering functions rebranded? Do you think we could do more of the hypothesis driven scientific enquiry?

As above, I think that in future the practice of gathering, maintaining and learning from data will be core to nearly all commercial and social enterprises. Bringing academic research to bear on real-world problems is just too useful, and those who rely on gut instinct or trivial analyses will be out-competed.

That said, I think we’re already seeing a definite split between data science (statistics, experimentation, prediction), data processing (large-scale systems development), and data engineering (acquiring, maintaining and making available high-quality data sources), and no doubt in future there will be more spin-out skills that take on a life of their own. The veritable zoo of job titles spawned from web development is a good example: UI designers, UX designers, javascript engineers, mobile app engineers, hosting and replication engineers etc etc.

Finally I’d just like to thank you for putting this series of interviews / blogposts together, it’s a really interesting resource, particularly as the data science industry is maturing.

Peadar Coyle is a Data Analytics Professional based in Luxembourg. He has helped companies solve problems using data relating to Business Process Optimization, Supply Chain Management, Air Traffic Data Analysis, Data Product Architecture and in Commercial Sales teams. He is always excited to evangelize about ‘Big Data’ and the ‘Data Mentality’, which comes from his experience as a Mathematics teacher and his Masters studies in Mathematics and Statistics. His recent speaking engagements include PyCon Sei in Florence and he will soon be speaking at PyData in Berlin and London. His expertise includes Bayesian Statistics, Optimization, Statistical Modelling and Data Products

(image credit: R∂lf Κλενγελ)

“I strongly prefer looking forward. There’s so much to build!”: An Interview With AI Researcher Trent McConaghy

Peadar Coyle — Wed, 15 Jul 2015 12:12:04 +0000

Trent McConaghy has been doing AI and ML research since the mid 90s. He co-founded ascribe GmbH, which enables copyright protection via internet-scale ML and the blockchain. Before that, he co-founded Solido where he applied ML to circuit design; the majority of big semis now use Solido. Before that, he co-founded ADA also doing ML + circuits; it was acquired in 2004. Before that he did ML research at the Canadian Department of Defense. He has written two books and 50 papers+patents on ML. He co-organizes the Berlin ML meetup. He keynoted Data Science Day Berlin 2014, PyData Beriln 2015, and more. He holds a PhD in ML from KU Leuven, Belgium.

Follow Peadar’s series of interviews with data scientists here.

At PyData in Berlin I chaired a panel – one of the guests was Trent McConaghy and so I reached out to him, to hear his views about analytics. I liked his views on shipping it, and the challenges he’s run into in his own world.

What project have you worked on do you wish you could go back to, and do better?

Before I answer this I must say: I strongly prefer looking forward. There’s so much to build!
I’ve made many mistakes! One is having rose-colored glasses for criteria that ultimately mattered little. For example, for my first startup, I hired a professor who’d written 100+ papers, and textbooks. Sounds great, right? Well, he’d optimized his way of thinking for academia, but was not terribly effective on the novel ML problems in my startup. It was no fun for anyone. We had to let him go.

What advice do you have to younger analytics professionals and in particular PhD students in the Sciences?

Do something that that you are passionate about, and that matters to the future. It starts with asking interesting scientific questions, and ends (ideally) with results that make a meaningful impact on the world’s knowledge.

What do you wish you knew earlier about being a data scientist?

As an AI researcher and an engineer: one thing that I didn’t know, but served me well because I did it anyway, was voracious reading of the literature. IEEE Transactions for breakfast:) That foundation has served me well my whole career.

How do you respond when you hear the phrase ‘big data’?

Marketing alert!!

That said: I like how unreasonably effective large amounts of data can be. And that it’s shifted some of focus away from algorithmic development on toy problems.

What is the most exciting thing about your field?

AI as a field has been around since the 50s. Some of the original aims of AI are still the most exciting! Getting computers to do tasks in superhuman fashions is amazing. These days it’s routine in narrow settings. When the world hits AI that can perform at the cognitive levels of humans or beyond, it changes everything. Wow! It’s my hope to help shepherd those changes in a way that is not catastrophic for humanity.

How do you go about framing a data problem – in particular, how do you avoid spending too long, how do you manage expectations etc. How do you know what is good enough?

I follow steps, along the lines of the following.

Write down goals, what question(s) I’m trying to answer. Give yourself a time limit.
Get benchmark data, and measure(s) of quality. Draw mockups of graphs I might plot.
Test against dumbest possible initial off-the-shelf algorithm and problem framing (including where I get the data)
Is it good enough compared to the goals? Great, stop! (Yes, linear regression will solve some problems:)
Try the next highest bang-for-the-buck algorithm & problem framing. Ideally, it’s off the shelf too. Benchmark / plot / etc. Repeat. Stop as soon as successful, or when time limit is hit.
Ship!

(Image Credit: Tris Linnell / Turing Bombe / CC BY SA 2.0 )