William Chen – Dataconomy https://dataconomy.ru Bridging the gap between technology and business Mon, 24 Apr 2017 14:15:16 +0000 en-US hourly 1 https://dataconomy.ru/wp-content/uploads/2025/01/DC_icon-75x75.png William Chen – Dataconomy https://dataconomy.ru 32 32 Jumping from PhD to Data Scientist: 3 Tips for Success https://dataconomy.ru/2014/12/18/jumping-from-phd-to-data-scientist-3-tips-for-success/ https://dataconomy.ru/2014/12/18/jumping-from-phd-to-data-scientist-3-tips-for-success/#comments Thu, 18 Dec 2014 15:59:01 +0000 https://dataconomy.ru/?p=11099 William is a Data scientist at Quora, interested in data-driven decision making to improve both product and business. Always interested in learning new things and exploring the ubiquity of data in everyday life. The Interviewees DJ Patil is VP of Product at RelateIQ (acquired by Salesforce). He’s held many other roles, including former Chief Scientist at […]]]>

William Chen - Data Scientist at Quora William is a Data scientist at Quora, interested in data-driven decision making to improve both product and business. Always interested in learning new things and exploring the ubiquity of data in everyday life.


The Interviewees

main-qimg-8f82009c564da7024ca3d0f875d5a368DJ Patil is VP of Product at RelateIQ (acquired by Salesforce). He’s held many other roles, including former Chief Scientist at LinkedIn. DJ co-coined the term “Data Scientist” and co-authored Data Scientist: The Sexiest Job of the 21st Century. DJ transitioned into data science following a research scientist position at University of Maryland. Follow him at @dpatil(Twitter) or at Dj Patil (Quora).

main-qimg-d8b15c139daafbd4f53b03d15e68df86Michelangelo D’Agostino is currently Senior Data Scientist at Civis Analytics. Formerly, he was head of data science at Braintree and was a senior analyst on the 2012 Obama Campaign’s analytics engine. He transitioned into data science following a PhD in Astr ophysics from Berkeley. Follow him at@MichelangeloDA (Twitter) or at Michelangelo D’Agostino (Quora).

1. Seek fast, collaborative environments

During graduate school, Michelangelo did his PhD on IceCube, a neutrino physics experiment at the South Pole. There, they measured cosmic neutrinos via sensors buried in the polar ice cap. One big transition for him as a physicist was the opportunity for him to learn in a fast, collaborative, environment. Michelangelo explains:

All of a sudden, I was working with a couple hundred people all around the world, half in Europe, half in the US in all these different time zones.

It felt like I wasn’t working on something by myself. I was working on really interesting problems with other smart people and doing really hard work. I think that was what kept me in grad school – knowing that I was working with other smart people in a collaborative environment.

Michelangelo goes on to explain that fast, collaborative environments are what distinguishes doing data science in industry from doing research in academia. After doing work for the 2012 Obama Re-election Campaign, Michelangelo briefly contemplated going back to finish his post-doc, but decided to stay in data scientist because the environment suited him better.

I like working with people a lot more than I like working by myself.  I like to work on things that have more impact.  You see a lot more of it in industry, in data science, than you do in research.

I like the pace a lot more.  I think research can often be very slow, especially particle physics.  It takes 10 years to build an experiment now.  You have to have a monastic personality to be a physicist nowadays.

Unfortunately, this kind of environment can be rare for any PhD student working in a research environment. DJ Patil explains the culture shock that many PhD students get when they get their first job in data science:

In academia, the first thing you do is sit at your desk and then close the door. There’s no door anywhere in Silicon Valley; you’re out on the open floor. These people are very much culture shocked when people tell them, “No you must be working, collaborating, engaging, fighting, debating, rather than hiding behind the desk and the door.”

I think that’s just lacking in the training, and where academia fails people. They don’t get a chance to work in teams; they don’t work in groups.

Ultimately, DJ says that forgetting that data science is collaborative is a common mistake people make when considering jumping into data science.

People make a mistake by forgetting that Data Science is a team sport. People might point to people like me or Hammerbacher or Hilary or Peter Norvig and they say, oh look at these people! It’s false, it’s totally false, there’s not one single data scientist that does it all on their own.

Data science is a team sport, somebody has to bring the data together, somebody has to move it, someone needs to analyze it, someone needs to be there to bounce ideas around.

It’s a common trap during one’s PhD to end up only focusing on one’s dissertation and research. Seek out opportunities to further your research in fast, collaborative environments. This includes getting involved with large collaborative projects in your department, team-based competitions in your field, working with others on side or related projects, actively speaking about your research, and attending various conferences, events, and activities!

2. Delve deeply into hard, dirty problems

Not everything you learn in graduate school is specialized domain knowledge. In fact, the experience of working on difficult problems and the strategies that you use to approach them is one of the most valuable skills that Michelangelo picked up during his astrophysics PhD. To get that experience that will ultimately become relevant to data science, Michelangelo suggests:

Work on a hard problem for a long time and figure out how to push through and not be frustrated when something doesn’t work, because things just don’t work most of the time. You just have to keep trying and keep having faith that you can get a project to work in the end. Even if you try many, many things that don’t work, you can find all the bugs, all the mistakes in your reasoning and logic and push through to a working solution in the end.

Specifically, you should be always looking for applications of your research on real, live datasets. This gives you the wisdom of all the nuances when dealing with large, messy datasets, and allows you to understand much more than just the theory of your research. Michelangelo explains:

You can read about it, and people can teach you techniques, but until you’ve actually dealt with a nasty data set that has a formatting issue or other problems, you don’t really appreciate what it’s like when you have to merge a bunch of data sets together or make a bunch of graphs to sanity check something and all of a sudden nothing makes sense in your distributions and you have to figure out what’s going on.

3. Do things beyond your academic specialty

During graduate school, the research that you are doing with your advisor might seem all-consuming. However, it is useful to step back, look at the bigger picture, and pursue the other skills that may serve to augment your experience as a PhD student. DJ offers a reminder:

Many people who come out of academia are very one-dimensional. They haven’t proven that they can make anything, all they’ve proven is that they can study something that nobody (except maybe your advisor and your advisor’s past two students) cares about. That’s a mistake in my opinion.

During that time, you can solve that hard PhD caliber problem AND develop other skills. For example, giving talks, coding in hackathons, etc. Do things in parallel and you’ll get much more out of your academic experience.

A traditional academic curriculum is actually lacking in teaching all of the skills one needs to even become a data scientist. Michelangelo notes this in his interview and says:

You can’t finish a degree and know all the things you need to know to be a data scientist. You have to be willing to constantly teach yourself new techniques.

Michelangelo elaborates that not constantly teaching yourself new things sends a negative signal to companies looking to hire data scientists.

From a hiring perspective, when I talk to PhD students who say they want to be data scientists, I become skeptical if they haven’t taken any active steps.

“Hey, I participated in these Coursera courses or these Kaggle competitions.” or “I’ve gone to the Open Government Meetup and have done these data visualizations.”

Things like that demonstrate that you can work on problems outside your academic specialty, and they show that you really have initiative.

One of the largest dangers of coming out of academia is that you constrain yourself into an environment that rewards an intensely narrow focus on one thing. To expand and to be able to tackle the challenges of becoming a data scientist, you must continuing developing your other skills in parallel, and always be on the lookout for new challenges and opportunities.

More Resources

The Data Science Handbook features interviews from 25 amazing data scientists, including DJ and Michelangelo. Sign up at Data Science Handbook to get 3 free interviews (including the full interviews with DJ and Michelangelo).

For more concrete advice on making the transition from academia to data science, check out the answers at

Subscribe to Storytelling with Statistics to get updates on more posts like these!

(This post was originally published on Quora)

(Image Credit: Eric Fischer)

]]>
https://dataconomy.ru/2014/12/18/jumping-from-phd-to-data-scientist-3-tips-for-success/feed/ 2
7 Ways Data Scientists use Statistics https://dataconomy.ru/2014/11/28/7-ways-data-scientists-use-statistics/ https://dataconomy.ru/2014/11/28/7-ways-data-scientists-use-statistics/#comments Fri, 28 Nov 2014 15:49:00 +0000 https://dataconomy.ru/?p=10702 William is a Data scientist at Quora, interested in data-driven decision making to improve both product and business. Always interested in learning new things and exploring the ubiquity of data in everyday life. 1. Design and interpret experiments to inform product decisions Observation: Advertisement variant A has a 5% higher click-through rate than variant B. […]]]>

William Chen - Data Scientist at Quora William is a Data scientist at Quora, interested in data-driven decision making to improve both product and business. Always interested in learning new things and exploring the ubiquity of data in everyday life.


1. Design and interpret experiments to inform product decisions

Observation: Advertisement variant A has a 5% higher click-through rate than variant B.

Data Scientists can help determine whether or not that difference is significant enough to warrant increased attention, focus, and investment.

They can help you understand experimental results, this is especially useful when you’re measuring many metrics, running experiments that affect each other, or have some Simpson’s Paradox happening in your results.

Let’s say you’re a national retailer and you’re trying to test the effect of a new marketing campaigns. Data Scientists can help you decide which stores you should assign to the experimental group to get a good balance between the experimental and control groups, what sample size you should assign to the experimental group to get clear results, and how to run the study spending as little money as possible.

Statistics Used: Experimental Design, Frequentist Statistics (Hypothesis Tests and Confidence Intervals)

2. Build models that predict signal, not noise

Observation: Sales in December increased by 5%.

Data Scientists can tell you potential reasons why sales have increased by 5%. Data scientists can help you understand what drives sales, what sales could look like next month, and potential trends to pay attention to.

See What is an intuitive explanation of overfitting? to understand why its important to only fit on signal.

Statistics Used: Regression, Classification, Time Series Analysis, Causal Analysis

3. Turn big data into the big picture

Observation: Some customers only buy healthy food, while others only buy when there’s a sale.

Anyone can observe that the business has 100,000 customers buying 10,000 items at your grocery store.

Data Scientists can help you label each customer, group them with similar customers, and understand their buying habits. This allows you to see how business developments can affect certain groups of the population, instead of looking at everyone as a whole or looking at everyone individually.

Dunnhumby breaks down grocery shoppers into groups including Shoppers On A Budget, Finest, Family Focused, Watching the Waistline, and Splurge and Save [1]

Statistics Used: Clustering, Dimensionality Reduction, Latent Variable Analysis

4. Understand user engagement, retention, conversion, and leads

Observation: A lot of people are signing up for our site and never coming back.

Why do your customers buy items from your site? How do you keep your clients coming back? Why are users dropping out of your funnel? When will they come out next? What kinds of emails from your company are most successfully engaging users? What are some leading indicators of engagement, activity, or success? What are some good sales leads?

Statistics Used: Regression, Causal Effects Analysis, Latent Variable analysis, Survey Design

5. Give your users what they want

Given a matrix of users (customers, clients, users), and their interactions (clicks, purchases, ratings) with your companies items (ads, goods, movies), can you suggest what items your users will want next?

Statistics Used: Predictive Modeling, Latent Variable Analysis, Dimensionality Reduction, Collaborative Filtering, Clustering

6. Estimate intelligently

Observation: We have a banner with 100 impressions and 0 clicks.

Is 0% a good estimate of the click-through-rate?

Data Scientists can incorporate data, global data, and prior knowledge to get a desirable estimate, tell you the properties of that estimate, and summarize what the estimate means.

If you’re interested in a better approach to estimating the click-through rate, check out What are the advantages of Bayesian methods over frequentist methods in web data?

Statistics Used: Bayesian Data Analysis

7. Tell the story with the data

The Data Scientist’s role in the company is the serve as the ambassador between the data and the company. Communication is key, and the Data Scientist must be able to explain their insights in a way that the company can get aboard, without sacrificing the fidelity of the data.

The Data Scientist does not simply summarize the numbers, but explains why the numbers are important and what actionable insights one can get from these.

The Data Scientist is the storyteller of the company, communicating the meaning of the data and why it is important to the company.

The success of the previous six points can be measured and quantified, but this one cannot. I’d say this role is the most important.

Statistics Used: Presenting and Communicating Data, Data Visualization

Follow my blog at Storytelling with Statistics

Chinese Translation: 数据科学家的7种统计学使用场景 | 大数据观察


TL; DR – With statistics, data scientists derive insights to encourage decisions that improve product or business, distilling the data into actionable insights that promote the vision of the company.

(This post was originally published on Quora)

(Image Credit: Simon Cunningham)

]]>
https://dataconomy.ru/2014/11/28/7-ways-data-scientists-use-statistics/feed/ 5