Key skills for aspiring data scientists: Problem solving and the scientific method
This blog is part two of our ‘Data science skills’ series, which takes a detailed look at the skills aspiring data scientists need to ace interviews, get exciting projects, and progress in the industry. You can find the other blogs in our series under the ‘Data science career skills’ tag.
One of the things that attracts a lot of aspiring data scientists to the field is a love of problem solving, more specifically problem solving using the scientific method. This has been around for hundreds of years, but the vast volume of data available today offers new and exciting ways to test all manner of different hypotheses – it is called data science after all.
If you’re a PhD student, you’ll probably be fairly used to using the scientific method in an academic context, but problem solving means something slightly different in a commercial context. To succeed, you’ll need to learn how to solve problems quickly, effectively and within the constraints of your organisation’s structure, resources and time frames.
Why is problem solving essential for data scientists?
Problem solving is involved in nearly every aspect of a typical data science project from start to finish. Indeed, almost all data science projects can be thought of as one long problem solving exercise.
To make this clear, let’s consider the following case study; you have been asked to help optimize a company’s direct marketing, which consists of weekly catalogues.
Defining the right question
The first aim of most data science projects is to properly specify the question or problem you wish to tackle. This might sound trivial, but it can often be one of the most challenging parts of any project, and how successful you are at this stage can come to define how successful you are by the finish.
In an academic context, your problem is usually very clearly defined. But as a data scientist in industry it’s rare for your colleagues or your customer to know exactly which problem they’re trying to solve.
In this example, you have been asked to “optimise a company’s direct marketing”. There are numerous translations of this problem statement into the language of data science. You could create a model which helps you contact customers who would get the biggest uplift in purchase propensity or spend from receiving direct marketing. Or you could simply work out which customers are most likely to buy and focus on contacting them.
While most marketers and data scientists would agree that the first approach is better in theory, whether or not you can answer this question through data depends on what the company has been doing up to this point. A robust analysis of the company’s data and previous strategy is therefore required, even before deciding on which specific problem to focus on.
This example makes clear the importance of properly defining your question up front; both options here would lead you on very different trajectories and it is therefore crucial that you start off on the right one. As a data scientist, it will be your job to help turn an often vague direction from a customer or colleague into a firm strategy.
Formulating and evaluating hypotheses
Once you’ve decided on the question that will deliver the best results for your company or your customer, the next step is to formulate hypotheses to test. These can come from many places, whether it be the data, business experts, or your own intuition.
Suppose in this example you’ve had to settle for finding customers who are most likely to buy. Clearly you’ll want to ensure that your new process is better than the company’s old one – indeed, if you’re making better data driven decisions than the company’s previous process you would expect this to be the case.
There is a challenge here though – you can’t directly test the effect of changing historical mailing decisions because these decisions have already been made. However, you can indirectly, by looking at people who were mailed, and then looking at who bought something and who didn’t. If your new process is superior to the previous one, it should be suggesting that you mail most of the people in this first category, as people missed here could indicate potential lost revenue. It should also omit most of the people in the latter category, as mailing this group is definitely wasted marketing spend.
While these metrics don’t prove that your new process is better, they do provide some evidence that you’re making improvements over what went before.
This example is typical of applied data science projects – you often can’t test your model on historical data to the extent that you would like, so you have to use the data you have available as best you can to give us as much evidence as is possible as to the validity of your hypotheses.
Testing and drawing conclusions
The ultimate test of any data science algorithm is how it performs in the real world. Most data science projects will end by attempting to answer this question, as ultimately this is the only way that data science can truly deliver value to people.
In our example from above, this might look like comparing your algorithm against the company’s current process by doing an randomised control trial (RCT), and comparing the response rates across the two groups. Of course one would expect random variation, and being able to explain the significance (or lack thereof) of any deviations between the two groups would be essential to solving the company’s original problem.
How successfully you test and draw your final conclusions, as well as well you take into account all the limitations with the evaluation, will ultimately decide how impactful the end result of the project is. When addressing a business problem there can be massive consequences to getting the answer wrong – therefore formulating this final test in a way that is scientifically robust but also helps address the original problem statement is therefore paramount, and is a skill that any data scientist needs to possess.
How to develop your problem solving skills
There are certainly ways you can develop your applied data science problem solving skills. The best advice, as so often is true in life, is to practice. Indeed, one of the reasons that so many employers look for data scientists with PhDs is because this demonstrates that the individual in question can solve hard problems.
Websites like kaggle can be a great starting point for learning how to tackle data science problems and winners of old competitions often have good posts about how they came to build their winning model. It’s also important to learn how to translate business problems into a clear data science problem statement. Data science problems found online have often solved this bit for you, so try and focus on those that are vague and ill-defined – whilst it might be tempting to stick to those that are more concrete, real life is seldom as accommodating.
As the best way to develop your skills is to practice them, Faculty’s Fellowship programme can be a fantastic way to improve your problem solving skills. As the fellowship gives you an opportunity to tackle a real business problem for a real customer, and take the problem through from start to finish, there are not many better ways to develop, and prove, your skills in this area.
Head to the Faculty Fellowship page to find out more.