How can we encourage collaboration in data science teams?

The more time I’ve spent in the industry, the more it’s become clear that collaboration in data science teams is vital to the success of any project. The best data science work is done by teams of data scientists working closely together – exchanging discoveries, offering advice, and spotting opportunities to improve.

2020-11-16Data Science

As machine learning and artificial intelligence gain ground in almost every industry, the stereotypical lone data scientist – squirrelled away in a darkened corner and writing equations that no one else understands – has, in many cases, become a whole team of people. 

But it’s taken businesses some time to abandon the trope of the ‘unicorn’ data scientist and realise that close collaboration is just as important for data science teams as it is for any other department. 

Of course, for many team leaders the question isn’t so much why we should collaborate, but how to overcome a host of historic barriers and entrenched ways of thinking that preclude collaboration. 

How can managers encourage collaboration in data science teams?

At Faculty, we’ve been building data science teams – both within our organisation and for our clients – for some time now, so we’re highly aware that providing support and tools for collaboration is vital to the success of our projects. That experience has revealed three major problem areas that arise from lack of collaboration in data science teams – and three methodologies for addressing them. 

In our experience, managers need to focus on these key areas in order to create strong, collaborative data science teams.

  1. Get data scientists and engineers working together by building cross-functional teams.

  2. Build shared knowledge into code through shared libraries, feature stores and templates.

  3. Agree on a set of tools that your whole team will use – and stick to that set.

Let’s take a look at these focus areas in more detail.

Get data scientists and engineers working together

If you are – or even work closely with – a data scientist or engineer, you’ll be well aware that the two job roles are often set up with partial or complete separation of responsibilities, priorities, and processes. The only point of real ‘collaboration’ comes when the data scientist hands over a completed model to an engineer. Then the project can go one of two ways: either the data scientists see their models cut up, rearranged, or thrown out entirely; or the engineers are left to wrestle with models that aren’t suitable for deployment and lose valuable time trying to find solutions. It will typically take at least three months from the moment the model is handed over to the moment it is deployed in production. By this point, the data scientist has moved on to solving other problems. This long feedback loop and lack of collective ownership can noticeably erode the quality of deployed models.

Too many data science teams have accepted this as an inherent problem – a feature of the industry’s processes that must be worked around, rather than a bug that can and must be fixed. It’s entirely possible to create a culture of collaboration and dual responsibility; we just have to back it up with the right processes. 

At Faculty, we accomplish this in two ways. First, we make teams cross-functional from the very beginning. Every project team is made up of a mix of data scientists and machine learning engineers, so that both can shape our strategy, monitor progress, remain up-to-date with new developments, and build a relationship with our client. This allows us to, for example, avoid prematurely optimising our models along specific metrics that suit data scientists (like accuracy), but may harm the overall product by compromising on latency or inference cost. 

Second, we gradually add structure to the data science process as the project progresses. The exploratory phase at the beginning of every project is the perfect environment for data scientists, who have the freedom to investigate new avenues and play with the data to answer questions as quickly as possible. The exploratory phase is essential, but it’s just as essential to ensure that the team switches to a more structured workflow once the problem space has been sufficiently explored, adopting software best practices to make a production-quality end product. 

As components of the project move closer to solution development, we therefore encourage data scientists to add structure to their investigations and their codebase – move training out of notebooks and into scripts, automate those scripts, and share models via a versioned model registry. As a result, it’s much easier for the engineer to pick up where the data scientist left off. 

The exploratory phase and productionisation phase aren’t usually separate and sequential: they co-exist, to varying degrees, all the way through the project. Data scientists have to periodically switch between the two operational modes. It is therefore important to provide a development environment that recognises the need for both exploration and more structured development and reduces the effort of switching between these two modes of operation.

Build shared knowledge into code

It’s clear that a consistent approach to solving problems would be a huge asset for data science teams. Not only does creating and enforcing one ‘best practice’ way of doing things substantially reduce the cognitive energy required to learn a new task, ensuring that everyone’s code looks the same or similar also makes it much easier to transfer knowledge across the team. It’s also an important protection against data scientists wasting hours writing a pipeline that, unbeknownst to them, has already been written by a colleague. 

But creating consistency, as any data science data team leader knows, is easier said than done. Data scientists are usually contending with dirty, siloed, messy data that needs to be wrangled into submission before progress can be made. No wonder that data science team processes often end up looking like Frankenstein’s monster, with different approaches and code styles cobbled together untidily.

To collaborate effectively, data science teams build shared knowledge into their code. This can be done through shared libraries that automate common processes, feature stores that show how to build features from existing datasets, or by sharing templates that automate the setup of a project. The basis of this is creating a place to effectively publish, access, and store knowledge.

Curation is key here: if you create a knowledge store, you’ll need to review it regularly to ensure that the information remains relevant and make depositing information in your knowledge store a non-negotiable step in your team’s workflow. 

Agree on a set of tools – and stick to them

Given the plethora of data science tools available today, it’s hardly surprising that most data scientists develop a personal preferred infrastructure in the course of their career. This is understandable, but hardly conducive to collaboration. Without an agreed-upon set of tools, sharing code, updating a process, or handing over a project always means wasting precious time understanding that project’s idiosyncratic tooling. 

If your organisation can agree on a set of tools that suits your needs, collaboration becomes much easier. Not only is it simple to hand over information and build organisational knowledge about best practice for your infrastructure, it’s also much easier for all team members to develop a good understanding of each project’s codebase. 

It’s unlikely that these tools will remain the same forever – some software will inevitably become defunct or unsuited to your needs over time – but resist the temptation to tinker with your infrastructure unless you’re sure that it has a tangible benefit. 

Making collaboration standard 

Overcoming the cultures, processes, and relationships that have blocked collaboration for years is no mean feat. Achieving it requires time, dedication, and a certain stubborn refusal to accept a flawed status quo. 

But doing so is vital. If we want AI and machine learning to take on more complex projects, analyse bigger datasets, and justify larger teams of data scientists and engineers, then it’s vital that the industry learns to value and support collaboration.

If you’d like to learn more about our approach to collaboration at Faculty and find out how to build strong, capable and collaborative data science teams in your own organisation, check out our page on technical training.