Pascal Bugnion works on the Faculty Platform team.
COVID-19 has been the catalyst for a major shift for data scientists, forcing thousands to rapidly embrace remote working. But this will be no temporary change; COVID-19 has made working from home mandatory for many, but the trend towards an increasing fraction of the workforce working remotely is likely to continue once normality has resumed.
Of course, as data science teams have grown larger and more complex, facilitating remote working has become more complex too. Data science teams are no longer composed of a lone unicorn who works furiously on his laptop to eventually present a few plots or a web application. We now expect that data science is done by cross-functional teams that bring together subject-matter experts, modellers, data visualisation experts, machine learning engineers, product managers and designers.
Unsurprisingly, when these complex, cross-functional teams begin to shift toward remote working, effective collaboration often becomes a huge challenge. Communication is just harder when teams are not co-located. In this post, we’ll show that the right tooling is essential to helping data science teams perform to the best of their ability while working remotely.
Avoid frustrations with a shared data science environment
High-functioning data science teams strive for consistent tooling and infrastructure across the team. It is much easier to jump on a call and help a colleague if you both work with similar tools. It is also much easier to write reusable, shareable code if the environment in which that code runs in is more constrained.
There are many components that make up a data science infrastructure. We have found the following processes to be effective:
- Give everyone the same hardware. If everyone uses macOS, for instance, it facilitates sharing documents, scripts, requirements files etc.
- Enforce common tooling. If everyone in the team uses Python, sharing code and models is greatly simplified. Similarly, if everyone uses the same text editor, the organisation can develop processes or even software (e.g. editor plugins) to facilitate collaboration.
- Enforce common environment management. If everyone in the team uses the same environment management system (e.g. Conda environments or Docker containers), data scientists can collaborate both on the code and on the environment the code runs in.
- Always store data online. If data is always stored in databases or in object storage that is accessible by everyone, it is much easier to share code or to debug issues together.
- Have an easy way to declare reproducible workflows that other team members can run. For instance, having a well-documented store of Docker containers that can run particular parts of the team’s data processing pipeline means that not everyone in the team needs to know every part of that pipeline at the same level of detail. In turn, this reduces the need for long phone calls explaining how to run the pipeline.
- As much as possible, deploy models behind documented APIs. Having a clear interface allows more people to leverage the team’s work.
Avoid isolation by making work visible
One of the most motivating elements of teamwork is the feeling of a shared goal.. When people aren’t sitting in the same physical space, it can be hard to even know what other team-members are working on, let alone generate a sense of shared purpose . This can lead to team members feeling isolated.
The best way to mitigate this is to make work visible. Having a good source code manager like GitHub, GitLab or Bitbucket helps with this: the steady flow of PRs, comments, and reviews gives a sense of motion to the team. Continuous integration pipelines running from this source code manager also increase visibility.
Besides exposing activity through a source code manager, high-functioning organisations also make other sources of activity visible. For instance, we have found companies use Airflow to show data processing or model retraining pipelines. Ongoing experimentation and hyperparameter tuning can be exposed to the entire team with systems of records like MLFlow.
Avoid silos by sharing best practices
When teams are remote, there is less space for ad-hoc knowledge sharing. That hallway conversation that sparks new ideas is less likely to happen. Teams therefore need to be much more deliberate about knowledge sharing.
We have seen teams build knowledge repositories in Confluence or the open source knowledge repo. In Faculty Platform, our data science workbench, we are working on building a way to easily share blueprints for common data science tasks. This allows data scientists to gradually build an organisation-specific knowledge centre for a single, consistent view of best practices.
Building cross-functional teams to deliver on the promise of machine learning is hard. It is hard to recruit people and, once you have, it is hard to get them to speak the same language, to pull in the same direction. Trying to do this with a remote team is even harder.
Good tools will not guarantee success, but they will make your team-members more productive and foster a feeling of collaboration and shared goals.
Faculty Platform, our workbench built for data scientists, gives remote data science teams a shared infrastructure for collaborating on model development and deployment – and all the best practices from this post are built into it.