Common mistakes we Data Scientists make

DISCLAIMER I am a data scientist and have made all these mistakes, but I have had the privilege of sitting on the managerial, project lead and developer side of the fence, and here are some tips to getting your stakeholders (i.e. anyone involved in the project team or has an interest in the success of the project) on board and delivering a successful data science project.

#1 Technical Language is a no no

Often when I have managed data scientists, and have been a data scientist, I have seen statisticians being too technical about their results, data scientists too excited about their newest ML model, data engineers too caught up in the data and the wide use of technical acronyms and technologies that the stakeholder(s), normally the requestors of the information, or project leads, get turned off straight away.

These managers, and leads, normally have busy schedules, so my rule is if you cannot describe what a model, pipeline, data analysis or metric is doing in under one minute, then the person you are communicating to will disengage. This is the concept of the elevator pitch.

#2 Don’t skip the fundamentals

This relates to really understanding the data. I break these steps down into the following:

Get someone with domain knowledge involved in the inception process
Explore the data graphically
Have strategies for imbalanced data, missing values and make sure these are factored into the work plan
If you are developing a dashboard, what are the key things the stakeholders need from that dashboard
ML pipeline – make sure the right people from DevOps, MLOps, ML Engineers are involved to help you build this. I have never met a data scientist who can do everything well, so specialise in what you are good at and do it well, but not too perfection
Know the statistics or modelling concepts before deploying them

#3 Don’t move to quickly

This relates to responding to pressured deadlines. Make sure you have enough time to sense check all the outputs, whether that be outputs from a predictive model, to data inaccuracies, to analysis, to domain input, etc.

During my time as Head of AI at one company, I delivered some model predictions and I had not checked if the predictions made sense at a more granular level. Needless to say – I learned from it and so should you.

If you are not failing then you are not learning.

#4 Too small a data sample and high variation

Sometimes the data is just too small. Therefore, focus more on data collection, rather than whipping up a neural network that won’t have enough observations to detect the underlying patterns and correlations in the data.

Variation is a key issue here as well, as we know with the central limit theorem that if the datasets are too small then the variation in the underlying distribution is larger, consider doing a statistic F Test here for variation.

Make sure that this is communicated to managers as the sample is too small to do anything meaningful. I know you wanted a fancy shiny neural net, but this model will perform much better on a larger set of observations.

#5 I have a hammer, therefore everything looks like a nail

I have a friend Chris, and this is his favourite saying “when all you have is a hammer everything looks like a nail.” How true is that?

This is meant as a critique on not keeping up with new tools and techniques. I know you cannot always be playing with the newest shiny toy, but some technologies move quickly, especially machine learning and development ops infrastructures.

This means that you have to have a scan of what is out there and do this every couple of months to make sure you are still competitive and doing things in an optimal way.

#6 Lack of documentation and commenting of code

Data scientists and data professionals are always spinning plates, and without documentation, there is a danger that all new insights never become insights, because the information is never converted to insight through documentation. I personally have to work hard to document, because it is not my favourite task, but at times it has really saved my proverbial bacon.

The same goes for code. I have worked with many developers over the years and without good, and consistent, code commenting it is very difficult to understand the thought process of another person, or the steps taken to achieve a specific output.

Top tip: get better at documenting and make the best use of markdown technologies for automating the bits you can.

#7 Continuous Learning

The game in data science is always changing and you need to be comfortable with learning new techniques, modelling concepts, coding styles and technologies.

This should not be a chore and should be fun. To make this fun – managers need to realise that protected learning time is essential. Without the ability to learn in company time, people will do it in their own time (normally at night) and this can leave them burned out if you sustain this over long periods of time.

Note here is to make effective use of appraisals and documentation and allow yourself time to do this in company time.

#8 Don’t wait until the next century to share results

Be prepared to share results, more frequently, with managers and project leads. This way they know you are working towards answering their question/hypothesis and it will provide you more time in the long run to deliver the project/product.

I have seen analysts and developers taking an age waiting until everything is perfect, but perfect to you is not perfect to someone else. You can have the best results, models, analysis and dashboards, but if it takes 10 years to build them it will be to no avail.

By working out loud allows for the project leads, and members of the wider stakeholder group to share their input. The very key to the success of any project is when everyone feels included and involved.

#9 Including your bias

There have been examples of this all through the ML community. Computer Vision researchers forgetting to train their person classification/detection models on different ethnicities and genders. Interpreting results and adding narrative that is not supported by the data, and probably many more examples to note.

Be careful about how the analysis or results are obtained and be self-aware of any bias before starting a project.

#10 Expectation management

Don’t always sell the shiniest toy to the stakeholder, or overpromise on a deadline. Make sure you find out what the stakeholder needs, run it through a MOSCOW prioritisation method and then rule out what cannot be delivered.

Make sure this is documented as a project plan or as backlog items in an AGILE process and that the stakeholders are aware through regular updating channels.

Very important is that the stakeholder as a requestor and you as the deliverer know exactly what is to be delivered and give them access to any Kanban boards they require or get them involved in sprint reviews or project meetings.

#11 Building tools from scratch

Is there a tool already on the market? A package you can use when coding? An existing dashboard internally?

If there are gaps – then go ahead and build, but most of the time a tool is there to solve a problem, all that is required is some customisation.

Ultimately, having to start from a blank slate will take much more time than utilising existing packages, code, tools, etc.

Have an easy life and use what is already available, and if it isn’t fill your boots.

#12 Assuming prior subject knowlegde of the stakeholders

I assume no knowledge prior to presenting findings, or find out in advance what the aim of the meeting is and the skillsets of the individuals before presenting.

However, if you don’t know this then always assume no prior knowledge. There are a million things we could confuse stakeholders within data science parlance e.g. epochs, variation, multicollinearity, regression from the mean, standard error, Gini index, gradient descent, entropy, convolution and the list goes on. This also plays to the point of how much technical jargon you should use. My manager always used to say Keep it sufficiently simple, or Keep It Simple Stupid! I have always tried to add a layer of simplicity to complex subjects, without the need to be reductionist.

In terms of explanation, and whether it will be received well, I have this technique called “explain it to your family“. That is not to say that my family is not intelligent, but it is more likely that they are less interested in the background domain and even less interested in data science. If you cannot simplify this to allow them to understand the presentation of results, then you can guarantee that lots of people you are presenting to on the day won’t understand what is going on either.

#13 Data Storytelling

This links to the fundamentals section and is very important for the recipient of the data to understand what it shows.

Exploratory data analysis is key here. This allows you to know, at a more micro and granular level, what is happening with the data. You need to get intimate with the data like a relationship. In the best relationships, you know the best stories about your best friends and family, but you know little or nothing about strangers you have never met.

Data is exactly the same and the best stories are always told when you know the area and have passion for the conveyance of that information.

Conclusion

These are some of the insights I have observed and personally been guilty of, working as a data scientist. I am sure there are more and I would be keen to understand if there are additions that are needed to this list.

I hope you found this interesting and remember you are doing an awesome job, but we can always improve. A quote from the famous author Mark Twain resounds here:

Continuous Improvement is better than delayed perfection.
Mark Twain