4 Common Machine - Learning Pitfalls and How To Avoid Them Machine learning is one of the hottest topics in technology today-and for good reason. It has great potential to automate or semi-automate some of the most tedious tasks faced by knowledge workers-leading technology companies have begun to realize most of this potential. For example, …
4 Common Machine – Learning Pitfalls and How To Avoid Them
Machine learning is one of the hottest topics in technology today-and for good reason.
It has great potential to automate or semi-automate some of the most tedious tasks faced by knowledge workers-leading technology companies have begun to realize most of this potential.
For example, machine learning can help reduce manual labor by 50% or more for the following tasks:
- Contract review.
- Human resource service management.
- Transcribe the minutes of the meeting.
- Financial forecasts.
As the application of machine learning becomes more and more widely used, we are on the verge of releasing this value. A study by Algorithmia found that by 2021, 76% of companies will prioritize artificial intelligence (AI) and machine learning (ML) over other IT programs.
However, most machine learning programs have failed.
Although there are many reasons why ML pilots never take off, the most pressing problem can be traced back to four main pitfalls:
- Lack of business consistency.
- Bad machine learning training practice.
- Data quality issues.
- Deployment complexity.
Let’s explore each of them and propose some solutions for data teams and organizations to avoid them.
Although there are many reasons why ML pilots never take off, the most pressing problem can be traced back to four main pitfalls:
1. Lack of business consistency
The original sin of machine learning lies in how most of these projects were born.
Many times, when a group of data scientists conceive a machine learning project, they think: “This data is very interesting; if… isn’t it cool?””
It is this kind of thinking that turned the ML project into a scientific experiment.
The model in such a project may still produce something valuable-but if the project does not solve urgent and painful needs, it will not get the time or attention it needs from business stakeholders. Or worse, it may become closer to Blockchain: a cool technology for finding problems.
Machine learning projects should start by looking at the most pressing business priorities, and then evaluating the resources needed to solve these problems-rather than starting with clean data at hand, and then trying to find problems they can solve.
Good questions to ask before starting a machine learning project include:
Is this question urgent? According to WHO?
Why is machine learning the right solution to this problem?
How will we define success?
2. Bad machine learning training
Suppose your project has a really difficult and valuable business problem. The next step is to collect enough clean data to train the model.
This is the paradox of data scientists: in order to eliminate the hard work of others, they must indulge in it.
According to Anaconda, data scientists spend about 45% of their time on data preparation tasks, including loading and cleaning data.
It is very likely that after all this work, there may just not be enough suitable or representative training data.Moreover, like any other manual task, there is a risk of human error. (Also read: Automation: The Future of Data Science and Machine Learning?)
Fine-tuning your ML model can also be challenging.It can be overfitting and learning too much, or it can be underfitting and learning too little.
You ask how can machine learning models learn too well?
There is a well-known example of a model that is trained to distinguish between huskies and wolves.It was very accurate during training, but began to fail in production.problem?The background of all wolf photos is snow, while Huskies do not.This is a snow detection model, not a wolf detection model.
Unfortunately, machine learning training may be a test where you don’t want to get a 100% score.
3. Data quality issues
Whether in training or deployment, it is impossible to have an effective machine learning model that contains bad data.As they said, garbage goes in and garbage goes out.
The challenge is that machine learning models require a lot of data.They always want more data-as long as it is reliable.
However, bad data can be introduced into good data pipelines in almost unlimited ways.Sometimes it may be a noisy anomaly, and errors will be found soon; other times, it may be a gradual situation of data drift, which will reduce the accuracy of the model over time.Either way, this is terrible.
That’s because you built this model to automate or notify a painful business problem-so when accuracy declines, trust also declines, and the consequences are serious.For example, a colleague of mine talked to a financial company that was using a machine learning model to buy bonds that met specific criteria.Bad data took it offline and it was trusted to put it back into production a few weeks later.
The data infrastructure that supports machine learning models needs to be continuously tested and observed-preferably in a scalable and automated way.
4. Deployment complexity
It turns out that deploying and maintaining machine learning in production requires a lot of resources.Who knows?
Well, I did it. It expects that by 2025, artificial intelligence will become the primary category driving infrastructure decision-making, because the maturity of the artificial intelligence market has led to a ten-fold increase in computing demand.
This requires a lot of support from business stakeholders, which is why business consistency is so important.For example, former Uber data product manager Atul Gupte led a project to improve the organization’s data science workbench, which data scientists use to simplify collaboration.
Data scientists are currently automating the process of verifying and verifying the worker documents required to apply to join the Uber platform.This is a great project for machine learning and deep learning, but the problem is that data scientists often reach the limits of available calculations.
Gupte studied a variety of solutions and identified virtual GPUs (an emerging technology at the time) as a possible solution.Although the price is high, Gupte justified the expenditure with leadership.The project not only saved the company millions of dollars, but also supported a key competitive advantage.
Another example is how Netflix never put its award-winning recommendation algorithm into production, instead choosing to adopt a simpler solution that is easier to integrate.
How to avoid these pitfalls
Don’t let these challenges prevent you from launching a machine learning program.
Alleviate these risk factors in the following ways:
Get the support of stakeholders as soon as possible and be consistent often.
Iterate in a DevOps manner.
Make sure you have the correct training data and monitor quality before and after production.
Keep in mind the limitations of production resources.






