Data-Centric vs. Model-Centric AI: The Key to Improved Algorithms
Nowadays, no matter what artificial intelligence (AI) project we want to build, we need two main components:
- A model.
- data.
A lot of progress has been made in the development of effective models, which has led to many breakthroughs in AI.However, in addition to making the data set larger, no similar work has been done in the data field.
Although the progress of traditional model-centered AI is narrowing the differences, Andrew Ng and many other leading scientists and scholars are debating the adoption of data-centered AI, which handles the development of new paradigms to systematically improve data quality.
Data-centered and model-centered AI
Data-centered AI is different from model-centered AI because the main focus of the latter is to develop and improve models and algorithms to achieve better performance on a given task.In other words, model-centered AI treats data as fixed artifacts and focuses on improving AI models, while data-centered AI treats models as static artifacts and focuses on improving data quality. (Also read: What is data profiling and why is it important in business analysis?)
Data is essential in artificial intelligence; adopting a method of obtaining high-quality data is essential-because useful data is not only error-prone and limited in quantity, but also very expensive to obtain.
The key idea of data-centric AI is to process data in the way we process high-quality materials when building houses: we spend relatively more time marking, expanding, managing, and organizing data.
Why we need data-centric artificial intelligence
The “mantra” of traditional model-centered artificial intelligence is to optimize highly parameterized models with larger data sets to achieve performance improvements.
Although this maxim applies to many industries such as media and advertising, it also faces many challenges faced by industries such as healthcare and manufacturing. These include:
Lack of training data examples. This usually leads to poor optimization and disappointing results.
A huge sum of money. Existing model-centered artificial intelligence requires huge data sets and expensive computer resources to provide performance improvements. In contrast, data-centric artificial intelligence focuses on data quality rather than quantity, and does not require expensive computing resources.
Less reliable and fair results. Through a data-centric AI approach that prioritizes data quality, we are more likely to eliminate data deviations through careful analysis.
A collection of complex models. Model-centered artificial intelligence methods require specialized models to handle different tasks, which has led to the accumulation of many data sets and many models in the organization. This also leads to increased costs associated with AI: it may be difficult to provide enough data to deal with each minor problem (such as fault detection in several different manufactured products).
A data-centric approach to artificial intelligence can help alleviate these challenges, thereby helping organizations gain more from data.
How to achieve data-centric AI
The essence of data-centric artificial intelligence is to treat data as a key asset when deploying artificial intelligence infrastructure.
Unlike model-centered AI that also deals with archiving data into repositories, this paradigm emphasizes the development of a common understanding of data to maintain a unified description.
So what should we do? What important aspects should we consider to implement this method? Facts have proved that to adopt data-centric artificial intelligence, we need to follow some guidelines. They are:
1. Correct data label
As the name suggests, data labeling processing assigns labels to data-for example, disease labels to medical images.
Data tags provide important information about the data set, and AI algorithms use this information for learning. Therefore, the information must be correct and consistent. In addition, it has been shown that fewer well-labeled data examples (for example, images) can produce better results than more data with mislabeled data. (Also read: Why diversity is essential for training high-quality data for AI.)
Data-centric artificial intelligence puts a high emphasis on the quality of data labeling, which requires dealing with inconsistencies in labeling and the work of labeling manuals. The best way to find these inconsistencies is to use multiple data taggers. After discovering that the label is inconsistent or vague, the label manufacturer should decide how to correct the inconsistent label and record their decision in the label manual. It is also helpful to provide examples of correct and incorrect data labels in the label manual.
The following shows some examples of inconsistent labels in the iguana detection described by Andrew Ng. Please note how the marker is inconsistent when marking iguanas:
2. Examples of noise removal data
You can eliminate noisy data instances by discarding them. This extends the ability of the model to generalize to new data.
3. Enhanced data
This task involves generating more data examples from existing examples by interpolation or extrapolation.
Since data-centric AI focuses on data quality rather than quantity, some AI models require a lot of data to function well, so data enhancement can help you find a middle ground.
However, it is important to note that if the data contains noisy instances, then generating more data will not help.
4.Feature engineering
Feature engineering processing uses prior knowledge or algorithms to represent raw data based on the most relevant variables (i.e. features).
The idea is to use domain knowledge as a feature to improve the quality of the predictive model, rather than providing raw data to the model. Feature engineering is essential to add additional features that may not exist in the original data but can significantly improve performance and reduce the need to collect large data sets. (Also read: Why is feature selection so important in machine learning?)
5. Error analysis
After training the model on a given data set, error analysis can help you find a subset of the data set to be improved.By repeating this process, you can gradually improve the quality of your data, thereby improving the performance of your model.
6. Domain knowledge
In model-centered AI, domain experts usually do not participate because data is considered a given artifact.
However, domain knowledge plays a vital role in data-centric AI, because domain experts can usually detect subtle inconsistencies in data, which may lead to better results.
The future of data-centric artificial Intelligence
Although most data-centric AI ideas already exist as the traditional wisdom of AI engineers, data-centric AI aims to build a systematic method and tool to facilitate this process.
data-centric AI is an iterative process-the results of training analysis and deployment may lead to a return to the data collection and model training stages to observe and correct problems in the test data.
In order to help AI engineers adopt data-centric AI in their projects, the AI community has developed various tools. These include:
Landing shot. Landing Lens is a product developed by Landing AI founded by Andrew Ng. It can help AI engineers develop and deploy consistent iterative inspection systems for various tasks in the production environment. The tool consists of data, models, and deployment modules to manage data, accelerate troubleshooting, and scale deployment.
Clean the laboratory. This data-centric AI package helps to clean up labels, perform error analysis, and learn label errors in the data set.
Breathing tube. Snorkel is a data-centric platform that helps to mark and prepare training data programmatically to accelerate the process of building and deploying machine learning models.
Automatic enhancement. This reinforcement learning algorithm was developed by Google Brain to help increase the quantity and diversity of data in existing training data sets. It can also be used as a Python package.
Photo album. This is another python library for fast and flexible image enhancement.
Holographic cleaning. Holo Clean aims to enable domain experts to exchange their domain knowledge in a declarative manner. This helps to generate accurate forecasts, analysis, and insights from noisy, incomplete, and incorrect data.
Conclusion
Data-centric AI prioritizes data quality over quantity. Compared with model-centric AI that seeks to improve performance by expanding data sets, a data-centric approach can help alleviate many of the challenges that may arise when deploying AI infrastructure