In this three-part series, Dr. Raminderpal Singh discusses the challenges that come with limited data quality and some practical solutions. In this second article, we discuss the problems that arise when using poor quality data.
The first article in this series was published on Wednesday the 14thNumber In August, we discussed the important role that data quality plays in the effectiveness of machine learning (ML) and AI data analysis. The characteristics that define data quality were listed as:
- Complete
- Consistency
- Open-minded
- Accuracy
Understanding the impact of these characteristics requires awareness of the sensitivity that algorithms have to these characteristics and the importance of choosing the right algorithm for the planned analysis. These are complex topics that require detailed discussion, which we will do in future posts.
Below are some of the main problems caused by poor data quality:
- If data is incomplete, ML models may miss patterns and relationships. For example, if important bioactivity data is missing for a particular compound, the model cannot fully consider the structure-activity relationship, which can lead to inaccurate predictions.
- Inconsistencies such as variations in how data is recorded (e.g., different units of measurement or naming conventions) can confuse the model and lead to erroneous predictions. For example, if the same compound is labeled differently in different datasets, the model may treat it as a different entity, skewing the results.
- Data bias can cause a model to fail to generalize or perform poorly on certain subsets of data. For example, if the training data is biased towards a particular chemical scaffold or a particular set of biological targets, the model may be less effective at predicting activity for compounds outside of these categories.
- Data noise, which can result from experimental error, biological assay variability, or inconsistent conditions, can obscure true signals and reduce the ability of models to learn association patterns, potentially resulting in high rates of false positives or false negatives.
- Duplicate records can skew the training process by giving excessive weight to certain data points, leading to overfitting of the model and a lack of generalization.
- If the data is very unbalanced, i.e. there are many more inactive than active compounds, the model may be biased towards predicting the majority class, which may result in poor performance in identifying active compounds.
- Redundant features or data points inflate the dimensionality of your data without adding any new information, which can lead to overfitting and poor model performance.
- Poor data quality undermines scientific reproducibility, as other researchers or systems may not be able to reproduce findings.
The next installment in the series is scheduled to be released on Friday the 13th.Number In September, we’ll provide practical guidelines to help you improve your data quality and detect if your data quality is being compromised.
About the Author
Dr. Raminderpal Singh
Dr. Raminderpal Singh is a recognized visionary in the implementation of AI across industries with a focus on technology and science. He has over 30 years of global experience in leading and advising teams, helping early to mid-stage companies achieve breakthroughs in the effective use of computational modelling.
Raminder Pal currently serves as Global Head of the AI and GenAI practice at 20/15 Visioneers. He also founded and leads the open source community HitchhikersAI.org and is co-founder of Incubate Bio, which serves life science companies looking to accelerate research and reduce wet lab costs. Computer-based modeling.
Raminderpal has extensive experience building businesses both in Europe and the US. As a business executive at IBM Research in New York, Dr. Singh led the market launch of IBM Watson Genomics Analytics. He also served as Vice President and Head of the Microbiome Division at Eagle Genomics Ltd in Cambridge. Raminderpal received his PhD in Semiconductor Modelling in 1997. He has published several papers, two books and holds 12 patents. In 2003, he was named one of the 13 most influential people in the semiconductor industry by EE Times.
For more information, visit http://raminderpalsingh.com, http://20visioneers15.com, http://hitchhikersAI.org and http://incubate.bio.