Development of New Methods for the (Q)SAR Applicability Domain Assessment: Using Structural Information in a Statistical Study of the Errors in Prediction

Diaza, Rodolfo Gonella (2015). Development of New Methods for the (Q)SAR Applicability Domain Assessment: Using Structural Information in a Statistical Study of the Errors in Prediction. PhD thesis The Open University.

DOI: https://doi.org/10.21954/ou.ro.0000efa1

Abstract

The main aim of (Q)SAR is to build models to evaluate and predict properties of molecules, such as biological and environmental effects, and physicochemical properties. These models are built using available experimental data, whose quality and quantity heavily affect their capability of obtaining reliable predictions for new chemicals. A dataset can be viewed as a "sampling" of the whole chemical space, if a sample is too small and / or too homogeneous, the model will inevitably have limitations in the type of chemicals it can predict.

From the point of view of protecting the human health and the environment, it is preferable that a model is able to predict even a small number of chemicals, but with the highest possible reliability. The "coverage" issue can be overcome by integrating results from different models. In this perspective the importance of clearly defining the model's applicability domain is crucial to identify which model is most suitable for each chemical to assess.

The definition of the applicability domain (AD) of (Q)SAR models is still an open research field. Several approaches have been proposed and implemented through years, including the use of structural features such as functional groups and atom-centered fragments. These features have also proven to be useful for an a priori definition of AD, making it independent from the specific algorithm chosen to develop the model.

Within this study, the definition of (Q)SAR models' applicability domain has been investigated using structural features of different complexity: thresholds for chemical composition and molecular weight, chemical classes related to commonly well and badly predicted molecules, and statistically-extracted structural fragments to model the error in prediction. In the case studies considered, these approaches improved the AD definition provided by the model developers, supporting their integration within the definition of the models' applicability domain.

Viewing alternatives

Download history

Metrics

Public Attention

Altmetrics from Altmetric

Number of Citations

Citations from Dimensions

Item Actions

Export

About

Recommendations