The drug discovery process is unduly time-consuming and expensive. Extensive laboratory testing is required to evaluate potential drug candidates for absorption, distribution, metabolism, excretion, and toxicity (ADMET) properties. Compounds that fail to meet acceptable thresholds for these properties may be unsafe or ineffective in humans, leading to high failure rates in clinical trials. But what if there was a way to predict ADMET issues computationally before spending resources on intensive lab assays? This is the goal of a new artificial intelligence (AI) platform called SAFIRE (Suite of ADMET Predictions For In Silico Refinement and Evaluation). Developed by scientists at Eurofins Discovery, SAFIRE leverages machine learning models trained on a unique proprietary database to predict key ADMET and drug-drug interaction parameters.
The BioPrint Database: A Goldmine for AI Training
At the heart of SAFIRE is the BioPrint database – a collection of approximately 2500 compounds with over 1 million associated experimental datapoints. What makes BioPrint special is both the quality and breadth of its data. It contains results from a wide array of ADMET and pharmacology assays performed under standardized conditions. About 60% of the compounds are marketed drugs, with the rest a mix of clinical candidates, withdrawn drugs, and tool compounds.
While 2500 molecules may seem small compared to the enormous proprietary datasets amassed by large pharmaceutical companies, the developers believe BioPrint’s data quality gives it an edge for machine learning applications. Consistent assay protocols yield cleaner, more reliable data for AI models to learn from. Having a diverse mix of well-characterized compounds also helps models generalize to new molecular structures.
SAFIRE Models Perform on Par with Industry Standards
The SAFIRE platform currently includes 13 machine-learning models predicting properties like metabolic stability, solubility, hERG (Human ether-a-go-go related gene) channel inhibition (linked to cardiac side effects), and inhibition of key drug-metabolizing enzymes (cytochrome P450s). Model performance was rigorously evaluated using metrics such as accuracy and Matthews Correlation Coefficient (MCC).
Impressively, all SAFIRE models achieved at least 75% accuracy and 0.4 MCC on validation compound sets. This meets or exceeds thresholds used to assess similar models developed by large pharma companies. Some models, like those for certain CYP450 isoforms, did show room for improvement in their true positive/negative rates. But overall, SAFIRE’s strong performance suggests high-quality training data can help overcome the limitations of dataset size.
The team also compared their models to leading open-source and commercial ADMET prediction tools. While details were not provided, they reported that SAFIRE performed as well or better across the board. This speaks to the power of combining robust data with modern machine-learning architectures.
Mixing Public and Proprietary Data Boosts Model Robustness
For some ADMET endpoints, BioPrint did not have sufficient data to train models alone. So, the developers experimented with augmenting their data with carefully selected public datasets. One might expect proprietary pharmaceutical data to be inherently better than published numbers aggregated from multiple sources. However, the team actually found that including both types of data usually gave superior results.
The solubility model illustrates this well. Training on just BioPrint data led to overprediction of high solubility, likely due to class imbalance in the dataset. Using only public data also gave a suboptimal performance. However, the model trained on the combined dataset achieved the best overall accuracy and MCC. It was also better at distinguishing medium solubility compounds, which are important “goldilocks” cases in drug discovery. Other models showed similar synergies between public and proprietary data.
Including diverse public data seems to help in two key ways. First, it provides more examples of underrepresented categories, creating a more balanced dataset for the AI to learn from. Second, it expands the chemical space covered by the training set. This allows the models to make accurate predictions on a broader array of molecular structures that may be absent from BioPrint.
SAFIRE Training Set Covers Broad Chemical Space
To further examine the chemical space represented in SAFIRE, the developers turned to a data visualization technique called principal moments of inertia (PMI) plots. These plots depict the 3D shape diversity of compounds in a dataset, with the three vertices representing the extremes of rod-like, disc-like, and spherical structures.
Overlaying the PMI plots of FDA-approved drugs and the SAFIRE training sets revealed substantial overlap. This suggests the BioPrint database and supplementary public data cover a meaningful chunk of therapeutically relevant chemical space. As expected, additional literature data increased the degree of overlap compared to BioPrint alone, especially in the rod-like and disc-like regions. So, while not fully comprehensive, SAFIRE models are trained on chemically diverse structures representative of real-world drug space.
The team acknowledges this analysis is qualitative and that there are still large untapped regions of chemical space worthy of exploration. They plan to strategically expand BioPrint in the future to cover novel regions, with the goal of making SAFIRE even more robust and applicable.
Flexible Scoring Scheme Aids Decision Making
Raw ADMET predictions from computational models are useful but don’t always point to a clear course of action. A compound may have great solubility but terrible metabolic stability – so is it a promising drug lead or not? The SAFIRE developers aimed to help users make sense of the model outputs by devising a simple but flexible scoring system.
Here’s how it works: Each ADMET property prediction is assigned a favorability score between 0 and 1. Optimal values (e.g., high solubility and low CYP450 inhibition) get a 1, acceptable values a 0.5, and undesirable values a 0. Then, to get a bird’s eye view of a compound, the 13 property scores are combined into a sum score and a geometric mean score.
The sum score is simply the average across all properties. It provides a balanced view of the compound’s overall ADMET profile. A few suboptimal property predictions will lower the score but won’t tank it completely. This is useful for evaluating and prioritizing lead compounds, where a mix of favorable and unfavorable features that can be optimized is expected.
The geometric mean score, in contrast, is much harsher. If any single ADMET property scores a 0, the whole compound gets a failing grade. This “weakest link” approach is valuable in the hit discovery stage, where you may have thousands of compounds and want to quickly filter out any with obvious liabilities. It ensures only the most promising all-around starting points rise to the top.
Of course, drug discovery projects may need to pay more or less emphasis to particular ADMET properties. The SAFIRE interface allows users to customize the scoring to fit their goals. Don’t care about hERG because it’s irrelevant to your therapeutic area? Take it out of the score calculation. Need rock-solid solubility for an oral dosing formulation? Boost its weighting. This flexibility, enabled by having a full panel of ADMET models, makes SAFIRE adaptable to diverse drug discovery needs.
User-Friendly Web Tool Enables Community Access
All too often, powerful computational tools remain out of reach for many in the research community. They may require specialized hardware, complex installation, and programming skills to use effectively. The SAFIRE developers wanted to make sure their platform was readily accessible to anyone who could benefit from its ADMET predictions.
To that end, they created a web-based user interface, SAFIRE, that allows users to input their compounds of interest and receive ADMET property predictions and scores through an intuitive graphical interface. Importantly, this includes access to a limited free version of the platform, ensuring smaller academic and biotech groups can take advantage of SAFIRE.
Behind the scenes, SAFIRE runs on open-source tools, including the Python package scikit-learn for machine learning and RDKit for working with chemical structures. This increases transparency and keeps the door open to future enhancements by the computational drug discovery community. While the full BioPrint database remains proprietary, the developers seem committed to an open science ethos.
Advancing AI for Drug Discovery
The SAFIRE platform demonstrates how high-quality data and modern machine learning can accelerate drug discovery. Models trained on the BioPrint database achieved strong predictive performance for key ADMET properties while covering a broad span of chemical space. Including public data expanded model applicability without sacrificing quality. A flexible scoring scheme makes the AI results actionable for diverse drug discovery projects.
SAFIRE’s early success highlights the immense value of databases like BioPrint. While modest in size, the depth, consistency, and annotation quality of BioPrint’s experimental results gave SAFIRE a strong foundation. Continued expansion of such high-caliber databases, in both size and chemical diversity, may hold the key to achieving human-level drug discovery insight with AI.
Thoughtful strategies for mixing proprietary and public data will also be critical. As a contract research organization, SAFIRE’s developers have access to customer data that may not be fully reflected in public repositories. But they recognize the benefits of judiciously layering in well-curated public results. Figuring out optimal ways to merge these data streams while respecting intellectual property will be an important challenge.
Moving forward, explainable and interactive AI will be essential for drug discovery applications. The SAFIRE team made some progress here with their interpretable scoring schemes and user-friendly web interface. But as machine learning models grow more sophisticated, making their “reasoning” transparent and manipulable by domain experts will be key to gaining trust and driving real-world impact.
Keeping Pace with a Fast-Moving Field
SAFIRE is an impressive step forward, but the team knows they must keep innovating to stay relevant. They’re already planning a host of enhancements – expanding BioPrint into novel regions of chemical space, refining training set balance, incorporating new descriptors and algorithms, and devising project-specific models and scoring functions. Partnerships with industry and academia could help drive progress.
From a broader perspective, the field of AI-driven drug discovery is advancing at breakneck speed. Major developments in deep learning, dataset curation, and simulation are reported almost daily. Commercial platforms and collaborations are proliferating. Marquee’s achievements, like AlphaFold’s protein structure predictions, have shown the disruptive potential of AI in biomedical research.
To keep pace, aspiring entrants like SAFIRE will need to offer concrete value to drug hunters while retaining the agility to absorb new techniques. Closely tracking the literature and engaging with the research community will be critical. Participating in data-sharing initiatives and benchmarking exercises can spur innovation and build credibility. Above all, listening to users and crafting tools that fit their real-world needs should be the north star.
A Promising Outlook
The SAFIRE platform is a powerful demonstration of how far AI has come for drug discovery. With just a moderately sized proprietary dataset and some carefully selected public data, the developers built highly accurate and applicable ADMET prediction models. The BioPrint database shines as an example of how experimentally consistent, well-annotated results can punch above their weight for machine learning.
SAFIRE also illustrates the benefits of combining data sources and prediction targets into a unified platform. Having all key ADMET properties estimated side by side, with sensible scores and visualizations, will help researchers quickly identify and act on the most promising compounds. The option to tailor the platform for specific projects should bring SAFIRE’s AI into more and more real-world drug discovery workflows.
Of course, SAFIRE is just one piece of the drug discovery puzzle. To achieve the full potential of AI, platforms like it will need to be integrated with other computational tools – molecular simulations, biological network analysis, clinical trial analytics, and more. Building connections between these technologies and making them all more accessible and interpretable to domain experts will keep computational scientists busy for years to come.
In the meantime, SAFIRE is a noteworthy addition to the growing ecosystem of AI for drug discovery. It shows how powerful machine learning has become for predicting complex biological properties, given the right data and methodology. And it provides an accessible platform for researchers to start applying these tools to their own drug discovery challenges today. While there’s still much work ahead, tools like SAFIRE are moving the field closer to the ultimate goal – safer, more effective medicines delivered faster and cheaper.
Article source: Reference Paper | Reference Article | SAFIRE is available for users through theย Eurofins Discoveryย website.
Follow Us!
Learn More:
Dr. Tamanna Anwar is a Scientist and Co-founder of the Centre of Bioinformatics Research and Technology (CBIRT). She is a passionate bioinformatics scientist and a visionary entrepreneur. Dr. Tamanna has worked as a Young Scientist at Jawaharlal Nehru University, New Delhi. She has also worked as a Postdoctoral Fellow at the University of Saskatchewan, Canada. She has several scientific research publications in high-impact research journals. Her latest endeavor is the development of a platform that acts as a one-stop solution for all bioinformatics related information as well as developing a bioinformatics news portal to report cutting-edge bioinformatics breakthroughs.