ChatGPT has become a rage all over the globe ever since its launch. From brainstorming business ideas to explaining complex code in the most clear-cut manner, it has been of tremendous help to people from all walks of life. And now, researchers from Xi’an Jiaotong University have developed a revolutionary ChatGPT counterpart suited specifically for drug development called DrugGPT. It is an autoregressive model that greatly streamlines and accelerates the design of potential ligands in the drug development process. DrugGPT effectively overcomes the shortcomings associated with conventional methods as well as other deep learning-based methods and proves itself to be a highly stable, efficient, and adaptable tool for drug discovery and design.

Navigating the Current Landscape of Drug Development

Gone are the days when discovering life-saving drug candidates could only be attributed to serendipity or the identification of the active ingredient in traditional remedies. Both of these ways were tedious and involved a serious lack of certainty and direction. Then came the era in which bioinformatics and computational chemistry were incorporated into the drug discovery and design process. These two fields gave drug development much-needed certainty and freedom from depending upon chance discoveries. Now, scientists know exactly where to search for potential drug candidates, what properties to analyze, and how different drug candidates compare with each other. Despite that, drug discovery and design remain largely lengthy, costly, and inefficient. One significant challenge is the vast chemical space required in the process. Because virtually, the number of potential drug candidates approaches infinity, and a comprehensive exploration of all of them would take forever.

Computational drug development approaches such as molecular docking, Quantitative Structure-Activity Relationship (QSAR), machine learning, and deep learning have been able to mitigate the issues, but only up to a limited extent. The demand to effectively explore the vast chemical space makes the development of an innovative tool or technique that can condense the enormity of the chemical space necessary.

Deep learning has demonstrated great potential in streamlining and automating complex processes in multiple fields. So, the question arises as to whether deep learning can be fine-tuned further to suit drug development needs and overcome the present drawbacks.

Is There Really a Remedy to the Ailment of Massive Chemical Space?

To combat the problem of vast chemical space in drug development, Yuesen Li and his group have come up with the brilliant strategy of employing an autoregressive model based on the GPT framework called DrugGPT. The primary learning strategy of DrugGPT is the tokenization of proteins and ligands. Tokenization with the Byte Pair Encoding (BPE) algorithm reveals that even though the number of potential drug candidates approaches infinity, a finite vocabulary is capable of representing all of these compounds, as is evident from the application of the BPE algorithm in the ZINC20 database. This database houses more than 2 billion compounds. Upon application of the BPE algorithm, it was discovered that only 5,373 tokens were sufficient to represent all the compounds in the database. This indicates that the colossal chemical space can be efficiently explored by mastering the permutations and combinations of a finite vocabulary.

To develop DrugGPT, first, a finite vocabulary was created. It consisted of the combination of the ligand and protein vocabularies (which were created by the BPE algorithm), and care was taken to handle duplicate tokens in the vocabulary. This led to a balanced sequence length distribution. Then, libraries and datasets from the HuggingFace team were utilized because they provide superior natural language processing tools along with effortless integration with the PyTorch platform. In the next step, the GPT2LMHeadModel was selected from the transformers library owing to its spectacular performance in natural language processing. The model was then trained from scratch because ligand and protein structural information vary considerably from natural language text. The training essentially involved two stages. In the first stage, the model was trained on the ligand text dataset in order to make it learn compound structure, properties, and representation. In the second stage, it was trained on the protein-ligand pair text dataset for the creation of compounds specific to proteins. After five cycles of training, it was observed that the model’s loss on the validation set reached 0.04. This is an indication that the model had successfully mastered a substantial amount of necessary information and had attained the ability to accomplish its task. 

There are three ways in which DrugGPT designs ligands. It can design ligands based on protein sequences as inputs. It can also design ligands that fulfill specific criteria and can autonomously generate ligand designs as well, even when no input is given. After the ligands are generated, screening and optimization steps follow. Ultimately, the selected compounds are subjected to experimental validation.

Can DrugGPT Save the Day When it Comes to Drug Development?

DrugGPT is an autoregressive model. Its autoregression-based approach boosts the accuracy with which it grasps the relationships between the structure of chemicals and their activities. DrugGPT also shows superior stability, and it is more convenient to optimize during training. A huge advantage of DrugGPT is that it implements the back-propagation algorithm during training, preventing the Mode Collapse problem commonly encountered in Generative Adversarial Networks in drug design. It even excels in adaptability, as it can extrapolate known information to unknown or new situations in order to accomplish diverse tasks. 

Evidently, DrugGPT brilliantly solves the problem of vast chemical space and makes the process of drug development a swifter, more accurate, and more efficient one.


Drug development is crucial to saving numerous lives on time. As the variety of pathogens and diseases increases, the quest to make the most effective medicines with the least side effects and as fast as possible gets bolstered. In such circumstances, leveraging the power of deep learning and other computational methods in drug design becomes a necessity. And the emergence of DrugGPT is certainly a huge step in the right direction. It already shows the potential for transforming the lengthy and expensive drug development process into a faster and more streamlined technique. Therefore, further research and improvement in DrugGPT can open the doors for new possibilities in the pharmaceutical field. 

Image Credit: Neegar@CBIRT

Article Source: Reference Paper | Code and checkpoints employed in the study are available at GitHub & Hugging Face

Important Note: bioRxiv releases preprints that have not yet undergone peer review. As a result, it is important to note that these papers should not be considered conclusive evidence, nor should they be used to direct clinical practice or influence health-related behavior. It is also important to understand that the information presented in these papers is not yet considered established or confirmed.

Learn More:

 | Website

Neegar is a consulting scientific content writing intern at CBIRT. She's a final-year student pursuing a B.Tech in Biotechnology at Odisha University of Technology and Research. Neegar's enthusiasm is sparked by the dynamic and interdisciplinary aspects of bioinformatics. She possesses a remarkable ability to elucidate intricate concepts using accessible language. Consequently, she aspires to amalgamate her proficiency in bioinformatics with her passion for writing, aiming to convey pioneering breakthroughs and innovations in the field of bioinformatics in a comprehensible manner to a wide audience.


  1. Thank you for reaching out and showing interest in the article! This article is based on a pre-print posted on bioRxiv in June 2023; the paper has not yet been published. As of now, there haven’t been any recent updates on DrugGPT, however, the pre-print itself highlights some noteworthy features of the tool.

    One exciting capability of DrugGPT is its ability to generate new compounds from ligand tokens, akin to how a language processor generates new sentences from familiar words. By controlling the randomness of this process, the tool ensures the production of high-quality molecules. Additionally, DrugGPT has the added feature of generating compounds that match specific protein structures, expanding its versatility in drug discovery.

    All credit for this research goes to the authors of the pre-print. To stay updated on any future developments or updates regarding DrugGPT, we recommend following the code repository associated with the tool, which is available on GitHub and Hugging Face. This repository will serve as a reliable source for any future developments or updates regarding DrugGPT.

    We appreciate your engagement with the article and would be delighted to assist you further. Please feel free to let us know if you require any additional information or have any other inquiries. Thank you again for your interest in the article!


Please enter your comment!
Please enter your name here