From 2e46a6a4b2bb07d9340c0658c07c59a15a4e8691 Mon Sep 17 00:00:00 2001
From: ACID Design Lab <82499756+acid-design-lab@users.noreply.github.com>
Date: Fri, 7 Jun 2024 00:26:29 +0300
Subject: [PATCH] Update README.md

---
 README.md | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/README.md b/README.md
index 0b9f5d8..18ad76e 100644
--- a/README.md
+++ b/README.md
@@ -14,6 +14,17 @@
 
    Our ultimate goal is to develop precise machine learning (ML) model allowing to **design CPPs with superior activity**.
 
+### Project pipeline
+
+   Here are the main steps which will allow you to build a precise model for CPP design:
+   **1. Data curation and cleaning.** All inappropriate or ambiguous data should be removed or corrected.
+   **2. Data unification.** The data presented in Datasets are heterogeneous and should be unified in terms of variables, measurement units etc.
+   **3. System parametriation.** You need to choose the set of parameters to describe CPPs as well as experimental setup. Most of the models use symbolic representations lacking physico-chemical properties crucial for CPP activity prediction.
+   **4. Model selection.** Best-performing models should be choosen for screening depending on the task complexity (sequence classification or sequence generation).
+   **5. Feature selecction.** After model selection, features used in the model should be choosen showing optimal prediction performance, robustness, and interpretability.
+   **6. Evaluation.** Every model should be evaluated beyond performance on train/test datasets. It can be structural analysis of CPP candidates, modelling of interaction with cellular membranes etc.
+   **7. Project design.** All results should be structured and systematized on GitHub for transparency and reproducibility.
+
 ### Challenges
 
    The main challenge here is to develop **unbiased model** not limited to existing CPP structures and cell penetration mechanisms. Another challenge is to develop CPPs **for particular drug delivery system and setup**, which includes multi-property optimization (amphiphilicity, molecular weight, toxicity etc.). Finally, models should be **interpretable**, which means user should know why particular CPP demonstrates its activity, and what are the possible ways to improve it further.