diff --git a/README.md b/README.md index db0ae97..ff12f9d 100644 --- a/README.md +++ b/README.md @@ -54,6 +54,8 @@ d) Hybrid algorithm is the most optimal choice, since simple classification/regression models can be "inversed" using evolutionary algorithms. Moreover, results obtained by simpler models can be reused by more complex to compensate for insufficient data. +

+

2. Create a database. Process datasets, look for more data, merge it, clean, and unify, create a database with DBMS.

- study the organization of data in the datasets @@ -66,6 +68,8 @@ - move the data to DBMS - set up access, data retrieval etc. +

+

3. Analyze the data. Perform sequence alignment, look for conservative patterns, study correlations.

- perform local or global sequence alignment on CPP and non-CPP sequences (either all or particular groups/clusters) @@ -74,6 +78,8 @@ - make correlation plots for categorical and numeric parameters - try to answer the question what parameters and sequence patterns lead to best-performing CPPs +

+

4. Choose the models. Find the best-performing classification/regression/generation models to develop and compare.

- do not screen models which were shown to underperform most of the modern ML/DL models @@ -81,6 +87,8 @@ - you can use models pre-trained on more abundant data (transfer learning) - prioritize interpretable models over black-box +

+

5. Build and optimize the models. Check model performance on default parameters, optimize hyperparameters and architechture.

- choose the logic of train/test split (random, stratified, rule-based etc.) @@ -89,6 +97,8 @@ - make a list of architectures you want to test (for neural networks) - choose a method for hyperparameter tuning (Optuna, Grid search etc.) +

+

6. Choose the best-performing model. Prioritize the list of models by accuracy, interpretability, extrapolative power, and robustness.

- analyze model performance (use appropriate classification/regression metrics or loss functions) @@ -97,6 +107,8 @@ - analyze model speed - choose the best model according to these parameters +

+

7. Validate your model. Use computational, predictive, or hybrid approaches to check model consistency with first principles and previous studies.

- choose the methods for additional model validation (computational models, benchmarked ML/DL models, hybrid approaches) @@ -104,6 +116,8 @@ - analyze how good these methods explain obtained results - generate novel CPPs for validation +

+

8. Formalize the results. Create a repo, systematize analysis results, submit the code, ensure the code is reproducible, usable, and readable.

- create a GitHub repository structure @@ -149,7 +163,7 @@ - uptake type, - sequence. -
+

2. Natural CPPs

@@ -173,7 +187,7 @@ Represents a balanced dataset of CPPs and non-CPPs; often used for model benchmarking. -
+

3. Non-CPPs

@@ -188,7 +202,7 @@ Contains non-CPP sequences shown not to demonstrate activity experimentally. -
+

4. Non-Natural CPPs

@@ -219,7 +233,7 @@ drawing -
+

Modelling of interaction with membrane

@@ -235,7 +249,7 @@ drawing -
+

Membrane permeability prediction