You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

2.0 KiB

Dataset pipeline v0

  • Why not just create whole new dataset?

    • we want to keep the same dataset for training and evaluation
    • because the initial dataset is already good
    • we don't want to waste it
  • Goal: introduce paraphrased document chunks to the training process

  • Ok let just go with the plan below cuz it's FAST to implement!s

    • Smol model 0.5b
    • Simple prompt 3 prompts -> 3 paraphrased chunks for each original chunk (why 3? idk, it was revealed for me in my dream, but it's smol and fast to run)
      • short medium long
      • 3 different styles/ personalities
  • Next (v0.1):

How?

  • Please refer to 250324_generate_data_anatomy.ipynb for more details

    • There are already 3 files generated by original generate_dataset.py script. There are chunk id in the question json file.
    • should modify the chunks file to include paraphrased chunks
    • re run faiss index
  • Final data has "chunk_id" field in the question json file, is it used or is it important for the training process or evaluation? - no (checked with Ctrl + F), only the "question" and "answer" matter -> so i can just iterate over the chunk file and add paraphrased chunks to the vector store

    • How do i iterate over the chunk.pkl file?
      • use pickle to load the file
      • iterate over the file
      • paraphrase the chunk paraphrase-prompt.md
      • add the paraphrased chunks to the vector store (how? will it affect the original chunk id?)
        • Can just append the new chunks to the existing file?
      • save the vector store
      • save the question json file
  • Should I ass wrong information or not? How correct should the paraphrased chunk be? How many paraphased chunks should I add for each original chunk? - V0.1? for now just use simple paraphrasing with correct information.