# Introduction
- This notebook is used to understand the behavior of the `generate_data.py` script

## Chunking and FAISS index generation notes

- Saved 341 chunks to saved_data/chunks.pkl
- The chunking process is deterministic. **chunks.pkl will be identical across runs**
- The **FAISS index creation can be non-deterministic**:
  - FAISS uses approximate nearest neighbor search algorithms, random initialization, and floating-point precision issues.
  - The order of vector insertion can affect the final index structure

In [None]:
import sys

sys.path.append("..")

import json
import os
import pickle
import re
from typing import Dict, List, Optional, Tuple

from langchain.text_splitter import RecursiveCharacterTextSplitter

# ========= Part 1: Document Processing and Embedding Generation =========
# Load and split the markdown document using LangChain
from langchain_community.document_loaders import UnstructuredMarkdownLoader
from langchain_community.vectorstores import FAISS

from embeddings import CustomHuggingFaceEmbeddings

# Load your markdown file (adjust the path as needed)
loader = UnstructuredMarkdownLoader("../data/mission_report.md")
docs = loader.load()

# Split the document into smaller chunks (each 1000 characters, no overlap)
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
chunks = text_splitter.split_documents(docs)

# Save chunks for later use
os.makedirs("saved_data", exist_ok=True)
with open("saved_data/chunks.pkl", "wb") as f:
    pickle.dump(chunks, f)
print(f"Saved {len(chunks)} chunks to saved_data/chunks.pkl")

embeddings = CustomHuggingFaceEmbeddings()

# Create a FAISS vector store from the document chunks and save it locally
vectorstore = FAISS.from_documents(chunks, embeddings)
vectorstore.save_local("faiss_index")
print("Saved FAISS index to 'faiss_index'")

## Try loading the chunks and parahprasing them

In [None]:
# Load the chunks from the pickle file
with open("saved_data/chunks.pkl", "rb") as f:
    chunks = pickle.load(f)

print(f"Loaded {len(chunks)} chunks from saved_data/chunks.pkl")

# Print content of first 3 chunks
for i, chunk in enumerate(
    chunks[:3]
):  # Ok cool, so this is much simpler than i expected!
    print(f"\n--- Chunk {i + 1}/{len(chunks)} ---")
    print(chunk.page_content)
    print("-" * 50)


## Try loading simple local model for paraphrasing (work with 16GB VRAM!)

In [None]:
from unsloth import FastLanguageModel
from vllm import SamplingParams

# Load the Llama model
model, tokenizer = FastLanguageModel.from_pretrained(
    # model_name="meta-llama/meta-Llama-3.1-8B-Instruct",
    model_name="Qwen/Qwen2-0.5B-Instruct",
    max_seq_length=2048,  # Reduced from 4096
    load_in_4bit=True,
    fast_inference=True,
    gpu_memory_utilization=0.8,  # Increased from 0.6
)

# Define sampling parameters
sampling_params = SamplingParams(
    temperature=0.3,
    top_p=0.95,
    max_tokens=512,  # Reduced for simple responses
)

## âœ… Paraphrase V0
 Ok this should be good enough for now.
```bash
-- Original Chunk 2/3 ---
Mission Spacecraft Description Laumch date Launch site PA-1 BP-6 First pad abort Nov.7, 1963 White Sands Missile Range; A-001 BP-12 Transonic abort May 13, 1964 N.Mex. White Sands Missile Range, AS-101 BP-13 Nominal launch and exit environment May 28, 1964 N. Mex. Cape Kennedy. Fla. AS-102 BP-15 Nominal launch and exit environment Sept.18,1964 Cape Kennedy, Fla. A-002 BP-23 Maximum dynamic pressure abort Dec.8, 1964 White Sands Missile Range, AS-103 BP-16 Micrometeoroid experiment Feb. 16, 1965 N.Mex. Cape Kennedy, Fla. A-003 BP-22 Low-altitude abort (planned high- May 19, 1965 White Sands Missile Range, AS-104 BP-26 altitude abort) Micrometeoroid experiment and service module May 25, 1965 N.Mex, Cape Kennedy, Fla. PA-2 BP-23A RCS launch environment Second pad abort June 29,1965 White Sands Missile Range, AS-105 BP-9A Micrometeoroid experiment and service module July 30, 1965 N. Mex. Cape Kennedy, Fla. A-004 SC-002 RCS launch environment Power-on tumbling boundary abort Jan.20,1966
--------------------------------------------------

--- Style 1 Paraphrase ---
The mission spacecraft described in the text is a spacecraft that is designed to launch and exit into a specific environment based on the specifications provided. The spacecraft is named "Mission Spacecraft" and is launched from the White Sands Missile Range in New Mexico on July 28, 1964. The spacecraft is
--------------------------------------------------

--- Style 2 Paraphrase ---
The mission spacecraft described in the text is a rocket that is designed to launch into the space. It is launched from a specific site, which is located in the United States. The spacecraft is launched into the space on a nominal launch and exit environment, which is described as a high-pressure abort. The spacecraft is also launched into the space on a micrometeoroid experiment, which is described as a planned high-altitude abort. The spacecraft is launched into the space on a RCS launch environment, which is described as a second pad abort. The spacecraft is launched into the space on a micrometeoroid experiment and service module, which is described as a power-on tumbling boundary abort.
--------------------------------------------------

--- Style 3 Paraphrase ---
The mission spacecraft, known as the Laumch, was launched from the site of the White Sands Missile Range, located in the state of New Mexico. The spacecraft was scheduled to enter the launch and exit environment on May 28, 1964, from the White Sands Missile Range in New Mexico. The spacecraft was also scheduled to enter the nominal launch and exit environment on June 18, 1964, from the White Sands Missile Range in New Mexico. The spacecraft was also scheduled to enter the nominal launch and exit environment on September 8, 1964, from the White Sands Missile Range in New Mexico. The spacecraft was also scheduled to enter the nominal launch and exit environment on January 8, 1965, from the White Sands Missile Range in New Mexico. The spacecraft was also scheduled to enter the nominal launch and exit environment on February 16, 1965, from the White Sands Missile Range in New Mexico. The spacecraft was also scheduled to enter the nominal launch and exit environment on May 19, 1965, from the White Sands Missile Range in New Mexico. The spacecraft was also scheduled to enter the nominal launch and exit environment on June 29, 1965, from the White Sands Missile Range in New Mexico. The spacecraft was also scheduled to enter the nominal launch and exit environment on July 30, 1965, from the White Sands Missile Range in New Mexico. The spacecraft was also scheduled to enter the nominal launch and exit environment on October 20, 1966, from the White Sands Missile Range in New Mexico
```

In [None]:
PARAPHRASE_PROMPTS = [
    """Rewrite this text in a formal, scholarly tone. Keep it very concise - summarize in 1-2 short sentences. Only output the paraphrased text:

    TEXT: {text}""",
    
    """Rewrite this text in a clear, simple way that's easy to understand. Provide a medium-length explanation with key details. Only output the paraphrased text:
    
    TEXT: {text}""",
    
    """Rewrite this text in a vivid, engaging style. Expand on the details and provide a comprehensive, detailed version. Only output the paraphrased text:
    
    TEXT: {text}"""
]

# Update sampling parameters for each style
sampling_params_short = SamplingParams(
    temperature=0.3,
    top_p=0.95,
    max_tokens=64,  # Short responses
)

sampling_params_medium = SamplingParams(
    temperature=0.3,
    top_p=0.95,
    max_tokens=256,  # Medium responses
)

sampling_params_long = SamplingParams(
    temperature=0.3,
    top_p=0.95,
    max_tokens=512,  # Long responses
)

def generate_response(text: str) -> list[str]:
    """
    Generate three different paraphrased versions with varying lengths.

    Args:
        text: Text to paraphrase

    Returns:
        List of three paraphrased versions (short, medium, long)
    """
    responses = []
    sampling_params_list = [sampling_params_short, sampling_params_medium, sampling_params_long]

    for prompt_template, sampling_params in zip(PARAPHRASE_PROMPTS, sampling_params_list):
        formatted_prompt = tokenizer.apply_chat_template(
            [{"role": "user", "content": prompt_template.format(text=text)}],
            tokenize=False,
            add_generation_prompt=True,
        )

        output = model.fast_generate(
            [formatted_prompt], sampling_params=sampling_params
        )
        responses.append(output[0].outputs[0].text)

    return responses


# Test example
test_text = "The blue whale is the largest animal on Earth"
responses = generate_response(test_text)
for i, response in enumerate(responses, 1):
    print(f"\nStyle {i}:")
    print(response)

In [None]:
# paraphrase only the first 3 chunks with multiple styles
paraphrased_chunks = []
for chunk in chunks[:3]:
    styles = generate_response(chunk.page_content)  # Now returns list of 3 styles
    paraphrased_chunks.append(styles)

# print the first 3 chunks and their paraphrased versions
for i, chunk in enumerate(chunks[:3]):
    print(f"\n--- Original Chunk {i + 1}/3 ---")
    print(chunk.page_content)
    print("-" * 50)
    
    for j, style in enumerate(paraphrased_chunks[i], 1):
        print(f"\n--- Style {j} Paraphrase ---")
        print(style)
        print("-" * 50)

## ðŸ“¦ Testing simple 1 prompt and choosing model


```bash
# llama3.1-8b-instruct
Processed prompts: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 1/1 [00:03<00:00,  3.38s/it, est. speed input: 50.57 toks/s, output: 39.34 toks/s]
```
- about 3s/item paraphrasing is llama3.1-8b-instruct locally, might take forever to run, LOL
- should go with smaller/quantized model first, or probaly just use groq/openrouter API for this?

```bash
# Qwen2-0.5B-Instruct
Processed prompts: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 1/1 [00:00<00:00,  2.32it/s, est. speed input: 369.09 toks/s, output: 167.13 toks/s]
```
- ok cool much faster now.

- Noo it's hallucinated, lol, but does this matter ðŸ‘€?
```
--- Chunk 1/3 ---
MSC-02680

DISTRIBUTION AND REFERENCING

This paper is not suitable for general distribution or referencing. It may be referenced only in other working correspondence and documents by participating organizations.

MANNED SPACECRAFT CENTER HOUSTON.TEXAS SEPTEMBER1970
--------------------------------------------------

--- Paraphrased Chunk 1/3 ---
The estimated total energy transferred to the command module was approximately 129 ampere hours. The total of 410 ampere hours remained in the lunar module batteries at the time of undocking. The figure 7.2-l.- Lunar module water usage shows the amount of water used in the lunar module during flight. The figure 7.2-2.- Lunar module total battery capacity during flight shows the total capacity of the lunar module batteries at the time of flight. The figure ll.l-2.- Field meter locations in the proximity of the launch complex shows the locations of field meters in the proximity of the launch complex. The figure ll.l-2.- Field meter locations in the proximity of the launch complex shows the locations of field meters in the proximity of the launch complex. The field meter locations show the locations of field meters in the proximity of the launch complex.
```


In [None]:
# paraphrase only the first 3 chunks
paraphrased_chunks = []
for chunk in chunks[:3]:
    paraphrased_chunks.append(
        generate_response(f"Paraphrase the following chunk: {chunk.page_content}")
    )

# print the first 3 chunks and their paraphrased versions LGTM!
for i, chunk in enumerate(chunks[:3]):
    print(f"\n--- Chunk {i + 1}/3 ---")
    print(chunk.page_content)
    print("-" * 50)
    print(f"\n--- Paraphrased Chunk {i + 1}/3 ---")
    print(paraphrased_chunks[i])
    print("-" * 50)
