You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

141 lines
4.1 KiB

{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Cook vllm - sglang - test eval script\n",
"## OpenAI API for reference\n",
"- https://platform.openai.com/docs/api-reference/completions/create\n",
"- https://platform.openai.com/docs/api-reference/completions/create"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Cook vllm\n",
"- https://docs.vllm.ai/en/latest/getting_started/examples/gradio_webserver.html\n",
"- https://docs.vllm.ai/en/latest/getting_started/examples/gradio_openai_chatbot_webserver.html\n",
"- https://docs.vllm.ai/en/latest/getting_started/examples/basic.html#gguf"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"!vllm serve meta-llama/Llama-3.2-1B-Instruct --dtype auto"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Cook sglang\n",
"- https://docs.sglang.ai/backend/server_arguments.html\n",
"- > lora_paths: You may provide a list of adapters to your model as a list. Each batch element will get model response with the corresponding lora adapter applied. Currently cuda_graph and radix_attention are not supported with this option so you need to disable them manually. We are still working on through these issues.\n",
"- > To enable multi-GPU data parallelism, add --dp 2. Data parallelism is better for throughput if there is enough memory."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# %pip install \"sglang[all]>=0.4.5\" --find-links https://flashinfer.ai/whl/cu124/torch2.5/flashinfer-python -q\n",
"!python3 -m sglang.launch_server --model-path meta-llama/Llama-3.2-1B-Instruct --dp 2 --port 30000"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import requests\n",
"\n",
"url = f\"http://localhost:30000/v1/chat/completions\"\n",
"\n",
"data = {\n",
" \"model\": \"meta-llama/Llama-3.2-1B-Instruct\",\n",
" \"messages\": [{\"role\": \"user\", \"content\": \"Hello em an com chua?\"}],\n",
"}\n",
"\n",
"response = requests.post(url, json=data)\n",
"print(response.json())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Cook eval script"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Command to launch SGLang server\n",
"!python3 -m sglang.launch_server \\\n",
" --model-path \"janhq/250404-llama-3.2-3b-instruct-grpo-03-s250\" \\\n",
" --context-length 8192 \\\n",
" --enable-metrics \\\n",
" --dtype bfloat16 \\\n",
" --host 0.0.0.0 \\\n",
" --port 8002 \\\n",
" --trust-remote-code \\\n",
" --disable-overlap \\\n",
" --disable-radix-cache"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# cd scripts/evaluation\n",
"!python run_eval.py \\\n",
" --config_path eval_config.yaml \\\n",
" --method_name research \\\n",
" --data_dir {root/path/to/evaluation/data} \\\n",
" --dataset_name bamboogle \\\n",
" --split test \\\n",
" --save_dir {your-save-dir} \\\n",
" --save_note research_qwen7b_ins\n",
" --sgl_remote_url {your-launched-sgl-url} \\\n",
" --remote_retriever_url {your-hosted-retriever-url} \\\n",
" --generator_model {your-local-model-path} \\\n",
" --apply_chat True"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "deepsearch-py311",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.11"
}
},
"nbformat": 4,
"nbformat_minor": 2
}