{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Cook vllm - sglang - test eval script\n", "## OpenAI API for reference\n", "- https://platform.openai.com/docs/api-reference/completions/create\n", "- https://platform.openai.com/docs/api-reference/completions/create" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Cook vllm\n", "- https://docs.vllm.ai/en/latest/getting_started/examples/gradio_webserver.html\n", "- https://docs.vllm.ai/en/latest/getting_started/examples/gradio_openai_chatbot_webserver.html\n", "- https://docs.vllm.ai/en/latest/getting_started/examples/basic.html#gguf" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!vllm serve meta-llama/Llama-3.2-1B-Instruct --dtype auto" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Cook sglang\n", "- https://docs.sglang.ai/backend/server_arguments.html\n", "- > lora_paths: You may provide a list of adapters to your model as a list. Each batch element will get model response with the corresponding lora adapter applied. Currently cuda_graph and radix_attention are not supported with this option so you need to disable them manually. We are still working on through these issues.\n", "- > To enable multi-GPU data parallelism, add --dp 2. Data parallelism is better for throughput if there is enough memory." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# %pip install \"sglang[all]>=0.4.5\" --find-links https://flashinfer.ai/whl/cu124/torch2.5/flashinfer-python -q\n", "!python3 -m sglang.launch_server --model-path meta-llama/Llama-3.2-1B-Instruct --dp 2 --port 30000" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import requests\n", "\n", "url = f\"http://localhost:30000/v1/chat/completions\"\n", "\n", "data = {\n", " \"model\": \"meta-llama/Llama-3.2-1B-Instruct\",\n", " \"messages\": [{\"role\": \"user\", \"content\": \"Hello em an com chua?\"}],\n", "}\n", "\n", "response = requests.post(url, json=data)\n", "print(response.json())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Cook eval script" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Command to launch SGLang server\n", "!python3 -m sglang.launch_server \\\n", " --model-path \"janhq/250404-llama-3.2-3b-instruct-grpo-03-s250\" \\\n", " --context-length 8192 \\\n", " --enable-metrics \\\n", " --dtype bfloat16 \\\n", " --host 0.0.0.0 \\\n", " --port 8002 \\\n", " --trust-remote-code \\\n", " --disable-overlap \\\n", " --disable-radix-cache" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# cd scripts/evaluation\n", "!python run_eval.py \\\n", " --config_path eval_config.yaml \\\n", " --method_name research \\\n", " --data_dir {root/path/to/evaluation/data} \\\n", " --dataset_name bamboogle \\\n", " --split test \\\n", " --save_dir {your-save-dir} \\\n", " --save_note research_qwen7b_ins\n", " --sgl_remote_url {your-launched-sgl-url} \\\n", " --remote_retriever_url {your-hosted-retriever-url} \\\n", " --generator_model {your-local-model-path} \\\n", " --apply_chat True" ] } ], "metadata": { "kernelspec": { "display_name": "deepsearch-py311", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.11" } }, "nbformat": 4, "nbformat_minor": 2 }