vLLM으로 API 서버 실행하기

vLLM으로 API 서버 실행하기

vLLM + Meta-Llama-3-8B-Instruct + Tool Calling + OpenAI 호환 API 조합으로 개인용 LLM API 서버를 구축하는 방법에 대해 알아봅니다.

Meta에서 공개한 Llama 3 시리즈 중 Meta-Llama-3-8B-Instruct 모델을 vLLM을 이용해 고성능으로 구동하고, OpenAI API 호환 서버로 만들 수 있습니다. 여기에 Tool Calling 기능까지 활성화하면, 프롬프트 안에서 도구 실행까지 연계가 가능해집니다.

사전준비

1. 윈도우 에서는 WSL 환경에서 구동시켜야합니다.

아래 명령을 이용해 wsl 을 설치하고 wsl 명령으로 리눅스로 진입합니다.

wsl --install

2. HuggingFace 에서 meta-llama/Meta-Llama-3-8B-Instruct 모델을 다운 받을 수 있는 권한을 신청해야합니다.

3. vllm 라이브러리를 설치합니다.

pip install vllm

서버실행

python3 -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Meta-Llama-3-8B-Instruct \
  --enable-auto-tool-choice \
  --tool-call-parser llama3_json \
  --dtype float16 \
  --port 8000

--model
실행할 HuggingFace 모델 명칭 (예: meta-llama/Meta-Llama-3-8B-Instruct)

--enable-auto-tool-choice
프롬프트 안에 적절한 도구를 자동 선택하도록 설정

--tool-call-parser llama3_json
Llama3 모델용 JSON 기반 Tool Calling 파서 사용

--dtype float16
GPU 메모리 최적화를 위한 float16 설정

--port 8000
API 서버가 열릴 포트 번호

API 테스트

curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d "{\"model\": \"meta-llama/Meta-Llama-3-8B-Instruct\", \"messages\": [{\"role\": \"user\", \"content\": \"안녕 kjun 이 몇 글자야?\"}]}"
{"id":"chatcmpl-410a11bd045a4692b0d01030615e5c5a","object":"chat.completion","created":1751985357,"model":"meta-llama/Meta-Llama-3-8B-Instruct","choices":[{"index":0,"message":{"role":"assistant","reasoning_content":null,"content":"😊\n\nThe Korean phrase \"안녕 kjun\" can be translated to \"Hello kjun\".\n\nThe word \"안녕\" (annyeong) is a greeting that means \"hello\" or \"goodbye\".\n\nAs for the word count, \"안녕 kjun\" is 4 characters in Korean:\n\n1. 안 (an)\n2. 녕 (nyeong)\n3. kj (kj)\n4. 운 (un)\n\nSo, the answer is 4! 👋","tool_calls":[]},"logprobs":null,"finish_reason":"stop","stop_reason":null}],"usage":{"prompt_tokens":20,"total_tokens":122,"completion_tokens":102,"prompt_tokens_details":null},"prompt_logprobs":null}

결과

{
  "id": "chatcmpl-410a11bd045a4692b0d01030615e5c5a",
  "object": "chat.completion",
  "model": "meta-llama/Meta-Llama-3-8B-Instruct",
  "choices": [
    {
      "message": {
        "role": "assistant",
        "content": "😊\n\nThe Korean phrase \"안녕 kjun\" ... So, the answer is 4! 👋"
      }
    }
  ],
  "usage": {
    "prompt_tokens": 20,
    "completion_tokens": 102,
    "total_tokens": 122
  }
}

참고

https://docs.vllm.ai/en/latest/

728x90

저작자표시 비영리 변경금지 (새창열림)

'코딩 > Python_AI' 카테고리의 다른 글

LiteLLM Proxy 대시보드 설정하기 (0)	2025.08.24
LiteLLM으로 여러 AI 모델을 한 번에 사용하기 (0)	2025.08.11
LangChain 을 이용한 Streamlit 채팅에 Smithery MCP 추가하기 (1)	2025.06.07
LangChain MCP 와 Streamlit 으로 채팅창 만들기 (0)	2025.06.03
OpenWebUI 사용하기 (0)	2025.06.03