lmdepoly_wrapper
Module Contents
Classes
TritonClient is a wrapper of TritonClient for LLM. |
|
|
|
|
|
|
- class lagent.llms.lmdepoly_wrapper.TritonClient(tritonserver_addr, model_name, session_len=32768, log_level='WARNING', **kwargs)
Bases:
lagent.llms.base_llm.BaseModelTritonClient is a wrapper of TritonClient for LLM.
- Parameters:
tritonserver_addr (str) – the address in format “ip:port” of triton inference server
model_name (str) – the name of the model
session_len (int) – the context size
max_tokens (int) – the expected generated token numbers
log_level (str) –
- generate(inputs, session_id=2967, request_id='', max_tokens=512, sequence_start=True, sequence_end=True, **kwargs)
Start a new round conversation of a session. Return the chat completions in non-stream mode.
- Parameters:
inputs (str, List[str]) – user’s prompt(s) in this round
session_id (int) – the identical id of a session
request_id (str) – the identical id of this round conversation
max_tokens (int) – the expected generated token numbers
sequence_start (bool) – start flag of a session
sequence_end (bool) – end flag of a session
- Returns:
(a list of/batched) text/chat completion
- stream_chat(inputs, session_id=2967, request_id='', max_tokens=512, sequence_start=True, sequence_end=True, **kwargs)
Start a new round conversation of a session. Return the chat completions in stream mode.
- Parameters:
session_id (int) – the identical id of a session
inputs (List[dict]) – user’s inputs in this round conversation
request_id (str) – the identical id of this round conversation
max_tokens (int) – the expected generated token numbers
sequence_start (bool) – start flag of a session
sequence_end (bool) – end flag of a session
- Returns:
status, text/chat completion, generated token number
- Return type:
tuple(Status, str, int)
- class lagent.llms.lmdepoly_wrapper.LMDeployPipeline(path, model_name=None, tp=1, pipeline_cfg=dict(), **kwargs)
Bases:
lagent.llms.base_llm.BaseModel- Parameters:
path (str) –
The path to the model. It could be one of the following options:
A local directory path of a turbomind model which is
converted by lmdeploy convert command or download from ii) and iii).
The model_id of a lmdeploy-quantized model hosted
inside a model repo on huggingface.co, such as “InternLM/internlm-chat-20b-4bit”, “lmdeploy/llama2-chat-70b-4bit”, etc.
The model_id of a model hosted inside a model repo
on huggingface.co, such as “internlm/internlm-chat-7b”, “Qwen/Qwen-7B-Chat “, “baichuan-inc/Baichuan2-7B-Chat” and so on.
model_name (str) – needed when model_path is a pytorch model on huggingface.co, such as “internlm-chat-7b”, “Qwen-7B-Chat “, “Baichuan2-7B-Chat” and so on.
tp (int) – tensor parallel
pipeline_cfg (dict) – config of pipeline
- generate(inputs, do_preprocess=None, **kwargs)
Return the chat completions in non-stream mode.
- Parameters:
inputs (Union[str, List[str]]) – input texts to be completed.
do_preprocess (bool) – whether pre-process the messages. Default to True, which means chat_template will be applied.
- Returns:
(a list of/batched) text/chat completion
- class lagent.llms.lmdepoly_wrapper.LMDeployServer(path, model_name=None, server_name='0.0.0.0', server_port=23333, tp=1, log_level='WARNING', serve_cfg=dict(), **kwargs)
Bases:
lagent.llms.base_llm.BaseModel- Parameters:
path (str) –
The path to the model. It could be one of the following options:
A local directory path of a turbomind model which is
converted by lmdeploy convert command or download from ii) and iii).
The model_id of a lmdeploy-quantized model hosted
inside a model repo on huggingface.co, such as “InternLM/internlm-chat-20b-4bit”, “lmdeploy/llama2-chat-70b-4bit”, etc.
The model_id of a model hosted inside a model repo
on huggingface.co, such as “internlm/internlm-chat-7b”, “Qwen/Qwen-7B-Chat “, “baichuan-inc/Baichuan2-7B-Chat” and so on.
model_name (str) – needed when model_path is a pytorch model on huggingface.co, such as “internlm-chat-7b”, “Qwen-7B-Chat “, “Baichuan2-7B-Chat” and so on.
server_name (str) – host ip for serving
server_port (int) – server port
tp (int) – tensor parallel
log_level (str) – set log level whose value among [CRITICAL, ERROR, WARNING, INFO, DEBUG]
- generate(inputs, session_id=2967, sequence_start=True, sequence_end=True, ignore_eos=False, timeout=30, **kwargs)
Start a new round conversation of a session. Return the chat completions in non-stream mode.
- Parameters:
inputs (str, List[str]) – user’s prompt(s) in this round
session_id (int) – the identical id of a session
sequence_start (bool) – start flag of a session
sequence_end (bool) – end flag of a session
ignore_eos (bool) – indicator for ignoring eos
timeout (int) – max time to wait for response
- Returns:
(a list of/batched) text/chat completion
- Return type:
List[str]
- stream_chat(inputs, session_id=0, sequence_start=True, sequence_end=True, stream=True, ignore_eos=False, timeout=30, **kwargs)
Start a new round conversation of a session. Return the chat completions in stream mode.
- Parameters:
session_id (int) – the identical id of a session
inputs (List[dict]) – user’s inputs in this round conversation
sequence_start (bool) – start flag of a session
sequence_end (bool) – end flag of a session
stream (bool) – return in a streaming format if enabled
ignore_eos (bool) – indicator for ignoring eos
timeout (int) – max time to wait for response
- Returns:
status, text/chat completion, generated token number
- Return type:
tuple(Status, str, int)
- class lagent.llms.lmdepoly_wrapper.LMDeployClient(path, url, **kwargs)
Bases:
LMDeployServer- Parameters:
path (str) – The path to the model.
url (str) – communicating address ‘http://<ip>:<port>’ of api_server