`lmdepoly_wrapper`

Module Contents

Classes

`TritonClient`	TritonClient is a wrapper of TritonClient for LLM.
`LMDeployPipeline`	param path: The path to the model.
`LMDeployServer`	param path: The path to the model.
`LMDeployClient`	param path: The path to the model.

class lagent.llms.lmdepoly_wrapper.TritonClient(tritonserver_addr, model_name, session_len=32768, log_level='WARNING', **kwargs)

Bases: lagent.llms.base_llm.BaseModel

TritonClient is a wrapper of TritonClient for LLM.

Parameters:

tritonserver_addr (str) – the address in format “ip:port” of triton inference server
model_name (str) – the name of the model
session_len (int) – the context size
max_tokens (int) – the expected generated token numbers
log_level (str) –

generate(inputs, session_id=2967, request_id='', max_tokens=512, sequence_start=True, sequence_end=True, **kwargs)

Start a new round conversation of a session. Return the chat completions in non-stream mode.

Parameters:

inputs (str, List[str]) – user’s prompt(s) in this round
session_id (int) – the identical id of a session
request_id (str) – the identical id of this round conversation
max_tokens (int) – the expected generated token numbers
sequence_start (bool) – start flag of a session
sequence_end (bool) – end flag of a session

Returns:

(a list of/batched) text/chat completion

stream_chat(inputs, session_id=2967, request_id='', max_tokens=512, sequence_start=True, sequence_end=True, **kwargs)

Start a new round conversation of a session. Return the chat completions in stream mode.

Parameters:

session_id (int) – the identical id of a session
inputs (List[dict]) – user’s inputs in this round conversation
request_id (str) – the identical id of this round conversation
max_tokens (int) – the expected generated token numbers
sequence_start (bool) – start flag of a session
sequence_end (bool) – end flag of a session

Returns:

status, text/chat completion, generated token number

Return type:

tuple(Status, str, int)

class lagent.llms.lmdepoly_wrapper.LMDeployPipeline(path, model_name=None, tp=1, pipeline_cfg=dict(), **kwargs)

Bases: lagent.llms.base_llm.BaseModel

Parameters:

path (str) –
The path to the model. It could be one of the following options:
- 1. A local directory path of a turbomind model which is
  converted by lmdeploy convert command or download from ii) and iii).
- 1. The model_id of a lmdeploy-quantized model hosted
  inside a model repo on huggingface.co, such as “InternLM/internlm-chat-20b-4bit”, “lmdeploy/llama2-chat-70b-4bit”, etc.
- 1. The model_id of a model hosted inside a model repo
  on huggingface.co, such as “internlm/internlm-chat-7b”, “Qwen/Qwen-7B-Chat “, “baichuan-inc/Baichuan2-7B-Chat” and so on.
model_name (str) – needed when model_path is a pytorch model on huggingface.co, such as “internlm-chat-7b”, “Qwen-7B-Chat “, “Baichuan2-7B-Chat” and so on.
tp (int) – tensor parallel
pipeline_cfg (dict) – config of pipeline

generate(inputs, do_preprocess=None, **kwargs)

Return the chat completions in non-stream mode.

Parameters:

inputs (Union[str, List[str]]) – input texts to be completed.
do_preprocess (bool) – whether pre-process the messages. Default to True, which means chat_template will be applied.

Returns:

(a list of/batched) text/chat completion

class lagent.llms.lmdepoly_wrapper.LMDeployServer(path, model_name=None, server_name='0.0.0.0', server_port=23333, tp=1, log_level='WARNING', serve_cfg=dict(), **kwargs)

Bases: lagent.llms.base_llm.BaseModel

Parameters:

path (str) –
The path to the model. It could be one of the following options:
- 1. A local directory path of a turbomind model which is
  converted by lmdeploy convert command or download from ii) and iii).
- 1. The model_id of a lmdeploy-quantized model hosted
  inside a model repo on huggingface.co, such as “InternLM/internlm-chat-20b-4bit”, “lmdeploy/llama2-chat-70b-4bit”, etc.
- 1. The model_id of a model hosted inside a model repo
  on huggingface.co, such as “internlm/internlm-chat-7b”, “Qwen/Qwen-7B-Chat “, “baichuan-inc/Baichuan2-7B-Chat” and so on.
model_name (str) – needed when model_path is a pytorch model on huggingface.co, such as “internlm-chat-7b”, “Qwen-7B-Chat “, “Baichuan2-7B-Chat” and so on.
server_name (str) – host ip for serving
server_port (int) – server port
tp (int) – tensor parallel
log_level (str) – set log level whose value among [CRITICAL, ERROR, WARNING, INFO, DEBUG]

generate(inputs, session_id=2967, sequence_start=True, sequence_end=True, ignore_eos=False, timeout=30, **kwargs)

Start a new round conversation of a session. Return the chat completions in non-stream mode.

Parameters:

inputs (str, List[str]) – user’s prompt(s) in this round
session_id (int) – the identical id of a session
sequence_start (bool) – start flag of a session
sequence_end (bool) – end flag of a session
ignore_eos (bool) – indicator for ignoring eos
timeout (int) – max time to wait for response

Returns:

(a list of/batched) text/chat completion

Return type:

List[str]

stream_chat(inputs, session_id=0, sequence_start=True, sequence_end=True, stream=True, ignore_eos=False, timeout=30, **kwargs)

Start a new round conversation of a session. Return the chat completions in stream mode.

Parameters:

session_id (int) – the identical id of a session
inputs (List[dict]) – user’s inputs in this round conversation
sequence_start (bool) – start flag of a session
sequence_end (bool) – end flag of a session
stream (bool) – return in a streaming format if enabled
ignore_eos (bool) – indicator for ignoring eos
timeout (int) – max time to wait for response

Returns:

status, text/chat completion, generated token number

Return type:

tuple(Status, str, int)

class lagent.llms.lmdepoly_wrapper.LMDeployClient(path, url, **kwargs)

Bases: LMDeployServer

Parameters:

path (str) – The path to the model.
url (str) – communicating address ‘http://<ip>:<port>’ of api_server

lmdepoly_wrapper

Module Contents

Classes

`lmdepoly_wrapper`