lmdepoly_wrapper

Module Contents

Classes

TritonClient

TritonClient is a wrapper of TritonClient for LLM.

LMDeployPipeline

param path:

The path to the model.

LMDeployServer

param path:

The path to the model.

LMDeployClient

param path:

The path to the model.

class lagent.llms.lmdepoly_wrapper.TritonClient(tritonserver_addr, model_name, session_len=32768, log_level='WARNING', **kwargs)

Bases: lagent.llms.base_llm.BaseModel

TritonClient is a wrapper of TritonClient for LLM.

Parameters:
  • tritonserver_addr (str) – the address in format “ip:port” of triton inference server

  • model_name (str) – the name of the model

  • session_len (int) – the context size

  • max_tokens (int) – the expected generated token numbers

  • log_level (str) –

generate(inputs, session_id=2967, request_id='', max_tokens=512, sequence_start=True, sequence_end=True, **kwargs)

Start a new round conversation of a session. Return the chat completions in non-stream mode.

Parameters:
  • inputs (str, List[str]) – user’s prompt(s) in this round

  • session_id (int) – the identical id of a session

  • request_id (str) – the identical id of this round conversation

  • max_tokens (int) – the expected generated token numbers

  • sequence_start (bool) – start flag of a session

  • sequence_end (bool) – end flag of a session

Returns:

(a list of/batched) text/chat completion

stream_chat(inputs, session_id=2967, request_id='', max_tokens=512, sequence_start=True, sequence_end=True, **kwargs)

Start a new round conversation of a session. Return the chat completions in stream mode.

Parameters:
  • session_id (int) – the identical id of a session

  • inputs (List[dict]) – user’s inputs in this round conversation

  • request_id (str) – the identical id of this round conversation

  • max_tokens (int) – the expected generated token numbers

  • sequence_start (bool) – start flag of a session

  • sequence_end (bool) – end flag of a session

Returns:

status, text/chat completion, generated token number

Return type:

tuple(Status, str, int)

class lagent.llms.lmdepoly_wrapper.LMDeployPipeline(path, model_name=None, tp=1, pipeline_cfg=dict(), **kwargs)

Bases: lagent.llms.base_llm.BaseModel

Parameters:
  • path (str) –

    The path to the model. It could be one of the following options:

      1. A local directory path of a turbomind model which is

      converted by lmdeploy convert command or download from ii) and iii).

      1. The model_id of a lmdeploy-quantized model hosted

      inside a model repo on huggingface.co, such as “InternLM/internlm-chat-20b-4bit”, “lmdeploy/llama2-chat-70b-4bit”, etc.

      1. The model_id of a model hosted inside a model repo

      on huggingface.co, such as “internlm/internlm-chat-7b”, “Qwen/Qwen-7B-Chat “, “baichuan-inc/Baichuan2-7B-Chat” and so on.

  • model_name (str) – needed when model_path is a pytorch model on huggingface.co, such as “internlm-chat-7b”, “Qwen-7B-Chat “, “Baichuan2-7B-Chat” and so on.

  • tp (int) – tensor parallel

  • pipeline_cfg (dict) – config of pipeline

generate(inputs, do_preprocess=None, **kwargs)

Return the chat completions in non-stream mode.

Parameters:
  • inputs (Union[str, List[str]]) – input texts to be completed.

  • do_preprocess (bool) – whether pre-process the messages. Default to True, which means chat_template will be applied.

Returns:

(a list of/batched) text/chat completion

class lagent.llms.lmdepoly_wrapper.LMDeployServer(path, model_name=None, server_name='0.0.0.0', server_port=23333, tp=1, log_level='WARNING', serve_cfg=dict(), **kwargs)

Bases: lagent.llms.base_llm.BaseModel

Parameters:
  • path (str) –

    The path to the model. It could be one of the following options:

      1. A local directory path of a turbomind model which is

      converted by lmdeploy convert command or download from ii) and iii).

      1. The model_id of a lmdeploy-quantized model hosted

      inside a model repo on huggingface.co, such as “InternLM/internlm-chat-20b-4bit”, “lmdeploy/llama2-chat-70b-4bit”, etc.

      1. The model_id of a model hosted inside a model repo

      on huggingface.co, such as “internlm/internlm-chat-7b”, “Qwen/Qwen-7B-Chat “, “baichuan-inc/Baichuan2-7B-Chat” and so on.

  • model_name (str) – needed when model_path is a pytorch model on huggingface.co, such as “internlm-chat-7b”, “Qwen-7B-Chat “, “Baichuan2-7B-Chat” and so on.

  • server_name (str) – host ip for serving

  • server_port (int) – server port

  • tp (int) – tensor parallel

  • log_level (str) – set log level whose value among [CRITICAL, ERROR, WARNING, INFO, DEBUG]

generate(inputs, session_id=2967, sequence_start=True, sequence_end=True, ignore_eos=False, timeout=30, **kwargs)

Start a new round conversation of a session. Return the chat completions in non-stream mode.

Parameters:
  • inputs (str, List[str]) – user’s prompt(s) in this round

  • session_id (int) – the identical id of a session

  • sequence_start (bool) – start flag of a session

  • sequence_end (bool) – end flag of a session

  • ignore_eos (bool) – indicator for ignoring eos

  • timeout (int) – max time to wait for response

Returns:

(a list of/batched) text/chat completion

Return type:

List[str]

stream_chat(inputs, session_id=0, sequence_start=True, sequence_end=True, stream=True, ignore_eos=False, timeout=30, **kwargs)

Start a new round conversation of a session. Return the chat completions in stream mode.

Parameters:
  • session_id (int) – the identical id of a session

  • inputs (List[dict]) – user’s inputs in this round conversation

  • sequence_start (bool) – start flag of a session

  • sequence_end (bool) – end flag of a session

  • stream (bool) – return in a streaming format if enabled

  • ignore_eos (bool) – indicator for ignoring eos

  • timeout (int) – max time to wait for response

Returns:

status, text/chat completion, generated token number

Return type:

tuple(Status, str, int)

class lagent.llms.lmdepoly_wrapper.LMDeployClient(path, url, **kwargs)

Bases: LMDeployServer

Parameters:
  • path (str) – The path to the model.

  • url (str) – communicating address ‘http://<ip>:<port>’ of api_server