Generators#

Underlying LLMs (or any function which completes text) is represented as a generator in Rigging. They are typically instantiated using identifier strings and the get_generator function.

The base interface is flexible, and designed to support optimizations should the underlying mechanisms support it (batching async, K/V cache, etc.)

Identifiers#

Much like database connection strings, Rigging generators can be represented as strings which define what provider, model, API key, generation params, etc. should be used. They are formatted as follows:

<provider>!<model>,<**kwargs>

provider maps to a particular subclass of Generator.
model is a any str value, typically used by the provider to indicate a specific LLM to target.
kwargs are used to carry:
1. Serialized GenerateParams fields like like temp, stop tokens, etc.
2. Additional provider-specific attributes to set on the constructed generator class. For instance, you can set the LiteLLMGenerator.max_connections property by passing ,max_connections= in the identifier string.

The provider is optional and Rigging will fallback to litellm/LiteLLMGenerator by default. You can view the LiteLLM docs for more information about supported model providers and parameters.

Here are some examples of valid identifiers:

gpt-3.5-turbo,temperature=0.5
openai/gpt-4,api_key=sk-1234
litellm!claude-3-sonnet-2024022
anthropic/claude-2.1,stop=output:;---,seed=1337
together_ai/meta-llama/Llama-3-70b-chat-hf
openai/google/gemma-7b,api_base=https://integrate.api.nvidia.com/v1

Building generators from string identifiers is optional, but a convenient way to represent complex LLM configurations.

Back to Strings

Any generator can be converted back into an identifier using either to_identifier or get_identifier.

generator = rg.get_generator("gpt-3.5-turbo,temperature=0.5")
print(generator.to_identifier())
# litellm!gpt-3.5-turbo,temperature=0.5

API Keys#

All generators carry a .api_key attribute which can be set directly, or by passing ,api_key= as part of an identifier string. Not all generators will require one, but they are common enough that we include the attribute as part of the base class.

Typically you will be using a library like LiteLLM underneath, and can simply use environment variables:

export OPENAI_API_KEY=...
export TOGETHER_API_KEY=...
export TOGETHERAI_API_KEY=...
export MISTRAL_API_KEY=...
export ANTHROPIC_API_KEY=...

Rate Limits#

Generators that leverage remote services (LiteLLM) expose properties for managing connection/request limits:

However, a more flexible solution is ChatPipeline.wrap() with a library like backoff to catch many, or specific errors, like rate limits or general connection issues.

import rigging as rg

import backoff
import backoff.types


def on_backoff(details: backoff.types.Details) -> None:
    print(f"Backing off {details['wait']:.2f}s")

pipeline = (
    rg.get_generator("claude-3-haiku-20240307")
    .chat("Give me a 4 word phrase about machines.")
    .wrap(
        backoff.on_exception(
            backoff.expo,
            Exception,  # This should be scoped down
            on_backoff=on_backoff,
        )
    )
)

chats = await pipeline.run_many(50)

Exception mess

You'll find that the exception consistency inside LiteLLM is quite poor. Different providers throw different types of exceptions for all kinds of status codes, response data, etc. With that said, you can typically find a target list that works well for your use-case.

Local Models#

We have experimental support for both vLLM and transformers generators for loading and running local models. In general vLLM is more consistent with Rigging's preferred API, but the dependency requirements are heavier.

Where needed, you can wrap an existing model into a rigging generator by using the VLLMGenerator.from_obj() or TransformersGenerator.from_obj() methods. These are helpful for any picky model construction that might not play well with our rigging constructors.

Required Packages

The use of these generators requires the vllm and transformers packages to be installed. You can use rigging[all] to install them all at once, or pick your preferred package individually.

import rigging as rg

tiny_llama = rg.get_generator(
    "vllm!TinyLlama/TinyLlama-1.1B-Chat-v1.0," \
    "gpu_memory_utilization=0.3," \
    "trust_remote_code=True"
)

llama_3 = rg.get_generator(
    "transformers!meta-llama/Meta-Llama-3-8B-Instruct"
)

See more about them below:

Loading and Unloading

You can use the Generator.load and Generator.unload methods to better control memory usage. Local providers typically are lazy and load the model into memory only when first needed.

Overload Generation Params#

When working with both CompletionPipeline and ChatPipeline, you can overload and update any generation params by using the associated .with_() function.

with_() as keyword argumentswith_() as GenerateParams

import rigging as rg

pipeline = rg.get_generator("gpt-3.5-turbo,max_tokens=50").chat([
    {"role": "user", "content": "Say a haiku about boats"},
])

for temp in [0.1, 0.5, 1.0]:
    chat = await pipeline.with_(temperature=temp).run()
    print(chat.last.content)

import rigging as rg

pipeline = rg.get_generator("gpt-3.5-turbo,max_tokens=50").chat([
    {"role": "user", "content": "Say a haiku about boats"},
])

for temp in [0.1, 0.5, 1.0]:
    chat = await pipeline.with_(rg.GenerateParams(temperature=temp)).run()
    print(chat.last.content)

Writing a Generator#

All generators should inherit from the Generator base class, and can elect to implement handlers for messages and/or texts:

async def generate_messages(...) - Used for ChatPipeline.run variants.
async def generate_texts(...) - Used for CompletionPipeline.run variants.

Optional Implementation

If your generator doesn't implement a particular method like text completions, Rigging will simply raise a NotImplementedError for you. It's currently undecided whether generators should prefer to provide weak overloads for compatibility, or whether they should ignore methods which can't be used optimally to help provide clarity to the user about capability. You'll find we've opted for the former strategy in our generators.

Generators operate in a batch context by default, taking in groups of message lists or texts. Whether your implementation takes advantage of this batching is up to you, but where possible you should be optimizing as much as possible.

Generators are Flexible

Generators don't make any assumptions about the underlying mechanism that completes text. You might use a local model, API endpoint, or static code, etc. The base class is designed to be flexible and support a wide variety of use cases. You'll obviously find that the inclusion of api_key, model, and generation params are common enough that they are included in the base class.

from rigging import Generator, GenerateParams, Message, GeneratedMessage

class Custom(Generator):
    # model: str
    # api_key: str
    # params: GeneratorParams

    custom_field: bool

    async def generate_messages(
        self,
        messages: t.Sequence[t.Sequence[Message]],
        params: t.Sequence[GenerateParams],
    ) -> t.Sequence[GeneratedMessage]:
        # merge_with is an easy way to combine overloads
        params = [
            self.params.merge_with(p).to_dict() for p in params 
        ]

        # Access self vars where needed
        api_key = self.api_key
        model_id = self.model
        custom = self.custom_field

        # Build output messages with stop reason, usage, etc.
        # output_messages = ...

        return output_messages


generator = Custom(model='foo', custom_field=True)
generator.chat(...)

Registering Generators

Use the register_generator method to add your generator class under a custom provider id so it can be used with get_generator.

import rigging as rg

rg.register_generator('custom', Custom)
custom = rg.get_generator('custom!foo')