How To Fix OpenAI Rate Limits & Timeout Errors.

LLMs are starting to get used across a wide variety of use cases. These include translation, sentiment analysis, generating code, blogs, emails, etc. However, integrating OpenAI API into your production directly has some problems as it is relatively new. Their APIs provide no SLAs and guarantee of uptimes, or even performance of the service. There are still rate limits on tokens per second & requests per second.

LLMs are starting to get used across a wide variety of use cases. These include translation, sentiment analysis, generating code, blogs, emails, etc. However, integrating OpenAI API into your production directly has some problems as it is relatively new. Their APIs provide no SLAs and guarantee of uptimes, or even performance of the service. There are still rate limits on tokens per second & requests per second.

OpenAI recommends using various techniques to mitigate this. Let's explore a few of them briefly.

Exponential Backoff

Exponential backoff is a strategy used to handle rate limits by gradually increasing the time between subsequent retries in the event of a rate-limiting error. Below is an example in Node.Js:

const axios = require('axios'); // Make sure to install axios with npm or yarn.

const BASE_URL = 'https://api.openai.com/v1/chat/completions';

async function makeRequestWithBackoff(endpoint, params, retries = 3, backoffDelay = 500) {
  try {
    const response = await axios.post(endpoint, params, {
      headers: {
        'Content-Type': 'application/json',
        'Authorization': `Bearer YOUR_OPENAI_API_KEY`,
      },
    });
    return response.data;
  } catch (error) {
    if (error.response && error.response.status === 429 && retries > 0) { // 429 is the HTTP status code for Too Many Requests
      // Wait for a random delay that increases exponentially with each retry
      const delay = Math.random() * backoffDelay;
      console.log(`Rate limit hit, retrying in ${delay}ms`);
      await new Promise((resolve) => setTimeout(resolve, delay));
      return makeRequestWithBackoff(endpoint, params, retries - 1, backoffDelay * 2);
    } else {
      // If it's not a rate limit error or we ran out of retries, throw the error
      throw error;
    }
  }
}

const params = {
  messages: [
        {role : "user", content: "Hi, Who are you?" }
  ]
  max_tokens: 50,
  model: "gpt-3.5-turbo"
};

makeRequestWithBackoff(BASE_URL, params)
  .then(data => console.log(data))
  .catch(error => console.error(error));

You can even modify the logic to change the exponential backoff to a linear or random one.

Batching

OpenAI also allows batch requests on the /completions endpoint. This can work if you are hitting requests per second but are good on tokens per second. But remember this API is being depreciated. Using the same example as above:

const BASE_URL = "https://api.openai.com/v1/completions";
const params = {
      model: "curie",
      prompts: [
        "Once upon a time there was a dog",
        "Once upon a time there was a cat",
        "Once upon a time there was a human"
        
      ]
};

makeRequestWithBackoff(BASE_URL, params)
  .then(data => console.log(data))
  .catch(error => console.error(error));

There are other techniques that you can use over and above these.

Caching

A lot of times your users are querying the same thing, some type of simple or semantic caching layer above your request can help you save cost & request time. But in this context, it will reduce the calls made to OpenAI.

Switching Between OpenAI and Azure.

You can apply for Azure's OpenAI service and set up a load balancing between both of the providers. This way even if one of them is down or slow, you can switch to the other provider.

Always Stream Responses

The OpenAI API provides a streaming feature that allows us to receive partial model responses in real-time, as they are generated. This approach offers a significant advantage over traditional non-streaming calls, where you might remain unaware of any potential timeouts until the entire response duration elapses, which can vary depending on your initial parameters such as request complexity and the number of max_tokens specified.

Streaming ensures that, regardless of the request's size or the max_tokens set, the model begins to deliver tokens typically within the first 5–6 seconds. Should there be a delay beyond this brief window, it's an early indicator that the request may time out or might not have been processed as expected. We can terminate such requests and retry them again.

Setting Up Fallbacks

For specific use cases where it is okay to get responses from other models, you can set up fallbacks to other models. The best alternatives could be Llama-70b, Gemini, or other smaller models like MIXTRAL 8X7B, Claude Instant, etc. to name a few. These are some common techniques that can be used to mitigate errors in production-grade applications.

That would be it, thank you for reading, and follow us at Merlin @ Twitter We at Merlin API provide all of these features & a lot more with 20+ models to choose from. We focus on the reliability of the API and we do all the switching, fallbacks, caching & rate-limit handling. We provide one Unified API and use one response format across all the models.

A Small Example of how to use Merlin API with Node.js:

import { Merlin } from "merlin-node"; // npm install merlin-node
 
// WARNING: test api key.
// Replace with your API key from Merlin Dashboard
// https://api.getmerlin.in
const apiKey = "merlin-test-3b7d-4bad-9bdd-2b0d7b3dcb6d";
const Merlin = new Merlin({ merlinConfig: { apiKey } });
 
const initChat = { 
  role: "system", 
  content: "You are a helpful assistant." 
}
 
async function createCompletion() {
  try {
    const completion = await Merlin.chat.completions.create({
      messages: [initChat],
      model: "gpt-3.5-turbo", // 20+ models as needed
    });
  } catch (error) {
    console.error("Error creating completion:", error);
  }
}
 
createCompletion();