📑 Table of Contents

The Complete Guide to Best Practices for Large Language Model API Calls

📅 · 📁 Tutorials · 👁 13 views · ⏱️ 8 min read
💡 As large language model APIs become the core infrastructure for AI application development, efficiently and cost-effectively calling mainstream API platforms such as OpenAI, Anthropic, and DashScope has become an essential skill for developers. This article systematically outlines best practice strategies.

Introduction: API Calls Have Become a Required Course in AI Development

Since 2024, the volume of large language model API calls has experienced explosive growth. From OpenAI's GPT series and Anthropic's Claude series to Alibaba Cloud's DashScope (Lingji) platform hosting the Qwen series, an increasing number of developers and enterprises are choosing to integrate LLM capabilities via APIs rather than training from scratch or deploying locally. However, while API calls may seem straightforward, real-world implementation is riddled with hidden pitfalls — from runaway token consumption and soaring response latency to service crashes caused by improper error handling. These issues are plaguing a large number of development teams.

How do you write high-quality API call code? How do you dramatically reduce costs while maintaining performance? This article systematically outlines best practices for large language model API calls from a hands-on perspective.

Core Practice 1: Choose the Right Model for the Right Scenario

Many developers default to selecting the most powerful model when integrating LLMs. For example, sending all requests to GPT-4o or Claude 3.5 Sonnet — but this approach often results in serious resource waste.

The best practice is to establish a tiered model calling strategy:

  • Simple tasks (such as text classification, keyword extraction, format conversion): Use lightweight models like GPT-4o-mini, Claude 3.5 Haiku, or Qwen Turbo, reducing costs by over 90%.
  • Medium tasks (such as summary generation, general Q&A, content rewriting): Use mid-tier models like GPT-4o or Claude 3.5 Sonnet.
  • Complex tasks (such as long document analysis, complex reasoning, code generation): Only then should you call top-tier models like Claude 3.5 Opus or GPT-4o with high-parameter configurations.

The DashScope platform offers particularly flexible options in this regard. The Qwen series ranges from qwen-turbo to qwen-max across multiple tiers, allowing developers to precisely match models to task complexity.

Core Practice 2: Prompt Engineering and Token Optimization

Tokens are the core billing unit for APIs, and optimizing token consumption is key to cost control. The following techniques are worth noting:

1. Streamline Your System Prompt

Many developers pack excessive instructions into their System Prompt, repeatedly sending thousands of tokens of system instructions with every request. It is recommended to keep the System Prompt under 500 tokens and distill core instructions into concise, clear rules.

2. Leverage Few-shot Over Zero-shot

In scenarios requiring specific output formats, providing 2-3 carefully selected examples is often more effective than lengthy format descriptions, while also reducing retry overhead caused by format errors.

3. Set Reasonable max_tokens

Set a reasonable maximum output length for each request to prevent models from generating excessively long, useless content. For example, a sentiment analysis task only needs max_tokens set to 10, rather than the default 4096.

4. Utilize Caching Mechanisms

Both OpenAI and Anthropic have introduced Prompt Caching features. For scenarios involving large amounts of fixed context (such as document Q&A), enabling caching can significantly reduce billing for repeated tokens. Anthropic's caching mechanism can even reduce the cost of repeated portions to one-tenth of the original price.

Core Practice 3: Production-Grade Calling Strategies

API calls in production environments must account for stability and reliability:

Exponential Backoff Retry Mechanism

When encountering status codes like 429 (rate limit) or 500 (server error), you should employ an exponential backoff strategy for retries rather than immediately resending requests. It is recommended to use retry libraries such as tenacity (Python), setting maximum retries to 3-5 attempts, with an initial wait time of 1 second that doubles with each attempt.

Streaming Output

For user-facing applications, always enable stream mode. OpenAI, Anthropic, and DashScope all support SSE (Server-Sent Events) streaming responses, allowing users to see generated content in real time and significantly improving the experience. Additionally, streaming mode reduces Time to First Token from several seconds to just a few hundred milliseconds.

Concurrency Control and Rate Management

All platforms have RPM (requests per minute) and TPM (tokens per minute) limits. It is recommended to use semaphores or token bucket algorithms to control concurrency and avoid triggering rate limits. For the DashScope platform, developers can view and request quota increases through the console.

Cost Analysis: Pricing Comparison Across Three Major Platforms

In terms of pricing, each of the three major platforms has its advantages:

  • OpenAI: GPT-4o is priced at $2.5 per million input tokens and $10 per million output tokens; GPT-4o-mini goes as low as $0.15 per million input tokens and $0.6 per million output tokens, offering exceptional value.
  • Anthropic: Claude 3.5 Sonnet is priced at $3 per million input tokens and $15 per million output tokens, excelling in long-text processing and complex reasoning scenarios.
  • DashScope: The Qwen series is priced in RMB, with qwen-turbo as low as 2 RMB per million tokens, making it extremely friendly for domestic Chinese developers while eliminating the hassle of cross-border payments.

Overall, it is recommended that developers in China use DashScope as their primary platform for routine tasks, calling OpenAI or Anthropic models on demand when specific capabilities are needed, achieving the optimal balance between cost and performance.

The large language model API ecosystem is evolving rapidly. Several trends worth watching include:

First, multi-model orchestration will become the norm. Frameworks like LangChain and LlamaIndex are making it increasingly convenient for a single application to call multiple models, enabling developers to automatically route tasks to the most suitable model.

Second, the adoption of Batch APIs will further reduce costs. OpenAI has already launched a Batch API priced at just half the cost of real-time calls, suitable for offline processing scenarios where latency is not a concern.

Finally, hybrid architectures combining local models and cloud APIs are on the rise. Simple tasks are handled by locally deployed small models, while complex tasks are routed to cloud APIs. This architecture offers significant advantages in both privacy protection and cost control.

Mastering API call best practices is not only a demonstration of technical capability but also a critical foundation for building sustainable products in the AI era. We hope this article provides practical and actionable guidance for developers everywhere.