How We Reduced API Response Time by 80%

When our client's healthcare platform started experiencing 2-3 second API response times during peak hours, it wasn't just a technical problem — it was directly impacting patient care. Doctors waiting for patient records, nurses delayed in updating vitals, administrators unable to pull reports. We were tasked with getting those response times under 500ms, and we ended up achieving an average of 380ms — an 80% reduction.

This post walks through the exact steps we took: profiling the bottlenecks, implementing multi-layer caching, optimizing database queries, and redesigning the API architecture. Every technique here is battle-tested and applicable to most backend systems.

Performance monitoring dashboard showing API metrics — Before optimization: average response time was 1,900ms during peak hours

Step 1: Profiling and Identifying Bottlenecks

Before optimizing anything, we needed to understand where time was being spent. We instrumented the API with distributed tracing using OpenTelemetry and visualized the results in Grafana. The trace data told a clear story: 70% of response time was spent in database queries, 20% in serialization/transformation, and 10% in network overhead.

middleware/tracing.ts

import { trace, SpanStatusCode } from '@opentelemetry/api';

const tracer = trace.getTracer('api-service');

export async function withTracing<T>(
  name: string,
  fn: () => Promise<T>
): Promise<T> {
  return tracer.startActiveSpan(name, async (span) => {
    try {
      const result = await fn();
      span.setStatus({ code: SpanStatusCode.OK });
      return result;
    } catch (error) {
      span.setStatus({ code: SpanStatusCode.ERROR });
      throw error;
    } finally {
      span.end();
    }
  });
}

Always profile before optimizing. We initially assumed the bottleneck was in our application logic, but tracing revealed that 70% of the time was spent on just 3 poorly-optimized database queries.

Step 2: Database Query Optimization

The three worst offenders were a patient search query doing a full table scan on 2M+ rows, a medical history query with 6 nested JOINs, and an appointment lookup using an unindexed date range filter. Here's how we fixed each one.

For the patient search, we replaced the LIKE '%term%' query with PostgreSQL's full-text search using GIN indexes. This single change reduced the search query from 1,200ms to 45ms.

migrations/add_search_index.sql

-- Before: LIKE query (1,200ms on 2M rows)
-- SELECT * FROM patients WHERE name LIKE '%smith%';

-- After: Full-text search with GIN index (45ms)
ALTER TABLE patients ADD COLUMN search_vector tsvector
  GENERATED ALWAYS AS (
    to_tsvector('english', coalesce(first_name, '') || ' ' || coalesce(last_name, '') || ' ' || coalesce(email, ''))
  ) STORED;

CREATE INDEX idx_patients_search ON patients USING GIN(search_vector);

-- Query using the index
SELECT * FROM patients
WHERE search_vector @@ plainto_tsquery('english', 'smith')
ORDER BY ts_rank(search_vector, plainto_tsquery('english', 'smith')) DESC
LIMIT 20;

For the medical history query with 6 JOINs, we introduced a materialized view that pre-joins the most commonly accessed data. We refresh it every 5 minutes with CONCURRENTLY so reads are never blocked.

Code optimization process — Query execution plans revealed the exact bottlenecks in our database layer

Step 3: Multi-Layer Caching Strategy

We implemented a three-tier caching strategy: in-memory cache (Node.js LRU), Redis distributed cache, and HTTP cache headers. Each layer serves a different purpose and has different invalidation strategies.

L1 — In-Memory LRU Cache (10ms): Hot data like config, feature flags, and frequently-accessed reference data. TTL of 60 seconds.
L2 — Redis Distributed Cache (2-5ms): User sessions, computed aggregations, and API responses. TTL varies by data type (5 min to 1 hour).
L3 — HTTP Cache Headers: Static resources and rarely-changing API responses. Leverages CDN edge caching for global distribution.

lib/cache.ts

import { LRUCache } from 'lru-cache';
import { redis } from './redis';

const memoryCache = new LRUCache<string, unknown>({
  max: 500,
  ttl: 60_000, // 60 seconds
});

export async function cached<T>(
  key: string,
  fetcher: () => Promise<T>,
  ttlSeconds = 300
): Promise<T> {
  // L1: Check memory cache
  const memResult = memoryCache.get(key);
  if (memResult) return memResult as T;

  // L2: Check Redis
  const redisResult = await redis.get(key);
  if (redisResult) {
    const parsed = JSON.parse(redisResult) as T;
    memoryCache.set(key, parsed);
    return parsed;
  }

  // Cache miss: fetch and populate both layers
  const data = await fetcher();
  memoryCache.set(key, data);
  await redis.setex(key, ttlSeconds, JSON.stringify(data));
  return data;
}

Step 4: API Architecture Changes

Beyond caching and query optimization, we made two architectural changes that had significant impact. First, we implemented response compression with Brotli, reducing payload sizes by 60-70%. Second, we introduced connection pooling with PgBouncer, which eliminated the overhead of establishing new database connections on every request.

We also restructured our API endpoints to support partial responses. Instead of returning entire patient records (which could be 50KB+ with full medical history), clients can now request only the fields they need using a fields query parameter. This alone reduced average payload size from 48KB to 6KB.

Results

After implementing all four optimization layers, the results exceeded our targets:

Average response time: 1,900ms → 380ms (80% reduction)
P95 response time: 4,200ms → 650ms (85% reduction)
Database CPU utilization: 78% → 23% during peak hours
API throughput: 450 req/s → 2,800 req/s (6x increase)
Average payload size: 48KB → 6KB (87% reduction)

“The performance improvements transformed the daily experience for our medical staff. What used to be a frustrating wait is now instant. Patient care has measurably improved because clinicians spend less time waiting and more time with patients.”
— Dr. Sarah Mitchell, Chief Medical Information Officer

After optimization dashboard showing improved metrics — After optimization: consistent sub-500ms response times even during peak load

Key Takeaways

Performance optimization is not a one-time task — it's an ongoing discipline. The most impactful lesson from this project is that profiling should always come first. Without distributed tracing, we would have spent weeks optimizing the wrong things. With it, we identified the three critical paths in hours and had a clear roadmap for improvement.