Langfuse v3 stable release
Langfuse v3 is now stable and ready for production use when self-hosting Langfuse, including many scalability and architectural improvements.
This is the biggest Langfuse release since we initially launched early last year. Thank you to everyone who contributed to the release and provided feedback via GitHub Discussions (v3 thread)!
What’s changing?
If you use Langfuse Cloud, nothing besides way better performance and reliability. However, you have been using large parts of v3 over the recent months already. At the core, the changes that we have made help Langfuse scale reliably and help unlock analytical product features that help learn from large volumes of production data.
If you self-host Langfuse, a whole lot is changing. With Langfuse v3, Langfuse is getting a new architecture that is optimized for scalability, reliability, and performance. Read on to learn more about the new architecture and the new features.
Infrastructure changes
Langfuse has gained significant traction over the last months, both in our Cloud environment and in self-hosted setups. With Langfuse v3 we introduce changes that allow our backend to handle hundreds of events per second with higher reliability. To achieve this scale, we introduce a second Langfuse container and additional storage services like S3/Blob store, Clickhouse, and Redis which are better suited for the required workloads than our previous Postgres-based setup.
In short, Langfuse v3 adds:
- A new worker container that processes events asynchronously.
- A new S3/Blob store for storing large objects.
- A new Clickhouse instance for storing traces, observations, and scores.
- Redis/Valkey for queuing events and caching data.
Architecture Diagram
Langfuse consists of two application containers, storage components, and an optional LLM API/Gateway.
- Application Containers
- Langfuse Web: The main web application serving the Langfuse UI and APIs.
- Langfuse Worker: A worker that asynchronously processes events.
- Storage Components:
- Postgres: The main database for transactional workloads.
- Clickhouse: High-performance OLAP database which stores traces, observations, and scores.
- Redis/Valkey cache: A fast in-memory data structure store. Used for queue and cache operations.
- S3/Blob Store: Object storage to persist all incoming events, multi-modal inputs, and large exports.
- LLM API / Gateway: Some features depend on an external LLM API or gateway.
Langfuse can be deployed within a VPC or on-premises in high-security environments. Internet access is optional. See networking documentation for more details.
Benefits: improved scalability, reliability, and performance
Since running this infrastructure on Langfuse Cloud, we have observed a significant improvement in reliability and performance. These are some of the largest benefits:
- Queued trace ingestion: All traces are received in batches by the Langfuse Web container and immediately written to S3. Only a reference is persisted in Redis for queueing. Afterward, the Langfuse Worker will pick up the traces from S3 and ingest them into Clickhouse. This ensures that high spikes in request load do not lead to timeouts or errors constrained by the database.
- Caching of API keys: API keys are cached in-memory in Redis. Thereby, the database is not hit on every API call and unauthorized requests can be rejected with very low resource usage.
- Caching of prompts (SDKs and API): Even though prompts are cached client-side by the Langfuse SDKs and only revalidated in the background (docs), they need to be fetched from Langfuse on first use. Thus, API response times are very important. Prompts are cached in a read-through cache in Redis. Thereby, hot prompts can be fetched from Langfuse without hitting a database.
- OLAP database: All read-heavy analytical operations are offloaded to an OLAP database (Clickhouse) for fast query performance.
- Recoverability of events: All incoming tracing and evaluation events are persisted in S3/Blob Storage first. Only after successful processing, the events are written to the database. This ensures that even if the database is temporarily unavailable, the events are not lost and can be processed later.
- Background migrations: Long-running migrations that are required by an upgrade but not blocking for regular operations are offloaded to a background job. This massively reduces the downtime during an upgrade. Learn more here.
New capabilities
Some of the more recent launches were dependent on the new architecture and thus only available on Langfuse Cloud for now. This changes today with v3.0.0. We do not plan to have a feature gap when self-hosting Langfuse (OSS, Pro, Enterprise).
- LLM-as-a-Judge Evaluators: Using the new worker container, evaluators can be run in a scalable and reliable way.
- Prompt Experiments: Run prompt experiments against datasets within Langfuse.
- Batch exports: Export large amounts of data in a single request from the UI as CSV/JSON.
Reasoning for the architectural changes
1. Why Clickhouse
We made the strategic decision to migrate our traces, observations, and scores table from Postgres to Clickhouse. Both us and our self-hosters observed bottlenecks in Postgres when dealing with millions of rows of tracing data, both on ingestion and retrieval of information. Our core requirement was a database that could handle massive volumes of trace and event data with exceptional query speed and efficiency while also being available for free to self-hosters.
Limitations of Postgres
Initially, Postgres was an excellent choice due to its robustness, flexibility, and the extensive tooling available. As our platform grew, we encountered performance bottlenecks with complex aggregations and time-series data. The row-based storage model of PostgreSQL becomes increasingly inefficient when dealing with billions of rows of tracing data, leading to slow query times and high resource consumption.
Our requirements
- Analytical queries: all queries for our dashboards (e.g. sum of LLM tokens consumed over time)
- Table queries: Finding tracing data based on filtering and ordering selected via tables in our UI.
- Select by ID: Quickly locating a specific trace by its ID.
- High write throughput while allowing for updates. Our tracing data can be updated from the SKDs. Hence, we need an option to update rows in the database.
- Self-hosting: We needed a database that is free to use for self-hosters, avoiding dependencies on specific cloud providers.
- Low operational effort: As a small team, we focus on building features for our users. We try to keep operational efforts as low as possible.
Why Clickhouse is great
- Optimized for Analytical Queries: ClickHouse is a modern OLAP database capable of ingesting data at high rates and querying it with low latency. It handles billions of rows efficiently.
- Rich feature-set: Clickhouse offers different Table Engines, Materialized views, different types of Indices, and many integrations which helps us to build fast and achieve low latency read queries.
- Our self-hosters can use the official Clickhouse Helm Charts and Docker Images for deploying in the cloud infrastructure of their choice.
- Clickhouse Cloud: Clickhouse Cloud is a database as a SaaS service which allows us to reduce operational efforts on our side.
When talking to other companies and looking at their code bases, we learned that Clickhouse is a popular choice these days for analytical workloads. Many modern observability tools, such as Signoz or Posthog, as well as established companies like Cloudflare, use Clickhouse for their analytical workloads.
Clickhouse vs. others
We think there are many great OLAP databases out there and are sure that we could have chosen an alternative and would also succeed with it. However, here are some thoughts on alternatives:
- Druid: Unlike Druid’s modular architecture, ClickHouse provides a more straightforward, unified instance approach. Hence, it is easier for teams to manage Clickhouse in production as there are fewer moving parts. This reduces the operational burden especially for our self-hosters.
- StarRocks: We think StarRocks is great but early. The vast amount of features in Clickhouse help us to remain flexible with our requirements while benefiting from the performance of an OLAP database.
Building an adapter and support multiple databases
We explored building a multi-database adapter to support Postgres for smaller self-hosted deployments. After talking to engineers and reviewing some of PostHog’s Clickhouse implementation, we decided against this path due to its complexity and maintenance overhead. This allows us to focus our resources on building user features instead.
2. Why Redis
We added a Redis instance to serve cache and queue use-cases within our stack. With its open source license, broad native support my major cloud vendors, and ubiquity in the industry, Redis was a natural choice for us.
3. Why S3/Blob Store
Observability data for LLM application tends to contain large, semi-structured bodies of data to represent inputs and outputs. We chose S3/Blob Store as a scalable, secure, and cost-effective solution to store these large objects. It allows us to store all incoming events for further processing and acts as a native backup solution, as the full state can be restored based on the events stored there.
4. Why Worker Container
When processing observability data for LLM applications, there are many CPU-heavy operations which block the main loop in our Node.js backend, e.g. tokenization and other parsing of event bodies. To achieve high availability and low latencies across client applications, we decided to move the heavy processing into an asynchronous worker container. It accepts events from a Redis queue and ensures that they are eventually being upserted into Clickhouse.
How to upgrade from v2 to v3 (self-hosted)?
We have released an extensive migration guide for upgrading from Langfuse v2 to v3.
High-level upgrade steps:
- If on SDKs v1: Upgrade to SDKs v2 or later
- Deploy production-ready Langfuse v3 using one of the deployment guides
- Migrate data from v2 to v3 using the fully managed background migrations
- Use v3
Watch this video to get an understanding of the upgrade process:
Please reach out in case you have any questions while upgrading! We tried to make the upgrade as seamless as possible as there are thousands of teams who rely on Langfuse in production.
New self-hosting documentation
We used the v3 release as an opportunity to overhaul the self-hosting documentation. It includes all the information you need to know when self-hosting Langfuse and answers to many questions that came up in the community.
Feel free to add to the docs and share any feedback that you might have!
What’s next?
Over the next weeks, we will be adding more deployment templates for different cloud providers (tracking this here for AWS, Google Cloud, Azure). Let us know if any additional documentation would be helpful!
Thank you everyone for your feedback!
The v3 thread is by far the most extensive Langfuse discussion thread. Thanks to everyone who contributed to the thread and helped us shape this release.
A special thank you to those who tested v3 ahead of the stable release (v3.0.0-rc*) and provided detailed feedback on the documentation and upgrade process. Thanks for your help and making this process smoother for everyone else!
We are super excited to see what you will build with Langfuse v3 and how it unlocks many new roadmap items that were constrained by Langfuse v2!
👋 Greetings from the Langfuse HQ, big day here!
Core team celebrating v3 release