Show HN: HelixDB – A graph database built on object storage
Posted by GeorgeCurtis 6 days ago
Hey HN, it’s been just over a year since we launched HelixDB (https://news.ycombinator.com/item?id=43975423), a project a friend and I started in college. It’s an OLTP graph database built on object-storage, with native vector search and full-text search (FTS).
Why graph, vector and FTS? Graph databases provide a natural cognitive model for data, vectors allow for a semantic understanding of the entities and relationships in the graph, and FTS provides more specific filtering. Many AI-driven applications attempt to combine all of these functionalities by stitching together multiple disconnected systems, but even then there’s no native way to perform joins or queries that span all systems. You still need to handle this logic at the application level.
Helix started as a graph DB, but we moved to a hybrid graph/vector approach after attempting to build an AI memory system, which led us down the GraphRAG and HybridRAG rabbit hole, where we would need separate graph and vector databases.
We knew scalability would be a challenge at each stage of our product's development, however our initial focus this past year was to prove out the product through local deployments and was only meant to be run on a single node. Scaling graph DBs remained a difficult and expensive problem we’d have to solve later. Some common ways other graph DBs solve scaling is by duplicating entire datasets across distributed machines (extremely expensive per node), or by sharding the data.
Sharding databases is effective and affordable, however, graph data doesn’t have explicit partitions like relational databases do. For example, sharding a relational DB involves splitting up tables. When it comes to graph DBs, the edges can span across any of the partitions, and hopping across multiple machines when traversing nodes is ineffective and computationally expensive.
Replicating graph DBs for high availability and better throughput drastically increases the operational cost of the db and still has a limit of how big you can vertically scale. The workload that we’re used for requires storing a huge amount of data for agents, where only a subset of that data is ever needed at any one time. So rather than having the whole thing in memory, we can store it all in object-storage and get the bits we need when they’re needed.
Agents benefit from better context, which is achieved from more and better data (more relationships etc). By using S3 as the persistence/data layer there is no limit to how big the graph can be or how many relationships you can have, and we can scale to serve throughput and requests by horizontally spinning up nodes and caching relevant subsets of the graph on each node. This way, you get extremely low latency for “hot” data and a p99 of ~100ms for writes and ~50ms for reads from cold storage (S3). Plus you get the benefit of dirt cheap storage.
Workloads that HelixDB is currently supporting: - Huge amounts of data (TBs) from which the agents need to search and traverse over - Offering affordable graph storage for companies where cost of graph data is a bottleneck - Consolidating multiple databases, enabling AI agents to have autonomy over companies, helping them become more autonomous. - AI memory - Company brains
We’re currently working on our own generalised AI memory layer which will use HelixDB under the hood and be completely open-source. Also, we’re finishing up on pre-filtering for vector search which will allow you to pre-filter based on relationships in the graph, metadata, and sub-graphs. And lastly, GA cloud will be available in the coming weeks.
If you want to run Helix locally (either on-disk or in-memory), you can find more info on our github (https://github.com/HelixDB/helix-db) or via our docs (https://docs.helix-db.com/database/local-development). If you’re interested in getting started with our distributed cloud, please email us founders@helix-db.com.
Many thanks! Comments and feedback welcome!
Comments
Comment by jesol 6 days ago
Comment by mentioum 6 days ago
What's your p99 like for multi hops?
Comment by GeorgeCurtis 6 days ago
Comment by mentioum 6 days ago
I'm more concerned about if the p99s stay consistent when things get spikey.
dgraph is fine otherwise...
Comment by GeorgeCurtis 6 days ago
Comment by keynha 6 days ago
Comment by zw17 6 days ago
Comment by mentioum 6 days ago
We're fine with clickhouse and redshift for the OLAP work we do. I've been looking at ParaQuery lately if I really want to speed that up.
Comment by GeorgeCurtis 6 days ago
email us: founders@helix-db.com
Comment by GeorgeCurtis 6 days ago
We’re just two young founders sharing what we’ve been building, so I’ll take the drive-by competitor plug as a compliment :)
Definitely a different focus though. Helix is OLTP, built for operational graph + vector workloads, especially apps/agent memory where low-latency traversals and writes are concerned.
Comment by jauntywundrkind 6 days ago
Comment by fouc 6 days ago
Comment by GeorgeCurtis 5 days ago
Comment by nulltrace 6 days ago
Comment by thedreammachine 6 days ago
Comment by GeorgeCurtis 5 days ago
As long as the sub-graph you're trying to hop is cached, then there's no problem or latency issues. However, if you need to do a deep hop query, where all those nodes and edges are in cold storage, each hop costs ~50ms. So a 10-hop would take ~0.5 seconds.
Again though, we find most people are using us for agentic workloads, so even this worst case scenario the LLMs make up the majority of the latency.
Comment by rgbrgb 6 days ago
can you host this yourself or do you need to use helix-cloud? the chat thing on the side seems to push me to helix-cloud but it looks like that starts at like $600/mo which is above my experimentation budget.
looking for a db for an agent memory application and i'd probably start with something that's just self-hosted / freeish. postgres is working ok but I want to start ingesting server and chat logs.
Comment by GeorgeCurtis 6 days ago
We aim to launch our GA cloud at the end of this month, which will be much more affordable.
Comment by Onawa 6 days ago
Comment by GeorgeCurtis 6 days ago
Soon you’ll be able to host it yourself AND have access to the source code
Comment by cjlm 6 days ago
Comment by dig1 6 days ago
Comment by cjlm 6 days ago
Comment by GeorgeCurtis 6 days ago
Comment by let_rec 6 days ago
What reassurance can you offer devs that are hesitant to try a new data-store?
Comment by GeorgeCurtis 5 days ago
I'd encourage them to start a local instance with claude/codex to build a mini project and see what it's like.
Comment by aitchnyu 5 days ago
Comment by GeorgeCurtis 20 hours ago
Comment by aitchnyu 2 hours ago
Comment by caust1c 6 days ago
Congrats on the launch!
Comment by GeorgeCurtis 6 days ago
We’re 100% committed to going back to open-source on an Apache 2.0 license as soon as possible. In the meantime, you can continue to deploy us completely for free, however you like, using the compiled docker container.
Comment by tao_oat 6 days ago
Comment by GeorgeCurtis 5 days ago
Comment by ymir_e 6 days ago
Looking forward to looking into the generalised AI memory layer when it comes out.
Comment by lennertjansen 5 days ago
Comment by GeorgeCurtis 19 hours ago
Comment by maxrumpf 6 days ago
Comment by GeorgeCurtis 6 days ago
Comment by brene 6 days ago
Comment by GeorgeCurtis 6 days ago
For vector search we have warm and cold p99s of approx 20ms and 400ms respectively. For FTS, warm and cold query p99s of approx 15ms and 250ms respectively.
Both of these benchmarks were run on 1m docs.
Comment by rajit 6 days ago
Comment by GeorgeCurtis 6 days ago
Comment by Bnjoroge 6 days ago
Comment by GeorgeCurtis 6 days ago
We're a graph database with vector and FTS capabilities. Our vector and FTS benchmarks are comparable with tpuffer, but you would primarily use us for building whole applications, knowledge graphs, or AI memory/retrieval. Anything that is relationship intense.
Let me know if this properly answers your question
Comment by Bnjoroge 5 days ago
Comment by raufakdemir 6 days ago
Comment by GeorgeCurtis 6 days ago
You can query HelixDB using JSON or directly in your programming language of choice by using our Rust, TypeScript, Go or Python SDKs. We’ve found AI is very good at working with the SDKs and JSON itself to query, making the development experience much better than before: https://docs.helix-db.com/database/querying
Comment by AmareshHebbar 4 days ago
Comment by busraugur 6 days ago
Comment by miningmai 5 days ago
Comment by lvca 6 days ago
Comment by sonixaep 5 days ago
Comment by NexoraDev 6 days ago