Graph database

# Graph database

Discussion

Encyclopedia
A graph database uses graph structures
Graph (data structure)
In computer science, a graph is an abstract data structure that is meant to implement the graph and hypergraph concepts from mathematics.A graph data structure consists of a finite set of ordered pairs, called edges or arcs, of certain entities called nodes or vertices...

with nodes, edges, and properties to represent and store data. By definition, a graph database is any storage system that provides index-free adjacency. General graph databases that can store any graph are distinct from specialized graph databases such as triplestore
Triplestore
A triplestore is a purpose-built database for the storage and retrieval of Resource Description Framework metadata.Much like a relational database, one stores information in a triplestore and retrieves it via a query language...

s and network databases.

## Structure

Graph databases are based on graph theory
Graph theory
In mathematics and computer science, graph theory is the study of graphs, mathematical structures used to model pairwise relations between objects from a certain collection. A "graph" in this context refers to a collection of vertices or 'nodes' and a collection of edges that connect pairs of...

. Graph databases employ nodes, properties, and edges. Nodes are very similar in nature to the objects that object-oriented programmers will be familiar with.
Nodes represent entities such as people, businesses, accounts, or any other item you might want to keep track of.

Properties are pertinent information that relate to nodes. For instance, if "Wikipedia" were one of the nodes, one might have it tied to properties such as "website", "reference material", or "word that starts with the letter 'w'", depending on which aspects of "Wikipedia" are pertinent to the particular database.

Edges are the lines that connect nodes to nodes or nodes to properties and they represent the relationship between the two. Most of the important information is really stored in the edges. Meaningful patterns emerge when one examines the connections and interconnections of nodes, properties, and edges.

## Properties

Compared with relational databases, graph databases are often faster for associative data sets, and map more directly to the structure of object-oriented applications. They can scale more naturally to large data sets as they do not typically require expensive join
Join (SQL)
An SQL join clause combines records from two or more tables in a database. It creates a set that can be saved as a table or used as is. A JOIN is a means for combining fields from two tables by using values common to each. ANSI standard SQL specifies four types of JOINs: INNER, OUTER, LEFT, and RIGHT...

operations. As they depend less on a rigid schema, they are more suitable to manage ad-hoc and changing data with evolving schemas. Conversely, relational databases are typically faster at performing the same operation on large numbers of data elements.

Graph databases are a powerful tool for graph-like queries, for example computing the shortest path between two nodes in the graph. Other graph-like queries can be performed over a graph database in a natural way (for example graph's diameter computations or community detection).

## Graph database projects

The following is a list of several well-known graph database projects:
• AllegroGraph
AllegroGraph
AllegroGraph is a closed source Graph database, an emerging category of databases. In contrast with a Relational database, a graph database considers each stored item to have any number of relationships. These relationships can be viewed as links, which together form a network, or graph....

- a scalable, high-performance RDF
Resource Description Framework
The Resource Description Framework is a family of World Wide Web Consortium specifications originally designed as a metadata data model...

and graph database.
• Bigdata - a highly scalable RDF/graph database capable of 10B+ edges on a single node or clustered deployment for very high throughput.
• CloudGraph - a disk- and memory-based, fully transactional .NET graph database that uses graphs and key/value pairs to store data.
• Cytoscape
Cytoscape
Cytoscape is an open source bioinformatics software platform for visualizing molecular interaction networks and integrating with gene expression profiles and other state data. Additional features are available as plugins...

- open-source platform, outgrowth of bioinformatics
Bioinformatics
Bioinformatics is the application of computer science and information technology to the field of biology and medicine. Bioinformatics deals with algorithms, databases and information systems, web technologies, artificial intelligence and soft computing, information and computation theory, software...

• DEX
DEX (Graph database)
DEX is a high-performance and scalable graph database management system written in C++.Its development started on 2006 and its first version was available on Q3 - 2008. Fourth version is available since Q3-2010...

- A high-performance graph database from Sparsity Technologies, a technology transition company from DAMA-UPC
• Filament - graph persistence framework and associated toolkits based on a navigational query style.
• GraphBase - a customizable, distributed, small-footprint, high-performance graph store with a rich tool set from FactNexus
• Graphd, the proprietary backend of Freebase
Freebase
Freebase is a large collaborative knowledge base consisting of metadata composed mainly by its community members. It is an online collection of structured data harvested from many sources, including individual 'wiki' contributions. Freebase aims to create a global resource which allows people to...

• Horton - a graph database from Microsoft Research Extreme Computing Group (XCG) based on the cloud programming infrastructure Orleans
• HyperGraphDB - an open-source (LGPL) graph database supporting generalized hypergraph
Hypergraph
In mathematics, a hypergraph is a generalization of a graph, where an edge can connect any number of vertices. Formally, a hypergraph H is a pair H = where X is a set of elements, called nodes or vertices, and E is a set of non-empty subsets of X called hyperedges or links...

s where edges can point to other edges
• InfiniteGraph - a highly scalable, distributed and cloud-enabled commercial product with flexible licensing for startups.
• InfoGrid - an open-source / commercial (AGPLv3, free for small entities) graph database with web front end and configurable storage engines (MySQL, PostgreSQL, Files, Hadoop)
• Neo4j
Neo4j
Neo4j is an open-source graph database, implemented in Java. The developers describe Neo4j as "embedded, disk-based, fully transactional Java persistence engine that stores data structured in graphs rather than in tables". Neo4j version 1.0 was released in February, 2010. The community edition of...

- an open-source / commercial (GPLv3 community edition, AGPLv3 advanced and enterprise edition) graph database
• OrientDB
OrientDB
OrientDB is an open source NoSQL database management system written in Java. Even if it is a document-based database, the relationships are managed as in graph databases with direct connections between records. It supports schema-less, schema-full and schema-mixed modes. It has a strong security...

- a high-performance open source document-graph database
• OQGRAPH - Graph computation engine (GPLv2 licensed) for MySQL
MySQL
MySQL officially, but also commonly "My Sequel") is a relational database management system that runs as a server providing multi-user access to a number of databases. It is named after developer Michael Widenius' daughter, My...

Drizzle (database server)
Drizzle is a free software/open source relational database management system that was forked from version 6.0 of the MySQL DBMS.Like MySQL, Drizzle has a client/server architecture and uses SQL as its primary command language...

• sones GraphDB
Sones GraphDB
Sones GraphDB was developed by the company sones in Erfurt and Leipzig. GraphDB is a new type of database with its design based on weighted graphs. The open source edition has been available since July 2010...

- an open-source / commercial (AGPLv3) graph database and universal access layer (funded by Deutsche Telekom AG)
• VertexDB - high performance graph database server that supports automatic garbage collection.
• Virtuoso Universal Server - a clustered high performance and scalable RDF graph database server
• R2DF - R2DF framework for ranked path queries over weighted RDF graphs

## Distributed Graph Processing (mostly in-memory-only)

• Angrapa - graph package in Hama, a bulk synchronous parallel (BSP
Bulk synchronous parallel
The Bulk Synchronous Parallel abstract computer is a bridging model for designing parallel algorithms. A bridging model "is intended neither as a hardware nor a programming model but something in between" . It serves a purpose similar to the Parallel Random Access Machine model. BSP differs from...

) platform
• FlockDB
FlockDB
FlockDB is an open source distributed, fault-tolerant graph database for managing data at webscale. It was initially used by Twitter to build its database of users and manage their relationships to one another...

- an open source distributed, fault-tolerant graph database based on MySQL
MySQL
MySQL officially, but also commonly "My Sequel") is a relational database management system that runs as a server providing multi-user access to a number of databases. It is named after developer Michael Widenius' daughter, My...

and the Gizzard
Gizzard (scala framework)
Gizzard is an open source sharding framework to create custom fault-tolerant, distributed databases. It was initially used by Twitter and emerged out of a wide variety of data storage problems. Gizzard operates as a middleware networking service that runs on the Java Virtual Machine...

framework for managing Twitter-like graph data (single-hop relationships) at webscale FlockDB on GitHub.
• Giraph - a Graph processing infrastructure that runs on Hadoop (see Pregel).
• GoldenOrb - Pregel implementation built on top of Apache Hadoop
• Phoebus - Pregel implementation written in Erlang
• Pregel - Google's internal graph processing platform, released details in ACM paper.
• Trinity - Distributed in-memory graph engine under development at Microsoft Research Labs.

## APIs and Graph Query/Programming Languages

• Blueprints - a Java API for Property Graphs from TinkerPop and supported by a few graph database vendors.
• Blueprints.NET - a C#/.NET API for generic Property Graphs.
• Cypher - a Property Graph Query Language developed by Neo4j
Neo4j
Neo4j is an open-source graph database, implemented in Java. The developers describe Neo4j as "embedded, disk-based, fully transactional Java persistence engine that stores data structured in graphs rather than in tables". Neo4j version 1.0 was released in February, 2010. The community edition of...

.
• Gremlin - an open-source graph programming language that works over various graph database systems.
• Pacer - is a Ruby dialect/implementation of the Gremlin graph traversal language.
• Pipes - a lazy dataflow framework written in Java that forms the foundation for various property graph traversal languages.
• Pipes.NET - a data flow framework for C#/.NET for processing generic graphs and Property Graphs.
• PYBlueprints - a Python API for Property Graphs.
• Rexster - a HTTP/REST API for accessing remote graph databases and supported by a few graph database vendors.

• NoSQL (concept)
• Document-oriented database
Document-oriented database
A document-oriented database is a computer program designed for storing, retrieving, and managing document-oriented, or semi structured data, information...

• Structured storage
Structured storage
COM Structured Storage is a technology developed by Microsoft as part of its Windows operating system for storing hierarchical data within a single file...

• Object database
Object database
An object database is a database management system in which information is represented in the form of objects as used in object-oriented programming...

• Resource Description Framework
Resource Description Framework
The Resource Description Framework is a family of World Wide Web Consortium specifications originally designed as a metadata data model...

(RDF) - framework to express node-edge graphs
• Graph transformation for a complementary topic (rule based in memory manipulation of graphs instead of transaction
Database transaction
A transaction comprises a unit of work performed within a database management system against a database, and treated in a coherent and reliable way independent of other transactions...

safe persistence
Persistence (computer science)
Persistence in computer science refers to the characteristic of state that outlives the process that created it. Without this capability, state would only exist in RAM, and would be lost when this RAM loses power, such as a computer shutdown....

).