Importing data into Neo4j via CSV

At GrapheneDB, a common question we receive is how to import CVS file into Neo4j. This article provides a step-by-step guide on importing data from a CSV file into Neo4j, as well as specific considerations for GrapheneDB users to ensure a smooth import process.

Neo4j provides a powerful tool for data import through the LOAD CSV Cypher clause. This feature allows for efficient ETL (Extract, Transform, Load) operations and supports a variety of data import scenarios:

It can load a CSV file from a remote URI (i.e. S3, Dropbox, Github).
It can perform multiple operations in a single statement.
Input data is mapped directly into a complex graph structure as outlined by the user.
The clause also supports runtime manipulation or computation of values, giving you flexibility in how data is processed during import.
It allows merging existing data (nodes, relationships, properties) rather than just adding it to the store.

Prepare your graph data model

Before starting the import process, ensure that you have a clear graph data model. Define how your data will map onto the graph, identifying the nodes, relationships, and their corresponding properties.

Optimize Cache and Heap configuration

To efficiently handle large datasets during import, you can configure the heap size for DS2 plan and higher (the value for the page cache will be automatically calculated to use the remaining available memory), to accommodate the entire dataset.

Estimate the required disk space

You can estimate the required disk space for your dataset using the guidelines in official Neo4j docs. For example, if you’re storing 100K nodes, 1M relationships, and each node/relationship has a fixed-size property (e.g., an integer):

Node store: 100,000 * 15B = 1.5 MB
Relationship store: 1,000,000 * 34B = 34 MB
Property store: 1,100,000 * 41B = 45.1 MB

These calculated values should guide the minimum settings for your file buffer cache configuration.

Set up indexes and constraints

To improve performance during and after the data import process, it’s important to set up indexes. Indexes speed up lookups, especially when using MERGE queries. Ensure that an index is created for each property used to locate nodes in these queries.

You can create an index using the CREATE INDEX clause, and you can find detailed information in official Neo4j docs.

If a property needs to be unique, you can add a constraint, which will also implicitly create an index.

Loading and mapping the data

The most straightforward way to load data from a CSV file into Neo4j is by using the LOAD CSV statement. This command supports various options, such as accessing data by column header or index, configuring the field terminator character, and more. For detailed information on additional options, refer to the official Neo4j documentation.

Example of LOAD CSV cypher query:

LOAD CSV FROM "https://example.com/dummy-data/artists.csv"
AS row
MATCH (n:Artist {name: row[1], year: toInteger(row[2])})

where:

FROM takes a STRING containing the path where the CSV file is located. The CSV file is assumed to be located at https://example.com/dummy-data/artists.csv.
The clause parses one row at a time, temporarily storing the current row in the variable specified with AS.
The MERGE clause accesses the row variable to insert data into the database.

You can manually enter the code through the Neo4j Browser UI for instance.

Considerations for GrapheneDB users

When loading data into your Neo4j instance on GrapheneDB, keep the following in mind:

Page Cache and Heap configuration: On DS2 and higher plans, you can configure the page cache, and the heap size will be automatically adjusted. For DS1 plans, these settings are fixed and cannot be changed.
neo4j-shell limitations: The neo4j-shell does not support authentication, which means it cannot be used to load data into a GrapheneDB-hosted instance or any instance that requires authentication.
Neo4j Browser UI file access: When running commands from the Neo4j Browser UI, note that Neo4j cannot access your local filesystem. Instead, you must provide a publicly accessible URL, such as a file hosted on AWS S3.
Handling large datasets: For larger datasets, it is recommended to run the import process locally. Once the import is complete, you can perform an import on your GrapheneDB instance.

Please feel free to Open a Support Case if you are having issues loading data into your GrapheneDB instance, we’ll be happy to help.