Importing data into Neo4j via CSV
At GrapheneDB, a common question we receive is how to import CVS file into Neo4j. This article provides a step-by-step guide on importing data from a CSV file into Neo4j, as well as specific considerations for GrapheneDB users to ensure a smooth import process.
Neo4j provides a powerful tool for data import through the LOAD CSV
Cypher clause. This feature allows for efficient ETL (Extract, Transform, Load) operations and supports a variety of data import scenarios:
- It can load a CSV file from a remote URI (i.e. S3, Dropbox, Github).
- It can perform multiple operations in a single statement.
- Input data is mapped directly into a complex graph structure as outlined by the user.
- The clause also supports runtime manipulation or computation of values, giving you flexibility in how data is processed during import.
- It allows merging existing data (nodes, relationships, properties) rather than just adding it to the store.
Prepare your graph data model
Before starting the import process, ensure that you have a clear graph data model. Define how your data will map onto the graph, identifying the nodes, relationships, and their corresponding properties.
Optimize Cache and Heap configuration
To efficiently handle large datasets during import, you can configure the heap size for DS2 plan and higher (the value for the page cache will be automatically calculated to use the remaining available memory), to accommodate the entire dataset.
Estimate the required disk space
You can estimate the required disk space for your dataset using the guidelines in official Neo4j docs. For example, if you’re storing 100K nodes, 1M relationships, and each node/relationship has a fixed-size property (e.g., an integer):
- Node store: 100,000 * 15B = 1.5 MB
- Relationship store: 1,000,000 * 34B = 34 MB
- Property store: 1,100,000 * 41B = 45.1 MB
These calculated values should guide the minimum settings for your file buffer cache configuration.
Set up indexes and constraints
To improve performance during and after the data import process, it’s important to set up indexes. Indexes speed up lookups, especially when using MERGE
queries. Ensure that an index is created for each property used to locate nodes in these queries.
You can create an index using the CREATE INDEX
clause, and you can find detailed information in official Neo4j docs.
If a property needs to be unique, you can add a constraint, which will also implicitly create an index.
Loading and mapping the data
The most straightforward way to load data from a CSV file into Neo4j is by using the LOAD CSV
statement. This command supports various options, such as accessing data by column header or index, configuring the field terminator character, and more. For detailed information on additional options, refer to the official Neo4j documentation.
Example of LOAD CSV
cypher query:
LOAD CSV FROM "https://example.com/dummy-data/artists.csv"
AS row
MATCH (n:Artist {name: row[1], year: toInteger(row[2])})
where:
FROM
takes a STRING
containing the path where the CSV file is located. The CSV file is assumed to be located at https://example.com/dummy-data/artists.csv
.
The clause parses one row at a time, temporarily storing the current row in the variable specified with AS
.
The MERGE
clause accesses the row
variable to insert data into the database.
You can manually enter the code through the Neo4j Browser UI for instance.
Considerations for GrapheneDB users
When loading data into your Neo4j instance on GrapheneDB, keep the following in mind:
- Page Cache and Heap configuration: On DS2 and higher plans, you can configure the page cache, and the heap size will be automatically adjusted. For DS1 plans, these settings are fixed and cannot be changed.
- neo4j-shell limitations: The neo4j-shell does not support authentication, which means it cannot be used to load data into a GrapheneDB-hosted instance or any instance that requires authentication.
- Neo4j Browser UI file access: When running commands from the Neo4j Browser UI, note that Neo4j cannot access your local filesystem. Instead, you must provide a publicly accessible URL, such as a file hosted on AWS S3.
- Handling large datasets: For larger datasets, it is recommended to run the import process locally. Once the import is complete, you can perform an import on your GrapheneDB instance.
Please feel free to Open a Support Case if you are having issues loading data into your GrapheneDB instance, we’ll be happy to help.