231 lines
12 KiB
Markdown
231 lines
12 KiB
Markdown
# CSV Import Tool Documentation
|
|
|
|
CSV is a universal and very versatile data format used to store large quantities
|
|
of data. Each Memgraph database instance has a CSV import tool installed called
|
|
`mg_import_csv`. The CSV import tool should be used for initial bulk ingestion
|
|
of data into the database. Upon ingestion, the CSV importer creates a snapshot
|
|
that will be used by the database to recover its state on its next startup.
|
|
|
|
If you are already familiar with the Neo4j bulk import tool, then using the
|
|
`mg_import_csv` tool should be easy. The CSV import tool is fully compatible
|
|
with the [Neo4j CSV
|
|
format](https://neo4j.com/docs/operations-manual/current/tools/import/). If you
|
|
already have a pipeline set-up for Neo4j, you should only replace `neo4j-admin
|
|
import` with `mg_import_csv`.
|
|
|
|
## CSV File Format
|
|
|
|
Each row of a CSV file represents a single entry that should be imported into
|
|
the database. Both nodes and relationships can be imported into the database
|
|
using CSV files.
|
|
|
|
Each set of CSV files must have a header that describes the data that is stored
|
|
in the CSV files. Each field in the CSV header is in the format
|
|
`<name>[:<type>]` which identifies the name that should be used for that column
|
|
and the type that should be used for that column. The type is optional and
|
|
defaults to `string` (see the following chapter).
|
|
|
|
Each CSV field must be divided using the delimiter and each CSV field can either
|
|
be quoted or unquoted. When the field is quoted, the first and last character in
|
|
the field *must* be the quote character. If the field isn't quoted, and a quote
|
|
character appears in it, it is treated as a regular character. If a quote
|
|
character appears inside a quoted string then the quote character must be
|
|
doubled in order to escape it. Line feeds and carriage returns are ignored in
|
|
the CSV file, also, the file can't contain a NULL character.
|
|
|
|
## Properties
|
|
|
|
Both nodes and relationships can have properties added to them. When importing
|
|
properties, the CSV importer uses the name specified in the header of the
|
|
corresponding CSV column for the name of the property. A property is designated
|
|
by specifying one of the following types in the header:
|
|
- `integer`, `int`, `long`, `byte`, `short`: creates an integer property
|
|
- `float`, `double`: creates a float property
|
|
- `boolean`, `bool`: creates a boolean property
|
|
- `string`, `char`: creates a string property
|
|
|
|
When importing a boolean value, the CSV field should contain exactly the text
|
|
`true` to import a `True` boolean value. All other text values are treated as a
|
|
boolean value `False`.
|
|
|
|
If you want to import an array of values, you can do so by appending `[]` to any
|
|
of the above types. The values of the array are then determined by splitting
|
|
the raw CSV value using the array delimiter character.
|
|
|
|
Assuming that the array delimiter is `;`, the following example:
|
|
```plaintext
|
|
first_name,last_name:string,number:integer,aliases:string[]
|
|
John,Doe,1,Johnny;Jo;J-man
|
|
Melissa,Doe,2,Mel
|
|
```
|
|
|
|
Will yield these results:
|
|
```plaintext
|
|
CREATE ({first_name: "John", last_name: "Doe", number: 1, aliases: ["Johnny", "Jo", "J-man"]});
|
|
CREATE ({first_name: "Melissa", last_name: "Doe", number: 2, aliases: ["Mel"]});
|
|
```
|
|
### Nodes
|
|
|
|
When importing nodes, several more types can be specified in the header of the
|
|
CSV file (along with all property types):
|
|
- `ID`: id of the node that should be used as the node ID when importing
|
|
relationships
|
|
- `LABEL`: designates that the field contains additional labels for the node
|
|
- `IGNORE`: designates that the field should be ignored
|
|
|
|
The `ID` field type sets the internal ID that will be used for the node when
|
|
creating relationships. It is optional and nodes that don't have an ID value
|
|
specified will be imported, but can't be connected to any relationships. If you
|
|
want to save the ID value as a property in the database, just specify a name for
|
|
the ID (`user_id:ID`). If you just want to use the ID during the import, leave
|
|
out the name of the field (`:ID`). The `ID` field also supports creating
|
|
separate ID spaces. The ID space is specified with the ID space name appended
|
|
to the `ID` type in parentheses (`ID(user)`). That allows you to have the same
|
|
IDs (by value) for multiple different node files (for example, numbers from 1 to
|
|
N). The IDs in each ID space will be treated as an independent set of IDs that
|
|
don't interfere with IDs in another ID space.
|
|
|
|
The `LABEL` field type adds additional labels to the node. The value is treated
|
|
as an array type so that multiple additional labels can be specified for each
|
|
node. The value is split using the array delimiter (`--array-delimiter` flag).
|
|
|
|
### Relationships
|
|
|
|
In order to be able to import relationships, you must import the nodes in the
|
|
same invocation of `mg_import_csv` that is used to import the relationships.
|
|
|
|
When importing relationships, several more types can be specified in the header
|
|
of the CSV file (along with all property types):
|
|
- `START_ID`: id of the start node that should be connected with the
|
|
relationship
|
|
- `END_ID`: id of the end node that should be connected with the relationship
|
|
- `TYPE`: designates the type of the relationship
|
|
- `IGNORE`: designates that the field should be ignored
|
|
|
|
The `START_ID` field type sets the start node that should be connected with the
|
|
relationship to the end node. The field *must* be specified and the node ID
|
|
must be one of the node IDs that were specified in the node CSV files. The name
|
|
of this field is ignored. If the node ID is in an ID space, you can specify the
|
|
ID space for the in the same way as for the node ID (`START_ID(user)`).
|
|
|
|
The `END_ID` field type sets the end node that should be connected with the
|
|
relationship to the start node. The field *must* be specified and the node ID
|
|
must be one of the node IDs that were specified in the node CSV files. The name
|
|
of this field is ignored. If the node ID is in an ID space, you can specify the
|
|
ID space for the in the same way as for the node ID (`END_ID(user)`).
|
|
|
|
The `TYPE` field type sets the type of the relationship. Each relationship
|
|
*must* have a relationship type, but it doesn't necessarily need to be specified
|
|
in the CSV file, it can also be set externally for the whole CSV file. The name
|
|
of this field is ignored.
|
|
|
|
## CSV Importer Flags
|
|
|
|
The importer has many command line options that allow you to customize the way
|
|
the importer loads your data.
|
|
|
|
The two main flags that are used to specify the input CSV files are `--nodes`
|
|
and `--relationships`. Basic description of these flags is provided in the table
|
|
and more detailed explainion can be found further down bellow.
|
|
|
|
|
|
| Flag | Description |
|
|
|-----------------------| -------------- |
|
|
|`--nodes` | Used to specify CSV files that contain the nodes to the importer. |
|
|
|`--relationships` | Used to specify CSV files that contain the relationships to the importer.|
|
|
|`--delimiter` | Sets the delimiter that should be used when splitting the CSV fields (default `,`)|
|
|
|`--quote` | Sets the quote character that should be used to quote a CSV field (default `"`)|
|
|
|`--array-delimiter` | Sets the delimiter that should be used when splitting array values (default `;`)|
|
|
|`--id-type` | Specifies which data type should be used to store the supplied <br /> node IDs when storing them as properties (if the field name is supplied). <br /> The supported values are either `STRING` or `INTEGER`. (default `STRING`)|
|
|
|`--ignore-empty-strings` | Instructs the importer to treat all empty strings as `Null` values <br /> instead of an empty string value (default `false`)|
|
|
|`--ignore-extra-columns` | Instructs the importer to ignore all columns (instead of raising an error) <br /> that aren't specified after the last specified column in the CSV header. (default `false`) |
|
|
| `--skip-bad-relationships`| Instructs the importer to ignore all relationships (instead of raising an error) <br /> that refer to nodes that don't exist in the node files. (default `false`) |
|
|
|`--skip-duplicate-nodes` | Instructs the importer to ignore all duplicate nodes (instead of raising an error). <br /> Duplicate nodes are nodes that have an ID that is the same as another node that was already imported. (default `false`) |
|
|
| `--trim-strings`| Instructs the importer to trim all of the loaded CSV field values before processing them further. <br /> Trimming the fields removes all leading and trailing whitespace from them. (default `false`) |
|
|
|
|
The `--nodes` and `--relationships` flags are used to specify CSV files that
|
|
contain the nodes and relationships to the importer. Multiple files can be
|
|
specified in each supplied `--nodes` or `--relationships` flag. Files that are
|
|
supplied in one `--nodes` or `--relationships` flag are treated by the CSV
|
|
parser as one big CSV file. Only the first line of the first file is parsed for
|
|
the CSV header, all other files (and rows) are treated as data. This is useful
|
|
when you have a very large CSV file and don't want to edit its first line just
|
|
to add a CSV header. Instead, you can specify the header in a separate file
|
|
(e.g. `users_header.csv` or `friendships_header.csv`) and have the data intact
|
|
in the large file (e.g. `users.csv` or `friendships.csv`). Also, you can supply
|
|
additional labels for each set of node files.
|
|
|
|
The format of `--nodes` flag is:
|
|
`[<label>[:<label>]...=]<file>[,<file>][,<file>]...`. Take note that only the
|
|
first `<file>` part is mandatory, all other parts of the flag value are
|
|
optional. Multiple `--nodes` flags can be supplied to describe multiple sets of
|
|
different node files. For the importer to work, at least one `--nodes` flag
|
|
*must* be supplied.
|
|
|
|
The format of `--relationships` flag is: `[<type>=]<file>[,<file>][,<file>]...`.
|
|
Take note that only the first `<file>` part is mandatory, all other parts of the
|
|
flag value are optional. Multiple `--relationships` flags can be supplied to
|
|
describe multiple sets of different relationship files. The `--relationships`
|
|
flag isn't mandatory.
|
|
|
|
## CSV Parser Logic
|
|
|
|
The CSV parser uses the same logic as the standard Python CSV parser. The data
|
|
is parsed in the same way as the following snippet:
|
|
|
|
```python
|
|
import csv
|
|
for row in csv.reader(stream, strict=True):
|
|
# process 'row'
|
|
```
|
|
|
|
Python uses 'excel' as the default dialect when parsing CSV files and the
|
|
default settings for the CSV parser are:
|
|
- delimiter: `','`
|
|
- doublequote: `True`
|
|
- escapechar: `None`
|
|
- lineterminator: `'\r\n'`
|
|
- quotechar: `'"'`
|
|
- skipinitialspace: `False`
|
|
|
|
The above snippet can be expanded to:
|
|
|
|
```python
|
|
import csv
|
|
for row in csv.reader(stream, delimiter=',', doublequote=True,
|
|
escapechar=None, lineterminator='\r\n',
|
|
quotechar='"', skipinitialspace=False,
|
|
strict=True):
|
|
# process 'row'
|
|
```
|
|
|
|
For more information about the meaning of the above values, see:
|
|
https://docs.python.org/3/library/csv.html#csv.Dialect
|
|
|
|
## Errors
|
|
|
|
1. [Skipping duplicate node with ID '{}'. For more details, visit:
|
|
memgr.ph/csv-import-tool.](#error-1)
|
|
2. [Skipping bad relationship with START_ID '{}'. For more details, visit:
|
|
memgr.ph/csv-import-tool.](#error-2)
|
|
3. [Skipping bad relationship with END_ID '{}'. For more details, visit:
|
|
memgr.ph/csv-import-tool.](#error-3)
|
|
|
|
## Skipping duplicate node with ID {} {#error-1}
|
|
|
|
Duplicate nodes are nodes that have an ID that is the same as another node that
|
|
was already imported. You can instruct the importer to ignore all duplicate
|
|
nodes (instead of raising an error) by using the `--skip-duplicate-nodes` flag.
|
|
|
|
## Skipping bad relationship with START_ID {} {#error-2}
|
|
|
|
A node with the id `START_ID` doesn't exist. You can instruct the importer to
|
|
ignore all bad relationships (instead of raising an error) that refer to nodes
|
|
that don't exist in the node files by using the `--skip-bad-relationships` flag.
|
|
|
|
## Skipping bad relationship with END_ID {} {#error-3}
|
|
|
|
A node with the id `END_ID` doesn't exist. You can instruct the importer to
|
|
ignore all bad relationships (instead of raising an error) that refer to nodes
|
|
that don't exist in the node files by using the `--skip-bad-relationships` flag.
|