Data structure and format

Data path and naming convention
Raw data file format
Matlab data structures
Data containers
Example dataset

Data path and naming convention

CellExplorer uses one main path/directory for each session called basepath. Each session is identified by a basename. The data in the basepath should follow this naming convention: basename.*, e.g. basename.dat and basename.session.mat. The clustered data can be in a sub directory, defined in session.spikeSorting.relativePath or in the basepath, All files generated by CellExplorer will be saved to the basepath. The raw ephys data should also be stored in the basepath. It is recommended to follow the naming convention: /path_to_data/basename/, meaning that the directory containing the dataset is named basename, but it is not required.

Raw data file format

CellExplorer supports raw binary data files (.dat files). This format is also supported by IntanTech, OpenEphys, KiloSort, Phy, NeuroSuite, Spyking Circus, NeuroSuite, Klustakwik, and many other tools.

Suppose you have N channels:

\[c_1, c_2, ... , c_N\]

And if you assume that $c_i(t)$ is the value of channel $c_i$ at time $t$, then your data file should be a raw file with values:

\[c_1(1), c_2(1), ... , c_N(1), c_1(2), ..., c_N(2), ... c_N(T)\]

This is simply the flatten version of your recordings matrix, with size $N x T$.

The related metadata is specified in the session struct:

Precision (the numeric data type): session.extracellular.precision. The default precision is int16, but numeric data types that Matlab supports are included: double, single, uint16, int32, int64.
Number of channels: session.extracellular.nChannels.
Sampling rate: session.extracellular.sr.
Least significant bit: session.extracellular.leastSignificantBit. The range/precision in µV. Intan system: 0.195µV/bit.

CellExplorer also uses a basename.lfp file - A low-pass filtered and down-sampled raw data file for lfp analysis (for efficient data analysis and data storage; typically down-sampled to 1250Hz). The lfp file is automatically generated in the pipeline from the raw data file - using the script ce_LFPfromDat). The sampling rate is specified in the session struct (session.extracellular.srLfp). The LFP file has the same channel count and scaling as the dat file.

Matlab data structures

Each type of data is saved in its own MATLAB structure, please see the list of data containers in the next section.

Session metadata

A MATLAB struct session stored in a .mat file: basename.session.mat. The session struct contains all session-level metadata. The session struct can be generated using the sessionTemplate.m and inspected with gui_session.m. The basename.session.mat files should be stored in the basepath. It is structured as defined below:

general
- name : name of session (also referred to as the basename of the session)
- basePath : basepath of session
- investigator : investigator of the session
- projects : projects the session belong to
- date : date the session was recorded
- time : start time of the session
- location : location where the session took place
- experimenters : who performed the experiments
- duration : the total duration of the session (seconds)
- sessionType : type of session (chronic/acute)
- notes : any notes
animal
- name : name of animal
- sex : sex of animal
- species : species of animal
- strain : strain of animal
- geneticLine : genetic line of animal
- probeImplants : struct-array with probe implants information
  - probe : Name of probe implanted
  - supplier : Name of probe supplier
  - descriptiveName : Descriptive name of probe
  - brainRegion : brain region
  - ap : Anterior-Posterior coordinate (mm)
  - ml : Medial-Lateral coordinate (mm)
  - depth : Implantation depth (mm)
  - ap_angle : ap-angle of probe implantation (degrees)
  - ml_angle : ml angle of probe implantation (degrees)
  - rotation : rotation of probe (degrees)
- opticFiberImplants : struct-array with optic fiber implants information
  - opticFiber : Name of optic fiber implanted
  - supplier : Name of optic fiber supplier
  - brainRegion : brain region
  - ap : Anterior-Posterior coordinate (mm)
  - ml : Medial-Lateral coordinate (mm)
  - depth : Implantation depth (mm)
  - ap_angle : ap-angle of optic fiber implantation (degrees)
  - ml_angle : ml angle of optic fiber implantation (degrees)
  - notes : notes
- surgeries : struct-array with surgery information
  - date : date of surgery
  - start_time : start time of surgery
  - end_time : end time of surgery
  - weight : weight of animal before surgery
  - type_of_surgery : type of surgery (Chronic or Acute)
  - room : room of surgery
  - persons_involved : Persons involved in the surgery
  - anesthesia : anesthesia used during surgery (e.g. Isoflurane)
  - analgesics : analgesics applied after surgery
  - antibiotics : antibiotics applied after surgery
  - complications : complications
  - notes : notes about the surgery
- virusInjections : struct-array with virus injections information
  - virus : Name of virus injected
  - brainRegion : brain region
  - injection_schema : brain region
  - injection_volume : brain region
  - injection_rate : brain region
  - ap : Anterior-Posterior coordinate (mm)
  - ml : Medial-Lateral coordinate (mm)
  - depth : Implantation depth (mm)
  - ap_angle : ap-angle of virus injection (degrees)
  - ml_angle : ml angle of virus injection (degrees)
  - notes : notes
epochs
- name
- behavioralParadigm
- builtMaze
- mazeType
- manipulations
- startTime
- stopTime
extracellular
- equipment : hardware used to acquire the data
- fileFormat : file format of the raw data file (e.g. dat or bin)
- fileName : name of the raw data file (can also be a relative path to the raw data file)
- sr : sampling rate
- nChannels : number of channels
- nSamples : number of samples
- nElectrodeGroups : number of electrode groups
- electrodeGroups (struct) : struct with definition of electrode groups (1-indexed)
  - channels : numeric cell array with list of channels (electrodes) in electrode groups (1-indexed)
  - label : char cell array with labels of electrode groups
- nSpikeGroups : number of spike groups
- spikeGroups (struct) : struct with definition of spike groups (1-indexed)
  - channels : numeric cell array with list of channels (electrodes) in spike groups (1-indexed)
  - label : char cell array with labels of spike groups
- precision : Numeric data type. The default precision is int16, but all numeric data types that Matlab supports are included: double, single, uint16, int32, int64.
- leastSignificantBit : range/precision in µV. Intan system: 0.195µV/bit
- srLFP : sampling rate of the LFP file
- chanCoords : struct with channel coordinates (x and y coordinates)
  - x : x coordinates of channels (1 x nChannels)
  - y : company producing the probe (1 x nChannels)
  - source : Source of channel coordinates
  - layout : Probe layout (e.g. linear,staggered,poly2,poly3,poly4,poly5)
  - shankSpacing : Shank spacing (in µm)
  - verticalSpacing : Vertical spacing between channels (in µm)
brainRegions
- regionAcronym : e.g. CA1 or HIP, Allen institute Atlas
  - brainRegion
  - channels : list of channels
  - electrodeGroups : list of electrode groups
channelTags
- tagName (e.g. Theta, Cortical, Ripple, Bad)
  - channels : list of channels (1-indexed)
  - electrodeGroups : list of electrode groups (1-indexed)
behavioralTracking
- equipment : hardware used to acquire the data
- filenames : file names containing the tracking
- framerate : frame rate of the tracking
- notes
inputs
- inputTag : unique name, e.g. temperature, stimPulses, OptitrackTTL
  - equipment : hardware used to acquire the data
  - inputType : adc, aux, dig, dat, …
  - channels : list of channels (1-indexed)
  - description
analysisTags
- tagName: the numeric or string values saved in the tag
spikeSorting
- method : KiloSort, KiloSort2,SpyKING CIRCUS, Klustakwik, MaskedKlustakwik, MountainSort, IronClust, MClust, UltraMegaSort2000
- format : Phy, KiloSort, SpyKING CIRCUS, Klustakwik, KlustaViewa, Neurosuite, MountainSort, IronClust, ‘ALF, AllenSDK, MClust, UltraMegaSort2000
- relativePath : relative to base/sessionpath
- channels : list of channels selected.
- spikeSorter : Person performed the manual spike sorting
- notes
- cellMetrics : (boolean) if the cell metrics has been run
- manuallyCurated : (boolean) if manual curation has been performed
timeseries
- typeTag : unique type (adc, aux, dat, dig …)
  - fileName : file name
  - precision : e.g. int16
  - nChannels : number of channels
  - sr : sampling rate
  - nSamples : number of samples
  - leastSignificantBit : range/precision in µV. Intan system: 0.195µV/bit
  - equipment : hardware used to acquire the data

Spikes

A MATLAB struct spikes stored in a .mat file: basename.spikes.cellinfo.mat. It can be generated with loadSpikes.m. The processing module ProcessCellMetrics.m used the script loadSpikes.m, to load spike-data from various pipelines/data formats, including KiloSort, Phy, and Neurosuite and saves it to a spikes struct: basename.spikes.cellinfo.mat, which is saved to the basepath. The struct has the following fields:

ts: a 1xN cell-struct for N units each containing a [nSpikes x 1] vector with nSpikes spike events in samples.
times: a 1xN cell-struct for N units each containing a [nSpikes x 1] vector with nSpikes spike events in seconds.
cluID: a 1xN vector with inherited IDs from the applied clustering algorithm.
UID: a 1xN vector with values 1:N.
shankID: a 1xN vector containing the corresponding shank/electrode-group each unit (1-indexed).
maxWaveformCh: a 1xN vector with the channel for the maximum waveform for the units (0-indexed)
maxWaveformCh1: a 1xN vector with the channel for the maximum waveform for the units (1-indexed)
total: a 1xN vector with the total number of spikes for each unit.
peakVoltage: a 1xN vector with spike waveform amplitude (µV).
filtWaveform: a 1xN cell-struct with spike waveforms from maxWaveformChannel (µV).
filtWaveform_std: a 1xN cell-struct with the std of the spike waveforms (µV).
rawWaveform: a 1xN cell-struct with raw spike waveforms (µV).
rawWaveform_std: a 1xN cell-struct with std of the raw spike waveforms (µV).
timeWaveform: a 1xN cell-struct with spike timestamps for the waveforms (ms).
maxWaveform_all: a 1xN vector with channel indexes for the _all waveforms for the units (1-indexed)
rawWaveform_all: a 1xN cell-struct with raw spike waveforms from maxWaveform_all (µV).
filtWaveform_all: a 1xN cell-struct with filtered spike waveforms from maxWaveform_all (µV).
timeWaveform_all: a 1xN cell-struct with spike timestamps for the _all waveforms (ms).
numcells: number of cells.
basename: name of the session (string).
spindices: a Kx2 matrix where the first column contains the K spike times for all units and the second column contains the unit index for each spike.
processinginfo: a substruct with information about how the spikes was generated including the name of the function, version, date and the parameters.

Any extra field can be added with info about the units, e.g. the theta phase of each spike for the units, or the position/speed of the animal for each spike.

Cell metrics

The cell metrics are kept in a cell_metrics struct as described here. The cell metrics are stored in: basename.cell_metrics.cellinfo.mat.

Events

This is a data container for event data. A MATLAB struct eventName stored in a .mat file: basename.eventName.events.mat with the following fields:

timestamps: [nEvents x 2] matrix with intervals for the nEvents in units of seconds.
peaks: Event time for the peak of each events in seconds (nEvents x 1).
amplitude: amplitude of each event (nEventsx1).
amplitudeUnits: specify the units of the amplitude vector.
eventID: numeric ID for classifying various event types (nEvents x 1).
eventIDlabels: cell array with labels for classifying various event types defined in stimID (cell array, nEvents x 1).
eventIDbinary: boolean specifying if eventID should be read as binary values (default: false).
center: center time-point of event (in seconds; calculated from timestamps; nEvents x 1).
duration: duration of event (in seconds; calculated from timestamps; nEvents x 1).
detectorinfo: info about how the events were detected.
- detectorname: Name of detector scriptdetectiondate.
- detectiondate: Detection date.
- detectionintervals: Detection intervals.
- detectionparms: Detection parameters.
- detectionchannel: Detection cahnnel (0-indexed).
- detectionchannel1: Detection cahnnel (1-indexed).

The *.events.mat files should be stored in the basepath. Any events files located in the basepath will be detected in the pipeline ProcessCellMetrics.m and an average PSTHs will be generated.

Manipulations

This is a data container for manipulation data. A MATLAB struct manipulationName stored in a .mat file: basename.eventName.manipulation.mat with the following fields:

timestamps: [nEvents x 2] matrix with intervals for the nEvents in units of seconds.
peaks: Event time for the peak of each events in seconds (nEvents x 1).
amplitude: amplitude of each event (nEvents x 1).
amplitudeUnits: specify the units of the amplitude vector.
eventID: numeric ID for classifying various event types (nEvents x 1).
eventIDlabels: cell array with labels for classifying various event types defined in stimID (cell array, nEvents x 1).
eventIDbinary: boolean specifying if eventID should be read as binary values (default: false).
center: center time-point of event (in seconds; calculated from timestamps; nEvents x 1).
duration: duration of event (in seconds; calculated from timestamps; nEvents x 1).
detectorinfo: info about how the events were detected.

The *.manipulation.mat files should be stored in the basepath. events and manipulation files are similar in content, but only manipulation intervals are excluded in the pipeline. Any manipulation files located in the basepath will be detected in the pipeline (ProcessCellMetrics.m) and an average PSTH will be generated. Events and manipulation files are similar in content, but only manipulation intervals are excluded in the pipeline.

Channels

This is a data container for channel-wise data. A MATLAB struct ChannelName stored in a .mat file: basename.ChannelName.channelinfo.mat with the following optional fields:

data: a [nSamples x nChannels] data container (optional).
channel: a [nChannels x 1] vector containing a list of channel (1-indexed).
channelClass: a [nChannels x 1] cell with classification assigned to each channel (char).
processinginfo: a struct with information about how the mat file was generated including the name of the function, version, date and parameters.
detectorinfo: If the channelinfo struct is based on determined events, detectorinfo contains info about how the event was processed.

The *.channelinfo.mat files should be stored in the basepath.

Channels coordinates chanCoords : Channels coordinates struct (probe layout) with x and y position for each recording channel saved to basename.chanCoords.channelinfo.mat with the following fields:

x : x position of each channel (in µm; [nChannels x 1]).
y : y position of each channel (in µm; [nChannels x 1]).
source : y position of each channel (in µm; [nChannels x 1]; optional).
layout : y position of each channel (in µm; [nChannels x 1]; optional).
shankSpacing : y position of each channel (in µm; [nChannels x 1]; optional).
channel : Channel list ([nChannels x 1]; optional).
verticalSpacing : Vertical spacing between channels (in µm) This works as a simple 2D representation of recordings and will help you determine the location of your neurons. It is also used to determine the spike amplitude length constant of the spike waveforms across channels.

Allen Institute’s Common Coordinate Framework ccf : Allen Institute’s Common Coordinate Framework (CCF) for each recording channel saved to basename.ccf.channelinfo.mat with the following fields:

x : Anterior-Posterior position of each channel (µm; nChannelsx1).
y : Superior-Inferior position of each channel (µm; nChannelsx1).
z : Left-Right position of each channel (µm; nChannelsx1; right hemisphere positive direction).
channel : Channel list (nChannelsx1; optional).

The Allen Institute’s Common Coordinate Frame allows you to visualize your cells into the standardized mouse atlas.

Time series

This is a data container for other time series data (check other containers for specific formats like intracellular). A MATLAB struct timeserieName stored in a .mat file: basename.timeserieName.timeseries.mat with the following fields:

data : a [nSamples x nChannels] matrix with time series data.
timestamps : a [nSamples x 1] vector with timestamps.
precision : e.g. int16.
units : e.g. mV.
nChannels : number of channels.
channelNames : struct with names of channels.
sr : sampling rate.
nSamples : number of samples.
leastSignificantBit : range/precision in µV. Intan system: 0.195µV/bit.
equipment : hardware used to acquire the data.
notes : Human-readable notes about this time series data.
description : Description of this time series data.
processinginfo : a struct with information about how the .mat file was generated including the name of the function, version, date, source file, and parameters.
- sourceFileName : file name.

Any other field can be added to the struct containing time series data. The *.timeseries.mat files should be stored in the basepath.

States

This is a data container for brain states data. A MATLAB struct states stored in a .mat file: basename.statesName.states.mat. States can contain multiple temporal states defined by intervals, .e.g sleep/wake-states (awake/nonREM and/REM) and cortical states (Up/Down). It has the following fields:

ints: a struct containing intervals (start and stop times) for each state (required).
- .stateName: start/stop time for each instance of state stateName (required).
processinginfo: a struct with information about how the .mat file was generated including the name of the function, version, date and parameters.
detectorinfo: a struct with information about how the states were detected.

Optional fields

idx: a struct containing timestamps for each state.
- .states: a [t x 1] vector giving the state at each point in time (t: number of timestamps).
- .timestamps: a [t x 1] vector with timestamps.
- .statenames: {Nstates} cell array for the name of each state. Any other field can be added to the struct containing states data. The *.states.mat files should be stored in the basepath.

Behavior

This is a data container for behavioral tracking data. A MATLAB struct behaviorName stored in a .mat file: basename.behaviorName.behavior.mat with the following fields:

timestamps: [nSamples x 1] array of timestamps that match the data subfields (in seconds).
sr: sampling rate (Hz).
SpatialSeries: several options (position, pupil, orientation) as defined below, each with optional subfields:
- units: defines the units of the data.
- resolution: The smallest meaningful difference (in specified unit) between values in data.
- referenceFrame: description defining what the zero-position is.
- coordinateSystem: position: cartesian[default] or polar. orientation: euler or quaternion[default].
position: .x, .y, and .z spatial position. Default units: cm.
- linearized: a projection of spatial parameters into a 1 dimensional representation:
speed: a 1D representation of the running speed (cm/s).
acceleration: a 1D representation of the acceleration (cm^2/s).
orientation: .x, .y, .z, and .w (default units: radians)
pupil: pupil-tracking data: .x, .y, .diameter.
epochs: behaviorally derived epochs.
trials: struct with trials information.
- *: the name of the trial analysis. e.g. alternation
  - start: trial start times in seconds.
  - stop: trial end times in seconds.
  - trials: continuous vector with numeric trial numbers.
  - nTrials: number of trials.
  - stateName: Name describing the what the trials fields describe (e.g. Alternative running on track).
states: e.g. spatially defined regions like central arm or waiting area in a maze. Can be binary or numeric.
stateNames: names of the states.
timeSeries: can contain any derived time traces projected into the behavioral timestamps e.g. temperature, oscillation frequency, power etc.
notes: Human-readable notes about this TimeSeries dataset.
description: Description of this TimeSeries dataset.
processinginfo: a struct with information about how the .mat file was generated including: function name of the function, version, date, parameters.

Any other field can be added to the struct containing behavior data. The *.behavior.mat files should be stored in the basepath.

Firing rate maps

This is a data container for firing rate map data. A MATLAB struct ratemap containing 1D or linearized firing rat maps, stored in a .mat file: basename.ratemap.firingRateMap.mat. The firing rate maps have the following fields:

map: a [1 x N] cell-struct for N units each containing a [nBins x nStates] matrix, where nBins corresponds to the bin count and nStates to the number of states. States can be trials, manipulation states, left-right states, etc.
x_bins: a 1 x nBins vector with K bin values used to generate the firing rate map.
x_label: a 1 x nStates vector with names of the states.
stateNames: a 1 x nStates vector with names of the states.
boundaries: a 1 x nStates vector with spatial boundaries.
boundaryNames: a 1 x nStates vector with labels for the boundaries.

Intracellular time series

This is a data container for intracellular recordings. Any MATLAB struct intracellularName containing intracellular data would be stored in a .mat file: basename.intracellularName.intracellular.mat. It contains fields inherited from timeSeries with the following fields:

data : a [nSamples x nChannels] vector with time series data.
timestamps : a [nSamples x 1] vector with timestamps in seconds.
precision : e.g. int16.
units : e.g. mV.
nChannels : number of channels.
channelNames : struct with names of channels.
sr : sampling rate.
nSamples : number of samples.
leastSignificantBit : range/precision in µV. Intan system: 0.195µV/bit.
equipment : hardware used to acquire the data.
notes : Human-readable notes about this time series data.
description : Description of this time series data.
intracellular : Intracellular specific fields
- clamping : clamping method: current,voltage.
- type : recording type (Patch or Sharp).
- solution : Glass pipette solution (string describing the solution).
- bridgeBalance : bridge balance (M ohm).
- seriesResistance : series resistance (M ohm).
- inputResistance : input resistance (M ohm).
- membraneCaparitance : Description of this time series data.
- electrodeMaterial : glass,tungsten, etc.
- groundElectrode : description of ground electrode.
- referenceElectrode : description of reference electrode.
processinginfo : a struct with information about how the .mat file was generated including:
- function : which function was used to generate the struct.
- version : version of function if any.
- date : date of processing.
- parameters : input parameters when the struct was created.
- sourceFileName : file name of the original source file. Any other field can be added to the struct containing intracellular time series data. The *.intracellular.mat files should be stored in the basepath.

Data containers

The data is organized into data-type specific containers, a concept introduced in the Buzcode toolbox repository:

basename.session.mat: session level metadata.
basename.*.lfp.mat: derived ephys signals including theta-band filtered lfp.
basename.*.cellinfo.mat: Spike derived data includingspikes, cell_metrics, mono_res.
basename.*.firingRateMap.mat: firing rate maps. Derived from behavior and spikes, e.g. ratemap.
basename.*.events.mat: events data, including ripples, SWR.
basename.*.manipulation.mat: manipulation data.
basename.*.channelinfo.mat: channel-wise data, including impedance.
basename.*.timeseries.mat: time series data, could be temperature measures, or other data collected together with the ephys data.
basename.*.behavior.mat: behavior data, including position tracking and pupil tracking.
basename.*.states.mat: brain states derived data including SWS/REM/awake and up/down states.
basename.*.intracellular.mat: intracellular data.

Example dataset

There is an example dataset available to help understanding the data structure. The dataset contains: a .dat file (raw ephys data; 62GB), a .lfp file (lowpass filtered and downsampled data file; 4 GB), session.mat, spikes, events, behavior, trials, timeseries, states, firingRateMap, cell_metrics, mono_res, and spike sorted data processed with KiloSort and curated in Phy. Available from our Webshare and our Globus endpoint. The size of the full dataset is 75GB, but files can be downloaded individual.