Python MRCZ Documentation Reference¶
Contents:
Introduction¶
MRCZ is a union of the MRC file format with blosc
meta-compression. blosc
is not a compression algorithm, rather it is a standard that supports most
popular compression algorithms. It can also apply lossless filters that
improve compression performance, such as the bitshuffle
filter. It achieves
high-performance through the use of multi-threading the supported compression
codecs. Generally you should expect MRCZ to result in faster file read/write
rates, as the compression is faster than hard drive read/write rates, as well
as near entropy-limited compression ratios. So you get something for nothing.
Typical usage patterns are:
imageData, imageMeta = mrcz.readMRC('my_filename.mrcz')
where imageData
is a numpy.ndarray
and imageMeta
is a Python dict
containing metadata. After some manipulation, you may want to then save to disk
so the file can be passed into a third-party application, such as a CTF estimation
tool. Here for maximum compatibility we will save it uncompressed (which
is the default keyword argument for compressor
):
mrcz.writeMRC( imageData, 'passed_file.mrc', compressor=None )
Alternatively you may want to save an archival compression version of your data in the background using the asynchronous feature. In this case, the exact time when the write finishes is typically not a concern (although see the the function documentation for finer control):
mrcz.asyncWriteMRC( imageData, 'my_newfile.mrcz', meta=newMeta, compressor='zstd', clevel=1 )
See the API reference docs for detailed information on usage. The recommended compression codecs and levels are:
compressor='zstd'
andclevel=1
for general archival use.compressor='lz4'
andclevel=9
for speed-critical applications.
The bitshuffle
filter is always used in MRCZ compressed files as it was
found to improve both compression rate and ratio with representative
electron microscopy data.
MRCZ Specification¶
In general MRCZ follows the CCPEM MRC2014 standard as outlined here:
http://www.ccpem.ac.uk/mrc_format/mrc2014.php
Please note we count bytes starting from 0
. CCP-EM counts bytes starting
from 1
.
Required deviations from CCPEM MRC2014 standard¶
Word 4 (@ byte 16): The MODE parameter is now the sum of the MRC2014 MODE plus the
blosc
compression used * 1000.The compressor enumeration is:
{ 0:None, 1:'blosclz', 2:'lz4', 3:'lz4hc', 4:'snappy', 5:'zlib', 6:'zstd' }
Unpacking is generally performed as follows:
mrcMode = numpy.mod(mrczMode, 1000) compressor = numpy.floor_divide(mrczMode, 1000)
In practice any MODE > 1000 indicates the use of a compression codec.
blosc
will discover the actual codec used itself.In the case where
compressor != None
, starting at byte 1024 (or 1024 + EXTRA if the extended header is used) ac-blosc
header is found. Thec-blosc
header format specification may be found here:https://github.com/Blosc/c-blosc/blob/master/README_HEADER.rst
blosc
is limited to2**31
bytes per chunk. Chunking for compression is accomplished by compressing each slice/frame in the z-axis with a separate call toblosc.compress()
. Therefore the data section consists of NZ structs ofc-blosc
headers followed by the packed bytes for the associated slice/frame.
Optional deviations from CCPEM MRC2014 standard¶
- Word 33 (@ byte 132): Accelerating voltage in keV, float-32 format. Deprecated.
- Word 34 (@ byte 136): Spherical aberration in mm, float-32 format. Deprecated.
- Word 35 (@ byte 140): Detector gain in e^-/DN, defaults to 1.0. Deprecated.
- Word 36-37 (@ byte 14): Size of compressed data in bytes stored as a 64-bit
integer, including
blosc
headers. Present for convenience only. - Word 57 (@ byte 224): The ascii-encoded identifier label ‘MRCZ<version>’. For example, ``b’MRCZ0.3.1’.
Failure to include any of these variables will not result in an exception.
JSON extended meta-data¶
When the keyword argument meta
is used with writeMRC
and
asyncWriteMRC
the passed dictionary will be converted to UTF-8 encoded JSON
and written into the extended header. This is indicated by the ascii-encoded
bytes 'json'
written into the EXTTYP variable of the MRC2014 header. The
length of the encoded JSON metadata is stored in the EXTRA variable of the
MRC2014 header.
Note: python-rapidjson
is preferred but the standard library json
module is used as a fallback.
Frequently Asked Questions¶
Can I access individual slices in a compressed MRCZ file?
Currently not, as blosc
does not record the indices of its individual
compressed blocks. The new feature of ‘super-chunks’, equivalent to slices or
frames in the context of microscopy, is expected to be implemented in
c-blosc2
, currently under development at https://github.com/Blosc/c-blosc2
API Reference¶
-
mrcz.
readMRC
(MRCfilename, idx=None, endian='le', pixelunits='\AA', fileConvention='ccpem', useMemmap=False, n_threads=None, slices=None)[source]¶ Imports an MRC/Z file as a NumPy array and a meta-data dict.
Parameters: - image: numpy.ndarray
a 1-3 dimension
numpy.ndarray
with one of the supported data types inmrcz.REVERSE_CCPEM_ENUM
- meta: dict
a
dict
with various fields relating to the MRC header information. Can also hold arbitrary meta-data, but the use of large numerical data is not recommended as it is encoded as text via JSON.- idx: Tuple[int]
Index tuple
(first, last)
where first (inclusive) and last (not inclusive) indices of images to be read from the stack. Index of first image is 0. Negative indices can be used to count backwards. A singleton integer can be provided to read only one image. If omitted, will read whole file. Compression is currently not supported with this option.- pixelunits: str
can be
'AA' (Angstoms), 'nm', '\mum', or 'pm'
. Internally pixel sizes are always encoded in Angstroms in the MRC file.- fileConvention: str
can be
'ccpem'
(equivalent to IMOD) or'eman2'
, which is only partially supported at present.- endian: str
can be big-endian as
'be'
or little-endian as'le'
. Defaults to ‘le’ as the vast majority of modern computers are little-endian.- n_threads: int
is the number of threads to use for decompression, defaults to use all virtual cores.
- useMemmap: bool = True
returns a
numpy.memmap
instead of anumpy.ndarray
. Not recommended as it will not work with compression.- slices: Optional[int] = None
Reflects the number of slices per frame. For example, in time-series with multi-channel STEM, would be
4
for a 4-quadrant detector. Data is always written contiguously in MRC, but will be returned as a list of[slices, *shape]
-shaped arrays. The default optionNone
will check for a'slices'
field in the meta-data and use that, otherwise it defaults to0
which is one 3D array.
Returns: - image: Union[list[numpy.ndarray], numpy.ndarray]
If
slices == 0
then a monolithic array is returned, else alist
of[slices, *shape]
-shaped arrays.- meta: dict
the stored meta-data in a dictionary. Note that arrays are generally returned as lists due to the JSON serialization.
-
mrcz.
writeMRC
(input_image, MRCfilename, meta=None, endian='le', dtype=None, pixelsize=[0.1, 0.1, 0.1], pixelunits='\AA', shape=None, voltage=0.0, C3=0.0, gain=1.0, compressor=None, clevel=1, n_threads=None, quickStats=True, idx=None)[source]¶ Write a conventional MRC file, or a compressed MRCZ file to disk. If compressor is
None
, then backwards compatibility with other MRC libraries should be preserved. Other libraries will not, however, recognize the JSON extended meta-data.Parameters: - input_image: Union[numpy.ndarray, list[numpy.ndarray]]
The image data to write, should be a 1-3 dimension
numpy.ndarray
or a list of 2-dimensional ``numpy.ndarray``s.- meta: dict
will be serialized by JSON and written into the extended header. Note that
rapidjson
(the default) orjson
(the fallback) cannot serialize all Python objects, so sanitizingmeta
to remove non-standard library data structures is advisable, includingnumpy.ndarray
values.- dtype: Union[numpy.dtype, str]
will cast the data before writing it.
- pixelsize: Tuple[x,y,z]
is [z,y,x] pixel size (singleton values are ok for square/cubic pixels)
- pixelunits: str = u’AA’
one of -
'\AA'
for Angstroms -'pm'
for picometers -'\mum'
for micrometers -'nm'
for nanometers. MRC standard is always Angstroms, so pixelsize is converted internally from nm to Angstroms as needed.- shape: Optional[Tuple[int]]
is only used if you want to later append to the file, such as merging together Relion particles for Frealign. Not recommended and only present for legacy reasons.
- voltage: float = 300.0
accelerating potential in keV
- C3: float = 2.7
spherical aberration in mm
- gain: float = 1.0
detector gain in units (counts/primary electron)
- compressor: str = None
is a choice of
None, 'lz4', 'zlib', 'zstd'
, plus'blosclz'
,'lz4hc'
-'lz4'
is generally the fastest. -'zstd'
generally gives the best compression performance, and is still almostas fast as ‘lz4’ with
clevel == 1
.- clevel: int = 1
the compression level, 1 is fastest, 9 is slowest. The compression ratio will rise slowly with clevel (but not as fast as the write time slows down).
- n_threads: int = None
number of threads to use for blosc compression. Defaults to number of virtual cores if
== None
.- quickStats: bool = True
estimates the image mean, min, max from the first frame only, which saves computational time for image stacks. Generally strongly advised to be
True
.- idx
can be used to write an image or set of images starting at a specific position in the MRC file (which may already exist). Index of first image is 0. A negative index can be used to count backwards. If omitted, will write whole stack to file. If writing to an existing file, compression or extended MRC2014 headers are currently not supported with this option.
Returns: None
-
mrcz.
asyncReadMRC
(*args, **kwargs)[source]¶ Calls readMRC in a separate thread and executes it in the background.
Parameters: - Valid arguments are as for `readMRC()`.
Returns: - future
A
concurrent.futures.Future()
object. Callingfuture.result()
will halt until the read is finished and returns the image and meta-data as per a normal call to readMRC.
-
mrcz.
asyncWriteMRC
(*args, **kwargs)[source]¶ Calls writeMRC in a seperate thread and executes it in the background.
Parameters: - Valid arguments are as for `writeMRC()`.
Returns: - future
A
concurrent.futures.Future
object. If needed, you can callfuture.result()
to wait for the write to finish, or check withfuture.done()
. Most of the time you can ignore the return and let the system write unmonitored. An exception would be if you need to pass in the output to a subprocess.
-
class
mrcz.
readDM4
(filename, verbose=False)[source]¶ A fast DM4 file reader to strip the data out of the large movie-mode files generated by K2 detectors along with important tags. Due to the emphasis on speed it’s not a general file parser with dynamic allocation, so if Gatan changes the format a lot it will break the script.
Parameters: - filename: str
the
- verbose: bool
whether to output debugging info or not.
-
mrcz.
__version__
¶ The version of Python MRCZ.
Release Notes¶
0.5.6¶
- In 0.5.5 a for-loop was omitted which lead to every frame being the zeroth frame in the stack. For this reason, upgrading from 0.5.5 is _strongly_ recommended.
0.5.5¶
- Meta-data with keys that match those used in the header could accidently overwrite critical values, such as ‘dimensions’. Any keys in the JSON meta-dictionary that overlap with the standard values are now ignored.
- Integration tests are now performed for Python 3.8.
0.5.4¶
- Added support for serialization of Python Enum objects in JSON serialization of meta-data.
0.5.3¶
- Ricardo Righetto added the means to append frames to a stack.
- Support for Python 3.4 was dropped as it is past end-of-life by Python.org.
0.5.2¶
- Improved on serialization of non-standard (i.e. NumPy) types in JSON-ized meta-data by making use of the default callable in json.dumps. In particular deeply nested NumPy types should now serialize without erroring. Note that there is no support for complex numbers in JSON meta-data, as JSON itself does not support it by default.
0.5.1¶
Versions of MRCZ <= 0.4.1 were improperly writing the dimensions into the (Mx, My, Mz) volume fields. Added a check for the MRCZ version tag, and if an older file is found, it defaults to
slices == 1
, i.e. one 2D frame per element in the returned list. - In order to suppress the warning message, files can be read into memory andre-saved. A utility script for batch processing is provided in
utils\update_mrcz_0.5.0.py
.
0.5.0¶
- Added support for lists of 3D numpy.ndarray objects. This is largely intended to support multi-channel STEM time series. Stores the number of channels per frame in the MZ value of the MRC2014 header, which must be uniform for every ndarray in the list. Any MRCZ archive that has a ‘strides’ key in the JSON metadata will be returned as a list of arrays. - See http://www.ccpem.ac.uk/mrc_format/mrc2014.php for header details - asList keyword arguments have been removed.
- Fixed a bug in casting from float64/complex128 that was not actually casting.
- Cleaned up the code to be more PEP8 compliant.
0.4.1¶
- Improved docstrings in ioDM4.py.
- Added asyncReadDM4 function, analogous to asyncReadMRC.
0.4.0¶
- Fix a minor bug with casting for lists of arrays
- Improved uncompressed write times by not using list comphrension
- Add scaling block size for small format images (e.g. Medipix) to scale to the number of threads.
- If the passed arrays are C_CONTIGUOUS and ALIGNED, writeMRC will use blosc.compress_ptr instead of coverting the array to a bytes object which is a significant speedup.
0.3.8¶
- Auto-casts np.float64 -> np.float32 and np.complex128 -> np.complex64 but logs a warning to the user.
0.3.7¶
- Updated MANIFEST.in and setup.py to make Conda-forge happy.
0.3.6¶
- mrcz.ReliablePy must be imported explicitely now, as it has requirements that the base mrcz package does not. This file may be removed in the future if no users are using it.
0.3.5¶
- If
blosc
is not installed and the user attempts to operate with compression on anImportError
is raised. - Documentation now using Numpy docstrings.
0.3.4¶
- Add (temporarily) MRC types for uint32 and int32 to support 24-bit detectors. May break in the future, as the CCP-EM committee should make the final decision on such enumerations.
- Added handling of NumPy scalars (i.e. np.float32(1.0)) in metadata so that JSON serialization does not generate errors. Values will be case to Python int or float as appropriate.
0.3.3¶
- Removed use of star-expansion of args as it breaks Python 2.7/3.4.
0.3.2¶
- Made blosc an optional dependency due to difficulties involved in building wheels for PyPi.
- Implemented reading/writing of list of equally-shaped 2D ndarray`s instead of a single 3D `ndarray, where the list represents the Z-axis. This approach can be helpful for larger arrays that do not have to be continuous as the operating system can more easily interleave them into memory.
0.3.1¶
- Added ascii identifier label ‘MRCZ’ + <__version__> to the labels. I.e. at byte 224 in the header will appear b’MRCZ0.3.1’
0.3.0¶
- Documentation now available at http://python-mrcz.readthedocs.io/
- Added continuous integration testing with Appveyor and TravisCI, which was previously handled by c-mrcz.
- Added handling for dask.array.core.Array objects.
- numpy.ndarrays inside meta dictionaries will be converted to list objects to facilitate serialization.
- Updated license to BSD-3-clause from BSD-2-clause.
- Various bug fixes.
0.2.1-4¶
- Various bug fixes to incorporate into Hyperspy.
0.2.0¶
- Added support for asynchronous reading and writing.
0.1.4a1¶
- Fixed a bug with the machine-stamp not being converted to bytes properly.
0.1.4a0¶
- Fixed a bug in import of mrcz from ReliablePy
0.1.3a2¶
- Added ReliablePy, an interface for Relion .star and Frealign .par files.
- Fixes to maintain cross-compatibility with c-mrcz. Main functions are readMRC and writeMRC. readMRC always returns a header now.
- Added mrcz_test suite, which also tests c-mrcz if it’s found in the path.
- Fixed bugs related to mrcz_test.py
0.1.1a1¶
- Renamed ‘cLevel’ to ‘clevel’ to maintain consistency with blosc naming convention.
- Updated license from MIT to BSD 2-clause.
0.1.0dev0¶
Initial commit