Release notes¶

Release 0.4.3¶

A command line utility petastorm-copy-dataset.py. Makes it easier to create subsets (columns/rows) of existing petastorm datasets.
Add option to use custom pyarrow filesystem when materializing datasets.
Limit memory usage correctly when using Reader with ProcessPool.
Added --pyarrow-serialize switch to petastorm-throughput.py benchmarking command line utility.
Faster serialization (using pyarrow.serialize) in ProcessPool implementation. Now decimal types are supported.
More information in reader.diagnostics property.
Check if a –unischema string passed to petastorm-generate-metadata is actually a Unischema instance.
Fixed race condition in ProcessPool resulting in indefinite wait on ProcessPool shutdown.
Force loading pyarrow before torch. Helps to avoid a segfault (documented in docs/toubleshoot.rst)
Fixed mnist training examples.
Make dependency on opencv optional in codecs.py

Making decimal.Decimal to be decoded as decimal, as in before 0.4.0.
Adding a benchmark module with a petastorm-throughput.py command line utility.

Added WeightedSamplingReader: aggregates multiple Reader output by sampling them with a specified probabilityWeightedSamplingReader see documentation.
Add option for driver memory in regenerating metadata
petastorm-generate-metadata command line tool renamed to petastorm-generate-metadata.py
pytorch support (petastorm.pytorch.DataLoader class)
pytorch and tensorflow mnist model training
Added CompressedNdarrayCodec codec
Support passing pyarrow filesystem as Reader construction argument
Speedup serialization (use pyarrow.serialize) when ProcessPool is used.
New, experimental, implementation of reader: ReaderV2.