Releases: tensorflow/datasets
v4.1.0
-
When generating a dataset, if download fails for any reason, it is now possible to manually download the data. See doc.
-
Simplification of the dataset creation API.
- We've made it is easier to create datasets outside TFDS repository (see our updated dataset creation guide).
_split_generators
should now returns{'split_name': self._generate_examples(), ...}
(but current datasets are backward compatible).- All dataset inherit from
tfds.core.GeneratorBasedBuilder
. Converting a dataset to beam now only require changing_generate_examples
(see example and doc). tfds.core.SplitGenerator
,tfds.core.BeamBasedBuilder
are deprecated and will be removed in future version.
-
Better
pathlib.Path
,os.PathLike
compatibility:dl_manager.manual_dir
now returns a pathlib-Like object. Example:
text = (dl_manager.manual_dir / 'downloaded-text.txt').read_text()
- Note: Other
dl_manager.download
,.extract
,... will return pathlib-like objects in future versions FeatureConnector
,... and most functions should acceptPathLike
objects. Let us know if some functions you need are missing.- Add a
tfds.core.as_path
to create pathlib.Path-like objects compatible with GCS (e.g.tfds.core.as_path('gs://my-bucket/labels.csv').read_text()
).
-
Other bug fixes and improvement. E.g.
- Add
verify_ssl=
option totfds.download.DownloadConfig
to disable SSH certificate during download. BuilderConfig
are now compatible with Beam datasets #2348--record_checksums
now assume the new dataset-as-folder modeltfds.features.Images
can accept encodedbytes
images directly (useful when used withimg_name, img_bytes = dl_manager.iter_archive('images.zip')
).- Doc API now show deprecated methods, abstract methods to overwrite are now documented.
- You can generate
imagenet2012
with only a single split (e.g. only the validation data). Other split will be skipped if not present.
- Add
-
And of course new datasets
Thank you to all our contributors for improving TFDS!
v4.0.1
- Fix
tfds.load
when generation code isn't present - Fix improve GCS compatibility.
Thanks @carlthome for reporting and fixing the issue.
v4.0.0
API changes, new features:
- Dataset-as-folder: Dataset can now be self-contained module in a folder with checksums, dummy data,... This simplify implementing datasets outside the TFDS repository.
tfds.load
can now load dataset without using the generation class. Sotfds.load('my_dataset:1.0.0')
can work even ifMyDataset.VERSION == '2.0.0'
(See #2493).- Add a new TFDS CLI (see https://www.tensorflow.org/datasets/cli for detail)
tfds.testing.mock_data
does not require metadata files anymore!- Add
tfds.as_dataframe(ds, ds_info)
with custom visualisation (example) - Add
tfds.even_splits
to generate subsplits (e.g.tfds.even_splits('train', n=3) == ['train[0%:33%]', 'train[33%:67%]', ...]
- Add new
DatasetBuilder.RELEASE_NOTES
property - tfds.features.Image now supports PNG with 4-channels
tfds.ImageFolder
now supports custom shape, dtype- Downloaded URLs are available through
MyDataset.url_infos
- Add
skip_prefetch
option totfds.ReadConfig
as_supervised=True
support fortfds.show_examples
,tfds.as_dataframe
Breaking compatible changes:
tfds.as_numpy()
now returns an iterable which can be iterated multiple times. To migratenext(ds)
->next(iter(ds))
- Rename
tfds.features.text.Xyz
->tfds.deprecated.text.Xyz
- Remove
DatasetBuilder.IN_DEVELOPMENT
property - Remove
tfds.core.disallow_positional_args
(should use Py3*,
instead) - tfds.features can now be saved/loaded, you may have to overwrite FeatureConnector.from_json_content and
FeatureConnector.to_json_content
to support this feature. - Stop testing against TF 1.15. Requires Python 3.6.8+.
Other bug fixes:
- Better archive extension detection for
dl_manager.download_and_extract
- Fix
tfds.__version__
in TFDS nightly to be PEP440 compliant - Fix crash when GCS not available
- Script to detect dead-urls
- Improved open-source workflow, contributor guide, documentation
- Many other internal cleanups, bugs, dead code removal, py2->py3 cleanup, pytype annotations,...
And of course, new datasets, datasets updates.
A gigantic thanks to our community which has helped us debugging issues and with the implementation of many features, especially vijayphoenix@ for being a major contributor.
v3.2.1
- Fix an issue with GCS on Windows.
v3.2.0
Future breaking change:
- The
tfds.features.text
encoding API is deprecated. Please use tensorflow_text instead.
New features
API:
- Add a
tfds.ImageFolder
andtfds.TranslateFolder
to easily create custom datasets with your custom data. - Add a
tfds.ReadConfig(input_context=)
to shard dataset, for better multi-worker compatibility (#1426). - The default
data_dir
can be controlled by theTFDS_DATA_DIR
environment variable. - Better usability when developing datasets outside TFDS
- Downloads are always cached
- Checksum are optional
- Added a
tfds.show_statistics(ds_info)
to display FACETS OVERVIEW. Note: This require the dataset to have been generated with the statistics. - Open source various scripts to help deployment/documentation (Generate catalog documentation, export all metadata files,...)
Documentation:
- Catalog display images (example)
- Catalog shows which dataset have been recently added and are only available in
tfds-nightly
nights_stay
Breaking compatibility change:
- Fix deterministic example order on Windows when path was used as key (this only impact a few datasets). Now example order should be the same on all platforms.
- Remove
tfds.load('image_label_folder')
in favor of the more user-friendlytfds.ImageFolder
Other:
- Various performances improvements for both generation and reading (e.g. use
__slot__
, fix parallelisation bug intf.data.TFRecordReader
,...) - Various fixes (typo, types annotations, better error messages, fixing dead links, better windows compatibility,...)
Thanks to all our contributors who help improving the state of dataset for the entire research community!
v3.1.0
Beaking compatibility change:
- Rename
tfds.core.NamedSplit
,tfds.core.SplitBase
->tfds.Split
. Nowtfds.Split.TRAIN
,... are instance oftfds.Split
- Remove deprecated
num_shards
argument fromtfds.core.SplitGenerator
. This argument was ignored as shards are automatically computed.
Future breaking compatibility changes:
- Rename
interleave_parallel_reads
->interleave_cycle_length
fortfds.ReadConfig
. - Invert ds, ds_info argument orders for
tfds.show_examples
Future breaking change: - The
tfds.features.text
encoding API is deprecated. Please usetensorflow_text
instead.
Other changes:
- Testing: Add support for custom decoders in
tfds.testing.mock_data
- Documentation: shows which datasets are only present in
tfds-nightly
- Documentation: display images for supported datasets
- API: Add
tfds.builder_cls(name)
to access a DatasetBuilder class by name - API: Add
info.split['train'].filenames
for access to the tf-record files. - API: Add
tfds.core.add_data_dir
to register an additional data dir - Remove most
ds.with_options
which where applied by TFDS. Now use tf.data default. - Other bug fixes and improvement (Better error messages, windows compatibility,...)
Thank you all for your contributions, and helping us make TFDS better for everyone!
v3.0.0
Breaking changes:
- Legacy mode
tfds.experiment.S3
has been removed - New
image_classification
section. Some datasets have been move there fromimages
. in_memory
argument has been removed fromas_dataset
/tfds.load
(small datasets are now auto-cached).DownloadConfig
do not append the dataset name anymore (manual data should be in<manual_dir>/
instead of<manual_dir>/<dataset_name>/
)- Tests now check that all
dl_manager.download
urls has registered checksums. To opt-out, addSKIP_CHECKSUMS = True
to yourDatasetBuilderTestCase
. tfds.load
now always returnstf.compat.v2.Dataset
. If you're using still usingtf.compat.v1
:- Use
tf.compat.v1.data.make_one_shot_iterator(ds)
rather thands.make_one_shot_iterator()
- Use
isinstance(ds, tf.compat.v2.Dataset)
instead ofisinstance(ds, tf.data.Dataset)
- Use
tfds.Split.ALL
has been removed from the API.
Future breaking change:
- The
tfds.features.text
encoding API is deprecated. Please use tensorflow_text instead. num_shards
argument oftfds.core.SplitGenerator
is currently ignored and will be removed in the next version.
Features:
DownloadManager
is now pickable (can be used inside Beam pipelines)tfds.features.Audio
:- Support float as returned value
- Expose sample_rate through
info.features['audio'].sample_rate
- Support for encoding audio features from file objects
- Various bug fixes, better error messages, documentation improvements
- More datasets
Thank you to all our contributors for helping us make TFDS better for everyone!
v2.1.0
New features:
- Datasets expose
info.dataset_size
andinfo.download_size
. All datasets generated with 2.1.0 cannot be loaded with previous version (previous datasets can be read with2.1.0
however). - Auto-caching small datasets.
in_memory
argument is deprecated and will be removed in a future version. - Datasets expose their cardinality
num_examples = tf.data.experimental.cardinality(ds)
(Requires tf-nightly or TF >= 2.2.0) - Get the number of example in a sub-splits with:
info.splits['train[70%:]'].num_examples
v2.0.0
- This is the last version of TFDS that will support Python 2. Going forward, we'll only support and test against Python 3.
- The default versions of all datasets are now using the S3 slicing API. See the guide for details.
- The previous split API is still available, but is deprecated. If you wrote
DatasetBuilder
s outside the TFDS repository, please make sure they do not useexperiments={tfds.core.Experiment.S3: False}
. This will be removed in the next version, as well as thenum_shards
kwargs fromSplitGenerator
. - Several new datasets. Thanks to all the contributors!
- API changes and new features:
shuffle_files
defaults to False so that dataset iteration is deterministic by default. You can customize the reading pipeline, including shuffling and interleaving, through the newread_config
parameter intfds.load
.urls
kwargs renamedhomepage
inDatasetInfo
- Support for nested
tfds.features.Sequence
andtf.RaggedTensor
- Custom
FeatureConnector
s can override thedecode_batch_example
method for efficient decoding when wrapped inside atfds.features.Sequence(my_connector)
- Declaring a dataset in Colab won't register it, which allow to re-run the cell without having to change the name
- Beam datasets can use a
tfds.core.BeamMetadataDict
to store additional metadata computed as part of the Beam pipeline. - Beam datasets'
_split_generators
accepts an additionalpipeline
kwargs to define a pipeline shared between all splits.
- Various other bug fixes and performance improvements. Thank you for all the reports and fixes!
v1.3.0
Bug fixes and performance improvements.