Datalake

Properties


Methods

upload_data

upload_data(
   filepaths: Union[str, Path, list[Union[str, Path]]],
   tags: Optional[list[Union[str, Tag]]] = None, source: Union[str, DataSource,
   None] = None, max_workers: Optional[int] = None,
   error_manager: Optional[ErrorManager] = None, metadata: Union[None, dict,
   list[dict]] = None, fill_metadata: Optional[bool] = False,
   wait_for_unprocessable_data: Optional[bool] = True
)

Description

Upload data into this datalake.

Upload files representing data, into a datalake.
You can give some tags as a list.
You can give a source for your data.

If some data fails to upload, check the example to see how
to retrieve the list of file paths that failed.

For more information about metadata, check https://documentation.picsellia.com/docs/metadata

Examples

from picsellia.services.error_manager import ErrorManager

source_camera_one = client.get_datasource("camera-one")
source_camera_two = client.get_datasource("camera-two")

lake = client.get_datalake()

tag_car = lake.get_data_tag("car")
tag_huge_car = lake.get_data_tag("huge-car")

lake.upload_data(filepaths=["porsche.png", "ferrari.png"], tags=[tag_car], source=source_camera_one)

error_manager = ErrorManager()
lake.upload_data(filepaths=["twingo.png", "path/unknown.png", error_manager=error_manager)

# This call will return a list of UploadError to see what was wrong
error_paths = [error.path for error in error_manager.errors]

Arguments

  • filepaths (str or Path or list[str or Path]) : Filepaths of your data

  • tags (list[Tag], optional) : Data Tags that will be given to data. Defaults to [].

  • source (DataSource, optional) : Source of your data.

  • max_workers (int, optional) : Number of max workers used to upload. Defaults to os.cpu_count() + 4.

  • error_manager (ErrorManager, optional) : Giving an ErrorManager will allow you to retrieve errors

  • metadata (Dict or list[Dict], optional) : Add some metadata to given data, filepaths length must match
    this parameter. Defaults to no metadata.

  • fill_metadata (bool, optional) : Whether read exif tags of image and add it into metadata field.
    If some fields are already given in metadata fields, they will be overridden.

  • wait_for_unprocessable_data (bool, optional) : If true, this method will wait for all data to be fully
    uploaded and processed by our services. Defaults to true.

Returns

A Data object or a MultiData object that wraps a list of Data.


find_data

find_data(
   filename: Optional[str] = None, object_name: Optional[str] = None, id: Union[str,
   UUID, None] = None
)

Description

Find a data into this datalake

You can find it by giving its filename or its object name or its id

Examples

my_data = my_datalake.find_data(filename="test.png")

Arguments

  • filename (str, optional) : filename of the data. Defaults to None.

  • object_name (str, optional) : object name in the storage S3. Defaults to None.

  • id (str or UUID, optional) : id of the data. Defaults to None

Raises

If no data match the query, it will raise a NotFoundError.
In some case, it can raise an InvalidQueryError,
it might be because platform stores 2 data matching this query (for example if filename is duplicated)

Returns

The Data found


list_data

list_data(
   limit: Optional[int] = None, offset: Optional[int] = None,
   page_size: Optional[int] = None, order_by: Optional[list[str]] = None,
   tags: Union[str, Tag, list[Union[str, Tag]], None] = None,
   filenames: Optional[list[str]] = None, intersect_tags: Optional[bool] = False,
   object_names: Optional[list[str]] = None, q: Optional[str] = None,
   ids: Optional[list[Union[str, UUID]]] = None
)

Description

List data of this datalake.

If there is no data, raise a NoDataError exception.

Returned object is a MultiData. An object that allows manipulation of a bunch of data.
You can add tags on them or feed a dataset with them.

Examples

lake = client.get_datalake()
data = lake.list_data()

Arguments

  • limit (int, optional) : if given, will limit the number of data returned

  • offset (int, optional) : if given, will return data that would have been returned
    after this offset in given order

  • page_size (int, optional) : deprecated.

  • order_by (list[str], optional) : if not empty, will order data by fields given in this parameter

  • filenames (list[str], optional) : if given, will return data that have filename equals to one of given filenames

  • object_names (list[str], optional) : if given, will return data that have object name equals to one of given object names

  • tags (str, Tag, list[Tag or str], optional) : if given, will return data that have one of given tags
    by default. if intersect_tags is True, it will return data
    that have all the given tags

  • intersect_tags (bool, optional) : if True, and a list of tags is given, will return data that have
    all the given tags. Defaults to False.

  • q (str, optional) : if given, will filter data with given query. Defaults to None.

  • ids : (list[UUID]): ids of the data you're looking for. Defaults to None.

Raises

  • NoDataError : When datalake has no data, raise this exception.

Returns

A MultiData object that wraps a list of Data.


create_data_tag

create_data_tag(
   name: str
)

Description

Create a data tag used in this datalake

Examples

tag_car = lake.create_data_tag("car")

Arguments

  • name (str) : Name of the tag to create

Returns

A Tag object


get_data_tag

get_data_tag(
   name: str
)

Description

Retrieve a data tag used in this datalake.

Examples

tag_car = lake.get_data_tag("car")

Arguments

  • name (str) : Name of the tag to retrieve

Returns

A Tag object


get_or_create_data_tag

get_or_create_data_tag(
   name: str
)

Description

Retrieve a data tag used in this datalake by its name.
If tag does not exist, create it and return it.

Examples

tag = lake.get_or_create_data_tag("new_tag")

Arguments

  • name (str) : Name of the tag to retrieve or create

Returns

A Tag object


list_data_tags

list_data_tags(
   limit: Optional[int] = None, offset: Optional[int] = None,
   order_by: Optional[list[str]] = None
)

Description

List all tags of this datalake

Examples

tags = lake.list_data_tags()
assert tag_car in tags

Arguments

  • limit (int, optional) : Limit the number of tags returned. Defaults to None.

  • offset (int, optional) : Offset to start listing tags. Defaults to None.

  • order_by (list[str], optional) : Order the tags returned by given fields. Defaults to None.

Returns

A List of Tag


create_projection

create_projection(
   data: Data, name: str, path: str, additional_info: dict = None,
   fill_metadata: bool = False
)

Description


import_bucket_objects

import_bucket_objects(
   prefixes: list[str], tags: Optional[list[Union[str, Tag]]] = None,
   source: Union[str, DataSource, None] = None
)

Description

Asynchronously import objects from your bucket where object names begins with given prefixes.

Args

  • prefixes : list of prefixes to import

  • tags : list of tags that will be added to data

  • source : data source that will be specified on data

Returns

A Job that you can wait for done.


launch_processing

launch_processing(
   processing: Processing, data: Union[list[Data], MultiData],
   parameters: dict = None, cpu: int = None, gpu: int = None
)

Description

Launch given processing onto this datalake version. You can give specific cpu, gpu or parameters.
If not given, it will use default values specified in Processing.
If processing cannot be launched on a ModelVersion it will raise before launching.

Examples

processing = client.get_processing("data auto tagging")
data = datalake.list_data(limit=10)
datalake.launch_processing(processing, data)

**Returns**

A [Job](job) object

---