Datalake - Philosophy and infrastructure

A Datalake is a place shared by all the members of an Organization to gather all the images (called Data) in the frame of your Computer Vision projects.

The Datalake feature mainly aims at having all your Data available for visualization, structuring, and exploration.

First of all, it is important to note that an Organization can have several Datalake.

Each Datalake is connected through a Storage Connector to a bucket on an Object Storage (hosted by a Cloud provider) where the Data visualized on the Datalake is physically stored.

When creating a new Picsellia Organization, a new dedicated bucket is created on the Picsellia Object Storage (hosted by AWS). The Datalake of the freshly created Organization is called Default and is connected to this bucket, tho all Data uploaded to this Datalake will be physically stored on this Picsellia Object Storage.

However, you can also decide to create a new Datalake for your Organization and connect it to your own bucket hosted by your Cloud provider. To do so, please refer to this tutorial.

You can easily switch from a Datalake to another using the navigation bar as shown below:

Muli-`Datalake` navigation

Muli-Datalake navigation

Every machine learning project begins with data, and in our case of Computer Vision, it starts with images.

There are two ways to upload your Data using Picsellia:

  1. Import Data already stored on your own Cloud Object Storage to Picsellia's Datalake and access them through Picsellia.
  2. Upload locally stored Data directly to Picsellia. In this scenario, your Data will be physically stored by Picsellia on the bucket linked to the current Datalake.

Please note that depending on the Datalake you are using, only one or both methods are available.

Indeed, if you are accessing the Datalake connected to the bucket created for you on the Picsellia Object Storage, which is the Datalake called default and created for you when initializing a new Organization, you will only be able to upload Data from your local drive. The uploaded Data will be physically stored on Picsellia's Object Storage and visualized on default Datalake using the Storage Connector created by default for your Organization named hinokuni-storage-production.

_default_ `Datalake` create for each Organization

default Datalake create for each Organization

If you create a new Datalake following this tutorial which will use a new Storage Connector linked to your own bucket, then you will be able to either:

  • Upload Data from a local drive as explained here. in this case, the uploaded Data will be physically stored by Picsellia on your bucket
  • Import Data already stored on your bucket as explained here. In this case, you will be able to visualize and explore Data already physically stored on your bucket through your Picsellia Datalake.

All the users with Admin rights in a given Organization can access the Organization Settings, particularly the Storages and Datalakes tab. From this one, you can manage the existing Datalake and Storage Connectors. More details are available in this tutorial.

_Storages and Datalake_ tab

Storages and Datalakes tab