Croissant and Deep Lake: A match made in heaven

  • croissant brings exchanging ML data to the next level
  • it comes with rich metadata (think responsible AI), it is by default machine readable and is therefore full web compliant (it follows schema.org syntax) and by that one can search atomic datapoints straight from the web
  • it is easy to build custom croissant files and make it a breeze for data scientist to cater to new, elevated data exchange requirements (besides this it can also be used in enterprise setups, i.e. on in-house data assets never meant to be shared publicly but across data silos as they exist in big companies
  • it can accept a wide set of data repositories (kaggle, huggingface, openml, dataverse) and can serve to all the common ML frameworks (TF, PT, keras, JAX)
  • it also integrates with Apache Beam which allow building scalable compute plans, which allows dedicated cloud computing services like GCP Dataflow to execute data preparation on hundreds of workers in parallel

But what comes after the exchange of data?

the receiving team needs to do exploratory data analysis, it needs to integrate the data in a scalable data landscape that can serve new consumers like GenAI applications or simply scalable cloud-native data storage solutions that gracefully connect to serverless model training solutions (e.g. VertexAI, AWS Sagemaker, Google Colab [PLEASE ADD MORE]

Let’s do an example

  • load one or more crazy big health data asset already pre-curated on one of the data repositories
  • do a mix and match of multiple assets
  • do exploratory data analysis
    • in medical imaging this means looking at pixels/voxels side-by-side with descriptive meta data (usually vanilla cloud services are bad at batch visualising medical images)
  • filter as per inclusion and exclusion criteria
    • save this as a data subversion (but always keep the raw data)
  • build a data pre-processing pipeline to batch process everything (e.g. background subtraction, cropping, brightness enhancement)

OR

  • build a dataloader that does pre-processing during runtime (watch for I/O bottlenecks) followed by e.g. image augmentation, shuffling etc (but in a reproducible way; all model trainings in high-stakes domains such as healthcare need to be 100% reproducible)
# ADD CODE HOW TO GET A PRE-BUILT CROISSANT FILE
# LOAD THE DATA AND PUT IT IN A DEEP LAKE OBJECT

AFTER THE LOADING, LET’S TAKE A LOOK AT THE DATA (w/o installing additional viewers)

  • show how deep lake can visualise images

FILTER FOR A CERTAIN USE CASE

  • show TQL capabilities (even over a union of multiple deep lake objects)
  • put the subset on a separate “branch” (read-only)

TRAIN SOME NICE TOY EXAMPLE

# TRAINING CODE GOES HERE

EVAL THE MODEL AND SUPERIMPOSE WITH CROISSANT META DATA TO MAKE SENSE OF MODEL BLIND SPOTS

# CODE FOR SOME QUICK AND DIRTY BLIND SPOT ANALYSIS GOES HERE