Skip to content

awslabs/project-lakechain

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 





Project Lakechain  Static Badge

Cloud-native, AI-powered, document processing pipelines on AWS.

Github Codespaces


🔖 Features

  • 🤖 Composable — Composable API to express document processing pipelines using middlewares.
  • ☁️ Scalable — Scales out-of-the box. Process millions of documents, scale to zero automatically when done.
  • Cost Efficient — Uses cost-optimized architectures to reduce costs and drive a pay-as-you-go model.
  • 🚀 Ready to use60+ built-in middlewares for common document processing tasks, ready to be deployed.
  • 🦎 GPU and CPU Support — Use the right compute type to balance between performance and cost.
  • 📦 Bring Your Own — Create your own transform middlewares to process documents and extend Lakechain.
  • 📙 Ready Made Examples - Quickstart your journey by leveraging 50+ examples we've built for you.

🚀 Getting Started

👉 Head to our documentation which contains all the information required to understand the project, and quickly start building!

What's Lakechain ❓

Project Lakechain is an experimental framework based on the AWS Cloud Development Kit (CDK) that makes it easy to express and deploy scalable document processing pipelines on AWS using infrastructure-as-code. It emphasizes on modularity of pipelines, and provides 40+ ready to use components for prototyping complex document pipelines that can scale out of the box to millions of documents.

This project has been designed to help AWS customers build and scale different types of document processing pipelines, ranging a wide array of use-cases including metadata extraction, document conversion, NLP analysis, text summarization, translations, audio transcriptions, computer vision, Retrieval Augmented Generation pipelines, and much more!

Show me the code ❗

👇 Below is an example of a pipeline that deploys the AWS infrastructure to automatically transcribe audio files uploaded to S3, in just a few lines of code. Scales to millions of documents.





LICENSE

See LICENSE.