OmniLake

OmniLake is a Python/AWS Framework that enables the development of enterprise-grade AI applications with built-in data lineage and traceability. It provides a comprehensive solution for managing unstructured information while addressing common AI adoption challenges.

Content Support Note: OmniLake currently only supports text-based storage/retrieval/processing, support for storing and indexing image-based types is on the roadmap but not prioritized at this time.

Key Features

Built-in data lineage tracking for AI outputs
Scalable from proof-of-concept to enterprise deployment
Standardized data management with full control
Rapid deployment capabilities with minimal initial setup
Cost-effective scaling with pay-as-you-go model
Semantic search and retrieval
Customizable for specific business needs

Core Benefits

Reduces AI implementation complexity
Ensures traceability of AI-generated content
Enables quick proof-of-concept development
Provides enterprise-grade data management
Maintains control over data

Key Concepts

Definitions

Archives: Provide the system ability to retrieve data for use during a Lake Request. Can be standard "Index" type storage like the Basic and Vector built-ins, or they can be direct read-only bridges to other systems such as CRMs, Wikis, or even general web page retrieval.
Construct: Lake constructs are the re-usable components that enable the system to lookup data, process the data, and provide responses. Archives, Processors, and Responders are all examples of OmniLake constructs.
Entries: Individual pieces of content within archives
Jobs: Management of asynchronous processing tasks
Processors: These constructs support intaking one or more entries from the system, processing them in a specific way, and then providing an entry for the final response.
Responders: The final stage of a Lake Request, the responder is responsible for formulating (or not in the case of Direct) a final response using the processed results.
Sources: Tracking of data provenance
Source Types: The declared type of source, defining all of the attributes required for a source. A source type must be declared before sources of that type can be created.

Technology

AWS Services: DynamoDB, S3, EventBridge, Lambda
Vector Storage: LanceDB
AI/ML: Amazon Bedrock for embeddings and language model inference
Infrastructure: AWS CDK for deployment

Getting Started

To utilize Omnilake effectively, you should include it as a dependency for your own application. However, the framework will deploy and create all the necessary services to start using a base Omnilake deployment.

Prerequisites

Python 3.12 or higher
Poetry (Python package manager)
Da Vinci Framework
AWS CLI configured with appropriate credentials
AWS Account (Max managed policies limit must be bumped to 20)

Installation

Clone the repository:

git clone https://github.com/your-repo/omnilake.git
cd omnilake

Install dependencies using Poetry:

poetry install

Set up the development environment:

./dev.sh

This script sets up necessary environment variables and prepares your local development environment.

Usage

Here's a basic example of how to use the OmniLake client library to interact with the system:

from omnilake.client.client import OmniLake
from omnilake.client.request_definitions import (
    AddEntry,
    AddSource,
    BasicInformationRetrievalRequest,
    CreateSourceType,
    CreateArchive,
    InformationRequest,
)

# Initialize the OmniLake client
omnilake = OmniLake()

# Create a new archive
archive_req = CreateArchive(
    archive_id='my_archive',
    description='My first OmniLake archive'
)
omnilake.create_archive(archive_req)

source_type = CreateSourceType(
    name='webpage',
    description='Content that belongs to a web page',
    required_fields=['url', 'published_date'],
)
omnilake.create_source_type(source_type)

source = AddSource(
    source_type='webpage',
    source_arguments={
        'url': 'https://example.com/about',
        'published_date': '2024-24-12',
    }
)
source_result = omnilake.add_source(source)

source_rn = source_result.response_body['resource_name']

# Add an entry to the archive
entry_req = AddEntry(
    archive_id='my_archive',
    content='This is a sample entry in my OmniLake archive.',
    sources=[source_rn],
    original_source=source_rn # Indicates whether the content is original content of the source location
)
result = omnilake.add_entry(entry_req)

print(f"Entry added with ID: {result.response_body['entry_id']}")

# Request information
info_req = InformationRequest(
    goal='Summarize the contents of the archive',
    retrieval_requests=[
        BasicInformationRetrievalRequest(
            archive_id='my_archive',
            max_entries=10,
        )
    ]
)
response = omnilake.request_information(info_req)

print(f"Information request submitted. Job ID: {response.response_body['job_id']}")

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
assets		assets
omnilake		omnilake
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
app.py		app.py
cdk.context.json		cdk.context.json
cdk.json		cdk.json
dev.sh		dev.sh
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

OmniLake

Key Features

Core Benefits

Key Concepts

Definitions

Technology

Getting Started

Prerequisites

Installation

Usage

About

Releases

Packages

Contributors 2

Languages

License

caylent/omnilake

Folders and files

Latest commit

History

Repository files navigation

OmniLake

Key Features

Core Benefits

Key Concepts

Definitions

Technology

Getting Started

Prerequisites

Installation

Usage

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages