Random Stuff from Week of Jan 27 2025

Gabe

30 Jan 2025 • 5 min read

Jan 27 2025

I took my video processor and worked on the next obvious optimizations:

Async transcription while extracting all the video frames
Increase the number of simultaneous requests to OpenAI for processing video frames + transcript chunks to 100 (I'm tier 3 and apparently that means my rate limit is 5k requests per minute, so this is way below the limit)
- 100 messages, using my 2x2 grid of 4 512x512 images (which are each 1 FPS), assuming each request includes 10 images (40 seconds of video + transcript), means I'm able to have it process 4000 seconds of video at a time. This is a little over 1 hour of video.

The async transcription using AssemblyAI's Python SDK looks like this:

    # Determine if there's audio in the video
    has_audio = does_video_have_audio(video_path_file)
    transcriber = aai.Transcriber()
    transcript_future = None
    if has_audio:
        print("Extracting audio from video")
        # Extract audio from the video (only if there's audio)
        extract_audio_from_video(video_path_file, audio_path_file)

        # Transcribe audio and get back sentences
        print("Transcribing audio asynchronously")
        transcript_future = transcriber.transcribe_async(audio_path_file)
    else:
        print("No audio found in video")
        sentences = []


    video_details = get_video_details(video_path_file)

    # Extract images from the video
    image_files = process_video(video_path_file, images_path_dir, processing_fps, video_frame_size=video_frame_size, combine_video_frames=combine_video_frames)

    # Wait for the transcription to finish
    if has_audio and transcript_future:
        transcript = transcript_future.result()
        print("Transcription complete")
        sentences = transcript.get_sentences()

That worked! And I was able to process a 26 minute video in just under 2 mins for ~$1.30 including transcription costs, input tokens, and the output tokens for the notes that AI took for me.

I initially just tried using asyncio to do these things separately (extract + transcribe audio and extract video frames), but we're using File IO and storing things on disk which apparently can't be parallelized (or isn't supported by asyncio). But this works just fine - we kick off an async transcript job, extract the frames (which usually takes longer than generating a transcript), then check the Future object to see if it's done processing.

Jan 28 2025

Can't remember at what point this week I wanted to work on this new idea, but it happened. I realized I was basically creating an ingestion engine for videos for retrieval use cases. And there was an AI Video LLM paper I was reading, but I wanted to ask AI some questions about it. However, there were several papers I wanted to chat about.

So I decided to try and see if I could build simple database of text + embeddings, then use an LLM with RAG to help me chat about these papers. I spent quite a bit of time looking over potential solutions. Some paid and some free and open source. I ultimately decided to go with an open source solution.

I found Unstructured and their docs and eventually decided to use their ingestion engine. It can take files (mine were going to be PDFs), and go through the process of partitioning, cleaning, and chunking the data into a local JSON file. But I wanted to put it somewhere where I could easily query it, so I tried doing this all manually with sqlite and sqlite-vec.

I did something like this in a previous app using Supabase and Vercel's ai-sdk. The use case was simple: user wants to save some part of their conversation for future reference, AI takes that text and runs generates vector embeddings (which basically turns the text into numbers that somehow encode semantic meaning), and store it in Supabase (after enabling pgvector). So you have plaintext stored with this vector in the same row.

Then you give your LLM the ability to use tools (call functions) and if the user asks something that the LLM doesn't know, it calls this "knowledge retrieval" tool with some details (expressed as a string) about what context it needs to remember, generate a vector embedding from that string, then do a hybrid search in the database using both plain text AND the vector embeddings.

Likely one of those will yield some kind of result, and the LLM gets the results back from the database (plaintext), and now it magically has context.

So I wanted to repeat this, except that I just needed the LLM to retrieve context, not store any. I then came across ChromaDB which was exactly what I needed. So I used Unstructured's ingestion pipeline to parse a PDF file into chunks in a pretty smart way, and then those chunks get stored in ChromaDB.

Below is the code for the ingestion engine.

import os
from unstructured_ingest.v2.pipeline.pipeline import Pipeline
from unstructured_ingest.v2.interfaces import ProcessorConfig
from unstructured_ingest.v2.processes.partitioner import PartitionerConfig
from unstructured_ingest.v2.processes.connectors.local import (LocalIndexerConfig, LocalDownloaderConfig, LocalConnectionConfig, LocalUploaderConfig)
from unstructured_ingest.v2.processes.connectors.local import LocalUploaderConfig
from unstructured_ingest.v2.processes.connectors.chroma import ChromaUploadStagerConfig, ChromaConnectionConfig, ChromaUploaderConfig
from unstructured_ingest.v2.processes.chunker import ChunkerConfig
from unstructured_ingest.v2.processes.embedder import EmbedderConfig

from chromadb import PersistentClient

CHROMA_DB_PATH = "./chroma_db"
if __name__ == "__main__":
    chroma_client = PersistentClient(
        path=CHROMA_DB_PATH
    )
    chroma_client.get_or_create_collection("video-llm-new")

    Pipeline.from_configs(
        context=ProcessorConfig(
            disable_parallelism=False,
        ),

        indexer_config=LocalIndexerConfig(input_path="./docs"),
        downloader_config=LocalDownloaderConfig(),
        source_connection_config=LocalConnectionConfig(),

        partitioner_config=PartitionerConfig(
            strategy="hi_res",
        ),
        chunker_config=ChunkerConfig(
            chunking_strategy="by_title",
            chunk_max_characters=1000,
            chunk_overlap=20
        ),
        embedder_config=EmbedderConfig(
            embedding_provider="openai",
            embedding_model_name="text-embedding-3-small",
            embedding_api_key=os.getenv("OPENAI_API_KEY"),
        ),

        stager_config=ChromaUploadStagerConfig(),

        destination_connection_config=ChromaConnectionConfig(
            path=CHROMA_DB_PATH
        ),

        # uploader_config=LocalUploaderConfig(output_dir="./ingest-output")
        uploader_config=ChromaUploaderConfig(collection_name="video-llm-new")
    ).run()

And here is some simple test code to do a full text search (as opposed to using the vector embeddings) to find something in the DB:

import chromadb
import os

import chromadb.utils.embedding_functions as embedding_functions
openai_ef = embedding_functions.OpenAIEmbeddingFunction(
                api_key=os.getenv("OPENAI_API_KEY"),
                model_name="text-embedding-3-small"
            )

client = chromadb.PersistentClient(path="./chroma_db")

collection = client.get_or_create_collection("video-llm-new", embedding_function=openai_ef)

result = collection.query(
    query_texts=["Long Short-Term Memory"],
    n_results=5,
)

print(result)

I spent quite a bit of time before I came up with simple process, but doing RAG well is tough. Imagine building a system that needs to:

Ingest many different file types
Break the file down logically into chunks that make sense (it's not ideal to store the entire file as one single vector embedding)
- And I think there are smarter systems that even extract and store things like tables and images
Clean those chunks to remove potentially garbage data or weird formatting that could cause problems later
Then generate embeddings for each chunk
Then store the embeddings + text in a database
Then write the interface that lets you intelligently search that database for relevant context

Not an easy task to do really well. The hardest part is really the ingestion and chunking of many different file types.

Things to know about vector embeddings

One thing I noticed about vector embeddings is that they really do encode semantic meaning. I used a test to see if I could pull out relevant information from a paper I had processed. I was searching for a section that talked about something called "Long Short-Term Memory" or LSTM.

So I found that if I generated embeddings for "LSTM" and queried ChromaDB for relevant pieces, it found it. But if I generated embeddings for "lstm" (lowercase), it didn't return the same section of the paper and almost didn't seem to even know what I was searching for.

But I guess that's why it's good to do a hybrid search. Just something interesting I noticed. And it probably changes from one embedding API to another. I've been using OpenAI's text-embedding-3-small because it's super cheap and will probably work well enough for what I'm doing. But how one embedding API interprets "Apple" vs "apple" might differ.

Jan 29 2025 and Jan 30 2025

Started building web app with HTML, CSS, JS. Got basics working with simple styling and structure, streaming responses, message history, etc.

Continued building the app; LiteLLM and tool call streaming to search our ChromaDB storage for context. Streaming tool calls is a kind of pain, but makes sense now after going through it.

I'll document this more over the weekend.