Data Loaders
These data loaders facilitate the seamless integration of data into MongoDB's vector store, leveraging specific embedding models. They are designed to support the construction of gen AI application.
Loaders with their usage
Currently the following data loaders can be used with the MAAP Framework; The usage defines what values are required by the config.yaml
file to work with the loader. Multiple data sources can be used to ingest data by providing details of each.
1. Confluence Loader
Used to load and ingest content directly from Confluence spaces by specifying the necessary credentials and configuration.
Usage:
ingest:
- source: confluence
space_names:
confluence_base_url:
confluence_username:
confluence_token:
chunk_size: 1000
chunk_overlap: 100
2. Docx Loader
Utilized for extracting and processing content from Microsoft Word documents.
Usage:
ingest:
- source: docx
source_path:
chunk_size: 1000
chunk_overlap: 100
3. PDF Loader
Designed for loading and extracting text from PDF files.
Usage:
ingest:
- source: pdf
source_path:
chunk_size: 1000
chunk_overlap: 100
4. PPT Loader
Facilitates the extraction of content from PowerPoint presentations for further processing.
Usage:
ingest:
- source: ppt
source_path:
chunk_size: 1000
chunk_overlap: 100
5. Sitemap Loader
Used for loading and processing sitemap files, typically for SEO purposes and navigation structure embedding.
Usage:
ingest:
- source: sitemap
source_path:
chunk_size: 1000
chunk_overlap: 100
6. Web Loader
Extracts and processes content from web pages or HTML files.
Usage:
ingest:
- source: web
source_path:
chunk_size: 1000
chunk_overlap: 100
7. Youtube Channel Loader
Ingests content from a specified YouTube channel by channel ID, suitable for processing large amounts of video data.
Usage:
ingest:
- source: youtube-channel
channel_id:
chunk_size: 1000
chunk_overlap: 100
8. Youtube Loader
Extracts content from individual YouTube videos using their video ID or URL.
Usage:
ingest:
- source: youtube
video_id_or_url: <video_id_or_url>
chunk_size: 1000
chunk_overlap: 100
9. Youtube Search Loader
Facilitates content extraction from the results of YouTube searches based on specified queries.
Usage:
ingest:
- source: youtube-search
query: <query>
chunk_size: 1000
chunk_overlap: 100
10. Folder MIME type Loader
Loads and processes content from a specified folder on the local file system. Automatically detects and processes supported file types. i.e. PDF, PPTX, DOCX, TXT
The file type filter is an optional field. If filter is not provided then the loader will process all the supported file type files in the folder.
Usage:
ingest:
- source: 'folder'
source_path: '/path/to/folder'
file_type: <'pdf'| 'txt' | 'pptx' | 'docx'>
chunk_size: 1000
chunk_overlap: 100
11. LlamaIndex Loader
Loads and processes content from either a single file or from all files within a specified folder. It uses LlamaParser for the loading, so it supports a wide array of document types.
The 'parsingInstructions' parameter allows you to give it natural-language instructions about what it's loading and how to load. For instance, you can tell the loader to summarize or format the document(s) being processed.
The 'folderProcessing' parameter is used to tell the loader if we're processing a folder or not. It defaults to false.
Usage:
ingest:
- source: 'llama-index-loader'
source_path: 'path/to/file' || 'path/to/folder'
chunk_size: 2000
chunk_overlap: 200
parsingInstructions: ""
folderProcessing: true || false
language: <"af" | "az" | "bs" | "cs" | "cy" | "da" | "de" | "en" | "es" | "et" | "fr" | "ga" | "hr" | "hu" | "id" | "is" | "it" | "ku" | "la" | "lt" | "lv" | "mi" | "ms" | "mt" | "nl" | "no" | "oc" | "pi" | "pl" | "pt" | "ro" | "rs_latin" | "sk" | "sl" | "sq" | "sv" | "sw" | "tl" | "tr" | "uz" | "vi" | "ar" | "fa" | "ug" | "ur" | "bn" | "as" | "mni" | "ru" | "rs_cyrillic" | "be" | "bg" | "uk" | "mn" | "abq" | "ady" | "kbd" | "ava" | "dar" | "inh" | "che" | "lbe" | "lez" | "tab" | "tjk" | "hi" | "mr" | "ne" | "bh" | "mai" | "ang" | "bho" | "mah" | "sck" | "new" | "gom" | "sa" | "bgc" | "th" | "ch_sim" | "ch_tra" | "ja" | "ko" | "ta" | "te" | "kn">
resultType: <"text" | "markdown">