

Note: during the ingest process no data leaves your local environment. If you want to start from an empty database, delete the db folder. You can ingest as many documents as you want, and all will be accumulated in the local embeddings database. Will take 20-30 seconds per document, depending on the size of the document. It will create a db folder containing the local vectorstore. Ingestion complete ! You can now run privateGPT.py to query your documents Using embedded DuckDB with persistence: data will be stored in: db Loaded 1 new documents from source_documents Run the following command to ingest all the data. Put any and all your files into the source_documents directory Instructions for ingesting your own dataset This repo uses a state of the union transcript as an example. Note: because of the way langchain loads the SentenceTransformers embeddings, the first time you run the script it will require internet connection to download the embeddings model itself. TARGET_SOURCE_CHUNKS: The amount of chunks (sources) that will be used to answer a question Optimal value differs a lot depending on the model (8 works well for GPT4All, and 1024 is better for LlamaCpp)ĮMBEDDINGS_MODEL_NAME: SentenceTransformers embeddings model name (see ) MODEL_N_BATCH: Number of tokens in the prompt that are fed into the model at a time. MODEL_N_CTX: Maximum token limit for the LLM model MODEL_PATH: Path to your GPT4All or LlamaCpp supported LLM PERSIST_DIRECTORY: is the folder you want your vectorstore in
