Explore PrivateGPT: A Local Solution for Querying Your Data with Privacy

ChatGPT has indeed changed the way we search for information. However, it is a cloud-based platform that does not have access to your private data.

There are options (usually paid) available where you can upload your own data into the cloud and use a ChatGPT-like interface to understand and query that data. As your data needs to be uploaded, it will still be processed in the cloud. This could be a deal breaker for some people.

Fortunately, there are tools available for free that you can use to ingest our data and allow you to query that data using a ChatGPT-like interface. One such tool is PrivateGPT.

PrivateGPT is a tool that allows you to train and use large language models (LLMs) on your own data. LLMs are powerful AI models that can generate text, translate languages, write different kinds of creative content, and answer your questions in an informative way.

There are many reasons why you might want to use privateGPT. For example, you might want to use it to:

Generate text that is tailored to your specific needs

Translate languages more accurately

Write creative content that is more original

Answer your questions in a more informative way

PrivateGPT gives you these benefits:

Privacy: PrivateGPT allows you to train LLMs on your own data, without having to worry about your data being shared with others.

Control: PrivateGPT gives you full control over the training process, so you can ensure that your LLM is trained on the data that you want it to be trained on.

LLMs can be expensive to train and require a lot of computing resources. PrivateGPT solves these problems by allowing you to train LLMs on your own data, without having to worry about the cost or resources.

privateGPT requires Python 3.

Download and Install

You can find PrivateGPT on GitHub at this URL: https://github.com/imartinez/privateGPT.git

There is documentation available that provides the steps for installing and using privateGPT, but I will provide the steps specifically for a macOS system. The process should be very similar for Linux.

Clone the git repo locally

git clone https://github.com/imartinez/privateGPT.git

Install the required modules

cd privateGPT
pip3 install -r requirements.txt

Download and Install the LLM model and place it in a directory of your choice. In the case below, I’m putting it into the models directory.

mkdir models
cd models
wget https://gpt4all.io/models/ggml-gpt4all-j-v1.3-groovy.bin
cd ..

Rename example.env to .env and edit the variables appropriately.

mv example.env .env
vi .env ## change the settings below as appropriate
PERSIST_DIRECTORY=db
MODEL_TYPE=GPT4All
MODEL_PATH=models/ggml-gpt4all-j-v1.3-groovy.bin
EMBEDDINGS_MODEL_NAME=all-MiniLM-L6-v2
MODEL_N_CTX=1000

Note: because of the way langchain loads the SentenceTransformers embeddings, the first time you run the script it will require internet connection to download the embeddings model itself.

Instructions for ingesting your own dataset

Put any and all your files into the source_documents directory

The supported extensions are:

.csv: CSV,

.docx: Word Document,

.doc: Word Document,

.enex: EverNote,

.eml: Email,

.epub: EPub,

.html: HTML File,

.md: Markdown,

.msg: Outlook Message,

.odt: Open Document Text,

.pdf: Portable Document Format (PDF),

.pptx : PowerPoint Document,

.ppt : PowerPoint Document,

.txt: Text file (UTF-8),

Run this command to ingest the documents

python ingest.py

It will create a db folder containing the local vectorstore, which will take 20–30 seconds per document, depending on the size of the document.

You can ingest as many documents as you want, and all will be accumulated in the local embeddings database.

If you want to start from an empty database, delete the b folder.

Note: During the ingest process, no data leaves your local environment. You can ingest without an internet connection, except for the first time you run the ingest script when the embeddings model is downloaded.

Ask questions to your documents, locally!

Run this command to start privateGPT to begin querying your data:

python privateGPT.py

In the sample session above, I used PrivateGPT to query some documents I loaded for a test. My objective was to retrieve information from it. However, as shown in #3 above, PrivateGPT did not remember the previous question I had asked about MacGPT.

Observation

Data ingestion is fast, but querying that data is much slower

privateGPT is much slower than ChatGPT on my M1 MacBook Pro.

The output is not as good as ChatGPT. Answers can be cut off

It doesn’t have a memory of previous chat prompts

Final Thoughts

Although privateGPT is not as fast or capable as ChatGPT, it still works reasonably well. It is particularly useful if you need to keep your data private. The software is incredibly user-friendly and can be set up and running in just a matter of minutes.

PrivateGPT (ChatGPT For Personal Data) Tutorial

Download and Install

Instructions for ingesting your own dataset

Ask questions to your documents, locally!

Observation

Final Thoughts

Join the newsletter

Company

Resources

Legal