ChatGPT has indeed changed the way we search for information. However, it is a cloud-based platform that does not have access to your private data.
There are options (usually paid) available where you can upload your own data into the cloud and use a ChatGPT-like interface to understand and query that data. As your data needs to be uploaded, it will still be processed in the cloud. This could be a deal breaker for some people.
Fortunately, there are tools available for free that you can use to ingest our data and allow you to query that data using a ChatGPT-like interface. One such tool is PrivateGPT.
PrivateGPT is a tool that allows you to train and use large language models (LLMs) on your own data. LLMs are powerful AI models that can generate text, translate languages, write different kinds of creative content, and answer your questions in an informative way.
There are many reasons why you might want to use privateGPT. For example, you might want to use it to:
- Generate text that is tailored to your specific needs
- Translate languages more accurately
- Write creative content that is more original
- Answer your questions in a more informative way
PrivateGPT gives you these benefits:
- Privacy: PrivateGPT allows you to train LLMs on your own data, without having to worry about your data being shared with others.
- Control: PrivateGPT gives you full control over the training process, so you can ensure that your LLM is trained on the data that you want it to be trained on.
- LLMs can be expensive to train and require a lot of computing resources. PrivateGPT solves these problems by allowing you to train LLMs on your own data, without having to worry about the cost or resources.
privateGPT requires Python 3.
Download and Install
You can find PrivateGPT on GitHub at this URL: https://github.com/imartinez/privateGPT.git
There is documentation available that provides the steps for installing and using privateGPT, but I will provide the steps specifically for a macOS system. The process should be very similar for Linux.
- Clone the git repo locally
git clone https://github.com/imartinez/privateGPT.git
- Install the required modules
cd privateGPTpip3 install -r requirements.txt
- Download and Install the LLM model and place it in a directory of your choice. In the case below, I’m putting it into the models directory.
mkdir modelscd modelscd ..
- Rename example.env to .env and edit the variables appropriately.
mv example.env .envvi .env ## change the settings below as appropriatePERSIST_DIRECTORY=dbMODEL_TYPE=GPT4AllMODEL_PATH=models/ggml-gpt4all-j-v1.3-groovy.binEMBEDDINGS_MODEL_NAME=all-MiniLM-L6-v2MODEL_N_CTX=1000
Note: because of the way langchain loads the SentenceTransformers embeddings, the first time you run the script it will require internet connection to download the embeddings model itself.
Instructions for ingesting your own dataset
Put any and all your files into the source_documents directory
The supported extensions are:
- .csv: CSV,
- .docx: Word Document,
- .doc: Word Document,
- .enex: EverNote,
- .eml: Email,
- .epub: EPub,
- .html: HTML File,
- .md: Markdown,
- .msg: Outlook Message,
- .odt: Open Document Text,
- .pdf: Portable Document Format (PDF),
- .pptx : PowerPoint Document,
- .ppt : PowerPoint Document,
- .txt: Text file (UTF-8),
Run this command to ingest the documents
python ingest.py
It will create a db folder containing the local vectorstore, which will take 20–30 seconds per document, depending on the size of the document.
You can ingest as many documents as you want, and all will be accumulated in the local embeddings database.
If you want to start from an empty database, delete the b folder.
Note: During the ingest process, no data leaves your local environment. You can ingest without an internet connection, except for the first time you run the ingest script when the embeddings model is downloaded.
Ask questions to your documents, locally!
Run this command to start privateGPT to begin querying your data:
python privateGPT.py
In the sample session above, I used PrivateGPT to query some documents I loaded for a test. My objective was to retrieve information from it. However, as shown in #3 above, PrivateGPT did not remember the previous question I had asked about MacGPT.
Observation
- Data ingestion is fast, but querying that data is much slower
- privateGPT is much slower than ChatGPT on my M1 MacBook Pro.
- The output is not as good as ChatGPT. Answers can be cut off
- It doesn’t have a memory of previous chat prompts
Final Thoughts
Although privateGPT is not as fast or capable as ChatGPT, it still works reasonably well. It is particularly useful if you need to keep your data private. The software is incredibly user-friendly and can be set up and running in just a matter of minutes.