AI Q&A System

Github: https://github.com/liuchang0826NW/mktg-case-qa-system

I. Introduction:

The purpose of this project is to develop an Artificial Intelligence (AI) Question & Answer (Q&A) system powered by chatGPT, LongChain, Pinecone and gradio for the marketing cases from Kellogg School of Management. The goal is to facilitate an easier and more efficient retrieval of case-specific information, thereby aiding students, researchers, and business professionals in their learning and decision-making processes.

II. Methodology:

The system was designed and developed in several critical stages, utilizing a combination of advanced machine learning techniques, NLP libraries, and vector database technologies.

A. Text Extraction and Preprocessing:

The project's first step involved using Langchain, an open-source library, to read PDF files of the cases and split them into multiple text pieces. Langchain was chosen due to its capacity to handle diverse formats and its effectiveness in extracting useful text from complex documents.

B. Text Embedding:

The next step in the pipeline was to use OpenAI's Embedding API to create embeddings for each piece of text.

C. Vector Storage:

After the embeddings were created, Pinecone, a vector-based database, was used to store these embeddings. Pinecone was selected for its scalability, performance, and its ability to handle high-dimensional data, thus facilitating efficient storage and retrieval of text embeddings.

D. Question Processing and Text Retrieval:

Once the cases were stored in Pinecone, the system was programmed to take a user's question, calculate its embeddings, and then find the most related text pieces by calculating the cosine similarity between the question's embeddings and the stored text embeddings. This approach enabled the system to deliver contextually relevant responses to the user's queries.

III. Demo:

The system was able to successfully extract, embed, and retrieve relevant text from the Kellogg School of Management marketing cases. The use of GPT for embedding and Pinecone for storage allowed the system to handle a large volume of data, deliver accurate responses, and scale as needed.

See below video:

IV. Future Improvements:

While the system has already shown promising results, there are several avenues for improvement and expansion.

Fine-tuning the GPT model: To improve the system's performance, the GPT model could be fine-tuned on the specific text of the marketing cases to better understand their unique context and terminology.

Data Granularity and Cleaning: Prior to the embedding process, it would be beneficial to refine the granularity of the data, ensuring that the most meaningful and contextually relevant pieces of information are being utilized. A thorough data cleaning process should also be employed to remove any irrelevant or erroneous data, thus enhancing the quality of the text embeddings.