Effective Data Chunking and Querying with Pinecone and GPT-4o
hackernoon.comAfter releasing the first part of the article, I kept querying the VectorDB with different queries. I was unhappy with the results, so I wanted to optimize the data ingestion on Pinecone (or at least try it) a bit.
Improving the data quality
First of all, I tried to clean the markdown from the link to images, new lines, separators, and other stuff so that the files passed to Pinecone are more readable.
So, I created a small function with regular expressions (thanks, ChatGPT!) to preprocess the markdown extracted by Firecrawl before passing it to Pinecone.
def clean_markdown(md_text):
"""Cleans Markdown text by removing images and dividers."""
import re
md_text = re.sub(r"!\[.*?\]\(.*?\)", "", md_text) # Remove ...
Copyright of this story solely belongs to hackernoon.com . To see the full text click HERE