Effective Data Chunking and Querying with Pinecone and GPT-4o

by Pierluigi Vinciguerra July 17th, 2025

In our previous article, we saw how to scrape this newsletter with Firecrawl and transform the posts into markdown files that can be loaded into a VectorDB in Pinecone.

After releasing the first part of the article, I kept querying the VectorDB with different queries. I was unhappy with the results, so I wanted to optimize the data ingestion on Pinecone (or at least try it) a bit.

Improving the data quality

First of all, I tried to clean the markdown from the link to images, new lines, separators, and other stuff so that the files passed to Pinecone are more readable.

So, I created a small function with regular expressions (thanks, ChatGPT!) to preprocess the markdown extracted by Firecrawl before passing it to Pinecone.

def clean_markdown(md_text):
    """Cleans Markdown text by removing images and dividers."""
    import re
    md_text = re.sub(r"!\[.*?\]\(.*?\)", "", md_text)  # Remove ...

Copyright of this story solely belongs to hackernoon.com . To see the full text click HERE

Improving the data quality

Share: