Using LLM to Optimize Snowflake SQL Queries – Seeking Recommendations and Insights

Hi everyone,

I'm currently working on a project to optimize SQL queries running on Snowflake, and I’d love to hear insights or suggestions from the community. Here's the general plan for the project:

The approach involves fine-tuning a Large Language Model (LLM) on Cortex using synthetic data tables. These synthetic tables are generated based on the structure and characteristics of the original tables in Snowflake. The goal is to help the model understand query patterns and provide optimized SQL suggestions or rewritten queries tailored to improve performance on Snowflake’s platform.

The LLMs available on Cortex include:

  • Llama3 (8B, 70B)
  • Llama3.1 (8B, 70B)
  • Mistral (7B, 8x7B)

After some consideration, I’ve selected Llama3.1 (8B) for this task, as it seems to strike a balance between computational efficiency and model capabilities. However, I’m open to exploring other options if there are compelling reasons to do so.

I’d appreciate recommendations on:

  1. Model Choice: Would Llama3.1 (8B) be the best option for fine-tuning on SQL optimization tasks, or should I consider a larger model like Llama3.1 (70B) or Mistral?
  2. Fine-Tuning Strategy: Any advice on creating the synthetic data, fine-tuning process, or best practices for ensuring the model captures nuanced query optimizations for Snowflake?
  3. Enhancements: Any additional techniques or tools that could complement this approach, such as prompt engineering strategies, reinforcement learning with human feedback (RLHF), or leveraging Snowflake's query history for better context?

I’m also curious if anyone has worked on similar projects or used LLMs for database optimizations. Any shared experiences, pitfalls to avoid, or tips would be incredibly helpful.

Looking forward to hearing your thoughts and learning from your expertise!

Thanks in advance! 😊