🧀 BigCheese.ai

Social

Ask HN: How does deploying a fine-tuned model work

🧀

A Hacker News discussion explores how to deploy and use a fine-tuned model like Llama in an app. Users discuss whether GPUs are needed for running the model continuously or whether it can be hosted on a web server. Solutions include serverless AI platforms that handle infrastructure and GPU reservations, quantization for efficient performance, and queue management to prevent GPU overload.

  • Serverless AI platforms like Replicate can help deploy models.
  • Cost-efficient GPU options include on-demand and reserved.
  • CPU inference is slower compared to GPU execution.
  • Cloud services like AWS offer GPU resources for compute tasks.
  • Local setup can include a gaming rig as a cost-conscious method.