Close Menu
AsiaTokenFundAsiaTokenFund
  • Home
  • Crypto News
    • Bitcoin
    • Altcoin
  • Web3
    • Blockchain
  • Trading
  • Regulations
    • Scams
  • Submit Article
  • Contact Us
  • Terms of Use
    • Privacy Policy
    • DMCA
What's Hot

Top 5 Factors Driving Ozak AI’s $1 Million Presale Milestone

June 9, 2025

Hyperliquid Traders Watch Closely as Binance May List HYPE Soon

June 9, 2025

Is the Musk-Trump Feud Real , Or a Setup? And Which Coin Will Rise From It?

June 9, 2025
Facebook X (Twitter) Instagram
Facebook X (Twitter) YouTube LinkedIn
AsiaTokenFundAsiaTokenFund
ATF Capital
  • Home
  • Crypto News
    • Bitcoin
    • Altcoin
  • Web3
    • Blockchain
  • Trading
  • Regulations
    • Scams
  • Submit Article
  • Contact Us
  • Terms of Use
    • Privacy Policy
    • DMCA
AsiaTokenFundAsiaTokenFund

Perplexity AI Leverages NVIDIA Inference Stack to Handle 435 Million Monthly Queries

0
By Aggregated - see source on December 6, 2024 Blockchain
Share
Facebook Twitter LinkedIn Pinterest Email


Terrill Dicki
Dec 06, 2024 04:17

Perplexity AI utilizes NVIDIA’s inference stack, including H100 Tensor Core GPUs and Triton Inference Server, to manage over 435 million search queries monthly, optimizing performance and reducing costs.





Perplexity AI, a leading AI-powered search engine, is successfully managing over 435 million search queries each month, thanks to NVIDIA’s advanced inference stack. The platform has integrated NVIDIA H100 Tensor Core GPUs, Triton Inference Server, and TensorRT-LLM to efficiently deploy large language models (LLMs), according to NVIDIA’s official blog.

Serving Multiple AI Models

To meet diverse user demands, Perplexity AI operates over 20 AI models simultaneously, including variations of the open-source Llama 3.1 models. Each user request is matched with the most suitable model using smaller classifier models that determine user intent. These models are deployed across GPU pods, each managed by an NVIDIA Triton Inference Server, ensuring efficiency under strict service-level agreements (SLAs).

The pods are hosted within a Kubernetes cluster, featuring an in-house front-end scheduler that directs traffic based on load and usage. This ensures consistent SLA adherence, optimizing performance and resource utilization.

Optimizing Performance and Costs

Perplexity AI employs a comprehensive A/B testing strategy to define SLAs for varied use cases. This process aims to maximize GPU utilization while maintaining target SLAs, optimizing inference serving costs. Smaller models focus on minimizing latency, while larger, user-facing models like Llama 8B, 70B, and 405B undergo detailed performance analysis to balance costs and user experience.

Performance is further enhanced by parallelizing model deployment across multiple GPUs, increasing tensor parallelism to achieve lower serving costs for latency-sensitive requests. This strategic approach has enabled Perplexity to save approximately $1 million annually by hosting models on cloud-based NVIDIA GPUs, surpassing third-party LLM API service costs.

Innovative Techniques for Enhanced Throughput

Perplexity AI is collaborating with NVIDIA to implement ‘disaggregating serving,’ a method that separates inference phases onto different GPUs, significantly boosting throughput while adhering to SLAs. This flexibility allows Perplexity to utilize various NVIDIA GPU products to optimize performance and cost-efficiency.

Further improvements are anticipated with the upcoming NVIDIA Blackwell platform, promising substantial performance gains through technological innovations, including a second-generation Transformer Engine and advanced NVLink capabilities.

Perplexity’s strategic use of NVIDIA’s inference stack underscores the potential for AI-powered platforms to manage vast query volumes efficiently, delivering high-quality user experiences while maintaining cost-effectiveness.

Image source: Shutterstock


Credit: Source link

Share. Facebook Twitter Pinterest LinkedIn Tumblr Email

Related Posts

Coinbase Cuts Account Lockouts by 82% to Restore User Trust

June 9, 2025

UK Advances AI Infrastructure with NVIDIA at London Tech Week

June 8, 2025

X and Polymarket Embed Live Crypto Odds in Feed

June 6, 2025
Leave A Reply Cancel Reply

What's New Here!

Top 5 Factors Driving Ozak AI’s $1 Million Presale Milestone

June 9, 2025

Hyperliquid Traders Watch Closely as Binance May List HYPE Soon

June 9, 2025

Is the Musk-Trump Feud Real , Or a Setup? And Which Coin Will Rise From It?

June 9, 2025

Trump and Musk Set Off Market Alarm , But Could Wall Street Ponke Be the Real Power Move?

June 9, 2025
AsiaTokenFund
Facebook X (Twitter) LinkedIn YouTube
  • Home
  • Crypto News
    • Bitcoin
    • Altcoin
  • Web3
    • Blockchain
  • Trading
  • Regulations
    • Scams
  • Submit Article
  • Contact Us
  • Terms of Use
    • Privacy Policy
    • DMCA
© 2025 asiatokenfund.com - All Rights Reserved!

Type above and press Enter to search. Press Esc to cancel.

Ad Blocker Enabled!
Ad Blocker Enabled!
Our website is made possible by displaying online advertisements to our visitors. Please support us by disabling your Ad Blocker.