Virtual TechTalk

Hear from Google Cloud Experts:  How to Scale Data Integration to and from Google BigQuery: Watch On-Demand

Watch Now

AI Infrastructure: Tutorial & Best Practices

Your Guide to Generative AI Infrastructure

AI infrastructure consists of the combination of the underlying hardware, software, networking, and system processes needed to develop, deploy, and maintain AI applications. AI infrastructure plays a crucial role in allowing engineers and researchers to ingest large quantities of data, train and release machine learning models, and integrate AI products into API and software products. It is just as pivotal to be able to have a computational environment to support the machine learning cycle as it is to build advanced AI models. 

In this article, we discuss the main areas to consider in the field of AI infrastructure as well as best practices for data storage and processing, training and inference hardware, and model deployment and hosting.

Summary of key AI infrastructure concepts

Concept Description 
Data storage and processing In the era of AI, data plays a pivotal role in training models. Traditional databases (MongoDB, Cassandra, etc.) and specialized vector databases (such as Pinecone, Weaviate, and QDrant) are crucial for storing and accessing diverse data types, facilitating tasks ranging from traditional ML to advanced semantic search and retrieval-augmented generation (RAG). Platforms like Nexla simplify the integration of external data sources, vector databases, and traditional databases, enabling the seamless construction of RAG workflows without extensive coding efforts, thus making them valuable for AI practitioners.
Training and inference hardware Modern AI, especially large models, relies heavily on GPUs due to their optimized batched matrix multiplication, which enables faster processing than CPUs. While CPUs are still relevant for data processing and training smaller neural network models, GPUs become essential for larger models exceeding approximately 50 million parameters, particularly for generative models and pretraining on extensive datasets. Key GPU options include NVIDIA’s T4, A10, A100, and H100, which vary in terms of their capabilities and costs. Model inference is computationally less intensive than training, and with optimization libraries, even large language models can run on consumer-grade laptops, offering the possibility of offline use.
Model deployment and hosting Deploying trained AI models for end-user utilization involves traditional solutions like Docker for containerization and scheduling within Kubeflow. Alternatively, managed services such as Amazon Bedrock and OctoML offer fully hosted solutions for large language models (LLMs) and other models of various modalities. These services handle scaling based on usage and include inference-time optimizations, making deployment more straightforward.

Data storage and processing

A popular catchphrase says that “data is the new oil,” and nowhere is this more true than in the world of AI, where virtually every advancement over the last decade would not have been possible without the terabytes of data we have digitized. However, it’s not enough to just have data—it’s essential to be able to store it in ways that make training AI models easy.

Due to the wealth of information stored in traditional databases—like MongoDB, Cassandra, DynamoDB, S3, and BigQuery—they will almost definitely be key sources that will help train AI models. The tabular information contained in these databases is immensely valuable for training traditional ML models (e.g., linear regression, random forest). Data from these sources is critical, particularly for powering deep learning applications, and extracting speech, images, and text.

To make these capabilities easily accessible to your machine learning workflow, use cloud services like S3 and BigQuery, which allow you to upload or download data for training with some coding. Alternatively, with a data management platform like Nexla, practitioners have the ability to integrate data from external, internal, cloud, and on-prem services and make it available to an object storage service like S3, so it can be used in low-code environments like Jupyter Notebooks.

With the advent of semantic search (searching and matching items in a database to a query based on having similar meaning instead of an exact text match), a new kind of database has become critical to the AI workflow: vector databases. Unlike traditional databases, these are designed specifically to store a unique machine-learning output known as a vector (which is really just a set of numbers). 

These vectors are notable for the fact that they store some sort of information about an image, text, speech waveform, etc. For example, if I provide the sentence “I have a dog” to a language model, we can extract a set of numbers after the last full layer of the model that corresponds to what the model believes the meaning of that sentence is. While a specific kind of model called an embedding model is trained primarily to ensure that the model has the best understanding of the text, images, audio, etc., you can use any pre-trained deep neural network (including LLMs) to output these vectors.

Transforming text into vectors (source: Nexla)

Transforming text into vectors (source: Nexla)

In the world of LLMs, vector databases have become crucial to power a new use case called retrieval-augmented generation (RAG). The goal of RAG is to provide an LLM with some external knowledge it did not see in training to help guide it to a correct answer. For example, if we want GPT-4 to answer questions about a company’s internal employee FAQ, we would likely have to provide the model some sort of document with this information so it can do so. However, we aren’t always sure what external knowledge source to provide the LLM with, and that is where semantic search and vector databases come in. Because we can use the aforementioned embedding models to understand meaning accurately, we can match a user query with a document that talks about similar concepts, with that document being added to the prompt of the LLM to help it answer a question. These documents are stored as vectors in vector databases, making it invaluable to use a fast, efficient, and reliable vector DB to power RAG applications.

Transforming text into vectors and storing in a vector database (source: Nexla)

Transforming text into vectors and storing in a vector database (source: Nexla)

In terms of choosing a good vector database, Pinecone, Weaviate, and QDrant are all excellent choices. However, when looking at the big picture—traditional and vector databases, connecting with external sources of data, data pipelines, etc.—a service like Nexla can prove valuable. Creating  RAG workflows requires connecting to vector databases, foundational models, and unstructured data sources to build richer context when prompt engineering. In combination with its extensive support and connectivity with traditional databases, Nexla’s features are invaluable since the platform allows AI practitioners to build the data infrastructure required to train and augment AI models without the need for extensive data engineering effort.

Powering data engineering automation for AI and ML applications

Learn how Nexla helps enhance LLM models

Enhance LLM models like GPT and LaMDA with your own data

Connect to any vector database like Pinecone

Build retrieval-augmented generation (RAG) with no code

Training and inference hardware

Modern AI is notorious for its substantial computational demands—especially the large models in use today. Unlike traditional software programs, they end up relying predominantly upon GPUs as opposed to CPUs because GPUs are optimized for batched matrix multiplication. Essentially, contemporary AI models function as colossal multiplication engines, and the ability to simultaneously process numerous calculations allows GPUs to run AI models dramatically faster than if they used CPUs.

CPUs still have a place in AI infrastructure, particularly in data processing (as discussed in the previous section) and running traditional machine learning algorithms. In general, if your AI algorithm is supported by the Python library scikit-learn, training on a CPU is fine because it is likely not very computationally demanding to train.

For small neural network models (<50 million parameters), training on a CPU is also still viable and can be done in a reasonable amount of time, but the speedup from using a GPU becomes large enough that you should start seriously considering the use of one. Given the advancements of modern deep learning, this category can also include: 

  • Training models for traditional NLP (classification, sentiment analysis, NER)
  • Speech (speaker diarization, automatic speech recognition)
  • Computer vision (classification, object detection, semantic segmentation, etc.).

GPUs become necessary once you begin considering neural network / deep learning methods larger than ~50 million parameters. Also, in general, GPUs are necessary to train or fine-tune almost every generative model and are necessary if you want to pre-train a neural network on large amounts of data (e.g., if you are an e-commerce company wanting to train a model to understand all 14 million products in your catalog). Thanks to Nexla’s integrations with data stores and large foundational LLMs like Falcon and the GPT series, the latter use case will likely not be as relevant unless you are working with niche domains where existing LLMs do not perform well yet.

Four GPU types that any practitioner focusing on building an effective AI infrastructure should consider are NVIDIA’s T4, A10, A100, and H100 GPUs. The table below explains when it is best to use each GPU model.

GPU type Best for
T4 Cost minimization for smaller models, especially on GCP. Can train up to 4 billion parameter models per GPU (has 16GB VRAM) without specialized optimization.
A10 Cost minimization for AWS and Azure (more expensive than T4, but is also 3x faster so can end up cheaper overall). Can train up to 13 billion parameter models per GPU (has 24 GB VRAM).
A100/H100 Training the largest models quickly and rapidly (has 40-80 GB VRAM). Most cloud companies don’t currently offer the H100 (check CoreWeave or Lambda Labs for this), but if you can get access to it it is preferred over the A100. Most expensive option of the three.

Compared to training models, model inference (having the model create predictions on new data in the wild) is far more computationally inexpensive. For instance, while having access to an A10 GPU allows you to train a 13 billion parameter LLM with LoRA, lower memory requirements in inference allow you to generate text with a 30 billion parameter LLM live without any optimizations. And it gets even better: With specific inference optimization libraries like Optimum, ONNX, TensorRT, TVM, llama.cpp, and VLLM, you can even use large language models to generate text on a consumer-grade laptop. In other words, you can run your own (albeit slightly worse) ChatGPT without even needing the internet.

Unlock the Power of Data Integration. Nexla’s Interactive Demo. No Email Required!


Model deployment and hosting

Finally, there is actually deploying your trained AI model so that end-users will be able to utilize it. While traditional solutions such as Docker for containerization and Kubeflow (Kubernetes’s machine learning toolkit) for scheduling users represent tried-and-true methods for all types of AI models, there has been a recent proliferation of managed services that offer to do this for you. Excellent starting points include Amazon Bedrock, which offers a fully managed service that hosts LLMs like Anthropic’s Claude and Meta’s Llama-2, and OctoML, which offers hosting for a wide variety of models across modalities including Stable Diffusion (image), Whisper (speech), and Mistral (text).

One advantage of using fully managed hosting services like these is that they handle scaling up and down access to cloud GPUs depending on how much your models are being used. Another is that they include many of the inference-time speed and memory optimizations previously discussed right out of the box.

Service Use case
Docker Containerization
Kubeflow (Kubernetes) Scheduling
Bedrock Hosting, fully managed LLM service, and dynamic scaling
OctoML Hosting, speed optimization, and dynamic scaling

If using a model only accessible via API like GPT-4 or Cohere’s models, then the API provider will take care of hosting for you. That means you won’t have to worry about deploying or hosting your model at all (beyond deploying the overall application, which would follow SaaS best practices). Of course, doing this means you have to provide your data to these API providers and ensure that you don’t exceed their rate limits.

The operationalization of AI applications 

This article focused on the infrastructure components necessary to host an AI application; however, operating an AI application in production requires more than hosting it and storing its data. It requires several support systems.

The machine learning (ML) code is represented by the small box in the center of the diagram below, highlighting the complexity of the systems infrastructure required to support a machine learning application in a production environment, including tools to monitor the model, verify its data integrity, and manage aspects such as access control. Read this guide’s chapters to learn more about operationalizing AI applications.

The machine learning code shown at the center of the diagram is surrounded by the systems required to operate in production (source).

Discover the Transformative Impact of Data Integration on GenAI



To build effective AI applications, it is essential to have an AI infrastructure that manages data, model training and inference, and deployment. In this article, we explained the main considerations one should explore for each as well as best practices or recommended vendors to build up each AI infrastructure component. With this guide as a starting point, your enterprise should be able to build and release AI models rapidly.

Like this article?

Subscribe to our LinkedIn Newsletter to receive more educational content

Subscribe now