Multi-chapter guide | Your Guide to Generative AI Infrastructure

AI Infrastructure: Tutorial & Best Practices

Unlock up to 10x
greater productivity

From prompt to pipelines, Express.dev, our conversational AI, turns your words into workflows–no code needed.

Try Express for Free

Like this article?

Subscribe to our LinkedIn Newsletter

Subscribe now

AI infrastructure consists of the combination of the underlying hardware, software, networking, and system processes needed to develop, deploy, and maintain AI applications. AI infrastructure plays a crucial role in allowing engineers and researchers to ingest large quantities of data, train and release machine learning models, and integrate AI products into API and software products. It is just as pivotal to be able to have a computational environment to support the machine learning cycle as it is to build advanced AI models.

In this article, we discuss the main areas to consider in the field of AI infrastructure as well as best practices for data storage and processing, training and inference hardware, and model deployment and hosting.

Summary of key AI infrastructure concepts

Concept	Description
Data storage and processing	In the era of AI, data plays a pivotal role in training models. Traditional databases (MongoDB, Cassandra, etc.) and specialized vector databases (such as Pinecone, Weaviate, and QDrant) are crucial for storing and accessing diverse data types, facilitating tasks ranging from traditional ML to advanced semantic search and retrieval-augmented generation (RAG). Platforms like Nexla simplify the integration of external data sources, vector databases, and traditional databases, enabling the seamless construction of RAG workflows without extensive coding efforts, thus making them valuable for AI practitioners.
Training and inference hardware	Modern AI, especially large models, relies heavily on GPUs due to their optimized batched matrix multiplication, which enables faster processing than CPUs. While CPUs are still relevant for data processing and training smaller neural network models, GPUs become essential for larger models exceeding approximately 50 million parameters, particularly for generative models and pretraining on extensive datasets. Key GPU options include NVIDIA’s T4, A10, A100, and H100, which vary in terms of their capabilities and costs. Model inference is computationally less intensive than training, and with optimization libraries, even large language models can run on consumer-grade laptops, offering the possibility of offline use.
Model deployment and hosting	Deploying trained AI models for end-user utilization involves traditional solutions like Docker for containerization and scheduling within Kubeflow. Alternatively, managed services such as Amazon Bedrock and OctoML offer fully hosted solutions for large language models (LLMs) and other models of various modalities. These services handle scaling based on usage and include inference-time optimizations, making deployment more straightforward.

Data storage and processing

A popular catchphrase says that “data is the new oil,” and nowhere is this more true than in the world of AI, where virtually every advancement over the last decade would not have been possible without the terabytes of data we have digitized. However, it’s not enough to just have data—it’s essential to be able to store it in ways that make training AI models easy.

Due to the wealth of information stored in traditional databases—like MongoDB, Cassandra, DynamoDB, S3, and BigQuery—they will almost definitely be key sources that will help train AI models. The tabular information contained in these databases is immensely valuable for training traditional ML models (e.g., linear regression, random forest). Data from these sources is critical, particularly for powering deep learning applications, and extracting speech, images, and text.

To make these capabilities easily accessible to your machine learning workflow, use cloud services like S3 and BigQuery, which allow you to upload or download data for training with some coding. Alternatively, with a data management platform like Nexla, practitioners have the ability to integrate data from external, internal, cloud, and on-prem services and make it available to an object storage service like S3, so it can be used in low-code environments like Jupyter Notebooks.

With the advent of semantic search (searching and matching items in a database to a query based on having similar meaning instead of an exact text match), a new kind of database has become critical to the AI workflow: vector databases. Unlike traditional databases, these are designed specifically to store a unique machine-learning output known as a vector (which is really just a set of numbers).

These vectors are notable for the fact that they store some sort of information about an image, text, speech waveform, etc. For example, if I provide the sentence “I have a dog” to a language model, we can extract a set of numbers after the last full layer of the model that corresponds to what the model believes the meaning of that sentence is. While a specific kind of model called an embedding model is trained primarily to ensure that the model has the best understanding of the text, images, audio, etc., you can use any pre-trained deep neural network (including LLMs) to output these vectors.

Transforming text into vectors (source: Nexla)

In the world of LLMs, vector databases have become crucial to power a new use case called retrieval-augmented generation (RAG). The goal of RAG is to provide an LLM with some external knowledge it did not see in training to help guide it to a correct answer. For example, if we want GPT-4 to answer questions about a company’s internal employee FAQ, we would likely have to provide the model some sort of document with this information so it can do so. However, we aren’t always sure what external knowledge source to provide the LLM with, and that is where semantic search and vector databases come in. Because we can use the aforementioned embedding models to understand meaning accurately, we can match a user query with a document that talks about similar concepts, with that document being added to the prompt of the LLM to help it answer a question. These documents are stored as vectors in vector databases, making it invaluable to use a fast, efficient, and reliable vector DB to power RAG applications.

Transforming text into vectors and storing in a vector database (source: Nexla)

In terms of choosing a good vector database, Pinecone, Weaviate, and QDrant are all excellent choices. However, when looking at the big picture—traditional and vector databases, connecting with external sources of data, data pipelines, etc.—a service like Nexla can prove valuable. Creating RAG workflows requires connecting to vector databases, foundational models, and unstructured data sources to build richer context when prompt engineering. In combination with its extensive support and connectivity with traditional databases, Nexla’s features are invaluable since the platform allows AI practitioners to build the data infrastructure required to train and augment AI models without the need for extensive data engineering effort.

Enhance LLM models like GPT and LaMDA with your own data
Connect to any vector database like Pinecone
Build retrieval-augmented generation (RAG) pipelines with no code

Training and inference hardware

Modern AI is notorious for its substantial computational demands—especially the large models in use today. Unlike traditional software programs, they end up relying predominantly upon GPUs as opposed to CPUs because GPUs are optimized for batched matrix multiplication. Essentially, contemporary AI models function as colossal multiplication engines, and the ability to simultaneously process numerous calculations allows GPUs to run AI models dramatically faster than if they used CPUs.

CPUs still have a place in AI infrastructure, particularly in data processing (as discussed in the previous section) and running traditional machine learning algorithms. In general, if your AI algorithm is supported by the Python library scikit-learn, training on a CPU is fine because it is likely not very computationally demanding to train.

For small neural network models (<50 million parameters), training on a CPU is also still viable and can be done in a reasonable amount of time, but the speedup from using a GPU becomes large enough that you should start seriously considering the use of one. Given the advancements of modern deep learning, this category can also include:

Training models for traditional NLP (classification, sentiment analysis, NER)
Speech (speaker diarization, automatic speech recognition)
Computer vision (classification, object detection, semantic segmentation, etc.).

GPUs become necessary once you begin considering neural network / deep learning methods larger than ~50 million parameters. Also, in general, GPUs are necessary to train or fine-tune almost every generative model and are necessary if you want to pre-train a neural network on large amounts of data (e.g., if you are an e-commerce company wanting to train a model to understand all 14 million products in your catalog). Thanks to Nexla’s integrations with data stores and large foundational LLMs like Falcon and the GPT series, the latter use case will likely not be as relevant unless you are working with niche domains where existing LLMs do not perform well yet.

Four GPU types that any practitioner focusing on building an effective AI infrastructure should consider are NVIDIA’s T4, A10, A100, and H100 GPUs. The table below explains when it is best to use each GPU model.

GPU type	Best for
T4	Cost minimization for smaller models, especially on GCP. Can train up to 4 billion parameter models per GPU (has 16GB VRAM) without specialized optimization.
A10	Cost minimization for AWS and Azure (more expensive than T4, but is also 3x faster so can end up cheaper overall). Can train up to 13 billion parameter models per GPU (has 24 GB VRAM).
A100/H100	Training the largest models quickly and rapidly (has 40-80 GB VRAM). Most cloud companies don’t currently offer the H100 (check CoreWeave or Lambda Labs for this), but if you can get access to it, it is preferred over the A100. Most expensive option of the three.

Compared to training models, model inference (having the model create predictions on new data in the wild) is far more computationally inexpensive. For instance, while having access to an A10 GPU allows you to train a 13 billion parameter LLM with LoRA, lower memory requirements in inference allow you to generate text with a 30 billion parameter LLM live without any optimizations. And it gets even better: With specific inference optimization libraries like Optimum, ONNX, TensorRT, TVM, llama.cpp, and VLLM, you can even use large language models to generate text on a consumer-grade laptop. In other words, you can run your own (albeit slightly worse) ChatGPT without even needing the internet.

Unlock the Power of Data Integration. Nexla's Interactive Demo. No Email Required!

Tour the Product

Model deployment and hosting

Finally, there is actually deploying your trained AI model so that end-users will be able to utilize it. While traditional solutions such as Docker for containerization and Kubeflow (Kubernetes’s machine learning toolkit) for scheduling users represent tried-and-true methods for all types of AI models, there has been a recent proliferation of managed services that offer to do this for you. Excellent starting points include Amazon Bedrock, which offers a fully managed service that hosts LLMs like Anthropic’s Claude and Meta’s Llama-2, and OctoML, which offers hosting for a wide variety of models across modalities including Stable Diffusion (image), Whisper (speech), and Mistral (text).

One advantage of using fully managed hosting services like these is that they handle scaling up and down access to cloud GPUs depending on how much your models are being used. Another is that they include many of the inference-time speed and memory optimizations previously discussed right out of the box.

Service	Use case
Docker	Containerization
Kubeflow (Kubernetes)	Scheduling
Bedrock	Hosting, fully managed LLM service, and dynamic scaling
OctoML	Hosting, speed optimization, and dynamic scaling

If using a model only accessible via API like GPT-4 or Cohere’s models, then the API provider will take care of hosting for you. That means you won’t have to worry about deploying or hosting your model at all (beyond deploying the overall application, which would follow SaaS best practices). Of course, doing this means you have to provide your data to these API providers and ensure that you don’t exceed their rate limits.

The operationalization of AI applications

This article focused on the infrastructure components necessary to host an AI application; however, operating an AI application in production requires more than hosting it and storing its data. It requires several support systems.

The machine learning (ML) code is represented by the small box in the center of the diagram below, highlighting the complexity of the systems infrastructure required to support a machine learning application in a production environment, including tools to monitor the model, verify its data integrity, and manage aspects such as access control. Read this guide’s chapters to learn more about operationalizing AI applications.

The machine learning code shown at the center of the diagram is surrounded by the systems required to operate in production (source).

Discover the Transformative Impact of Data Integration on GenAI

Watch Expert Panel

Conclusion

To build effective AI applications, it is essential to have an AI infrastructure that manages data, model training and inference, and deployment. In this article, we explained the main considerations one should explore for each as well as best practices or recommended vendors to build up each AI infrastructure component. With this guide as a starting point, your enterprise should be able to build and release AI models rapidly.

Navigate Chapters:

Continue reading this series

Chapter 1

AI Infrastructure: Tutorial & Best Practices

Learn about the key concepts and best practices for data storage, processing, training, inference hardware, and model deployment and hosting in the field of AI infrastructure.

Chapter 2

Large Language Models (LLMs) Tutorial

Learn how Large Language Models revolutionized Natural Language Processing and their best practices, use cases, and challenges.

Chapter 3

Vector Embedding Tutorial & Example

Learn how vector embeddings are used to convert non-numeric data into vectors for machine learning.

Chapter 4

Vector Databases: Tutorial, Best Practices & Examples

Learn about the significance, types, use cases, challenges, and best practices of vector databases, with an exploration of popular solutions like Pinecone, Milvus, Redis, and MongoDB.

Chapter 5

Retrieval-Augmented Generation (RAG) Tutorial & Best Practices

Learn how retrieval-augmented generation (RAG) combines traditional AI language models with dynamic external data to improve machine understanding and responses.

Chapter 6

LLM Hallucination—Types, Causes, and Solution

Learn about LLM hallucination, why it happens and how you can use data to improve LLM reliability and ethical use.

Chapter 7

Prompt Engineering vs. Fine-Tuning—Key Considerations and Best Practices

Learn about how fine-tuning and prompt engineering work, their impact on customization and accuracy in specialized tasks, and how to choose between the two.

Chapter 8

Model Tuning—Key Techniques and Alternatives

Learn how to improve the performance of your machine learning or large language model through hyperparameter tuning techniques. Open AI tutorial included.

Chapter 9

Prompt Tuning vs. Fine-Tuning—Differences, Best Practices and Use Cases

Learn prompt tuning vs. fine-tuning in customizing large language models. Explore parameter adjustments, input format, challenges, real-world examples and more.

Chapter 10

Data Drift in LLMs—Causes, Challenges, and Strategies

Learn about how data drift impacts LLM output quality over time and the need for continuous data integration and re-training to minimize the impact.

Chapter 11

LLM Security—Vulnerabilities, User Risks, and Mitigation Measures

Learn about all aspects of LLM security—from model design to prompt-based and user-based risks. Implement best practices to protect users and your organization.

Chapter 12

LLMOps—Benefits, Implementation, and Best Practices

Learn what is LLMOps and why it is different from MLOps. Learn how it works in the LLM lifecycle, implementation details, and best practices for LLM developers.

AI Infrastructure: Tutorial & Best Practices

Table of Contents

Unlock up to 10x
greater productivity

Like this article?

Summary of key AI infrastructure concepts

Data storage and processing

Powering data engineering automation for AI and ML applications

Training and inference hardware

Unlock the Power of Data Integration. Nexla's Interactive Demo. No Email Required!

Model deployment and hosting

The operationalization of AI applications

Discover the Transformative Impact of Data Integration on GenAI

Conclusion

Continue reading this series

AI Infrastructure: Tutorial & Best Practices

Large Language Models (LLMs) Tutorial

Vector Embedding Tutorial & Example

Vector Databases: Tutorial, Best Practices & Examples

Retrieval-Augmented Generation (RAG) Tutorial & Best Practices

LLM Hallucination—Types, Causes, and Solution

Prompt Engineering vs. Fine-Tuning—Key Considerations and Best Practices

Model Tuning—Key Techniques and Alternatives

Prompt Tuning vs. Fine-Tuning—Differences, Best Practices and Use Cases

Data Drift in LLMs—Causes, Challenges, and Strategies

LLM Security—Vulnerabilities, User Risks, and Mitigation Measures

LLMOps—Benefits, Implementation, and Best Practices

AI Infrastructure: Tutorial & Best Practices

Table of Contents

Unlock up to 10x greater productivity

Like this article?

Summary of key AI infrastructure concepts

Data storage and processing

Powering data engineering automation for AI and ML applications

Training and inference hardware

Unlock the Power of Data Integration. Nexla's Interactive Demo. No Email Required!

Model deployment and hosting

The operationalization of AI applications

Discover the Transformative Impact of Data Integration on GenAI

Conclusion

Continue reading this series

AI Infrastructure: Tutorial & Best Practices

Large Language Models (LLMs) Tutorial

Vector Embedding Tutorial & Example

Vector Databases: Tutorial, Best Practices & Examples

Retrieval-Augmented Generation (RAG) Tutorial & Best Practices

LLM Hallucination—Types, Causes, and Solution

Prompt Engineering vs. Fine-Tuning—Key Considerations and Best Practices

Model Tuning—Key Techniques and Alternatives

Prompt Tuning vs. Fine-Tuning—Differences, Best Practices and Use Cases

Data Drift in LLMs—Causes, Challenges, and Strategies

LLM Security—Vulnerabilities, User Risks, and Mitigation Measures

LLMOps—Benefits, Implementation, and Best Practices

Unlock up to 10x
greater productivity