Contact us
Our friendly team would love to hear from you.
We use cookies
This website uses cookies to provide necessary website functionality, improve your experience and analyze our traffic. By using our website, you agree to our Privacy Policy and our cookies usage.
Dive into the essentials of Natural Language Processing and Large Language Models. This course offers a hands-on approach to NLP workflows, tokenization, and the Hugging Face ecosystem, building your expertise step-by-step. Perfect for software engineers eager to master NLP.
For all parts of the course, its main goal is to familiarize the participants with the NLP workflows and models, including the LLMs, from the software engineering perspective, only introducing the mathematical theory when strictly necessary.
As the whole course was developed to be presented as a whole, each day builds on the knowledge gained in the previous one(s), further expanding the participant’s knowledge of the NLP in general. However, if deemed necessary, sufficiently prepared or experienced users should be able to forgo certain days.
This day is a general introduction to NLP and the context in which its tasks are performed. The main goal is to discover steps that should be performed before feeding the data to the neural network, talk about how AI models work with textual data, and introduce the participants to the Hugging Face ecosystem and their libraries which are a staple of NLP workflows.
This day’s course can be divided into two parts. The introduction to the various tools and Hugging Face ecosystem, that are more of a high-level demonstration than a challenge, and the more advanced sections such as the Tokenization, Training custom Tokenizers, Word Embeddings, and Transformer Architecture, some of them with more theory than others. All newly introduced terms and techniques are explained from scratch, thus leaving no one in the dark.
Because of the introductory character of the first part, its difficulty is low, evolving to medium in the aforementioned latter sections.
The beginner audience, with little to no NLP experience, is best suited for this day, since during the course its participants get to know a little bit of everything from the tokenization, through the transformers library and its use cases to the whole Hugging Face ecosystem.
However, even the advanced users, already acquainted with the presented tools may find some of the more theoretical sections interesting and valuable, since an understanding of them is vital for successful participation in the following days of the course.
Throughout the various stages of the workshop we will be introducing ways to deal with the imbalanced datasets, whether it’s during the training or evaluation.
On this day we take another step into the world of NLP, focusing on one of the most versatile tasks in the field – text classification. We learn about the metrics with which you can measure a model’s performance, discover ways to work with imbalanced datasets and most importantly explore different ways to classify text.
The difficulty for this part of the course varies between medium (for the sections like the introduction of metrics, Masked Language Modeling, or working with datasets) and hard (like managing class imbalance, usage of SHAP, and especially working with Torch framework to implement the MLP).
In this part of the course we expect the more advanced participants to thrive, as we introduce more complex ways to work with the neural networks, including devising network architecture with PyTorch.
Less experienced users should also find a lot of interesting parts, such as new metrics, fine-tuning of models with transformers API, or working with the datasets.
In this workshop, we dive deep into various Token Classification problems, highlighting their main challenges, potential pitfalls, and strategies to overcome them. We’ll cover preprocessing techniques for data, and methods to evaluate results tailored to specific tasks, including leveraging several new libraries.
This particular workshop’s difficulty can safely be evaluated as medium as it mostly leverages what the participants should already be familiar with from the transformers library and pure Python, while providing more details on the theoretical side of things.
We expect this part of the course to appeal to both the beginner and the advanced participants, as we go deeper into possible use-cases of NLP and discover another of its tasks, while mostly using the transformers API that was described in detail in the previous parts of the course.
To those more interested in the technicalities of model training and evaluation we introduce the usage of monitoring tools and a new evaluation library.
Continuing our journey through the NLP field with the use of the transformer models. This time we focus on the variants that leverage the full transformer architecture, also known as seq2seq models. We will explore two of the tasks those models thrive in: question answering and text summarization, learn how to measure the quality of text generated by a model and try to create a multi-task model able to perform both aforementioned tasks.
Considering the similarity to the previous workshop, this one is also of moderate difficulty. Just like the former, it contains significant theoretical content, while also touching upon important technical aspects of the systems solving those tasks, once again relying heavily on the transformers library.
As mentioned in the ‘level’ section, due to similarities to the day 3’s workshop, we expect this part of the course to be enjoyable and valuable for participants on all levels of the knowledge tree.
While keeping the formula similar to the previous workshop, we provide new information by introducing a whole new set of NLP problems, classical and innovative metrics to evaluate the performance of models solving those tasks and methods of text generation, important for yet another architecture variant of Transformers.
Venturing away from two previous days, we introduce the participants to the concept of LLMs, prompt engineering, as well as the zero-shot and few-shot learning techniques, translating from fine-tuning a smaller, dedicated model for each task to using one LLM with fitting prompts for all of them.
We compare results achieved by those two approaches on various tasks from previous days and explore newly introduced aspects regarding LLMs.
This workshop acts as an introduction to the wide field of LLMs, thus its difficulty level is low. We aim to make the transition from the smaller models to LLMs as gentle and easy as possible, slowly, but surely building the foundation for the more advanced techniques and aspects of leveraging the biggest neural networks.
While we hope that everyone could gain some useful knowledge from this part of the course, we have to admit that it is mostly aimed at those who, have barely touched upon the Large Language Models in general, as this workshop takes the participants a step lower from using the UI of LLMs available online to being able to tweak model’s hyperparameters and try some new, albeit simple, prompting techniques.
In this workshop, we want to refresh and expand participant’s knowledge about the fundamental concepts of LLMs, talking in detail about the techniques introduced on the fifth day of the first session, as well as adding some new ideas and methods, all with the aim of communicating with LLMs more efficiently, thus increasing our control over their work.
Treating this workshop as a continuation and expansion of the fifth day from the first session, we can derive that its difficulty increases along the notebook from easy revisions from the former part of the course to medium when introducing new concepts or providing more details on the ones already introduced.
Once again using the last workshop from the first session as a reference point, we assume this part of the course is suitable for those with limited knowledge in the field of LLMs. It would prove especially useful for those who skipped the previous workshop altogether.
That being said, we did our best to develop this course in such a way that even more advanced participants could learn something new from every day and every session.
In this workshop, we explore various techniques and strategies for effective prompt crafting, ensuring that our communications harness the full potential of these powerful tools. We can split those approaches into two main categories based on their intent.
As we continue our journey into the intricacies of artificial intelligence, Day 2 shifts focus to the art and science of Prompt Engineering. This critical skill set involves crafting specific inputs that guide Large Language Models (LLMs) to generate desired outputs with higher precision and relevance. Prompt Engineering is not merely about asking questions; it’s about formulating them in a way that aligns closely with the model’s training and capabilities. Understanding this can significantly enhance the quality of interactions with LLMs, enabling more accurate and contextually appropriate responses.
As we move onto more complex and complex prompt engineering techniques, the level also increases from fairly easy (e.g. CoT prompting), through intermediate (like Tab-CoT) to really hard (see RAG and ReAct).
Since the difficulty and advancement of the described techniques progress through the workshop, we can guarantee that every participant will find their own niche and learn something new from it. While remaining in scope of the transformers library in the technical aspect, this part of the course is more focused on the theoretical approach, since we introduce a lot of prompt engineering techniques.
Diving deeper into the LLMs field, in this particular workshop we explore several new aspects of it, LLMs acting as Agents, and also how to evaluate and fine-tune LLMs, since their size and complexity vastly differentiate them from small task-specific models, creating both theoretical and technical challenges.
Due to the introduction of new frameworks and tools, as well as because we define new, more complex tasks, the difficulty of this workshop is estimated as hard.
Since we present new, more sophisticated and technical territories, we recommend this part of the course to the more advanced participants, especially with more experience when it comes to ML frameworks such as LangChain.
In this workshop, we dig even deeper into the fine-tuning of LLMs, focusing on the processe’s efficiency and its another important factor – alignment, aiming to enhance this very important part of LLMs’ deployment.
With the introduction of yet another set of tools and approaches, that also build upon the knowledge gained in the prior days (especially the third day of the second session), the level remains the same, deeming this workshop as hard, especially to those with limited coding experience and the ones who omitted the previous day.
Once again, maintaining the difficulty of the previous workshop, we are obliged to emphasize that this part of the course is prepared with advanced participants, who aim to customize their LLM-based solutions, in mind.
On the final day of the second session, we move even lower in the technology stack of LLM-based systems.
We will focus especially on the very feature that makes them so powerful – their size. One of the main reasons for their great performance also creates challenges for inference, sometimes making it last excruciatingly long. The amount of LLM’s parameters also makes it difficult to store and manage all those weights in memory.
We will explore a range of optimizations, from hardware-specific to conceptual techniques. These methods will help boost the performance of our models and enable us to efficiently serve even larger models using the same hardware setup.
As we reach closer and closer to the hardware beneath the ML, slowly venturing into the ML-Ops field, the level of difficulty increases again, making the gist of this workshop hard to grasp, especially on the technical side.
To not repeat ourselves, we recommend this particular workshop for those deeply interested in building and deploying systems that leverage LLMs. It’s also important to mention that any previous experience with the aforementioned tools is crucial.
In this session as a whole, we will venture into a specific, highly sophisticated field of LLM usages called Retrieval Augmented Generation (RAG). We will learn in detail what this fancy-sounding term really means, how we can leverage such a solution in the real world, what’s the difference between using a normal LLM and a RAG system, what components such a tool consists of, and how to build it. The initial workshop aims to introduce the participants to the concept of RAG, its main components, and the LangChain library which is one of the commonly used tools for building the aforementioned system.
The whole workshop has an introductory character, but the knowledge gained in the previous sessions is greatly recommended. Experience with LangChain is not obligatory, although one with it would find the discussed topic easier to understand. Overall, in comparison to the earlier sessions the difficulty could be placed somewhere between easy and medium.
The entire session is aimed at people who want to learn about RAG, how to build its main and additional components, how to improve the system and overcome its challenges, and most importantly – what advantages over the casual use of LLMs it provides.
This initial day is the introduction to the topic, which should bring all participants up to speed and allow them to squeeze the most out of the following, more specific workshops in this session.
The characteristic part of the whole session is that after the initial introduction, you learn something new each day, and while those things are separate and the participants could benefit from taking them separately, it’s much better to approach them as one, elaborate RAG guidebook.
After getting to know the basics of Retrieval Augmented Generation on the first day, we’re ready to dive deeper into its architecture. We’ve already built a demo from pre-prepared parts, but what if those elements don’t satisfy our needs? In this workshop, we will explore the issues related to the scenario when you don’t have your own data, we introduce a different, more efficient vector store and we show which evaluation metrics might be useful when assessing the retrieval module.
The difficulty level of the workshop is low. Participants with prior knowledge of metrics’s mathematical formulas or Weaviate SDK may complete the day quicker, but it’s easily manageable for those without such background. Once again, experiences from the second session about prompt engineering and handling LLMs in general may come in handy.
As we continue the journey into the RAG domain, we learn a bit more about the retrieval part of the system: an alternative way to obtain training and/or testing data, how to enhance it with the leverage of metadata, and finally how to evaluate it separately from the RAG as a whole.
More on how to boost our retriever, this time from a different perspective.
We will start with building a simple embedding-based retrieval which will serve as a baseline. We will then evaluate it on the test set and analyze what kind of errors it makes. This will help us understand the limitations of the simple retrieval model and the data we are working with. We will also explore how the choice of chunking strategy can affect the retrieval performance. Next, we will go back to the basics and learn about lexical search and when it can be used to improve retrieval. Then, we will add another component to our retrieval pipeline: the reranker. We will learn how to use it and how it can improve the retrieval performance. Finally, we will come back to the embedding-based retrieval and see how to fine-tune it to further improve the performance.
From now on, things get more serious, but only just a bit. Once again it is the introduction of new terms, concepts, and ideas that the participants might find difficult to grasp at first. The code part is easily manageable by all who’ve participated in the previous workshops of this session. There are also references to the first session of the training, so it should be much easier for those who participated in the whole training cycle.
Similarly to the previous day, this is another part of broadening the knowledge we gained since the first workshop of the session. Here we introduce new components and showcase how to adjust them to our needs. All of those who want to customize RAG for their own needs could benefit greatly from this workshop.
This time, we focus on yet another part of the system – the generator.
We will start with an overview of the metrics used to evaluate the generated responses. In particular, we will focus on how to use Large Language Models (LLMs) to analyze the generated responses and compare them to the ground truth. Then, we will do a short recap and build a simple RAG system which will serve as a baseline. We will evaluate it on the test set and analyze what kind of errors it makes. In the next stage, we will explore the context created from documents returned by the retrieval model and analyze how the quality of the context affects the generation performance. Next, we will fine-tune the generator model to align it to the expected answers and improve the generation performance. Finally, we will explore several other extensions to the RAG model.
Due to the usage of LoRA this might be the most code-advanced of the workshops of this session. The difficulty of other, more complex RAG components discussed there also serves to increase the overall difficulty. All in all, it’s safe to say that it’s the hardest of the workshops in the third session, thus its level is estimated as medium bordering on hard.
Yet again, we learn something new about the concept, though, out of all workshops in this session, this one is most suitable for the more experienced RAG users, due to the complexity of the introduced components, which while not exactly essential, will bring the system to a whole new level of user-friendliness, usability and security.
Despite the inclusion of self-supervised learning in the process of training the Large Language Models (LLMs), these largest models, as well as the vast majority of other, smaller ones that solve various NLP tasks, still require labeled data to produce any meaningful results. More often than not the problems start at the very beginning, when gathering the data. Then another set of impediments arise… who should annotate these texts, do we have to hire experts, how many of them, what tools to use and so on. Or maybe we can use the LLM to do our bidding and be done with the problem quicker and cheaper? Today we will address some of those issues, and show you how the data annotation process can look like.
This workshop is considered to be beginner-friendly as it mainly focuses on the showcase of tools for data annotation and describes terms connected to it. Regarding the programming aspects, we don’t anticipate any difficulties for the participants.
This day is dedicated to those who seek to streamline the data gathering process, especially the annotation part. However, as the topic is broad, we expect that everyone will find something of interest here.
This day focuses on tools and techniques for monitoring and optimizing the training process of machine learning models. We introduce MLflow as an open source tool, allowing us to monitor the fine-tuning experiments. We present the experiment card as the basic means for documenting the progress related to searching for a better model. Then we move to the methods allowing for trining optimization and we present Optuna as one of the tools which automates this procedure. We conclude the day with the introduction of the model’s registry.
New day, new tools, and once again the main challenge of the workshop is to understand and leverage them to the greatest extent possible. Thus, the difficulty will be perceived differently by the participants based on their previous experiences with such tools. On average, we expect a medium level of difficulty.
All interested in neural networks can find some interesting bits here, but the main beneficiaries of the workshop should be those directly involved in the very process of training or fine-tuning models.
In the entire NLP pipeline, before modeling we have to process and clean the data, and after training we should analyze and understand our models. This workshop focuses on data-centric AI, data quality testing, as well as resulting model testing and analysis. Understanding data quality, problems and errors is crucial to design proper preprocessing. These factors also influence models, causing lower quality and stability, often related to bias and unfairness. Spotting those problems is particularly challenging for large text corpora, which cannot be analyzed manually.
Similarly to the previous days in the last two sessions, this workshop introduces many new concepts and tools like CleanLab and Giskard, thus the participants should focus on exploring these, learning how to use them and how to recognize scenarios in which leveraging them would be beneficial. The coding side isn’t particularly demanding, but due to the multitude of novelties the difficulty level is estimated at medium.
This workshop was created to optimize the data mining phase, as well as the latter quality assurance processes, after the models are trained. All tools and techniques introduced on this day could benefit users that dedicate themselves to obtaining the best possible data and understanding how it affects model’s quality and interpretability, including data engineers and researchers on R&D teams.
After gathering data, creating and deploying a model, a whole new phase begins: post-deployment. It covers versioning, monitoring, retraining, updating, and more. Here we focus on monitoring, which helps us to detect changes in data and model behavior. This is also known as drift monitoring, and can be divided into data drift and concept (model) drift. These are particularly challenging to get right in NLP applications, where we have text data, embeddings, multimodal datasets, and LLMs.
This workshop is expected to be the most challenging of the session in terms of the concepts and solutions introduced, mainly due to new tools, but also due to the phenomena of data drift and concept drift.
This part of the course aims to provide an answer to the following question: What to do after the model is trained and deployed? The conclusions should be of value to all involved in the training and deployment processes, but especially to data scientists responsible for data mining and cleansing, and engineers responsible for model instantiation and its training or fine-tuning, as the data and concept drifts affect primarily those areas in further support of the deployed solution.
With Cognitum, you can confidently launch your project. Ensure scalability, security, performance and design with our product experts on side.
Get in touchLarge Language Models (LLMs) are a type of machine learning model for natural language processing. They’re trained on a large amount of text data and can generate human-like text based on the input they’re given.
A private LLM is a large language model that is a model exclusively utilized by a specific organization. This guarantees data security and privacy as the model and its associated data are not disseminated to other entities.
Yes, especially when you use private LLMs. These models are not shared with other entities, ensuring your data remains secure and complies with your stringent data policies.
Yes, LLMs can be seamlessly integrated with clients’ environments such as databases, websites, mobile apps, messaging apps, customer support platforms, and more.
To start implementing LLMs, reach out to us at Cognitum. We’ll discuss your specific needs and how our solutions can help you achieve your goals.
A Generative AI application is a type of artificial intelligence that creates new content. It’s based on patterns and structure of their input training data and then generates new data.