Presentation Taking Llms: Latest Updates and Analysis

Presentation: Taking LLMs out of the Black Box: A Practical Guide to Human-in-the-Loop Distillation

Montani: I'll be talking about taking large language models out of the black box, and a few practical tips that hopefully you can apply in your work today. Some of you might know me from my work on spaCy, which is an open-source library for natural language processing in Python. spaCy was really designed from day one to be used in real products and be used in production. That also means we had to do a lot of the boring software development stuff like backwards compatibility, making sure we don't break people's code.

This actually had a pretty nice unintended side effect more in recent times, which is that ChatGPT is actually really good at writing spaCy code. You can try that out if you want. We also developed Prodigy, which is a modern annotation tool for creating training and evaluation data for machine learning developers and machine learning models. Prodigy is fully scriptable in Python, so it allows for a lot of nice, semi-automated workflows, including, actually, a lot of the ideas that I'll be talking about in my talk.

As an industry developing software, we have really built up a lot of best practices over the years and a lot of ideas of how we want our software to behave and what's good software development. One of those is, we want our software to be modular. We want to have components that we can work on independently. We want to have building blocks that we can work with. We also want our tools and our software to be transparent. We want to understand what's going on, be able to look under the hood, and also debug if something is going wrong. That also means we want to be able to explain to others why our system is doing something, or, in the case of a machine learning model, why we get a specific result.

Many solutions also need to be data private. Who works in a field where data privacy is crucial? That's an crucial topic. We want systems to be reliable, not just randomly go down. Whatever we build also needs to be affordable, whatever that means in a specific context. Maybe you're working on a budget, maybe you have no money. If we're building something, it needs to fit. This introduces a lot of problems, if we're looking at new ways and new technologies. If we're working with black box models, we can't really look under the hood, and we can't really understand what's going on. For a lot of them, if we're using an API, we don't even really know how they work. A lot of that is not public.

We have these big monoliths that do a whole thing at once. That can be very challenging. Because a lot of the newer models are very large, which is also what makes them so good, it's not really efficient or sustainable to run them yourself in-house, which means we need to consume them via a third-party API. That also means we need to send our data to someone else's server. We're at the mercy of an API provider. If it goes down, it goes down. If it's slow, that's how it is. The costs can also really add up if you're using a model via an API at runtime. How can we fix this, and how can we still use all these exciting latest technologies that are really good, while also maintaining the best practices that we've built over years of developing software. That's what I'm going to cover in my talk.

To maybe start off with a practical example, here's something that's maybe familiar to you and maybe similar if you're working in NLP, to something that you were tasked with in your job. Let's say you work for an electronics retailer or a business producing phones, and you're getting lots of reviews in from people, and you've collected them all, and now you want to analyze what people are saying about your product. First, what you want to do is you want to find mentions of your products in these reviews and find those that are relevant. You also want to link them to your catalog that you already have, where you have all meta information about your products. Then you want to extract what people are saying, so you have different categories, like battery, camera, performance, design, and you want to extract whether people like these or not, which is also often referred to as aspect-oriented sentiment analysis.

Then, finally, if you have all that information, you want to add those results to a database, maybe the database that also has your catalog. As you can see, there are some parts of this that are actually quite straightforward. We have, battery life is incredible. That's easy. You can definitely extract that. There are other parts that aren't necessarily so straightforward because of its language, and language can often be vague. Here, the reviewer says, never had to carry a power bank before, and now I need it all the time. From the context that we have about the language and the world, we know that this means that the battery life is bad. It's not explicit in the text. That's really a use case that benefits from machine learning. There are a lot of larger models these days that are really good at that, because they're trained on such a vast volume of text, and they're really good at understanding, "This type of context", in the text.

Using in-context learning, you could basically prompt the model. You can provide it examples of your text, and it can respond with an answer, for example, of the sentiment you're interested in. Of course, if you think about what you actually need here, you have a specific use case, and you really only need a small subset of what the model is able to do. You really need the context. You're not interested in talking to it about the weather. There's really a very specific thing that you want. The question is, what if we could just take that out? What if we can extract only the part that we're interested in into a model that can also then be much smaller and much more specific, because all we want it to do is predict those sentiment categories.

While there has been a lot of talk about in-context learning, because that's new, and it's also what a lot of the research focuses on, it doesn't mean that transfer learning is somehow outdated or has been replaced. It's simply a different technique. You might be familiar with models like BERT and their embeddings and all kinds of different local variants, like CamemBERT, and these embeddings encode a lot of very crucial and very relevant contextual information that you can initialize your model with, and add a task-specific network on top. The thing is, if we're looking at the research, we'll actually see that even the most classic BERT, base, is still very competitive and achieves very good results compared to zero-shot or few-shot in-context learning. What's also crucial to keep in mind here is that these are all academic benchmarks. These are calculated based on datasets we can't control. That's the idea of a benchmark.

If we're already getting these very promising and good results using benchmark datasets, we'll probably be able to achieve even superior results if we have control over the data which we have in our use case. What we want to do is, if we want our large generative model to do a specific thing, we start out with our text, and we start out with the prompt based on a prompt template specific to a thing we want to extract. Then we pass that to the model. What we get back is raw output, usually in the form of an answer, depending on the prompt we gave it.

Then we can use a corresponding parser that corresponds to the prompt template in order to parse out the task specific output. In our case, we're not after a conversation. We're after structured data that expresses the categories that we've already defined. What we also want to do is we don't just want to match the large language model's performance. Ideally, we want to do it even improved, because we have the capability to do that. If the model makes mistakes, we want to correct them. What we can do is we can pass our task output to an annotation workflow and really create a dataset that's very specific to what we're looking to extract. Using transfer learning, we can use that to distill a task-specific model that only performs the one task we're interested in, using what we already have in the weights of the large generative model.

Close the Gap Between Prototype and Production.

This is both pretty exciting and very promising, but one thing we've definitely seen, if you get to work, is that a lot of projects these days, they really get stuck in this phase that I also call the prototype plateau. You start working. It's all very exciting, and it's working. When it comes to actually shipping your system that you've built, you realize that it doesn't work. There are a lot of reasons for that, and also solutions that are actually really essential to keep in mind before you start building. In order to close the gap between prototype and production, one essential thing is you want to be standardizing your inputs and outputs. You want to have the same workflow during prototyping as you have during production.

If your prototype takes random human generated text that you type in and outputs a human readable text response, but your production system needs structured data, then you're going to have a problem, and it might not actually translate so well. You also want to start with an evaluation, just like when you're writing software, you're writing tests. For a machine learning model, the equivalent is an evaluation. You want examples where you know the answer, and that you can check and so you actually know if your system is improving or not. That's something that's often glossed over.

A lot of people if you're excited in building something, you might go by just a vibes-based evaluation. Does it feel good? If you actually want to assess how is my model performing, you want an evaluation with accuracy scores that you can compare. Of course, accuracy is not everything. Especially if you're coming from a research background, you can be very focused on just optimizing those scores. A high accuracy in your model is useless if the model isn't actually doing what you want it to do, and if it's useful in your application. In an applied context, you don't just want to optimize for a score, you also want to test, is it actually useful? Whatever that means in your context. That also requires working on your data iteratively, just like with code.

Usually, the first idea you have is usually not what you ship to production, and the same goes for data. You want to have a workflow where you can quickly try things out, and ideally also tooling to help with that, so you don't need to schedule large meetings and spend hours to try out every idea you have. Finally, we're working with language here, and that's really significant to keep in mind. While as developers, we really like to fit things neatly into boxes, language doesn't work that way. It's usually vaguely gesturing at things. There's a lot of ambiguity in language that we have to keep in mind, it's not just data or it's not just vectors.

On the other hand, we can also use that to our advantage. There's a lot in the language that helps us express things and get our point across, and that generalizes across language very well. If we could identify these parts, we can actually use that to our advantage in the application, and make the problem easier for our model. These are also things that we really thought about a lot when developing our tools. Because I think if you're building developer tools, that's one of the problems you want to address. How can we make it easier for people to standardize workflows between prototype and production, and actually ship things and not just get stuck in the prototype plateau?

Here's an example of a prototype we might build for an application. We have a large generative model, and what we can do, and something that we've actually built with spaCy LLM is, have a way to prompt the model and transform the output, and parse it into structured data. Even while you're trying things out without any data required, you can use a large language model to create the structured data for you, and what you get out in the end is an object that contains that structured data. You can, of course, ship this to production the way it is, but you can also work on replacing the large generative model at development time so that at runtime you end up with distilled task-specific components that perform only the parts that you're interested in, and that are fully modular and also transparent, and usually much smaller and faster as well. The output in that case is also the same. You're also getting this structured machine facing object that you can standardize on.

As I expressed previously, of course we don't just want to match what the large generative model is doing. We actually want to make it superior. We want to correct its mistakes. For that, we need a human at some point in the loop. That's a very significant step here. To give you an example how that works, we start off with a model and all the weights it has available. As a first step, as I mentioned before, we want to have a continuous evaluation. We need a way to figure out our baseline. What are we up against? What's the performance we get out of the box without doing anything?

Otherwise, you'll have no idea whether what you're doing actually makes a difference or not. Now we can use all the weights we have available in that model and prompt it, and it will return whatever data we ask for, using everything that it has available. We can pipe that forward into an annotation environment where we can look at just the exact structured data and make corrections and very quickly move through that data and create a dataset that's really specific to the task, like the aspect-oriented sentiment predictions, for instance. With transfer learning, create a component that only performs that. Of course, here, it comes in handy that we have our evaluation because we want to do that until our distilled model beats and ideally also exceeds that baseline. I'll show you some examples of this later, but you might be surprised how easily you can actually do this and apply this yourself.

First, how do we access our human? Going back to that practical example, we have one of these reviews of someone who rated our Nebula phone, "meh". As an example, the type of structured data we're after is something like this. For simplicity, for this example, I'll only focus on assuming we have binary values for those categories. Of course, in some cases, you might want to define some other schema and have a scale of like, how much do people like the battery life, and so on? That's the structured data, that's our output. If we're presenting that to the human, a naive approach would be, let's just show the human the text, give the human the categories, and then let them correct it.

If you're looking at this, you'll see that this doesn't actually capture these null values. We have no distinction here between a negative response, or no mention of that aspect at all. We can extend this a bit and collect whether it's positive or negative, and have the large generative model make the selection for us. That means you can move through the examples very quickly, and all you have to do is correct the model if it makes a mistake. The big problem we have is that humans are humans, and have a lot of disadvantages. One of them is that humans actually have a cache too and a working memory. If you ask a human to constantly in their head iterate over your label scheme and every aspect that you're interested in, humans are actually quite bad at that.

You'll find that humans really lose focus, end up making mistakes, and humans are very bad at consistency. What you can do instead is you can help the human and the human cache and make multiple passes over the data, one per category and one per aspect. While it might seem like a lot more work at first, because you're looking at the same example multiple times, and you're collecting a lot more decisions, it can actually be much faster, because you reduce the cognitive load on the human. I'll show you an example of this later, where a team actually managed to increase their speed by over 10 times by doing this. You have your human, you have a model that helps you create the data, and you're collecting a task-specific dataset that doesn't just match the few-shot or zero-shot baseline, but actually improves upon it.

To give you some examples of how this works and how this can look in practice. This is the case study we did based on a workshop we held at PyData in New York. The task here was we want to stream in data from our cooking subreddit, and extract dishes, ingredients, and equipment from it. We did that together with the group, and also discussed the data while we were doing it. We used a GPT model during the annotation process to help create the data. In the workshop, we were actually able to beat the few-shot LLM baseline of 74%, which is actually pretty good out of the box without any training data. We beat that in the workshop and created a task-specific model that performed the same or even more effective, and that model also was more than 20 times faster.

If you look at the stats here, we have a model that's 400 megabytes, which is pretty good. You can totally run that yourself, runs on your laptop, runs at over 2000 words per second, so really fast. The data development time, we calculated, how long would it have taken a single person to create all the data for it. That's about eight hours. That's one standard workday. If you think about other things you spend time on as part of your work, you probably spend more time trying to get CUDA installed or trying to get your GPU running. It's really not true anymore that data development is like this absolutely tedious task, even a single developer can do this in a workday. That was very promising.

That also inspired the next case study we did, which was with a company called S&P Global. What they're doing, in this project, is they're extracting commodities trading data in real-time. If crude oil is traded somewhere, they extract the price, the participants, the location, and a wide range of other attributes, and they provide that as a structured feed to their customers in real time. Of course, this is information that can really significantly impact the economy and move markets. Their environment is a high security environment. I actually went to visit them in London a while ago, and even within their office, it's very highly segregated.

They have this glass box that the analysts sit in, you can only access it with a specific card. It's incredibly essential that everything they do runs in-house, and no other third party gets to see it before it's published, which also is like a promise of the data product. That's why their consumers are using it. What they did was they moved the dependency of the large language model to development and used that to create data for them. This plus some optimizations of how they actually present the questions to the human, including having simple, often binary questions and making multiple passes over the data, that actually made the whole process more than 10 times faster using a human and the model in a loop.

They currently have eight pipelines in production, probably even more by now. This was a very successful project. If you're looking at the stats again, they're achieving very high accuracy. The models are 6 megabytes per pipeline. If you're letting that sink in, this is really tiny. You can train that on your laptop really easily. They run super-fast, at over 16,000 words per second, so they're really a great fit for processing these insights in real time and as quickly as possible. Again, also data development time, that's a single person, so in under two workdays, or with two people, you can create the data needed for a distilled task-specific pipeline in about a day. Totally doable, even if you don't have that many resources.

How did they do it? What's the secret? One of them is, if you're thinking about developing AI solutions, they're really code plus data, and so just like you refactor code, you can also refactor your data. Refactoring code is probably something you do all the time, and you're very familiar with. The same really applies to your data development process. There are different aspects of this. One big part of refactoring is you're breaking down a large problem into individual components, and you're factoring out the different steps and creating reusable functions. That's something we really accepted as the best practice, has a lot of advantages. You can do the same for your machine learning problem and your models. As part of that, the goal is you can make your problems easier.

Again, you do this with code a lot, trying to reduce the complexity, and you're allowed to do that. Have an easier system, and make it easier for the model as well. One part of that is factoring out business logic, and separating logic that's really specific to your application, from logic that's general purpose and that maybe applies to any language and doesn't need any external knowledge. I'll show you an example of that later. Again, that's something you do in your code already, and that works well. You can apply that same idea to your data process.

Part of refactoring is also reassessing dependencies. Do you need to pull in this massive library at runtime that you only use a function of, or can you replace that? Is there something you can compile during development time so you don't need to use it at runtime? The same is true for machine learning models. Can you move the dependency on the really complex and expensive and maybe intransparent model to development, and have a much cleaner and operationally simpler production environment? Finally, choosing the best techniques, you decide how a specific problem is best solved, and you have this massive toolbox of skills and of techniques that are available, and you pick the one that's the best fit for the task at hand.

One thing people really easily forget is that you are allowed to make your problem easier. This is not a competition. This is not academia. You're allowed to reduce the operational complexity, because less operational complexity means that less can go wrong. When I started programming, I didn't know very much. Of course, what I built was all pretty simple. Then as I got more experience, I learned about all of these new things, and of course, wanted to apply them. My code became a lot more complex. Also, if I'm looking back now, back then, I didn't really write comments because it felt like a sign of weakness. If I found an especially complex solution to my problem, commenting meant that I'm admitting that this was hard, so I didn't do that, which also makes it even harder to figure out what was going on and what I was thinking at the time.

Then, of course, with more experience my code also became much more straightforward, and I was able to pick the best techniques to get the job done and actually solve it most efficiently, instead of coming up with the most complex and interesting solution. I think it's easy to forget this, because we are in a field that is heavily influenced by academia. In research, what you're doing is you're really building a Commons of Knowledge. You also want to compare the things you're building using standard evaluations. If you're comparing algorithms, everyone needs to evaluate them on the same thing, otherwise, we can't compare them. You also standardize everything that's not the novel thing that you are researching and publishing. Even if what you're standardizing isn't the best possible solution or isn't efficient, it doesn't matter. It needs to be standardized so you can focus on the novel thing you're exploring.

On the other hand, if you're building an application and working in applied NLP, what you're doing is you're basically learning from that Commons of Knowledge that was built by academia and provided, and basically pick what works best, and follow some of the latest ideas. You also align your evaluation to project goals. You're not using benchmarks. Your evaluation basically needs to tell you, does this solve the problem, and is this useful in my product or project, or not? You also do whatever works. Whatever gets the job done, you can take advantage of. If that means it's less operationally complex, then that's great.

One big part, as I mentioned, of the refactoring process is separating out what's business logic and what's general-purpose logic. That can be quite tricky, and really requires engaging with your data and your problems. Here we have our SpacePhone review again. If we're looking at that, we can basically break down the two different types of logic in this pseudocode formula. We have the classification task, which is our model that really predicts and processes the language itself. Then we have the business logic which is specific to our application and which can build on top of that.

To give you some examples here, general-purpose classification in our example would be stuff like, what are the products? There's a model. What's the model of this phone? Is it a phone? Are we comparing the phone to something else? That requires no outside context, and that's really inherent to the language, and not our specific problem. Then, on the other hand, we have stuff like our catalog reference. That's external. Nothing in the text tells us that. We also have things like, does it have a touch screen? Is it worse than the iPhone 13? The fact, is it the latest model? That is something that can change tomorrow. We have information that can really change over time.

While we can include that in the model and in the predictions we make, we'll end up with a system that's immediately outdated, that we constantly need to retrain, and a problem that's a lot harder for the model to build some reasoning around, because we have nothing in the text that tells us that, whereas what we do have is we have our catalog reference, we have dates, we have things we can do math with. This process can be very powerful, but of course, it really is absolutely specific to your problem and requires engaging with it.

To give you an example of this idea in context, this is a most recent case study that we . What they're doing is they've processed one year's worth of support tickets and usage questions from different platforms, and they want to extract actionable insights. For example, how can we improved support our support engineers in answering questions? What are things that we could add to our docs? Also questions like, how are people adopting new capabilities? How many people have upgraded to the latest version? Are people still stuck on an older version? What are potential problems there, and so on? While these things don't necessarily sound like particularly sensitive information, if you think about it, support tickets can actually include a lot of potentially sensitive data, like paths, details on people's setup.

They're working in a high security environment and a hardened offline machine, so whatever they're building, it needs to run internally, and it also needs to be rerun whenever they have new tickets coming in and new data. It needs to be very efficient. Another very key feature of this project was, it needs to be easy to adapt it to new scenarios and new business questions. What's the latest version changes? capabilities change. Things people are doing change. It needs to be easy to answer different questions that maybe weren't intended when the system was built. Of course, you can do these things as end-to-end prediction tasks, but that means that every time something changes, you need to redo your entire pipeline. Whereas if you can factor out general-purpose capabilities from product specific logic, it becomes a lot easier to add extraction logic for any other future problems and future questions on top.

A very simple example of this is, you have things like the software version that is very specific business logic, whereas extracting numbers, makes it a lot easier for the model. If you have that, you can add your business logic on top to determine, is this a version of the software? Is this a link to the docs, and so on? I've linked the case study, [website] They've done some pretty interesting things. Also have a pipeline that's super-fast, and are working on adding a conversational output on top. I hope we'll be able to publish more on that, because it's a very cool project that really reveals the importance of data refactoring.

What you can see here, is, as developers, we really love to put things into clear boxes and have this idea of like, if we can just have this one model that can do everything, wouldn't that be great? Unfortunately, reality doesn't really work that way. Reality isn't an end-to-end prediction problem. It's actually very nuanced and very complex. Human-in-the-loop distillation and going from a much larger general-purpose model to a much smaller and more efficient task-specific model really is a refactoring process.

You refactor your code, you refactor your data, and that requires engaging with it. Iteration, which, again, is very heavily influenced by the tooling you use, can be a huge help in getting you past that prototype plateau and closing the gap between prototype and production. Because I think at the moment, we're seeing a lot of prototypes being built, but a lot of them also don't make it into production, and that's sad. If we standardize and align our workflows with more effective tooling, we're actually able to build a prototype and translate that directly into a production system.

Again, you are allowed to make your problems easier. I think with other aspects of software development, we've really learned that making things less operationally complex is superior because it means less can go wrong. If something goes wrong, it becomes a lot easier to diagnose. If you can apply that to machine learning, that's incredibly helpful, and as a result, you also get systems that are much cheaper, that are much smaller, that are much faster, that are entirely private and much easier to control. There's no need to give up on these best practices, and it's totally possible.

Also, we're working with data here, and as soon as you start engaging with that, you will immediately come across edge cases and things you haven't considered, and ambiguities in the language that are very hard to think of upfront. It's very significant to engage with your data, and also have a process in place that lets you iterate and make changes as needed. I also highly recommend having little workshops internally, like the one we did at PyData, where you can have long discussions about whether Cheetos, a dish or not, or whether the forehead is part of your face. All of these questions are significant, and if you can't make a consistent decision, no AI model is magically going to save you and will be able to do it for you.

Finally, there's really no need to compromise on software development best practices and data privacy, as you've seen in the talk. Moving dependencies to development really changes the calculation. We can be more ambitious than that, and we should be. We shouldn't stop at having a monolithic model. We can take it one step further and really make the best use of new technologies to allow us to do things that we weren't able to do before, while not making our overall system worse in the process and throwing out a lot of best practices that we've learned. It's absolutely possible. It's really something you can experiment with and apply today.

Participant 1: You mentioned in your talk that with model assistance it's totally feasible and quick to create data in-house. In your experience, how many examples do you think you need in order to create good results?

Montani: Of course, it always depends. You'll be surprised how little you might actually need. Often, even just starting with a few hundred individual examples that are good, can really beat the few-shot baseline. It also depends the amount you choose. It depends on what accuracy figures you want to investigation. Do you want to just investigation whole numbers? Do you want to investigation accuracies like [website] That introduces a different magnitude. I think if you look at some of the case studies I linked, we've also done some experiments where we basically took an existing dataset and trained on small portions of the data. Then compared when we beat the LLM baseline.

Often, even just using under 10% of the dataset already gave us really good results. It's really not a lot. I think if you start doing that, you'll really be surprised how little you need, with transfer learning. That also, of course, means that what you're doing needs to be good. Garbage in, garbage out. That's also the reason why it's more crucial than ever to have a way that gives you high quality data, because you can get by with very little, but it needs to be good.

Participant 2: Do you have any guidelines when it comes to comparing structured outputs. In the beginning, it seems like a very simple task, but if you start nesting it, in particular, if you have lists on both sides, trying to figure out what entities you're missing, can just become so complex. How to actually get it down to maybe at least just 5 or 10 numbers, instead of 100, of like, I'm missing the token in entity 5 and missing entity 6 completely.

Montani: There are different ways you can evaluate this. Some evaluations really look at the token levels. Others look at the whole entities. Also, there's a difference in, how do you calculate that if something is missing. Is that false, or do you count partial matches? That's a whole other can of worms in itself. More generally, I think it comes back to that refactoring idea of like, if you have these entities, is this actually a problem where boundaries are significant? Some people always often go for named entity recognition because you're like, I can have these spans of text that give me what I want. If you take a step back, in a lot of cases, it actually turns out that you're not even really interested in the spans. You're interested in, does this text contain my phone?

Then that becomes a text classification task, which is generally a lot easier and also gives you better results, because you're not actually comparing boundaries that are really very sensitive. That's what makes named entity recognition good. It's very hard to do that consistently. I think refactoring can also help there. Or if you have a lot of people who have who nested categories, taking a step back, do I need these nested categories? Can I maybe come up with a process where I focus on the most important top level first, and then maybe drill down into the subcategories. Or think in that S&P case study, they realized that there are actually some types of information that's relatively straightforward. If we know that it's of this category, we can deterministically decide which sublabels apply, for example. I think actually, it really ties into the refactoring point.

Often, the first label scheme you come up with is usually not the best. You want to pick something that's easy for the machine learning model, and not necessarily translating your business question one to one into a label scheme. That's usually where a lot of the problems happen.

Participant 3: What's the process to find the baseline? Because in my life, it's very hard to find the baseline.

Montani: What the case study companies did is, you create evaluation data. Let's say you have the text categories, is this about battery life, or is the battery life positive? Then you first have evaluation data where you know the correct answer. Then you basically let the LLM predict those, and then you compare it. For example, with spaCy LLM, in that case, you get the exact same output. You can evaluate that pipeline the same way you would evaluate any other model. Or you can try a few-shot approach. Basically, the idea is you let the model make the predictions, and then compare the output to examples where you know the answer, and that gives you the baseline.

Participant 3: For example, you have a multivariable problem when you're trying to evaluate, for example, risks, and you have so many different points, and you don't have reliable data to find the baseline.

Montani: That's also why I think creating data is essential. If you're building something that's reliable, you can't really get around creating a good evaluation. I've also heard people say, we can just pass it to some other LLM to evaluate. It's like, you're still stuck in the cycle. You can build software and not write tests. That's legal. You don't want to do that. Even if you're doing something that's completely unsupervised at the end, you want a good evaluation that also actually matches what you're doing, not just some benchmark dataset. I think that is super essential.

Then once you have that, it lets you calculate a baseline. It lets you test things. I always recommend, do something really basic, do like a regular expression, and benchmark that, just to have some comparison. Or do something really simple, because if you find out, I already get really good results on that, does it actually make sense? Or what's my machine learning model up against? I think it's such an key part of it. I think people should talk about it more. Yes, do it properly.

Participant 4: Would you also say that the approach that you described here would work in a similar way, if you basically use the model to then interact with individuals, and maybe, for example, respond based on the comment about the products, and directly interact back.

Montani: How would that look like as an example, if you're building a chat interface?

Participant 4: In this case, I think there was an evaluation, so you don't need to really chat. You just need to maybe respond and say, "Thank you for the evaluation. We're working on improving the battery", or something like that.

Montani: Here you have the model, actually, you ask, you have a prompt, like, here are the categories. Respond with the categories, and whether it's positive or negative? Then you try to get the model to respond as structured as possible, and then also pass that out so you really get label true, false, or none.

Participant 4: Would you, for these kinds of use cases, also use in-house trained LLM, or use the bigger ones on the market?

Montani: You can do both. One thing that's nice here is that, since you're moving the dependency to development instead of runtime, it actually becomes a lot more feasible to run your own open-source LLMs. If you're not relying on it at runtime, it's actually affordable and efficient to just do it in-house, and you can fine-tune it in-house, or you just use something off the shelves, or you use an API that you have access to. I think having the dependency during development is the key and really changes things so you can use whatever. You're not using the LLM to create any data itself. You're using it to add structure to your data.

Participant 5: Once you have an LLM running in production, do you have any tips of what I can check to recheck how the data works with the model, and retrain it.

Montani: What do you mean by how the data works with the model?

Participant 5: How is the model performing in production, and based on the new data that is coming in, can I have an automated retraining of the same model? Any tips on that?

Montani: I think that ties back into the evaluation as well. Even if you have your model running, you want to capture, what does it output? Whatever your context is. Then have a human look at it and see, is this correct, or is this not correct? How does this change over time? Because you also easily can have the problem of data drift, if the input data changes, if the model changes, which is also a problem you have if you have an API and then the model just changes. That changes a lot.

I think having a QA process in place where you really store what is your model doing at runtime, and then review, is it doing the correct thing? Do that regularly, iterate on that, and also see how is that changing over time as things change. That's kind of the thing of evaluation. You never get out of it. It's not something you do once and then forget about it. You constantly need to iterate and constantly do it if you're actually interested in really getting reliable feedback of how your system is doing.

JVM has a widely used library for implementing caching called Caffeine.

Key Takeaways GenAI can enhance employee productivity while safeguarding data security with data redaction and locally-hosted models.

As the Trump administration revokes Executive Order 14110, the [website] shifts toward a market-driven AI strategy, departing from the Biden administration......

Presentation: Zero Waste, Radical Magic, and Italian Graft – Quarkus Efficiency Secrets

Cummins: I'm Holly Cummins. I work for Red Hat. I'm one of the engineers who's helping to build Quarkus. Just as a level set before I start, how many of you are Java folks? How many of you are using Quarkus? How many of you have not even heard of Quarkus? I've worked on Java for most of my career. I'm here to talk about Java. I want to actually start by talking a little bit about Rust. I'm not a Rust developer. I have never developed Rust. I'm not here to criticize Rust, but actually I'm going to start by criticizing Rust. Of course, Rust has so many amazing capabilities. It's so well engineered. It's really a needed language. It is incredibly efficient, but Rust does have a problem.

There's a reason I have never learned Rust, which is, Rust has a reputation for being really hard to learn, and I am lazy. This is something that you see everywhere in the community. People talk about how hard Rust is. It's too difficult to be widely adopted. Even people who really advocate strongly for Rust will talk about how hard it is. I love the title of this article, "Why Rust is Worth the Struggle". They start by saying, with Rust, you approach it with trepidation, because it's got this notoriously difficult learning curve. I love this, "Rust is the hardest language up to that time I've met".

When people talk about Rust, people will tell you that Rust doesn't have garbage collection, and that's one of the things that makes it efficient. I have some questions about that. If we start with the assumption that not having garbage collection makes a language performant, which is wrong, but if we start with that assumption, what happens if we add garbage collection to Rust? Now at this point, all of the people who are Rust developers are sort of screaming quietly in the corner, going, why would you do that? What happens if you do that? It turns out, if you do that, Rust becomes much easier to use. They added a layer of garbage collection on top of Rust, and then they had a bunch of volunteers do a coding task. The people who had the garbage collected version were more likely to complete the task, and they did it in a third of the time.

Now I think we really need to rethink the efficiency of Rust, because Rust is very efficient in terms of its computational resources. If you can make something adding garbage collection, is that really an efficient language? Rust maybe is not so efficient. There's always this tradeoff of, you've got your human efficiency and your machine efficiency, and with Rust, they've really gone all in on the machine efficiency at the expense of human efficiency. That's the tradeoff. I don't like that tradeoff. In fairness to Rust, I think Rust don't like that tradeoff either, which is why they have all of the things like the really powerful compiler. That's something that we'll come back to as well.

The question is, can we do more effective? This is where Quarkus comes in. Quarkus is a Java framework. The programming model will be very familiar to you. We have integrations with the libraries that you're almost certainly already using, like Hibernate, like RESTEasy, but it's got some really nice characteristics. One of those, and this is probably the thing that people think of when they think of Quarkus, is that Quarkus applications start really fast. You can run Quarkus with GraalVM as a natively compiled binary, or you can run it on OpenJDK. Either way, it starts really fast. If you run it with GraalVM, it actually starts faster than an LED light bulb. Just to give you a scale of how instantaneous the start is. Quarkus applications also have a really low memory footprint. When we used to run on dedicated hardware, this didn't really matter.

Now that we run in the cloud where memory footprint is money, being able to shrink our instances and have a higher deployment density really matters. If you compare Quarkus to the cloud native stack that you're probably all using, if you are architecting for Java, we are a lot smaller. You can fit a lot more Quarkus instances in. It's not just when you compare it to other Java frameworks. When you compare Quarkus even to other programming languages, you can see that we're competing with Go in terms of our deployment density. [website] has a higher deployment density than old-school Java, but it's not as good as Quarkus. This is cool.

There's another thing that Quarkus is quite good at which we don't talk about so much, and I wish we would talk about it more, and that's throughput. If you look at your traditional cloud native stack, you might get about 3000 requests per second. If you are taking Quarkus with the GraalVM native compilation, the throughput is a little bit lower, same order of magnitude, but it's lower. This is your classic tradeoff. You're trading off throughput against footprint. This is something that I think we're probably all familiar with in all sorts of contexts. With native compilation, you get a really great startup time, you get a great memory footprint, but at the expense of throughput.

Many years ago, I worked as a Java performance engineer, and one of the questions we always got was, I don't like all of this stuff, this JIT and that kind of thing, couldn't we do ahead-of-time compilation? The answer was, at that time, no, this is a really terrible idea. Don't do ahead-of-time compilation. It will make your application slower. Now the answer is, it only makes your application a little bit slower, and it makes it so much more compact. Native compilation is a pretty reasonable choice, not for every circumstance, but for some use cases, like CLIs, like serverless. This is an awesome tradeoff, because you're not losing that much throughput. This is a classic tradeoff. This is something that we see. I just grabbed one thing off core, but we see this sort of tradeoff all the time like, do I optimize my throughput or do I optimize my memory? Depends what you're doing.

Let's look at the throughput a little bit more, though, because this is the throughput for Quarkus native. What about Quarkus on JVM? It's actually going faster than the alternative, while having a smaller memory footprint and a more effective startup time. That's kind of unexpected, and so there is no tradeoff, we just made it more effective. Really, we took this tradeoff that everybody knows exists, and we broke it. Instead of having to choose between the two, you get both, and they're both more effective. I always try and think, it's a double win. I've tried a few. I've tried 2FA.

Someone suggested I should call it the überwinden. I don't speak German, and so it sounded really cool to me, but it's become clear to me now that the person who suggested it also didn't speak German, because whenever I say it to a German person, they start laughing at me. German's a bit like Rust. I always felt like I should learn it, and I never actually did. You may think, yes, this isn't realistic. You can't actually fold a seesaw in half. You can't beat the tradeoff. It turns out you can fold a seesaw in half. There are portable seesaws that can fold in half.

How does this work? What's the secret? Of course, there's not just one thing. It's not like this one performance optimization will allow you to beat all tradeoffs. There's a whole bunch of things. I'll talk about some of the ones that I think are more interesting. Really, with a lot of these, the starting point is, you have to challenge assumptions. In particular, you have to challenge outdated assumptions, because there were things that were a good idea 5 years ago, things that were a good idea 10 years ago, that now are a bad idea. We need to keep revisiting this knowledge that we've baked in. This, I was like, can I do this? Because I don't know if you've heard the saying, when you assume you make an ass of you and me, and this is an African wild ass.

The first assumption that we need to challenge is this idea that we should be dynamic. This one I think is a really hard one, because anybody knows being dynamic is good, and I know being dynamic is good. I was a technical reviewer for the book, "Building Green Software", by Anne. I was reading through, and I kept reading this bit where Anne and Sarah would say, "We need to stop doing this because it's on-demand". I was thinking, that's weird. I always thought on-demand was good. I thought on-demand made things efficient. This is sort of true. Doing something on-demand is a lot advanced than doing it when there's no demand, and never will be a demand. When you do something on-demand, you're often doing it at the most expensive time. You're often doing it at the worst time. You can optimize further, and you can do something when it hurts you least.

This does need some unlearning, because we definitely, I think, all of us, we have this idea of like, I'm going to be really efficient. I'm going to do it on-demand. No, stop. Being on-demand, being dynamic is how we architected Java for the longest time. Historically, Java frameworks, they were such clever engineering, and they were optimized for really long-lived processes, because we didn't have CI/CD, doing operations was terrible. You just made sure that once you got that thing up, it stayed up, ideally, for a year, maybe two years.

Of course, the world didn't stay the same. What we had to do was we had to learn how to change the engine while the plane was flying, so we got really good at late-binding. We got really good at dynamic binding, so that we could change parts of the system without doing a complete redeployment. Everything was oriented towards, how can I reconfigure this thing without restarting it? Because if I restart it, it might never come up again, because I have experience of these things.

We optimized everything. We optimized Java itself. We optimized all of the frameworks on top of it for dynamism. Of course, this kind of dynamism isn't free, it has a cost. That cost is worth paying if you're getting something for it. Of course, how do we run our applications now? We do not throw them over the wall to the ops team who leave it up for a year, we run things in the cloud.

We run things in containers, and so our applications are immutable. That's how we build them. We have it in a container. Does anybody patch their containers in production? If someone mentioned to you, I patch my containers in production, you'd be like, "What are you doing? Why are you doing that? We have CI/CD. Just rebuild the thing. That's more secure. That's the way to do it". Our framework still has all of this optimization for dynamism, but we're running it in a container, so it's completely pointless. It is waste. Let's have a look at how we've implemented this dynamism in Java. We have a bunch of things that happen at build time, and we have a bunch of things that happen at runtime.

Actually, the bunch of things that happen at build time, it's pretty small. It's pretty much packaging and compilation to bytecode, and that is it. All of the other excitement happens at runtime. The first thing that happens at runtime is the files are loaded. Config files get parsed. Properties files get parsed. The YAMLs gets parsed. The XML gets parsed. Then once we've done that, then there's classpath scanning, there's annotation discovery. Quite often, because things are dynamic, we try and load classes to see if we should enable or disable attributes. Then we keep going. Then, eventually the framework will be able to build this metamodel.

Then, after that, we do the things that are quite environment specific. We start the thread pools. We initialize the I/O. Then eventually, after all of that, we're ready to do work. We've done quite a lot of work before we did any work, and this is even before we consider any of the Java elements, like the JIT. What happens if we start this application more than once, then we do all of that work the first time. We do it again the second time. We do it again the third time. We do it again the fourth time, and there's so much work each time. It's a little bit like Groundhog Day, where we're doing the same work each time. Or it's a little bit like a goldfish, where it's got this 30-second memory, and the application has no memory of the answers that it just worked out and it has to do the same introspection each time.

Let's look at some examples. In Hibernate, it will try and bind to a bunch of internal services. For example, it might try and bind to JTA for your transactions. The first thing it does is it doesn't know what's around it, so it says, ok, let me do a reflective load of an implementation. No, it's not there. Let me try another possible implementation. No, it's not there. Let me try another implementation. No, it's not there. It keeps going. Keeps going. It keeps going. Of course, each time it does this reflective load, it's not just the expense of the load, each time a class not found exception is thrown. Throwing exceptions is expensive, and it does this 129 times, because Hibernate has support for a wide range of possible JTA implementations. It does that every single time it starts. This isn't just JTA, there are similar processes for lots of internal services. We see similar problems with footprint.

Again, with Hibernate, it has support for lots of databases, and so it loads the classes for these databases. Then eventually, hopefully, they're never used, and the JIT works out that they're not used, and it unloads them, if you're lucky. Some classes get loaded and then they never get unloaded. For example, the XML parsing classes, once they're loaded, that's it. They're in memory, even if they never get used again. This is that same thing. It's that really sort of forgetful model. There's a lot of these classes. For example, for the Oracle databases, there's 500 classes, and they are only useful if you're running an Oracle database. It affects your startup time. It affects your footprint. It also affects your throughput.

If you look, for example, at how method dispatching works in the JVM, if you have an interface and you've got a bunch of implementations of it. When it tries to invoke the method, it kind of has to do quite a slow path for the dispatch, because it doesn't know which one it's going to at some level. This is called a megamorphic call, and it's slow. If you only have one or two implementations of that interface, the method dispatching is fast. By not loading those classes in the first place, you're actually getting a throughput win, which is quite subtle but quite interesting. The way you fix this is to initialize at build time.

The idea is that instead of redoing all of this work, we redo it once at build time, and then at runtime we only do the bare minimum that's going to be really dependent on the environment. What that means is, if you start repeatedly, you've got that efficiency because you're only doing a small amount of work each time. That is cool. Really, this is about eliminating waste. As a bonus with this, what it means is that if you want to do AOT, if you want to do native in GraalVM, you're in a really good place. Even if you don't do that, even if you're just running on the JVM as a normal application, you've eliminated a whole bunch of wasted, repeated, duplicate, stupid work.

Really, this is about doing more upfront. The benefits that you get are, it speeds up your start. It shrinks your memory footprint. Then, somewhat unexpectedly, it also improves your throughput. What this means is that, all of the excitement, all of the brains of the framework is now at build time rather than at runtime, and there's lots of frameworks.

One of the things that we did in Quarkus was we noted, we have to make the build process extensible now. You have to be able to extend Quarkus, and they have to be able to participate in the build process, because that's where the fun is happening. I think with anything that's oriented around performance, you have to have the right plug-points so that your ecosystem can participate and also contribute performance wins. What we've done in Quarkus is we have a framework which is build steps and build items, and any extension can add build steps and build items.

Then, what we do is, build steps get declared, and then an extension can declare a method that says, I take in this build step, and I output that build step. We use that to dynamically order the build to make sure that things happen in the right time and everything has the information that it needs. The framework automatically figures out what order it should build stuff in. Of course, if you're writing an extension, or even if you're not, you can look to see what's going on with your build, and you can see how long each build step is taking, and get the introspection there.

Some of you are probably thinking, if you move all of the work to build time, and I, as a developer, build locally a lot, that sounds kind of terrible. What we've done to mitigate this is we've got this idea of live coding. I've been in the Quarkus team for about two years. When I joined the team, I always called live coding, hot reload. Every time my colleagues would get really annoyed with me, and they'd be like, it's not hot reload, it's different from hot reload. I think I now understand why. We have three levels of reload, and the framework, which knows a lot about your code, because so much excitement is happening at build time, it knows what the required level of reload is. If it's something like a config file, we can just reload the file, or if it's something like CSS or that kind of thing. If it's something that maybe affects a little bit more of the code base, we have a JVM agent, and so it will do a reload there. It will just dynamically replace the classes.

Or, if it's something pretty invasive that you've changed, it will do a full restart. You can see that full restart took one second, so even when it's completely bringing the whole framework down and bringing it back up again, as a developer, you didn't have to ask it to do it, and as a developer, you probably don't even notice. That's cool. I think this is a really nice synergy here, where, because it starts so fast, it means that live coding is possible. Because as a developer, it will restart, and you'll barely notice. I think this is really significant, because when we think about the software development life cycle, it used to be that hardware was really expensive and programmers were cheap.

Now, things have switched. Hardware is pretty cheap. Hardware is a commodity, but developers are really expensive. I know we shouldn't call people resources, and people are not resources, but on the other hand, when we think about a system, people are resources. Efficiency is making use of your resources in an optimum way to get the maximum value. When we have a system with people, we need to make sure that those people are doing valuable things, that those people are contributing, rather than just sitting and watching things spin.

How do you make people efficient? You should have a programming language that's hard to get wrong, idiot proof. You want strong typing and you want garbage collection. Then, it's about having a tight feedback loop. Whether you're doing automated testing or manual testing, you really need to know that if you did get it wrong despite the idiot proofing, you find out quickly. Then, typing is boring, so we want to do less typing. Java gives us those two, the strong typing and the garbage collection. I just showed that tight feedback loop. What about the typing? With Quarkus, we've looked at the performance, but then we've also really tried to focus on developer joy and making sure that using Quarkus is delightful and fast. One of the things that we do to enable this is indexing. Indexing seems like it's actually just a performance technique, but we see it gives a lot of interesting benefits in terms of the programming model.

Most frameworks, if it's doing anything framework-y and non-trivial, it needs to find all of the classes. It needs to find all of the interfaces that have some annotation, because everything is annotations, because we've learned that everything shouldn't be XML. You also really often have to find all of the classes that implement or extend some class. Annoyingly, even though this is something that almost every Java library does, Java doesn't really give us a lot of help for this. There's nothing in the reflection package that does this. What we've done is we have a library called Jandex, which is basically offline reflection. It's really fast. It indexes things like the annotations, but it also indexes who uses you. You can start to see, this could be quite useful.

What kind of things can we do with the index? What we can do is we can go back and we can start challenging more assumptions about what programming looks like, and we can say, what if developers didn't have to do this and that, and this and that? As an example, a little example, I always find it really frustrating when I'm doing logging that I have to initialize my logger, and I have to say, Logger.getLogger, whatever the call is, and tell it what class it sees. I only half the time know what class I'm programming in, and I get this wrong so often because I've cut and paste the declaration from somewhere else.

Then there's this mistake in the code base, and the logging is wrong. I was like, why do I have to tell you what class you're in when you should know what class you're in, because you're the computer, and I'm just a stupid person? What we've done with Quarkus is exactly that. You don't have to declare your logger. You can just call, capital the static call [website], and it will have the correct logging with the correct class information. This is so little, but it just makes me so happy. It's so nice. I think this is a good general principle of like, people are stupid and people are lazy. Don't make people tell computers things that the computer already knows, because that's just a waste of everybody's time, and it's a source of errors. When I show this to people, sometimes they like it, and go, that's cool.

Sometimes they go, no, I don't like that, because I have an intuition about performance, I have an intuition about efficiency, and I know that doing that kind of dynamic call is expensive. It's not, because we have the Jandex index, so we can, at build time, use Jandex to find everybody who calls that [website], inject a static field in them, initialize the static field correctly. Because it's done at build time, you don't get that performance drag that you get with something like aspects. Aspects were lovely in many ways, but we all stopped using them, and one of the reasons was the performance of them was a bit scary. We assume that we can't do this thing that we really want to do because we assume it's expensive, it's not anymore. It gets compiled down to that. You can see that that is pretty inoffensive code. I don't think anybody would object to that code in their code base.

Let's look at a more complex example. With Hibernate, obviously, Hibernate saves you a great deal of time, but you still end up with quite a bit of boilerplate in Hibernate, and repeated code. Things like, if I want to do a listAll query, you have to declare that for every entity. It's just a little bit annoying. You think, couldn't I just have a superclass that would have all of that stuff that's always the same? What we can do with Hibernate, if you have your repository class, what we can do is we can just get rid of all of that code, and then we can just have a Panache repository that we extend.

That's the repository pattern where you have a data access object because your entity is a bit stupid. For me, I find an active record pattern a lot more natural. Here I just acquire my entity, and everything that I want to do in my entity is on the entity. That's normally not possible with normal Hibernate, but with Hibernate with Panache, which is something that the Quarkus team have developed, you can do that. Again, you've got that superclass, so you don't have to do much work, and it all works. One interesting thing about this is it seems so natural. It seems like, why is this even hard?

Of course, I can inherit from a superclass and have the brains on the superclass. With how Hibernate is working, it's actually really hard. If I was to implement this from scratch, I might do something like, I would have my PanacheEntity, and then it would return a list. The signature can be generic. It's ok to say, it just returns a list of entities. In terms of the implementation, I don't actually know what entity to query, because I'm in a generic superclass. It can't be done, unless you have an index, and unless you're doing lots of instrumentation at build time. Because here what you can do is you see the superclass as a marker, and then you make your actual changes to the subclass, where you know what entity you're talking to. This is one of those cases where we broke the tradeoff that machine efficiency of having the index enabled the human efficiency of the nice programming model.

Some people are probably still going, no, I have been burned before. I used Lombok once, and once I got into production, I knew that magic should be avoided at all cost. This is something that the Quarkus team have been very aware of. When I was preparing for this talk, I asked them, under the covers, what's the difference between what we do and something like Lombok? Several of the Quarkus team started screaming. They know that, with this, what you want is you want something that makes sense to the debugger, and you want something where the magic is optional. Like that logging, some of my team really like it.

Some of my team don't use it because they want to do it by hand. Panache, some people really like it. Some of the team just use normal Hibernate. All of these attributes are really optional. They're a happy side effect. They're not like the compulsory thing. I think again, this is a question of efficiency. What we see with a lot of these frameworks, or some of these low-code things, is they make such good demos, but then as soon as you want to go off the golden path, off the happy path, you spend so long fighting it that you lose any gain that you maybe had from that initial thing. Really, we've tried to optimize for real use, not just things that look slick in demos.

The Common Factor Behind Performance Improvements.

I've talked about a few of the things that we do, but there's a lot of them. When I was preparing this talk, I was trying to think, is there some common factor that I can pull out? I started thinking about it. This is my colleague, Sanne Grinovero. He was really sort of developer zero on Quarkus. He did the work with Hibernate to allow Hibernate to boot in advance. This is my colleague, Francesco Nigro. He's our performance engineer, and he does some really impressive performance fixes. This is another colleague, this is Mario Fusco. He's not actually in the Quarkus team. He tends to do a lot of work on things like Drools, but he's given us some really big performance fixes too.

For example, with Quarkus and Loom, so virtual threads, we had really early support for virtual threads back when it was a tech preview. What we found was that virtual threads, you hope that it's going to be like a magic go faster switch, and it is not, for a number of reasons. One of the reasons is that some libraries interact really badly with virtual threads, and so some libraries will do things like pinning the carrier thread. When that happens, everything grinds to a halt. Jackson had that behavior. Mario contributed some PRs to Jackson that allowed that problem in Jackson to be solved, so that Jackson would work well with virtual threads.

I was looking and I was like, what is that common factor? What is it? I realized they're Italian. This is a classic example of confirmation bias. I decided the key to our performance was being Italian. Without even realizing it, I looked for the Italians who'd done good performance work. When we do a Quarkus release, we give out a T-shirt that says, I made Quarkus. On the most recent release, we gave out 900 T-shirts. There's a lot of contributors. A lot of people have done really cool engineering on Quarkus, only some of them were Italian. You don't have to be Italian to be good at performance, in case anybody is feeling anxious. The title of this talk is Italian graft, and so being Italian is optional, but the graft part is not. This stuff is work. When you're doing that kind of performance optimization, you have to be guided by the data, and you have to do a lot of graft. You measure, because you don't want to do anything without measuring.

Then you find some tiny improvement, and you shave it off. Then you measure and you find some tiny improvement, and you shave a little bit of time off. You measure, and then you find some tiny improvement. This was very much what we saw in this morning's talk as well. It was in C rather than Java, but it was the same thing. If I'm going to profile, then I'm going to find some tiny optimization that I'm going to do. You keep going and you keep going. It's not easy, so it needs a lot of skill, and it also needs a lot of hard work. I mentioned Francesco, our performance engineer, and he really is like a dog with a bone. When he sees a problem, he'll just go and go. I think a lot of the rest of us would have gone, "Ooh", and he just keeps going. He has this idea that what he offers to the team is obsession as a service. You need people like that.

I want to give one example. We run the tech and power benchmark, and what we found was we were behaving really unexpectedly badly when there was this large number of cores. With a small number of cores, our flame graph looked as we hoped. When it was a lot of cores, all of a sudden, our flame graph had this really weird shape, and there was this flat bit, and we're like, what's going on there? Why is no work happening in this section of the flame graph? Again, many people would have gone, what a shame? To find out, Francesco and Andrew Haley, another colleague, they read 20,000 lines of assembler. What they found was worth it. They found the pattern that was causing the scalability problem, and the pattern was checking if something is an instanceof.

At this point, hopefully some of you are screaming as well and going, I think there's a lot of that. That's not a weird, obscure pattern, that is a very common pattern. Once Franz had found the problematic pattern, he started to look at what other libraries might be affected. We found Quarkus was affected. Netty was affected. Hibernate was affected. Camel was affected. The Java Class library was affected. This was a really big, really bad bug. He found actually that there was an existing bug, but nobody had really realized the impact of it. I think this is partly because it happens when you've got like 32 cores, when you've got like 64 cores. We're now much more often running at that kind of scale. It's a cache pollution problem.

The problem is, when you do this check, the cache that is used for this check is shared across all of the cores. If you've got a lot of code running in parallel, basically the cache just keeps getting corrupted, and then you just keep having to redo the work. This was a bad problem. This was not like that saving 2%. This is one of the tech and power benchmarks, and this was running before the fix and running after the fix. You can see we went from [website] million requests per second to [website] million requests per second. That's just a small benchmark, but it was a huge improvement.

What we did was, Franz wrote a little tool, because not every instanceof call is problematic. It depends on various factors. He wrote a tool that would go through and detect the problematic case. We ran it through the whole code base, and we started doing the fixes. It's very sad, because this is fixed in the JVM now, but only on the sort of head, so people won't get the benefit of the fix for quite a while. We had code that was, for example, like this. Then after the fix, you can see we had to do all of this stuff.

Again, you don't need to necessarily read the code, but you can just see that the throughput is a lot higher, but the code is a lot longer, so it's again exactly the same as Alan's talk. You have this tradeoff. I love it for this one, because the developer did the PR and then they basically apologized for the code that they're doing in the PR. I'm not a fan of the fix. It's not idiomatic. It's difficult to maintain, but it gives us so much more throughput that we have to do it. Again, it's that tradeoff of machine efficiency against human efficiency. Only in this case, it's not everybody else's efficiency, it's just my team's efficiency. This is what Anne was talking about when she unveiled, you really want your platform to be doing the hard, grotty, nasty work so that you can have the delightful performance experience. We do the nasty fixes so that hopefully other people don't have to.

Another thing to note about efficiency is it's not a one-time activity. It's not like you can have the big bang, and you can go, yes, we halved the throughput, or halved the cost. Life happens, and these things just tend to backslide. A while ago, Matt Raible was doing some benchmarking, and he showcased, this version of Quarkus is much slower than the previous version. We thought, that's odd. That's the opposite of what we hoped would happen. Then we showcased, "Are we measuring our performance?" Yes. "Don't we look to see if we're getting advanced or worse?" Yes. "What happened?" What it is, is, if you get that bit of code, is the performance getting advanced or worse here? It looks like the performance is getting much advanced. If you look at it over the longer picture, you can see that actually it's probably getting a little bit worse. Because we had this really big regression that masked a series of smaller regressions.

We had a change detection algorithm that was parametric, and it meant that we missed this regression. We did the work and we fixed it, and we fixed a lot. It was very cool. That was another engineer who was not Italian, called Roberto Cortez. One of the things that Roberto did, which just makes me smile, is, again, it's about the assumptions. We do a lot of string comparison in config. Config tends to be names based, and so the way any normal human being would do a string comparison is you start at the first character, and then you go. The interesting bit is always at the end. Roberto worked out, if I go from the other end, the config is much faster. I would recommend you all to have a Francesco, to have a performance engineer. You can't have Francesco, he's ours, but you need to find your own. It does need investment.

I've got one last tradeoff I want to talk about. This is the efficient languages track, but we really do have a green focus here. There's this classic tradeoff with sustainability between doing the stuff that we want to do and saving the planet. In general, historically, we have always tended to do the stuff we want to do rather than save the planet. I think there is some hope here. I've started talking about something called the vrroooom model. Naming is the hardest problem in computer science, because I didn't think to actually do a Google before I did the name. It turns out there is a vroom model, which is a decision model. That's with a slightly different spelling than I did. I did 3r's and 2o's and stuff, which was another terrible mistake.

My vrroooom model, the one that doesn't involve sexy cars, I really started thinking about this when I looked at the paper. We were talking about this before, and Chris noted, you know that stupid paper that compares the programming languages, and there's a lot of problems with this paper. What I want to show you is not the details of it, but something that I noticed, which is, it has a column for energy and it has a column for time, and they look kind of almost the same.

If you plot it, you can confirm that this trend line is basically straight. It means languages that go fast have a low carbon footprint. We see this with Quarkus. With Quarkus on this graph, we benchmarked the energy consumption of Quarkus native, Quarkus on JVM, the other framework on JVM, the other framework on native. What we did was we had a single instance, and we just fired load at it until it ran out of throughput. The shorter lines are where it ran out of throughput earlier. Lower is advanced. Lower is the lower carbon footprint. You can see that there's, again, this really strong correlation. Quarkus on JVM has the lowest carbon footprint of any of these options because it has the highest throughput. It's the win-win again, that you get to have the really fast language and have the nice programming model and also save the world. We beat the tradeoff.

I just love this that instead of having this opposition between machine efficiency and human efficiency, the one helps us gain the other. If you start with efficient languages, you really need to consider both machine efficiency and human efficiency. When you're looking at your machine efficiency, you need to challenge your assumptions. Only do work once, obviously. Move work to where it hurts the least. Index. Indexes are so cheap, they're so good, they solve so many problems. Unfortunately, this isn't a one-off activity. You do need that continued investment in efficiency. Then, when you look at your human efficiency again, same thing, you need to challenge your assumptions. You need to get those feedback loops as small as you can. Don't make people tell the computer what the computer already knows, because that's a waste of everybody's time.

Access to education is one of the most powerful ways to reduce inequality, but it’s often limited by outdated systems and insufficient resources. At B......

This is part two of a Makers series on the state of observability. Part one featured Christine Yen , CEO and co-founder of [website].

everything is working well, the objects tab is working, and the *relief tab is working too. BUT the tab with the addition of grass on the terrain does......

Presentation: Unveiling the Tech Underpinning FinTech's Revolution

Grzesik: The core question that we wanted to start with is, how is it that some organizations can actually deliver software and do it quite well and do it consistently, and some can't? We will not explore the can't part, we will explore the yes can and how is it possible, and what are the shared experiences that we have with Wojtek in the places we worked at? The answer, usually, it's not something on the surface, so it's not trivial, which you probably know. One thing that we found to be at the core is something that goes on to culture.

The Invisible Engine of Success: Culture (Westrum's Model).

Ptak: Speaking of the culture, we really wanted to talk to you about one of the concepts that really fits well into the topic, Westrum's Organization Culture. We really wanted to use Westrum's culture model. It's well ingrained into DevOps as well. If you look for DORA and Westrum, you will find really good reading materials on the DORA website regarding the Westrum culture model. We really wanted to touch briefly on it. You can think of what kind of culture model are you in? Westrum spent his scientific career working in organizations that really work well, and that's how also we know about his work in the DevOps realm. The model consists of three types of organizations. First is pathological. These are the types of organizations that there are several attributes of them that tell you that, actually, they're really pathological.

One of them would be that, for instance, cooperation is really low between different teams, different departments. We know Conway's Law, of course. Messengers, meaning the people, whistleblowers, they will be shot on sight. Meaning, of course, we will reject them, and so on. This is the type of organization that is most likely to die in the current world because, of course, collaboration between teams is very low. They do not innovate very well. The second type of organization that Westrum described was bureaucratic. This is the type of organization which is really rule oriented. We have rules and we stick to the rules, process heavy, bureaucratic heavy. Looking at the same capabilities, you can guess what it means. Collaboration and cooperation will be quite modest. Messengers, they will be neglected.

Probably, if the rules approve, and they play by the rules, it's fine, but they will be most likely neglected. Very critical, of course, if there is an incident, we look for justice, usually, in this type of organizations.

Grzesik: Its responsibilities are narrowing, so that the scope that people take onto themselves is narrowing further down, so that eventually it ends up in the previous one where nobody is responsible for anything. Here there is still some, but not too much.

Ptak: There are different types of organizations that we really wanted to look into and why they're successful. The ones that you really want to work for. Very high collaboration and cooperation. Messengers. We train people to be whistleblowers. We train people to look for opportunities to learn. We look for every failure as the opportunity for learning. We really train people to do it.

Grzesik: Organizations seek signals wherever they appear, and they want it. They do it consciously because they absolutely want to be aware of what's happening, and they want to make decisions based on that.

My name is Andrzej Grzesik. I've been head of engineering, principal engineer. I now build distributed systems at a firm called LMAX. It's an exchange that does high-performance Java thing, nanoseconds counting Java. I like building systems. I'm proud to be a Java champion. I also run a conference, Java User Group, speak at conferences. I like technology for some strange reasons.

Ptak: I'm Wojtek. I'm former CTO. I also had my own organization. Now I'm an engineering executive working with Revolut for 2 years, almost exactly. I'm responsible for Revolut Business. If you use Revolut Business, I'm the guy to talk to about the bugs, if you have any, of course. I'm also co-host of a community initiative called CTO Morning Coffee, where we really want to train the next generation of engineering leaders.

Revolut has a family of products. You probably know retail. It's very popular in England. As far as I know, we're number one in England. Business is also, as far as I remember, the number one B2B solution. We also have an app called Trader coming very soon in a reshaped form. That also will be separate application and Revolut form a junior. We're definitely experiencing hyper growth.

Two years with Revolut, we grew two and a half times since I joined. We have over 40 million retail end-people. In the business itself, for instance, that's almost 70% now, year-over-year, and really accelerating. That's critical for me, because that sets the context that I'm working in a organization that really grows really fast and it's actually accelerating the growth. At the same time, it's a very rapid product development. Usually, teams will have at least several deploys per day to production. Lead time for changes, one of the DORA metrics, is usually way less than three hours.

Grzesik: When I joined Revolut, I was head of backend engineering. Backend was 120 people, when I left it was 400, in under 3 years. That's quite a growth. I'm quite proud of the things that were there about all the examples about Revolut, what he is going to speak about.

Ptak: Coming back to Westrum organization, there is actually a practical, a first thing that we want to tell you so you can recognize the type of the organization, how to measure the culture. Ask your team what type of questions to ask. From strongly disagree, to neither agree, to strongly agree. Ask the following question.

Grzesik: If you're a leader, run a survey across all of your teams in all departments, and you'll get signals. Those questions, this information actively sought, how do people rate it from 1 to 5, 1 to 7, whatever scale you want, something that gives you a range. Then, if you notice good spots, good, if you notice bad spots, maybe teams, maybe departments, maybe some areas, you will have places to begin.

Ptak: Hopefully you recognize Gene Kim, one of the people who really started DevOps revolution. He has a podcast, and we definitely recommend. There are two episodes with Dr. Westrum.

With companies like that, let's talk about Conway's Law and scaling the architecture. As we discussed, we are at the hyper growth scale. How to scale? Usually what would happen is you see something like this. What do you see?

Grzesik: A famous picture of teams, complexity, services, whatever you call it is there. Something that's not immediately visible is the amount of connections, and how do you get from one far end to another? That's actually a problem that organizations, as they scale up, get into. That's something that I like to call problem of knowledge discovery. How do we know what we know?

How do we know who knows what? How do we know who are the good people to ask a question about the service, about how the code is oriented? Who are the people who get approval from? All of those questions. What services do we have? How reliable are they?

Ptak: If I have an incident, why do I have it? How many services that I was dependent on are in the chain? Between the database and my endpoint, how many services are truly there?

Grzesik: If you have a payment flow that needs to be processed, which services are on the critical path, and so on. For the backend services, there was even a talk about Spotify's Backstage at QCon. Backstage, if you haven't been there, it looks like that. It's a catalog of services which has a plug-in architecture, which gives an information radiation solution to the problem. That's very nice and awesome because it allows services to be discovered, so people can know what services there are in the organizations. What's the description? What's the role? What's the responsibility? What's the materiality? Which means, what happens if it goes down? How essential is it to business operation? What are the SLOs, SLAs? Aspirational and contractual expectations of quality? How do we monitor? How do we get there? Upstream, downstream dependencies.

Basically, what's the shape of the system? If you have above some level of services, you want that, otherwise, it's hard to fish out from the code. Anything else that's relevant. Backstage solves it for many people. Backstage has plugins, but not everybody uses Backstage. What does Revolut use?

Ptak: Revolut has its own solution, and I'm going to talk about a couple of points which are significant. It's called Tower. It gives us technology governance, so everything is described in the code. It's trackable. It's fully auditable. It's fully shareable. It looks like that, nice interface. I can go there, look for any service, any database, pretty much any component in the infrastructure, and get all of the details which we discussed. Including the Slack angles of the teams, including the Confluence pages with the documentation, SLO, SLAs, logs, CI/CD pipelines. I know everything. Even if we have this massive amount of services, I know exactly what to look for, the information. Regarding the dependencies, here it is.

For instance, I can get all of the dependencies. We're also working on an extension which will allow us to understand also the event-based dependencies, so asynchronous. That's a big problem in the large distributed system. That's actually coming very soon. As a leader of the team, I can also understand all of the problems that I have by several scorecards. We can define our own also scorecards, so I can, for instance, ensure that I have full visibility. What teams? How do they work? How do they actually maintain the services?

Coming back to our example, what else do you see?

Grzesik: We have a beautiful picture and we have a system, but as we build, as we have more services or we have more components, we have a system which is complex, because the business reality that we're dealing with is complex. Now that we've introduced more moving parts, more connections, we've made a complex system even more complicated. Then, how can we deal with that? There is a tool that we all very much agree that is a way to go forward with that, and that tool is systems thinking.

Ptak: Systems thinking is a helpful model to understand the whole actually system that we're talking about, for instance, as a FinTech bank solution. Complexity, as Andrzej stated, can come from, for instance, compliance, [inaudible 00:13:47]. Complication, it's something we're inviting. There are two critical definitions that I really wanted to touch on. In the systems thinking we have one definition, which is randomness. Randomness of the system means that we cannot really predict.

Grzesik: It's things beyond our control. Things that will manifest themselves in a different random way that we have to deal with, because they are part of our team.

Ptak: They're unpredictable. Or we see that as a noise of data. As I introduced, there is complexity, which is there by design. For instance, onboarding, in the business we're presenting in over 30 countries. Onboarding any business is complex by definition. We cannot simplify it. It's complex because, for instance, you need to support all of the jurisdictions. You need to make sure that you're compliant with all of the rules. That is very well described in several books. The one that we're using for this example is Weinberg. Weinberg is a super fruitful author, so a lot of books. That one comes from, "An Introduction to General Systems Thinking". He proposed a model where there are three types of complexities in our systems.

Grzesik: The very first one, the easiest one is organized simplicity. That's the low randomness, low complexity realm of well understood things. It can be things that we conquer with grunts. It can be things that we conquer with non-stack, no known services, problems that we know how to solve. They are business as usual. They are trivial. There is nothing magical there. There shouldn't be anything magical there. If we keep the number of them low, and we keep them at bay, they are not going to complicate our lives too much.

Ptak: If you make things more complicated, as you can see on the axis, so introduce randomness, you will get to the realm of unorganized complexity. You will get a high randomness. If you have many moving parts, and each of them introduce some randomness, they sum up, multiply even sometimes. The problem is that actually the system gets really unorganized. Our job is to make sure that we get to the realm of organized complexity.

Grzesik: Which is where our system organizes. Business flows are going to use this technology in a creative way to solve a problem. That's what we do when we build systems, not only technical, but in the process and people and interactions and customer support sense of things, so that a firm can operate and people can use it, and everybody is happy.

How Do We Introduce Randomness into Our Systems?

There is a problem, as the system grows, it's going to have a more broad surface area, and that's normal because it's bigger. Which means there's going to be randomness that is going to happen there, and then there is some randomness that people want to introduce, like having multiple stacks for every single service.

Grzesik: Can I put yet another approach to solving the problem that we have, because I like the technology for it?

Ptak: Can I get another cloud provider? You know where it's going. How do we actually introduce that randomness into our systems? How do we make our system complicated and therefore prone slower? Because you need to manage the randomness. From our perspective, as we discussed, we see three really critical data of randomness, where you invite the problems into your organization. The number of frameworks and tools that you have. If you allow each team to have their own stack, the randomness and the complication of the overall system, all of the dots that you can see connected, goes off the roof.

Grzesik: Then you have problems like, there is a team that's used to Java that has to read a Kotlin service, maybe they will be ok. Then they have to look at a Rust service and a Go service, and then, how do I even compile it? What do I need to run it? That gets complex. If there is a database that I know how it works, I use a mental model for consistency and scaling. Then somebody used something completely different. It becomes complex. Then there is an API that always speaks REST. Somebody puts a different attitude API in there, then you have to model to. There is that complexity, which is sometimes not really life changing, but it just adds on.

Ptak: Another thing is differences in processes. A lot of organizations will understand agile as, let people choose whatever they want to do, make sure that they deliver. A lot of people will actually have their own processes. The more processes, the more different they are, the bigger problem we have, the bigger randomness. Same with the skills.

Grzesik: Same with skills. Both of those areas mean that the answer to, how do we solve a problem, or how do we reach a solution to a problem in our area, will differ across teams. That means that it's harder to transfer learnings, and that means that you have to find two answers in any organization other than one, and then apply the pattern in every single place. If you have a team that follows DDD, you know that you're going to get tests. If there is a team that would like to do testing differently, then the quality of tests might differ across solutions. What we are advocating, what we have experience of not doing, is automating everything.

Ptak: You start to take care of the things that are not essential to your business. You start to use the energy of your teams not to build stuff, not to build your products, not to scale, but to solve the problems that are really not essential to your business. As somebody expressed, actually, we should be focusing on the right things.

What's the Revolut approach? Simplified architecture by design. We try to really reduce the randomness. Simplicity standards, they're being enforced so you cannot really use whatever that you want. We enforce certain set of technologies. I'll touch on it. It's enforced by our tooling. We really optimize for a very short feedback loop. First, to talk about architecture. It's designed, supported by the infrastructure, and our architecture framework, service topology. Every service would have its own type predefined. Every definition will contain how it should behave, how it should be exposed, what it should be integrated with. How does it integrate? If and how it integrates with the database, and so on? The critical one will be frontend service, so the resources definitions.

Flow service, it's the business logic orchestration. State service, which is a domain model. That gives us actually the comfort that we know exactly what to expect if you reopen the service. You know exactly what will be there, how it will be structured, and so on. Revolut's architecture is super simple in this way. It's really simple. It's to the level where it's really vanilla Java services with our own internal DDD framework, and deployed on Kubernetes. That's all. We use some networking services from Google. Processes layer, that's interesting. Postgres is used as the database, as the event store and event stream. We have our own in-house solution. Why? Why not use Kafka and so on, Pub/Sub? Because we know exactly how to deal with databases.

We know exactly how to monitor them, how to scale them. If you introduce a very critical component to banking, such as the technology that is not exactly built for these purposes, you introduce the randomness, and you will need to build the workarounds around that randomness. Data analytics, of course, there is a set of the capabilities. Here is an example, coming back to my screenshot. These things are enforced.

Grzesik: A stack of Java service, CI/CD has, what is it? Is it a template?

Ptak: It's a definition, and we know exactly what to expect. The whole CI/CD monitoring will be preset. When you define your service, everything will be preset for you. You don't need to worry about everything that should be not a problem for you. You're not solving a business problem, but worrying about, for instance, CI/CD or monitoring. You need to focus on the business logic. That's what we optimize for our engineers.

Grzesik: We have information being radiated. We have things that are templated. We have a simple architecture. Then, still, how do you do it well? How can you answer that question?

Ptak: We wanted to go through heuristics of trouble. We really want to ask you to see how painful it is for you. We have some examples that you probably can hear in your teams. The first one would be.

Grzesik: Why do people commit to our code base? Have you ever heard it?

Ptak: That will be sign of blurred boundaries. The problem of no clear ownership, conflicting architecture drivers that lead to I don't care solutions.

Grzesik: That's this randomness that we mentioned before. It's ok for people to commit to other services. Absolutely, it's a model that the corporation I work at uses. I think your place also uses that. The thing is, somebody should be responsible. Somebody should review it. That's the gap here. If somebody commits without thinking, that's going to be strange.

Ptak: Another thing that you can hear in the teams. I wonder, when did you hear it? It's them, whenever something happens.

Grzesik: "It's them. This incident is not ours. They've added this. They should fix it". If you connect it with Westrum's bureaucratic model, this is exactly how it manifests in a place. In the grand scheme of things, if everybody works in the same business and everybody wants shared success of the business, this is not the right attitude. How do you notice? By those comments, in Slack, maybe in conversations, maybe by the water cooler, if you still go to the office.

Ptak: It's a lack of ownership. We see the blame culture, fear of innovation, and actually good people will quit. The problem is the other will stay. That's a big problem for the organization. Really, the same, of course, deployable modules and teams. That's key also to understand regarding the ownership. Another thing that you may hear, let me fix it.

Grzesik: There is a person, or maybe a team, that they're amazing because they fix all the problems. They are so engaged. They run on adrenaline. They almost maybe never sleep, which fixes the problem. You've met them, probably. The problem that they generate is they create knowledge silos, because they know how things work, nobody else does. They also reduce ownership because, if we break something, somebody else is going to fix that. That's not great. Because of how intensively they work, they risk burnout, which is a human thing, and it happens. Then somebody can operate at this pace and at this level for maybe a couple of months, maybe a couple of years. Eventually something happens, and they are no longer there. Maybe they decided to go on holidays in Hawaii for a couple of months: this happens, a sabbatical.

Ptak: God bless you when you have the incident.

Grzesik: Then you have a problem, and then what do we do? We don't know, because that person or that team is the one that knows.

Ptak: Very connected. You have an incident, and you've heard, contact them on Slack. The problem is, you have components which are owned by a central team, and only central team keeps the knowledge for themselves. It's always like a hero's guild. They will be forcing their own perspective. They will be reducing, actually, the accountability and ownership of the teams. They will actually be the bottleneck in your organization.

Another very famous, is you have a bug incident, you will contact the team and you hear, create a Jira ticket. That's painful for all of you. That's a good sign of siloed teams and conflicting priorities. It means that there is a very low collaboration. We don't plan together. We don't understand each other's priorities. We duplicate very often efforts. How many times have you seen in your organizations, they won't be able to build it, we need it, so we'll build our own, or we'll use our own, whatever. There goes the randomness. Another one that you may hear is that you ask the team to deliver something and they say, I need to actually build a workaround in our framework, because we're not allowed to do it with our technology.

Grzesik: The problem that we have is not, how do you sort a list, but how do you sort a list using this technology, that language, using that version of the library, because it's restricting on this database?

Ptak: That's actually when technology becomes a constraint. It's a very good sign that the randomness is really high. You are constrained by the technology choices that you made. There are probably too many moving parts.

Grzesik: Another aspect might be how it manifests. People will say that, I'm an engineer. I want some excitement in my life. I'm going to learn another library, learn another language. The purpose of the organization is to build software well, and you can challenge that perspective. People can be proud of how well, and last without bugs, execute software delivery, but it requires work on the team. This is a very good signal. Another one is, hammer operators. If you've met people who will solve every single problem using the same framework and that same tool, any technology that they are very fond of, even if it doesn't fit or even if they made the choices for technologies before knowing what the problem really is, that's a sign of a constraint being built and being implemented.

Practical Tips for Increasing Collaboration and Ownership.

Ptak: Actually, there are some good news for you, so we're not made to suffer. Practical tips from the trenches. How to increase collaboration and ownership.

Grzesik: We know all the bad signals, or what things we can look for, so that we know that something is slightly wrong, or maybe there is a problem brewing in the organization. The problem is, it's not going to manifest itself immediately. It's going to manifest itself in maybe a year, maybe two years down the line. Some people will have gone. Some people will have moved on. We will have a place that slows down and cannot deliver, maybe introduces bugs. Nobody wants that. How do we prevent it?

Ptak: First one is, make sure that you form it around boundaries. For instance, in Revolut, every team is a product team with their own responsibilities, very clear ownership, and most likely, with a service they own. I would recommend, if you know DDD, to go into strategic patterns and, of course, use business context. That's very useful. The second thing is, there is a lot of implicit knowledge in the organization, make it as explicit as possible.

Grzesik: Put ADRs out there. Put designs out there. Don't use emails to transfer design decisions or design discussions. Put it in a wiki. It's also written, but it's also asynchronous. In the distributed organization that we work with, that makes it possible for people to ask questions and comments, and know what was the error, and what was decided and why.

Ptak: If you're a leader, I would encourage you to do ownership continuous refactoring. Look for the following signs. If there is a peer review confusion, who should review it, how we should review it?

Grzesik: How can you measure that? The time to review is long because nobody feels empowered, or the correct person to review it.

Ptak: Another one would be, hard to assign bugs. The ones that are being moved between teams. We do have it, but we really try to measure it and understand which parts of the apps have this problem.

Grzesik: I don't have that problem because the place I work in does Spark programming. There is no need to do PR reviews. If you do pair programming, you actually get instant review as you pair, which is awesome.

Ptak: Of course, incidents with unclear ownership. Look for these signals. How to deal with situations where you really don't know who should own the thing. There are a couple of strategies which I would recommend. Again, clear domain ownership. Then the second one, if we still cannot do it, is proximity of the issue. We can say, it's close to this team's responsibilities. They're the best to own it.

Grzesik: Or, they are going to be affected, or the product that they are responsible for is going to be affected. Maybe it's time to refactor and actually put it under their umbrella.

Ptak: Sometimes we can have central teams or platform teams who can own it, or in the worst-case scenario, we can agree on the ownership rotation, but do not leave things without an owner.

Grzesik: Sometimes something will go wrong. Of course, never in the place we work at, never in the place you work at, but in the hypothetical organization in which something happens.

Ptak: I would disagree. I would wish for things to go wrong, because that's actually the best opportunity to learn.

Grzesik: It's a learning opportunity. There is a silent assumption that you do post-mortems. If you do have an incident, do a post-mortem. Some of the things that we can say about post-mortems, for example, first, let's start with a template.

Ptak: I know it might be basic, but it's really essential, teach your team to own the post-mortems. Have a very clear and easy to use template. That's ours, actually. We have exactly, for instance, the essential ones, the impact with the queries, or links to logs that I can use. We have several metrics to measure, so see how improved we get to: so mean time to detection, mean time to acknowledge, mean time to recovery. We do root cause analysis, 5 Whys. The essential thing, we will be challenged on that, and I will come back to that.

Grzesik: Also, what is not here is, who's at fault? There is no looking for a victim.

Ptak: Because they're blameless. We try to make them blameless.

Grzesik: It should be also accurate, which means they should give a story. It could be a criminal story or a science fiction story, depending on your take on literature, but they should give a story about what happened. How did we get there? What could we potentially do different, this is the actionable part, and we have to do it rapidly. Why? Because human memory is what it looks. We forget things. We forget context. The more it lingers, the more painful it becomes.

Ptak: We come back to Westrum pathological organizations. You can recall that probably that won't work in such an organization. Couple of tips that I would have. Create post-mortems for all incidents. Actually, with my teams, we're also doing almost incidents, near miss incidents. When we almost got an incident in production, amazing opportunity to learn. We keep post-mortems trackable. There is actually a whole wiki that contains all of the post-mortems' links, searchable, taggable, very easy to track, to understand what happened actually, and how we could improve the system.

Grzesik: I also keep them in the wiki. If your risk team or somebody in the organization says that, post-mortem should be only restricted to the people that actually take part, or maybe they shouldn't be public knowledge. Maybe you're leaders, maybe you're empowered in the correct position to fight it, or escalate it to your CTOs, this is the source of knowledge. This is a source of learning. It's absolutely crucial not to allow that to happen, because that's what people will learn from and that's what influences people's further designs.

Ptak: Drill deeper. Root causes, we actually peer review our post-mortems, and we actually challenge them. It's a very good learning opportunity for everyone, actually. I would encourage this as a great idea.

Grzesik: A very practical attitude. Find the person who is naturally very inquisitive. It can be a devil's advocate kind of attitude. They are going to ask questions. They are not going to ask questions when people are trying to describe what happened, but they will ask the uneasy questions. That's a superpower, sometimes, in moderation. If you have such an individual, expose them to some of the post-mortems, figure out a way of working together. That attitude is absolutely very useful.

Ptak: Two last items is, track action items. The worst thing is to create post-mortem and let it die, or a bureaucratic, I do it for my boss. Celebrate improvements. It might be very obvious knowledge, but if you want to improve the organization and improve the architecture, so Reverse Conway Maneuver, actually, I would recommend post-mortems as one of the things that really teach people to own things and to understand them. May be basic for some, but actually very useful.

Grzesik: Systems that we write, they will have dependencies internally, they will have them externally. That's something that we need to worry. Making it explicit, is knowing what they are. Then, that also means that you can have a look at, how is my dependency upstream and downstream doing? What are their expectations, aspirations, in terms of quality? Do we have circular dependencies? You might discover it if you have, for example, a very big events-driven system. If you've never drawn what is the loop, or, certain processes, which services they follow. You might get there. Then it's obviously harder to work.

Also, if you know what the dependencies for you, which are critical are, then you can follow their evolution. You can see what's happening. You can maybe review PRs, or maybe just look at the design reviews that people in those services do. Of course, talking to them. In a distributed, very big team, talking on scale, to an extent, which means RFCs, design reviews, architecture decision records, whatever you want to call it, same thing. DDD integration patterns offer some ideas here. Since I mentioned ADRs or RFCs, what we found working really well is very specific takeouts of doing them.

Ptak: Challenge yourself. We call it a sparring session or SDR review. You invite people who are really good in being a devil's advocate, and you, on purpose, want to actually review your RFC or ADR, and make sure that it's the best.

Grzesik: I will recommend The Psychology of Making Decisions talk, if you want to make your RFCs improved, because it already mentions a lot of things that we could have included here, but we don't have to because that was already mentioned.

Ptak: In the large organization with a lot of dependencies, there is a question on how to make sure that you involve the right people with the right knowledge. Of course, that can challenge you, because you might be dependent on the system that they know, and you want them to challenge you, if you take into consideration all of the problems in that system. What can help you? What's the tool that can help you? It's the RACI model. Use it also for the RFCs.

Grzesik: What is the RACI model? RACI model is grouping or attributing different roles with regards to a problem, to a category of being responsible, accountable, consulted, and informed. Who is responsible? The person who needs to make sure that something happens. A team lead, a head of area, somebody like that. Accountable, who is going to get the blame if it doesn't get done and if it doesn't get done well. Again, team lead, head of area, maybe CTO, depending on the problem. Consulted, who do you need to engage? Maybe security. Maybe ops. Maybe another team that you're going to build a design with.

Grzesik: Those people you will invite into an ADR. Then people who are informed are the people who they will learn what the consequences are. If they want to come, sure they can, but they don't have to. Which means, if you've done a few ADRs using this model, know which people to invite, then you almost have a template, not only for the document, what an ADR should look like, but also, who are the people to engage and what kind of interactions you expect from every single group. Look at the benefits. What are they?

Ptak: Some of the benefits, of course, it's clear, explicit collaboration and communication patterns. It really improves decision making. We don't do maybes, but we know exactly who to involve and who to, for instance, consult with. It really facilitates ownership. It's very clear who should own it, who should be involved, who should be communicated about the changes. I would encourage you to review it regularly. A very typical example from Revolut would be, responsible is usually the team owning a feature. Normally, I would be accountable. Consulted, we make sure that, for instance, other departments, other heads of engineering, other teams, or CTO, if it's a massive change, is consulted and informed. It can be, for instance, a department or a whole business, so we know exactly how to announce any changes, for instance.

Grzesik: We know how to make awesome designs.

Ptak: We know how to execute well, and your architecture needs scaling.

Ptak: You have the system, and we need to talk about the thing that we would call architecture refactoring. Once again, we've got heuristics of pain. The first sentence that you may hear, it takes forever to build.

Grzesik: If you've ever seen it, if people on your team say that it's a slow deploy pipeline, so hours, not minutes, people releasing high percentage of red builds, those are the signals. What is the consequence? The feedback loop slows down, and also the time to deploy slows down, which means small changes will not get into production so quickly, which means there is a tendency to cool down and slow and be a bit bureaucratic, and maybe run the tests again. Maybe run the test again and again, because some of them will be flaky or intermittent, however you want to call them. That's a problem to track.

Ptak: You may hear something like that. It's Monday and people are crying, for any reason.

Grzesik: People who dread going back to work.

Ptak: That's usually a sign. Simplification, of course. A system hard to maintain where simple changes will be difficult. It's a very steep learning curve, onboarding curve. You need to repeat all of the code, possibly because you don't know if you take, of course, spaghetti from one side of the plate, the meatball will fall on the other side. It's typical in a system hard to maintain.

Grzesik: Who are these signals? Senior lead engineers, people who have been developing software, they know at the back of their head, it should be a simple change. Then they learn that something isn't right, it actually is more complex. They've spent their third week of doing something very trivial. That's a signal. That's something that is very hard to pick up on a day-to-day basis, because we want to solve the problem. We want to get it done, whatever it is. Then we'll do the next thing that we want to get done. Service will capture that. Or people we onboard, ask them after a couple weeks, maybe months, see, what is their gut reaction? Is it nice to work with? Is it nice to reason about? Do they get what's happening?

Ptak: Another quote that you may hear, that there is a release train to catch. That's a very good sign of slow time to production. We have forced synchronization of changes. We need teams to collaborate together to release something. That means that we have infrequent releases. Of course, that means that we're really slow and we are not innovating quickly enough. That forces the synchronization of changes between teams, which means that actually they're not working on the most crucial things at the time. Another sign and another quote that you can hear, we're going to crash.

Grzesik: Performance issues, slow response times, frequent crashes. The number of errors in services. Of course, you can throw more services, spin up more, to work around it. It's going to eventually lead to, hopefully, a decision of scaling the architecture. We need to scale, not add more regions, more clouds, but something different.

Ptak: How to scale and refactor architecture, tips. Of course, every organization is in this situation. The crucial thing is, when you have a large system, and of course, moving many components as we've shown you. There is a temptation that, of course, when you want to refactor something that you make a decision to, "This time we'll make it perfect, every Greenfield project. This time, it will work". Usually, what you really want to do is you want to review the CI/CD, for instance, the patterns, the infrastructure. We can now rebuild the whole thing and it will be shiny. The problem is, it doesn't work.

Grzesik: What do you do instead? You might have heard of theory of constraints. You might have been doing software optimization.

Ptak: Let's apply it. First thing is, you need to identify what is the biggest pain of all.

Grzesik: Some examples, tests are taking too long, CI/CD too long, and so on. You can definitely come up with more examples. You pick one, and then you try to work with it, which is formally called exploit the constraint.

Ptak: You focus everything to it. You ignore the rest, and you fix it, but to the level where you actually remove it, and you actually get even advanced. It not only stops being a constraint, but actually you remove it for a longer period of time as a constraint.

Grzesik: Then, the very critical last element, you take this list from scratch, because the previous order will no longer apply, most likely. How do you then use it?

Ptak: There is a second approach that we can take. Let's say you don't have a very large pain point. It's not like you're crying, but you really want to optimize for something that you know that you will need, for instance. It's called the fitness function. We really recommend to look for several examples. Can be, we want, for instance, builds to be 10% quicker by the end of the quarter. That's how it works. You make a metric, you devote everything to fixing it. Then, you work on it. What you can do is combine them. Let me tell you about how we did it. Last two quarters, we're working on Revolut business modularization.

The biggest pain was build time. It took over an hour for us to build it, over two hours to release it to production. For us, it was way too slow. It was massive, over 2000 endpoints, nearly 500 consumers, nearly 20 teams involved in the project. That's exactly how we did the theory of constraints. We focused on the build time. Now every team has their own service. We reduced the build times by 75%. Massive. We optimized for one thing only. We haven't, for instance, refactored the architecture.

The Cold Shower Takeaways for Uncomplicating Architecture.

Grzesik: You've probably heard this, a bad system will beat a good person every time. If the organization you work in has shown it to you in history or previous past lives, it's a learning point. Which means the question that we keep asking ourselves when we try to design how teams thinks processes work in the places we work at, is, can we build a system? Or, how do we build a system that supports people to do the correct thing and keeps giving the right signals?

Ptak: To build on the fragile system, because that's what we're talking about, learn nature. Nature has its own way. Apply stress to your organization, to your system, to your architecture. Look for things to simplify, unify, and automate. We gave you several tools how to do it. Learn from the nature. Actually, you want to apply stress. You want to apply stress to the architecture, to your teams. That's very crucial.

Grzesik: Crucial element of that is short cycle time. Humans have a limited attention span, which means if the change and the effect is something we can observe, we're going to learn from it. If it takes years or months from one to the other, it's hard, and we're probably going to do it less.

Ptak: We gave you several tips how to build organization growth mindset. Definitely look for them. They may seem basic, but if you do them well and connect them, they actually will lead to teams improving themselves to own more effective, and you will be able to do what is famously called the Reverse Conway's Maneuver. Make sure that ownership is explicit. You really want to look for the things that nobody owns or the ownership is wrong, therefore refactor the ownership. We gave you several tips. That's how you can work on the Westrum different factors.

Grzesik: If you do that, you can have a system in which a team, or a pair, a couple of engineers can make a change, can make a decision that they need to deploy to production and do it. Which means, if they notice that, they take the correct action at the correct lowest possible level of complexity. Then the learning is already in.

Ptak: A lot of companies would say, we're agile, and we can do different things. I would say, don't. Impose constraints. If you really want to be fast, if you really want to scale in the hyper scale fashion, scale up, impose constraints on the architecture, tooling, so people focus on the right things. Remove everything that is non-essential, so you reduce the randomness and the complication of the system. For the most critical part, remain focused. A lot of organizations, of course, understand agile very wrong. For instance, on the retrospectives, teams are really good in actually explaining why they haven't delivered. That's the other side of the spectrum that we could be hearing.

Hi, engineers! Have you ever been asked to implement a retry algorithm for your Java code? Or maybe you saw something similar in the codebase of your ......

AI-driven data trends in Indian governance in 2025 are revolutionizing decision-making, enhancing efficiency, and improving public servi......

Having attended Sprint Review meetings for over 15 years, I’ve seen both highly productive sessions that drive alignment and progress — and ones that ......

Market Impact Analysis

Market Growth Trend

2018	2019	2020	2021	2022	2023	2024
7.5%	9.0%	9.4%	10.5%	11.0%	11.4%	11.5%

Quarterly Growth Rate

Q1 2024	Q2 2024	Q3 2024	Q4 2024
10.8%	11.1%	11.3%	11.5%

Market Segments and Growth Drivers

Segment	Market Share	Growth Rate
Enterprise Software	38%	10.8%
Cloud Services	31%	17.5%
Developer Tools	14%	9.3%
Security Software	12%	13.2%
Other Software	5%	7.5%

Technology Maturity Curve

Different technologies within the ecosystem are at varying stages of maturity:

Competitive Landscape Analysis

Company	Market Share
Microsoft	22.6%
Oracle	14.8%
SAP	12.5%
Salesforce	9.7%
Adobe	8.3%

Future Outlook and Predictions

The Presentation Taking Llms landscape is evolving rapidly, driven by technological advancements, changing threat vectors, and shifting business requirements. Based on current trends and expert analyses, we can anticipate several significant developments across different time horizons:

Year-by-Year Technology Evolution

Based on current trajectory and expert analyses, we can project the following development timeline:

2024Early adopters begin implementing specialized solutions with measurable results

2025Industry standards emerging to facilitate broader adoption and integration

2026Mainstream adoption begins as technical barriers are addressed

2027Integration with adjacent technologies creates new capabilities

2028Business models transform as capabilities mature

2029Technology becomes embedded in core infrastructure and processes

2030New paradigms emerge as the technology reaches full maturity

Technology Maturity Curve

Different technologies within the ecosystem are at varying stages of maturity, influencing adoption timelines and investment priorities:

(Interactive diagram available in full report)

Innovation Trigger

Generative AI for specialized domains
Blockchain for supply chain verification

Peak of Inflated Expectations

Digital twins for business processes
Quantum-resistant cryptography

Trough of Disillusionment

Consumer AR/VR applications
General-purpose blockchain

Slope of Enlightenment

AI-driven analytics
Edge computing

Plateau of Productivity

Cloud infrastructure
Mobile applications

Technology Evolution Timeline

1-2 Years

Technology adoption accelerating across industries
digital transformation initiatives becoming mainstream

3-5 Years

Significant transformation of business processes through advanced technologies
new digital business models emerging

5+ Years

Fundamental shifts in how technology integrates with business and society
emergence of new technology paradigms

Expert Perspectives

Leading experts in the software dev sector provide diverse perspectives on how the landscape will evolve over the coming years:

"Technology transformation will continue to accelerate, creating both challenges and opportunities."
— Industry Expert

"Organizations must balance innovation with practical implementation to achieve meaningful results."
— Technology Analyst

"The most successful adopters will focus on business outcomes rather than technology for its own sake."
— Research Director

Areas of Expert Consensus

Acceleration of Innovation: The pace of technological evolution will continue to increase
Practical Integration: Focus will shift from proof-of-concept to operational deployment
Human-Technology Partnership: Most effective implementations will optimize human-machine collaboration
Regulatory Influence: Regulatory frameworks will increasingly shape technology development

Short-Term Outlook (1-2 Years)

In the immediate future, organizations will focus on implementing and optimizing currently available technologies to address pressing software dev challenges:

Technology adoption accelerating across industries
digital transformation initiatives becoming mainstream

These developments will be characterized by incremental improvements to existing frameworks rather than revolutionary changes, with emphasis on practical deployment and measurable outcomes.

Mid-Term Outlook (3-5 Years)

As technologies mature and organizations adapt, more substantial transformations will emerge in how security is approached and implemented:

Significant transformation of business processes through advanced technologies
new digital business models emerging

This period will see significant changes in security architecture and operational models, with increasing automation and integration between previously siloed security functions. Organizations will shift from reactive to proactive security postures.

Long-Term Outlook (5+ Years)

Looking further ahead, more fundamental shifts will reshape how cybersecurity is conceptualized and implemented across digital ecosystems:

Fundamental shifts in how technology integrates with business and society
emergence of new technology paradigms

These long-term developments will likely require significant technical breakthroughs, new regulatory frameworks, and evolution in how organizations approach security as a fundamental business function rather than a technical discipline.

Key Risk Factors and Uncertainties

Several critical factors could significantly impact the trajectory of software dev evolution:

Technical debt accumulation

Security integration challenges

Maintaining code quality

Organizations should monitor these factors closely and develop contingency strategies to mitigate potential negative impacts on technology implementation timelines.

Alternative Future Scenarios

The evolution of technology can follow different paths depending on various factors including regulatory developments, investment trends, technological breakthroughs, and market adoption. We analyze three potential scenarios:

Optimistic Scenario

Rapid adoption of advanced technologies with significant business impact

Key Drivers: Supportive regulatory environment, significant research breakthroughs, strong market incentives, and rapid user adoption.

Probability: 25-30%

Base Case Scenario

Measured implementation with incremental improvements

Key Drivers: Balanced regulatory approach, steady technological progress, and selective implementation based on clear ROI.

Probability: 50-60%

Conservative Scenario

Technical and organizational barriers limiting effective adoption

Key Drivers: Restrictive regulations, technical limitations, implementation challenges, and risk-averse organizational cultures.

Probability: 15-20%

Scenario Comparison Matrix

Factor	Optimistic	Base Case	Conservative
Implementation Timeline	Accelerated	Steady	Delayed
Market Adoption	Widespread	Selective	Limited
Technology Evolution	Rapid	Progressive	Incremental
Regulatory Environment	Supportive	Balanced	Restrictive
Business Impact	Transformative	Significant	Modest

Transformational Impact

Technology becoming increasingly embedded in all aspects of business operations. This evolution will necessitate significant changes in organizational structures, talent development, and strategic planning processes.

The convergence of multiple technological trends—including artificial intelligence, quantum computing, and ubiquitous connectivity—will create both unprecedented security challenges and innovative defensive capabilities.

Implementation Challenges

Technical complexity and organizational readiness remain key challenges. Organizations will need to develop comprehensive change management strategies to successfully navigate these transitions.

Regulatory uncertainty, particularly around emerging technologies like AI in security applications, will require flexible security architectures that can adapt to evolving compliance requirements.

Key Innovations to Watch

Artificial intelligence, distributed systems, and automation technologies leading innovation. Organizations should monitor these developments closely to maintain competitive advantages and effective security postures.

Strategic investments in research partnerships, technology pilots, and talent development will position forward-thinking organizations to leverage these innovations early in their development cycle.

Technical Glossary

Key technical terms and definitions to help understand the technologies discussed in this article.

Understanding the following technical concepts is essential for grasping the full implications of the security threats and defensive measures discussed in this article. These definitions provide context for both technical and non-technical readers.

CI/CD intermediate

algorithm

algorithm intermediate

interface

scalability intermediate

platform

agile intermediate

encryption

framework intermediate

API

Kubernetes intermediate

cloud computing

interface intermediate

middleware Well-designed interfaces abstract underlying complexity while providing clearly defined methods for interaction between different system components.

API beginner

scalability APIs serve as the connective tissue in modern software architectures, enabling different applications and services to communicate and share data according to defined protocols and data formats.

How APIs enable communication between different software systems

Example: Cloud service providers like AWS, Google Cloud, and Azure offer extensive APIs that allow organizations to programmatically provision and manage infrastructure and services.

DevOps intermediate

DevOps

platform intermediate

microservices Platforms provide standardized environments that reduce development complexity and enable ecosystem growth through shared functionality and integration capabilities.

Presentation Taking Llms: Latest Updates and Analysis

Presentation: Taking LLMs out of the Black Box: A Practical Guide to Human-in-the-Loop Distillation

SHARE

Presentation: Zero Waste, Radical Magic, and Italian Graft – Quarkus Efficiency Secrets

SHARE

Presentation: Unveiling the Tech Underpinning FinTech's Revolution

SHARE

Market Impact Analysis

Market Growth Trend

Quarterly Growth Rate

Market Segments and Growth Drivers

Technology Maturity Curve

Competitive Landscape Analysis

Future Outlook and Predictions

Year-by-Year Technology Evolution

Technology Maturity Curve

Innovation Trigger

Peak of Inflated Expectations

Trough of Disillusionment

Slope of Enlightenment

Plateau of Productivity

Technology Evolution Timeline

Expert Perspectives

Areas of Expert Consensus

Short-Term Outlook (1-2 Years)

Mid-Term Outlook (3-5 Years)

Long-Term Outlook (5+ Years)

Key Risk Factors and Uncertainties

Alternative Future Scenarios

Optimistic Scenario

Base Case Scenario

Conservative Scenario

Scenario Comparison Matrix

Transformational Impact

Implementation Challenges

Key Innovations to Watch

Technical Glossary

CI/CD intermediate

algorithm intermediate

scalability intermediate

agile intermediate

framework intermediate

Kubernetes intermediate

interface intermediate

Related Terms

API beginner

Related Terms

DevOps intermediate

platform intermediate

Related Terms

Related Articles

Terraform Smarter, Not Harder: The Power of Modular Infrastructure as Code - Related to gitlab:, harder:, nice, hear, nit.

How To Run DeepSeek R1 on AWS Using Infrastructure as Code - Related to how, run, using, status, r1

Presentation: Efficient Incremental Processing with Netflix Maestro and Apache Iceberg - Related to performance, cx, netflix, how, can