Technology News from Around the World, Instantly on Oracnoos!

The pursuit of AI education—past, present, and future - Related to are, future, education—past,, show, realising

Realising scientists are the real superheroes

Realising scientists are the real superheroes

organization Realising scientists are the real superheroes Share.

Meet Edgar Duéñez-Guzmán, a research engineer on our Multi-Agent Research team who’s drawing on knowledge of game theory, computer science, and social evolution to get AI agents working improved together.

What led you to working in computer science? I've wanted to save the world ever since I can remember. That's why I wanted to be a scientist. While I loved superhero stories, I realised scientists are the real superheroes. They are the ones who give us clean water, medicine, and an understanding of our place in the universe. As a child, I loved computers and I loved science. Growing up in Mexico, though, I didn't feel like studying computer science was feasible. So, I decided to study maths, treating it as a solid foundation for computing and I ended up doing my university thesis in game theory. How did your studies impact your career? As part of my PhD in computer science, I created biological simulations, and ended up falling in love with biology. Understanding evolution and how it shaped the Earth was exhilarating. Half of my dissertation was in these biological simulations, and I went on to work in academia studying the evolution of social phenomena, like cooperation and altruism. From there I started working in Search at Google, where I learned to deal with massive scales of computation. Years later, I put all three pieces together: game theory, evolution of social behaviours, and large-scale computation. Now I use those pieces to create artificially intelligent agents that can learn to cooperate amongst themselves, and with us. What made you decide to apply to DeepMind over other companies? It was the mid-2010s. I’d been keeping an eye on AI for over a decade and I knew of DeepMind and some of their successes. Then Google acquired it and I was very excited. I wanted in, but I was living in California and DeepMind was only hiring in London. So, I kept tracking the progress. As soon as an office opened in California, I was first in line. I was fortunate to be hired in the first cohort. Eventually, I moved to London to pursue research full time.

What surprised you most about working at DeepMind? How ridiculously talented and friendly people are. Every single person I’ve talked to also has an exciting side outside of work. Professional musicians, artists, super-fit bikers, people who appeared in Hollywood movies, maths olympiad winners – you name it, we have it! And we’re all open and committed to making the world a superior place. How does your work help DeepMind make a positive impact? At the core of my research is making intelligent agents that understand cooperation. Cooperation is the key to our success as a species. We can access the world's information and connect with friends and family on the other side of the world because of cooperation. Our failure to address the catastrophic effects of climate change is a failure of cooperation, as we saw during COP26. What’s the best thing about your job? The flexibility to pursue the ideas that I think are most key. For example, I’d love to help use our technology for superior understanding social problems, like discrimination. I pitched this idea to a group of researchers with expertise in psychology, ethics, fairness, neuroscience, and machine learning, and then created a research programme to study how discrimination might originate in stereotyping.

How would you describe the culture at DeepMind? DeepMind is one of those places where freedom and potential go hand-in-hand. We have the opportunity to pursue ideas that we feel are crucial and there’s a culture of open discourse. It’s not uncommon to infect others with your ideas and form a team around making it a reality. Are you part of any groups at DeepMind? Or other activities? I love getting involved in extracurriculars. I’m a facilitator of Allyship workshops at DeepMind, where we aim to empower participants to take action for positive change and encourage allyship in others, contributing to an inclusive and equitable workplace. I also love making research more accessible and talking with visiting students. I’ve created publicly available educational tutorials for explaining AI concepts to teenagers, which have been used in summer schools across the world. How can AI maximise its positive impact? To have the most positive impact, it simply needs to be that the benefits are shared broadly, rather than kept by a tiny number of people. We should be designing systems that empower people, and that democratise access to technology. For example, when I worked on WaveNet, the new voice of the Google Assistant, I felt it was cool to be working on a technology that is now used by billions of people, in Google Search, or Maps. That's nice, but then we did something enhanced. We started using this technology to give their voice back to people with degenerative disorders, like ALS. There's always opportunities to do good, we just have to take them.

Detecting signs of this debilitating disease with AI before any bones start to break.

Melissa Formosa is an osteoporosis expert at the University of M...

New benchmark for evaluating multimodal systems based on real-world video, audio, and text data.

From the Turing test to ImageNet, benchmarks have pla...

We’re partnering with six education charities and social enterprises in the United Kingdom (UK) to co-create a bespoke education programme to help tac...

The pursuit of AI education—past, present, and future

The pursuit of AI education—past, present, and future

Before DeepMind, I worked for a social purpose startup that increased access to mental healthcare. Then I got a job at a university alongside academics and students. At that point, I realised I was looking for a ‘Goldilocks’ role that brought together everything I loved about these different environments – the speed and excitement of a tech startup, impact-focussed goals, and the fascination of working with brilliant researchers. It seemed impossible to combine all these things. Then, enter stage left: DeepMind.

During the interview process, I was surprised by how much the interviewers wanted to get to know me and how I related to the culture DeepMind was building. People from all different backgrounds, disciplines, and approaches find their way to DeepMind, and having such an open discussion about the environment I’d join as a new employee made me feel at home.

I believe in ensuring that people from all walks of life, and especially those from underrepresented communities and backgrounds, are able to contribute to the development of AI. I’ve had the chance to work on some really special projects at DeepMind, but the scholarships programme is – by far – the most personally rewarding programme I’ve ever been involved in. Every academic year, we get to see the new crop of talented AI scholars become part of an international community of students and mentors. It’s an incredibly special moment – and I can’t wait to see what they all achieve in the future!

Google DeepMind researchers are presenting more than 80 new papers at ICML this year. As many papers were submitted before Google Brain and DeepMind j...

Research A generalist AI agent for 3D virtual environments Share.

We present new research on a Scalable Instructable Multiworld Age...

revision November 11, 2024: As of November 2024, we have released AlphaFold 3 model code and weights for academic use to help advance research. Learn mo...

Show and Tell

Show and Tell

Natural Language Processing and Computer Vision used to be two completely different fields. Well, at least back when I started to learn machine learning and deep learning, I feel like there are multiple paths to follow, and each of them, including NLP and Computer Vision, directs me to a completely different world. Over time, we can now observe that AI becomes more and more advanced, with the intersection between multiple fields of study getting more common, including the two I just mentioned.

Today, many language models have capability to generate images based on the given prompt. That’s one example of the bridge between NLP and Computer Vision. But I guess I’ll save it for my upcoming article as it is a bit more complex. Instead, in this article I am going to discuss the simpler one: image captioning. As the name hints at, this is essentially a technique where a specific model accepts an image and returns a text that describes the input image.

One of the earliest papers in this topic is the one titled "Show and Tell: A Neural Image Caption Generator" written by Vinyals et al. back in 2015 [1]. In this article, I will focus on implementing the Deep Learning model proposed in the paper using PyTorch. Note that I won’t actually demonstrate the training process here as that’s a topic on its own. Let me know in the comments if you want a separate tutorial on that.

Generally speaking, image captioning can be done by combining two types of models: the one specialized to process images and another one capable of processing sequences. I believe you already know what kind of models work best for the two tasks – yes, you’re right, those are CNN and RNN, respectively. The idea here is that the CNN is utilized to encode the input image (hence this part is called encoder), whereas the RNN is used for generating a sequence of words based on the functions encoded by the CNN (hence the RNN part is called decoder).

It is discussed in the paper that the authors attempted to do so using GoogLeNet ([website], Inception V1) for the encoder and LSTM for the decoder. In fact, the use of GoogLeNet is not explicitly mentioned, yet based on the illustration provided in the paper it seems like the architecture used in the encoder is adopted from the original GoogLeNet paper [2]. The figure below reveals what the proposed architecture looks like.

Figure 1. The image captioning model proposed in [1], where the encoder part (the leftmost block) implements the GoogLeNet model [2].

Talking more specifically about the connection between the encoder and the decoder, there are several methods available for connecting the two, namely init-inject, pre-inject, par-inject and merge, as mentioned in [3]. In the case of the Show and Tell paper, authors used pre-inject, a method where the functions extracted by the encoder are perceived as the 0th word in the caption. Later in the inference phase, we expect the decoder to generate a caption based solely on these image functions.

Figure 2. The four methods possible to be used to connect the encoder and the decoder part of an image captioning model [3]. In our case we are going to use the pre-inject method (b).

As we already understood the theory behind the image captioning model, we can now jump into the code!

I’ll break the implementation part into three sections: the Encoder, the Decoder, and the combination of the two. Before we actually get into them, we need to import the modules and initialize the required parameters in advance. Look at the Codeblock 1 below to see the modules I use.

# Codeblock 1 import torch #(1) import [website] as nn #(2) import [website] as models #(3) from [website] import GoogLeNet_Weights #(4).

Let’s break down these imports quickly: the line marked with #(1) is used for basic operations, line #(2) is for initializing neural network layers, line #(3) is for loading various deep learning models, and #(4) is the pretrained weights for the GoogLeNet model.

Talking about the parameter configuration, EMBED_DIM and LSTM_HIDDEN_DIM are the only two parameters mentioned in the paper, which are both set to 512 as shown at line #(1) and #(2) in the Codeblock 2 below. The EMBED_DIM variable essentially indicates the feature vector size representing a single token in the caption. In this case, we can simply think of a single token as an individual word. Meanwhile, LSTM_HIDDEN_DIM is a variable representing the hidden state size inside the LSTM cell. This paper does not mention how many times this RNN-based layer is repeated, but based on the diagram in Figure 1, it seems like it only implements a single LSTM cell. Thus, at line #(3) I set the NUM_LSTM_LAYERS variable to 1.

# Codeblock 2 EMBED_DIM = 512 #(1) LSTM_HIDDEN_DIM = 512 #(2) NUM_LSTM_LAYERS = 1 #(3) IMAGE_SIZE = 224 #(4) IN_CHANNELS = 3 #(5) SEQ_LENGTH = 30 #(6) VOCAB_SIZE = 10000 #(7) BATCH_SIZE = 1.

The next two parameters are related to the input image, namely IMAGE_SIZE ( #(4) ) and IN_CHANNELS ( #(5) ). Since we are about to use GoogLeNet for the encoder, we need to match it with its original input shape (3×224×224). Not only for the image, but we also need to configure the parameters for the caption. Here we assume that the caption length is no more than 30 words ( #(6) ) and the number of unique words in the dictionary is 10000 ( #(7) ). Lastly, the BATCH_SIZE parameter is used because by default PyTorch processes tensors in a batch. Just to make things simple, the number of image-caption pair within a single batch is set to 1.

It is actually possible to use any kind of CNN-based model for the encoder. I found on the internet that [4] uses DenseNet, [5] uses Inception V3, and [6] utilizes ResNet for the similar tasks. However, since my goal is to reproduce the model proposed in the paper as closely as possible, I am using the pretrained GoogLeNet model instead. Before we get into the encoder implementation, let’s see what the GoogLeNet architecture looks like using the following code.

The resulting output is very long as it lists literally all layers inside the architecture. Here I truncate the output since I only want you to focus on the last layer (the fc layer marked with #(1) in the Codeblock 3 Output below). You can see that this linear layer maps a feature vector of size 1024 into 1000. Normally, in a standard image classification task, each of these 1000 neurons corresponds to a specific class. So, for example, if you want to perform a 5-class classification task, you would need to modify this layer such that it projects the outputs to 5 neurons only. In our case, we need to make this layer produce a feature vector of length 512 ( EMBED_DIM ). With this, the input image will later be represented as a 512-dimensional vector after being processed by the GoogLeNet model. This feature vector size will exactly match with the token embedding dimension, allowing it to be treated as a part of our word sequence.

# Codeblock 3 Output GoogLeNet( (conv1): BasicConv2d( (conv): Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False) (bn): BatchNorm2d(64, [website], [website], affine=True, track_running_stats=True) ) (maxpool1): MaxPool2d(kernel_size=3, stride=2, padding=0, dilation=1, ceil_mode=True) (conv2): BasicConv2d( (conv): Conv2d(64, 64, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn): BatchNorm2d(64, [website], [website], affine=True, track_running_stats=True) ) . . . . (avgpool): AdaptiveAvgPool2d(output_size=(1, 1)) (dropout): Dropout([website], inplace=False) (fc): Linear(in_features=1024, out_features=1000, bias=True) #(1) ).

Now let’s actually load and modify the GoogLeNet model, which I do in the InceptionEncoder class below.

# Codeblock 4a class InceptionEncoder([website] def __init__(self, fine_tune): #(1) super().__init__() self.googlenet = models.googlenet(weights=GoogLeNet_Weights.IMAGENET1K_V1) #(2) [website] = [website], #(3) out_features=EMBED_DIM) #(4) if fine_tune == True: #(5) for param in self.googlenet.parameters(): param.requires_grad = True else: for param in self.googlenet.parameters(): param.requires_grad = False for param in [website] param.requires_grad = True.

The first thing we do in the above code is to load the model using models.googlenet() . It is mentioned in the paper that the model is already pretrained on the ImageNet dataset. Thus, we need to pass GoogLeNet_Weights.IMAGENET1K_V1 into the weights parameter, as shown at line #(2) in Codeblock 4a. Next, at line #(3) we access the classification head through the fc attribute, where we replace the existing linear layer with a new one having the output dimension of 512 ( EMBED_DIM ) ( #(4) ). Since this GoogLeNet model is already trained, we don’t need to train it from scratch. Instead, we can either perform fine-tuning or transfer learning in order to adapt it to the image captioning task.

In case you’re not yet familiar with the two terms, fine-tuning is a method where we upgrade the weights of the entire model. On the other hand, transfer learning is a technique where we only upgrade the weights of the layers we replaced (in this case it’s the last fully-connected layer), while setting the weights of the existing layers non-trainable. To do so, I implement a flag named fine_tune at line #(1) which will let the model to perform fine-tuning whenever it is set to True ( #(5) ).

The forward() method is pretty straightforward since what we do here is simply passing the input image through the modified GoogLeNet model. See the Codeblock 4b below for the details. Additionally, here I also print out the tensor dimension before and after processing so that you can superior understand how the InceptionEncoder model works.

# Codeblock 4b def forward(self, images): print(f'originalt: {[website]}') functions = self.googlenet(images) print(f'after googlenett: {[website]}') return functions.

To test whether our decoder works properly, we can pass a dummy tensor of size 1×3×224×224 through the network as demonstrated in Codeblock 5. This tensor dimension simulates a single RGB image of size 224×224. You can see in the resulting output that our image now becomes a single-dimensional feature vector with the length of 512.

# Codeblock 5 inception_encoder = InceptionEncoder(fine_tune=True) images = [website], IN_CHANNELS, IMAGE_SIZE, IMAGE_SIZE) elements = inception_encoder(images).

# Codeblock 5 Output original : [website][1, 3, 224, 224]) after googlenet : [website][1, 512]).

As we have successfully implemented the encoder, now that we are going to create the LSTM decoder, which I demonstrate in Codeblock 6a and 6b. What we need to do first is to initialize the required layers, namely an embedding layer ( #(1) ), the LSTM layer itself ( #(2) ), and a standard linear layer ( #(3) ). The first one ( nn.Embedding ) is responsible for mapping every single token into a 512 ( EMBED_DIM )-dimensional vector. Meanwhile, the LSTM layer is going to generate a sequence of embedded tokens, where each of these tokens will be mapped into a 10000 ( VOCAB_SIZE )-dimensional vector by the linear layer. Later on, the values contained in this vector will represent the likelihood of each word in the dictionary being chosen.

# Codeblock 6a class LSTMDecoder([website] def __init__(self): super().__init__() #(1) self.embedding = nn.Embedding(num_embeddings=VOCAB_SIZE, embedding_dim=EMBED_DIM) #(2) [website] = [website], hidden_size=LSTM_HIDDEN_DIM, num_layers=NUM_LSTM_LAYERS, batch_first=True) #(3) [website] = [website], out_features=VOCAB_SIZE).

Next, let’s define the flow of the network using the following code.

# Codeblock 6b def forward(self, aspects, captions): #(1) print(f'aspects originalt: {[website]}') aspects = aspects.unsqueeze(1) #(2) print(f"after unsqueezett: {[website]}") print(f'captions originalt: {[website]}') captions = self.embedding(captions) #(3) print(f"after embeddingtt: {[website]}") captions = [website][aspects, captions], dim=1) #(4) print(f"after concattt: {[website]}") captions, _ = [website] #(5) print(f"after lstmtt: {[website]}") captions = [website] #(6) print(f"after lineartt: {[website]}") return captions.

You can see in the above code that the forward() method of the LSTMDecoder class accepts two inputs: aspects and captions , where the former is the image that has been processed by the InceptionEncoder , while the latter is the caption of the corresponding image serving as the ground truth ( #(1) ). The idea here is that we are going to perform pre-inject operation by prepending the aspects tensor into captions using the code at line #(4) . However, keep in mind that we need to adjust the shape of both tensors beforehand. To do so, we have to insert a single dimension at the 1st axis of the image aspects ( #(2) ). Meanwhile, the shape of the captions tensor will align with our requirement right after being processed by the embedding layer ( #(3) ). As the aspects and captions have been concatenated, we then pass this tensor through the LSTM layer ( #(5) ) before it is eventually processed by the linear layer ( #(6) ). Look at the testing code below to more effective understand the flow of the two tensors.

# Codeblock 7 lstm_decoder = LSTMDecoder() attributes = [website], EMBED_DIM) #(1) captions = torch.randint(0, VOCAB_SIZE, (BATCH_SIZE, SEQ_LENGTH)) #(2) captions = lstm_decoder(attributes, captions).

In Codeblock 7, I assume that functions is a dummy tensor that represents the output of the InceptionEncoder model ( #(1) ). Meanwhile, captions is the tensor representing a sequence of tokenized words, where in this case I initialize it as random numbers ranging between 0 to 10000 ( VOCAB_SIZE ) with the length of 30 ( SEQ_LENGTH ) ( #(2) ).

We can see in the output below that the capabilities tensor initially has the dimension of 1×512 ( #(1) ). This tensor shape changed to 1×1×512 after being processed with the unsqueeze() operation ( #(2) ). The additional dimension in the middle (1) allows the tensor to be treated as a feature vector corresponding to a single timestep, which is necessary for compatibility with the LSTM layer. To the captions tensor, its shape changed from 1×30 ( #(3) ) to 1×30×512 ( #(4) ), indicating that every single word is now represented as a 512-dimensional vector.

# Codeblock 7 Output capabilities original : [website][1, 512]) #(1) after unsqueeze : [website][1, 1, 512]) #(2) captions original : [website][1, 30]) #(3) after embedding : [website][1, 30, 512]) #(4) after concat : [website][1, 31, 512]) #(5) after lstm : [website][1, 31, 512]) #(6) after linear : [website][1, 31, 10000]) #(7).

After pre-inject operation is performed, our tensor is now having the dimension of 1×31×512, where the aspects tensor becomes the token at the 0th timestep in the sequence ( #(5) ). See the following figure to superior illustrate this idea.

Figure 3. What the resulting tensor looks like after the pre-injection operation. [3].

Next, we pass the tensor through the LSTM layer, which in this particular case the output tensor dimension remains the same. However, it is key to note that the tensor shapes at line #(5) and #(6) in the above output are actually specified by different parameters. The dimensions appear to match here because EMBED_DIM and LSTM_HIDDEN_DIM were both set to 512. Normally, if we use a different value for LSTM_HIDDEN_DIM , then the output dimension is going to be different as well. Finally, we projected each of the 31 token embeddings to a vector of size 10000, which will later contain the probability of every possible token being predicted ( #(7) ).

At this point, we have successfully created both the encoder and the decoder parts of the image captioning model. What I am going to do next is to combine them together in the ShowAndTell class below.

# Codeblock 8a class ShowAndTell([website] def __init__(self): super().__init__() self.encoder = InceptionEncoder(fine_tune=True) #(1) self.decoder = LSTMDecoder() #(2) def forward(self, images, captions): attributes = self.encoder(images) #(3) print(f"after encodert: {[website]}") captions = self.decoder(attributes, captions) #(4) print(f"after decodert: {[website]}") return captions.

I think the above code is pretty straightforward. In the __init__() method, we only need to initialize the InceptionEncoder as well as the LSTMDecoder models ( #(1) and #(2) ). Here I assume that we are about to perform fine-tuning rather than transfer learning, so I set the fine_tune parameter to True . Theoretically speaking, fine-tuning is superior than transfer learning if you have a relatively large dataset since it works by re-adjusting the weights of the entire model. However, if your dataset is rather small, you should go with transfer learning instead – but that’s just the theory. It’s definitely a good idea to experiment with both options to see which works best in your case.

Still with the above codeblock, we configure the forward() method to accept image-caption pairs as input. With this configuration, we basically design this method such that it can only be used for training purpose. Here we initially process the raw image with the GoogLeNet inside the encoder block ( #(3) ). Afterwards, we pass the extracted elements as well as the tokenized captions into the decoder block and let it produce another token sequence ( #(4) ). In the actual training, this caption output will then be compared with the ground truth to compute the error. This error value is going to be used to compute gradients through backpropagation, which determines how the weights in the network are updated.

It is critical to know that we cannot use the forward() method to perform inference, so we need a separate one for that. In this case, I am going to implement the code specifically to perform inference in the generate() method below.

# Codeblock 8b def generate(self, images): #(1) elements = self.encoder(images) #(2) print(f"after encodertt: {[website]}n") words = [] #(3) for i in range(SEQ_LENGTH): #(4) print(f"iteration #{i}") elements = elements.unsqueeze(1) print(f"after unsqueezett: {[website]}") elements, _ = [website] print(f"after lstmtt: {[website]}") elements = elements.squeeze(1) #(5) print(f"after squeezett: {[website]}") probs = [website] #(6) print(f"after lineartt: {[website]}") _, word = [website] #(7) print(f"after maxtt: {[website]}") [website] #(8) if word == 1: #(9) break elements = self.decoder.embedding(word) #(10) print(f"after embeddingtt: {[website]}n") return words #(11).

Instead of taking two inputs like the previous one, the generate() method takes raw image as the only input ( #(1) ). Since we want the capabilities extracted from the image to be the initial input token, we first need to process the raw input image with the encoder block prior to actually generating the subsequent tokens ( #(2) ). Next, we allocate an empty list for storing the token sequence to be produced later ( #(3) ). The tokens themselves are generated one by one, so we wrap the entire process inside a for loop, which is going to stop iterating once it reaches at most 30 ( SEQ_LENGTH ) words ( #(4) ).

The steps done inside the loop is algorithmically similar to the ones we discussed earlier. However, since the LSTM cell here generates a single token at a time, the process requires the tensor to be treated a bit differently from the one passed through the forward() method of the LSTMDecoder class back in Codeblock 6b. The first difference you might notice is the squeeze() operation ( #(5) ), which is basically just a technical step to be done such that the subsequent layer does the linear projection correctly ( #(6) ). Then, we take the index of the feature vector having the highest value, which corresponds to the token most likely to come next ( #(7) ), and append it to the list we allocated earlier ( #(8) ). The loop is going to break whenever the predicted index is a stop token, which in this case I assume that this token is at the 1st index of the probs vector. Otherwise, if the model does not find the stop token, then it is going to convert the last predicted word into its 512 ( EMBED_DIM )-dimensional vector ( #(10) ), allowing it to be used as the input aspects for the next iteration. Lastly, the generated word sequence will be returned once the loop is completed ( #(11) ).

We are going to simulate the forward pass for the training phase using the Codeblock 9 below. Here I pass two tensors through the show_and_tell model ( #(1) ), each representing a raw image of size 3×224×224 ( #(2) ) and a sequence of tokenized words ( #(3) ). Based on the resulting output, we found that our model works properly as the two input tensors successfully passed through the InceptionEncoder and the LSTMDecoder part of the network.

# Codeblock 9 show_and_tell = ShowAndTell() #(1) images = [website], IN_CHANNELS, IMAGE_SIZE, IMAGE_SIZE) #(2) captions = torch.randint(0, VOCAB_SIZE, (BATCH_SIZE, SEQ_LENGTH)) #(3) captions = show_and_tell(images, captions).

# Codeblock 9 Output after encoder : [website][1, 512]) after decoder : [website][1, 31, 10000]).

Now, let’s assume that our show_and_tell model is already trained on an image captioning dataset, and thus ready to be used for inference. Look at the Codeblock 10 below to see how I do it. Here we set the model to eval() mode ( #(1) ), initialize the input image ( #(2) ), and pass it through the model using the generate() method ( #(3) ).

# Codeblock 10 [website] #(1) images = [website], IN_CHANNELS, IMAGE_SIZE, IMAGE_SIZE) #(2) with torch.no_grad(): generated_tokens = show_and_tell.generate(images) #(3).

The flow of the tensor can be seen in the output below. Here I truncate the resulting outputs because it only reveals the same token generation process 30 times.

# Codeblock 10 Output after encoder : [website][1, 512]) iteration #0 after unsqueeze : [website][1, 1, 512]) after lstm : [website][1, 1, 512]) after squeeze : [website][1, 512]) after linear : [website][1, 10000]) after max : [website][1]) after embedding : [website][1, 512]) iteration #1 after unsqueeze : [website][1, 1, 512]) after lstm : [website][1, 1, 512]) after squeeze : [website][1, 512]) after linear : [website][1, 10000]) after max : [website][1]) after embedding : [website][1, 512]) . . . .

To see what the resulting caption looks like, we can just print out the generated_tokens list as shown below. Keep in mind that this sequence is still in the form of tokenized words. Later, in the post-processing stage, we will need to convert them back to the words corresponding to these numbers.

# Codeblock 11 Output [5627, 3906, 2370, 2299, 4952, 9933, 402, 7775, 602, 4414, 8667, 6774, 9345, 8750, 3680, 4458, 1677, 5998, 8572, 9556, 7347, 6780, 9672, 2596, 9218, 1880, 4396, 6168, 7999, 454].

With the above output, we’ve reached the end of our discussion on image captioning. Over time, many other researchers attempted to make improvements to accomplish this task. So, I think in the upcoming article I will discuss the state-of-the-art method on this topic.

Thanks for reading, I hope you learn something new today!

_By the way you can also find the code used in this article here._.

[1] Oriol Vinyals et al. Show and Tell: A Neural Image Caption Generator. Arxiv. [website] [Accessed November 13, 2024].

[2] Christian Szegedy et al. Going Deeper with Convolutions. Arxiv. [website] [Accessed November 13, 2024].

[3] Marc Tanti et al. Where to put the Image in an Image Caption Generator. Arxiv. [website] [Accessed November 13, 2024].

[4] Stepan Ulyanin. Captioning Images with CNN and RNN, using PyTorch. Medium. [website]/@stepanulyanin/captioning-images-with-pytorch-bc592e5fd1a3 [Accessed November 16, 2024].

[5] Saketh Kotamraju. How to Build an Image-Captioning Model in Pytorch. Towards Data Science. [website] [Accessed November 16, 2024].

[6] Code with Aarohi. Image Captioning using CNN and RNN | Image Captioning using Deep Learning. YouTube. [website]?v=htNmFL2BG34 [Accessed November 16, 2024].

AlphaFold is helping researchers uncover how protein-mutations cause disease, and how to prevent them.

Luigi Vitagliano is a Research Director at the ...

Impact YouTube: Enhancing the user experience Share.

It’s all about using our technology and research to help enrich people’s lives...

Research AI achieves silver-medal standard solving International Mathematical Olympiad problems Share.

Market Impact Analysis

Market Growth Trend

2018201920202021202220232024
23.1%27.8%29.2%32.4%34.2%35.2%35.6%
23.1%27.8%29.2%32.4%34.2%35.2%35.6% 2018201920202021202220232024

Quarterly Growth Rate

Q1 2024 Q2 2024 Q3 2024 Q4 2024
32.5% 34.8% 36.2% 35.6%
32.5% Q1 34.8% Q2 36.2% Q3 35.6% Q4

Market Segments and Growth Drivers

Segment Market Share Growth Rate
Machine Learning29%38.4%
Computer Vision18%35.7%
Natural Language Processing24%41.5%
Robotics15%22.3%
Other AI Technologies14%31.8%
Machine Learning29.0%Computer Vision18.0%Natural Language Processing24.0%Robotics15.0%Other AI Technologies14.0%

Technology Maturity Curve

Different technologies within the ecosystem are at varying stages of maturity:

Innovation Trigger Peak of Inflated Expectations Trough of Disillusionment Slope of Enlightenment Plateau of Productivity AI/ML Blockchain VR/AR Cloud Mobile

Competitive Landscape Analysis

Company Market Share
Google AI18.3%
Microsoft AI15.7%
IBM Watson11.2%
Amazon AI9.8%
OpenAI8.4%

Future Outlook and Predictions

The Realising Scientists Real landscape is evolving rapidly, driven by technological advancements, changing threat vectors, and shifting business requirements. Based on current trends and expert analyses, we can anticipate several significant developments across different time horizons:

Year-by-Year Technology Evolution

Based on current trajectory and expert analyses, we can project the following development timeline:

2024Early adopters begin implementing specialized solutions with measurable results
2025Industry standards emerging to facilitate broader adoption and integration
2026Mainstream adoption begins as technical barriers are addressed
2027Integration with adjacent technologies creates new capabilities
2028Business models transform as capabilities mature
2029Technology becomes embedded in core infrastructure and processes
2030New paradigms emerge as the technology reaches full maturity

Technology Maturity Curve

Different technologies within the ecosystem are at varying stages of maturity, influencing adoption timelines and investment priorities:

Time / Development Stage Adoption / Maturity Innovation Early Adoption Growth Maturity Decline/Legacy Emerging Tech Current Focus Established Tech Mature Solutions (Interactive diagram available in full report)

Innovation Trigger

  • Generative AI for specialized domains
  • Blockchain for supply chain verification

Peak of Inflated Expectations

  • Digital twins for business processes
  • Quantum-resistant cryptography

Trough of Disillusionment

  • Consumer AR/VR applications
  • General-purpose blockchain

Slope of Enlightenment

  • AI-driven analytics
  • Edge computing

Plateau of Productivity

  • Cloud infrastructure
  • Mobile applications

Technology Evolution Timeline

1-2 Years
  • Improved generative models
  • specialized AI applications
3-5 Years
  • AI-human collaboration systems
  • multimodal AI platforms
5+ Years
  • General AI capabilities
  • AI-driven scientific breakthroughs

Expert Perspectives

Leading experts in the ai tech sector provide diverse perspectives on how the landscape will evolve over the coming years:

"The next frontier is AI systems that can reason across modalities and domains with minimal human guidance."

— AI Researcher

"Organizations that develop effective AI governance frameworks will gain competitive advantage."

— Industry Analyst

"The AI talent gap remains a critical barrier to implementation for most enterprises."

— Chief AI Officer

Areas of Expert Consensus

  • Acceleration of Innovation: The pace of technological evolution will continue to increase
  • Practical Integration: Focus will shift from proof-of-concept to operational deployment
  • Human-Technology Partnership: Most effective implementations will optimize human-machine collaboration
  • Regulatory Influence: Regulatory frameworks will increasingly shape technology development

Short-Term Outlook (1-2 Years)

In the immediate future, organizations will focus on implementing and optimizing currently available technologies to address pressing ai tech challenges:

  • Improved generative models
  • specialized AI applications
  • enhanced AI ethics frameworks

These developments will be characterized by incremental improvements to existing frameworks rather than revolutionary changes, with emphasis on practical deployment and measurable outcomes.

Mid-Term Outlook (3-5 Years)

As technologies mature and organizations adapt, more substantial transformations will emerge in how security is approached and implemented:

  • AI-human collaboration systems
  • multimodal AI platforms
  • democratized AI development

This period will see significant changes in security architecture and operational models, with increasing automation and integration between previously siloed security functions. Organizations will shift from reactive to proactive security postures.

Long-Term Outlook (5+ Years)

Looking further ahead, more fundamental shifts will reshape how cybersecurity is conceptualized and implemented across digital ecosystems:

  • General AI capabilities
  • AI-driven scientific breakthroughs
  • new computing paradigms

These long-term developments will likely require significant technical breakthroughs, new regulatory frameworks, and evolution in how organizations approach security as a fundamental business function rather than a technical discipline.

Key Risk Factors and Uncertainties

Several critical factors could significantly impact the trajectory of ai tech evolution:

Ethical concerns about AI decision-making
Data privacy regulations
Algorithm bias

Organizations should monitor these factors closely and develop contingency strategies to mitigate potential negative impacts on technology implementation timelines.

Alternative Future Scenarios

The evolution of technology can follow different paths depending on various factors including regulatory developments, investment trends, technological breakthroughs, and market adoption. We analyze three potential scenarios:

Optimistic Scenario

Responsible AI driving innovation while minimizing societal disruption

Key Drivers: Supportive regulatory environment, significant research breakthroughs, strong market incentives, and rapid user adoption.

Probability: 25-30%

Base Case Scenario

Incremental adoption with mixed societal impacts and ongoing ethical challenges

Key Drivers: Balanced regulatory approach, steady technological progress, and selective implementation based on clear ROI.

Probability: 50-60%

Conservative Scenario

Technical and ethical barriers creating significant implementation challenges

Key Drivers: Restrictive regulations, technical limitations, implementation challenges, and risk-averse organizational cultures.

Probability: 15-20%

Scenario Comparison Matrix

FactorOptimisticBase CaseConservative
Implementation TimelineAcceleratedSteadyDelayed
Market AdoptionWidespreadSelectiveLimited
Technology EvolutionRapidProgressiveIncremental
Regulatory EnvironmentSupportiveBalancedRestrictive
Business ImpactTransformativeSignificantModest

Transformational Impact

Redefinition of knowledge work, automation of creative processes. This evolution will necessitate significant changes in organizational structures, talent development, and strategic planning processes.

The convergence of multiple technological trends—including artificial intelligence, quantum computing, and ubiquitous connectivity—will create both unprecedented security challenges and innovative defensive capabilities.

Implementation Challenges

Ethical concerns, computing resource limitations, talent shortages. Organizations will need to develop comprehensive change management strategies to successfully navigate these transitions.

Regulatory uncertainty, particularly around emerging technologies like AI in security applications, will require flexible security architectures that can adapt to evolving compliance requirements.

Key Innovations to Watch

Multimodal learning, resource-efficient AI, transparent decision systems. Organizations should monitor these developments closely to maintain competitive advantages and effective security postures.

Strategic investments in research partnerships, technology pilots, and talent development will position forward-thinking organizations to leverage these innovations early in their development cycle.

Technical Glossary

Key technical terms and definitions to help understand the technologies discussed in this article.

Understanding the following technical concepts is essential for grasping the full implications of the security threats and defensive measures discussed in this article. These definitions provide context for both technical and non-technical readers.

Filter by difficulty:

platform intermediate

algorithm Platforms provide standardized environments that reduce development complexity and enable ecosystem growth through shared functionality and integration capabilities.

computer vision intermediate

interface

NLP intermediate

platform

transfer learning intermediate

encryption

embeddings intermediate

API

machine learning intermediate

cloud computing

neural network intermediate

middleware

deep learning intermediate

scalability

algorithm intermediate

DevOps