Okay, let's see if it's working. We have an image being uploaded.
Could you describe this image please?
We get back a description. The way this works is we convert the File into a data URL and send it as a part alongside the user's text.
Convert a File to a data URL:
const fileToDataURL = (file: File) => {return new Promise<string>((resolve, reject) => {const reader = new FileReader();reader.onload = () => resolve(reader.result as string);reader.onerror = reject;reader.readAsDataURL(file);});};
We send two parts in sendMessage:
url (data URL or hosted URL) and the mediaType (IANA type from the File)sendMessage({parts: [{type: 'text',text: input,},{type: 'file',mediaType: file.type,url: await fileToDataURL(file),},],});
On the server, we convert UI messages, call the model, and stream the response back to the UI:
const modelMessages: ModelMessage[] =convertToModelMessages(messages);const streamTextResult = streamText({model: google('gemini-2.0-flash'),messages: modelMessages,});const stream = streamTextResult.toUIMessageStream();return createUIMessageStreamResponse({stream,});
Just a little bit of front-end work lets us pass an image directly to the LLM. This multimodal flow is straightforward with the AI SDK.
If you're wondering about other modalities (transcribing audio, generating images, etc.), the AI SDK likely supports them too.
Nice work!