While testing open-source Gemma-4 and Qwen 3.6 large language models (LLMs) for my hobbyist OpenClaw setup, I was both amazed and genuinely alarmed by the sheer power of their AI image analysis.
Just a few weeks after OpenClaw went viral, the “cloud-first AI” era was disrupted when Google published the powerful, resource-efficient, and open-source Gemma-4 and Alibaba their Qwen 3.6 LLMs. These LLMs are free, run locally without cloud access, and keep your data private. Especially the Gemma-4 E2B model runs in only 6GB of memory with small CPUs often found in inexpensive edge devices like surveillance cameras. The permissive Apache 2.0 open-source license will further support a widespread use of those LLMs, as companies can use those models for free in their own commercial products and services.
This sounded promising for me who spent far too much money on Anthropic’s Claude Opus 4.x API tokens to run OpenClaw.
AI “Super-Power”: Turning Photos into Intelligence
Let’s take a hands-on look at the AI image analysis of a typical city travel photo: the entrance hall of São Bento train station in Porto, Portugal.

I tested the open source Gemma-4 E4B LLM from Google, because it can run locally on small personal computers or even on my Pixel 8 smartphone. It also supports agentic AI workflows with function-calling, JSON support and multimodal input via text, images, audio and video. Perfect for me and local OpenClaw Agentic AI usage.
In addition, I tested the new open-source Qwen 3.6 35B LLM from Alibaba because it is the largest of the new models from Google and Alibaba. I expected it to produce the most detailed AI image analysis results of those new models, which turned out to be true after directly comparing its analysis results with the largest Gemma-4 LLM.
Both of the tested LLMs are mixture of expert (MoE) LLMs, which typically have lower memory and CPU power requirements.
Image Analysis with the small Gemma-4 E4B
Here is the Gemma-4 LLM image analysis result of the São Bento train station travel photo above. The LLM has a file size of only approx. 8GB, which together with some LLM working space and Linux resources easily fit into my NVIDIA RTX 3060 12GB VRAM.
LLM Prompt:
create a detailed description of the image and a summary
Click here if you are interested in the Gemma-4 LLM thinking steps …
Thought for 8.31 seconds
1. Analyze the Request:
The user wants a “detailed description of the image” and a “summary.”
2. Examine the Image:
I need to thoroughly observe all elements in the photograph.
- Setting/Location: It appears to be a large, grand, historical indoor space—likely a train station or major terminal (suggested by the departure boards).
- Architecture: High ceilings, ornate moldings, pillars, and arches. The walls are covered in extensive, detailed blue-and-white murals/frescoes (suggesting classical or revival style). There is a central, large arched window structure framing clocks and information screens.
- Focal Points(Foreground/Midground):
- Multiple digital display boards (departure/arrival/train times) are visible on the left side, indicating transportation services.
- A prominent circular clock face is mounted within an archway below a large window structure.
- The floor is polished stone or tile, reflecting light.
- People: There are numerous people scattered throughout the hall—travelers, staff, general public. They are dressed in casual, modern clothing. The crowd density is moderate.
- Lighting: Bright, artificial lighting, supplemented by natural light filtering through the large arched windows/skylight structure. The overall mood is bustling but orderly.
3. Draft the Detailed Description (Section 1):
Structure this descriptively, moving from general ambiance to specific details (architecture, function, activity).
- Focus on Scale and Style: Grand, historical, mural work.
- Focus on Function: Transportation hub, evident by the departure boards.
- Detailing Elements: Clocks, arches, murals, signage (departure boards).
- People/Atmosphere: Busy, active, sense of travel.
4. Draft the Summary (Section 2):
This needs to synthesize the key elements into a concise overview. What is the core essence? It’s a busy scene in an architecturally magnificent public space dedicated to transit.
5. Review and Refine:
Ensure the tone is objective, detailed, and accurately reflects the visual information. (Self-Correction: Make sure to mention the mix of old architecture and modern technology—the murals vs. digital boards.)
6. Final Output Generation.
This matches the structured response provided below.
LLM Answer:
🖼️ Detailed Image Description
The image captures the expansive interior of what appears to be a grand, historical train station or major public terminal. The scene conveys a sense of bustling activity within an architecturally magnificent setting that blends classical elements with modern technology.
Architecture and Decor: The space is characterized by monumental scale and ornate detail. The walls are covered in elaborate blue-and-white murals (frescoes) depicting scenes that suggest historical or civic life, adding a rich cultural depth to the environment. Massive arched structures dominate the center, framing large window sections and housing prominent round clocks.
In the foreground and midground, the functional elements of transit are visible: multiple digital departure boards display real-time information (train times, destinations, etc.). The flooring is polished stone or tile, reflecting the bright overhead lighting. To the right, another grand archway leads further into the terminal space. The overall aesthetic suggests late 19th or early 20th-century architecture that has been meticulously preserved and updated for modern use.
Focal Points: The most prominent focal points are the large, centrally mounted clocks and the digital display boards on the left. These elements anchor the image by clearly defining the location’s function as a transportation hub—a place where time is regulated and travel occurs. The mixture of classical murals with brightly lit LED screens creates a striking juxtaposition of permanence and constant movement.
People and Activity: The hall is populated by a moderate number of people—travelers, commuters, and general public. They are scattered across the wide floor space, walking in various directions, indicating the typical flow of activity at a station. The people add scale to the enormous architecture and imbue the scene with life.
Lighting and Atmosphere: The lighting is bright and even, provided by a combination of large natural light sources (visible through the arches) and internal artificial fixtures. The overall atmosphere is one of controlled chaos—busy, but orderly; grand, yet functional.
📜 Summary
This image captures a dynamic scene within a majestically preserved historical train station. The terminal features breathtaking architecture characterized by monumental arched ceilings, extensive blue-and-white murals, and large circular clocks that define the passage of time. Despite its grandeur, the space is highly utilitarian, featuring prominent modern digital display boards providing real-time transit information. The hall is moderately busy with travelers moving about, illustrating the constant rhythm of a major transportation hub where historical beauty meets the demands of modern global travel.
62.28 tok/sec, 1062 tokens, 2.24s duration
Verdict: The open-source Gemma-4 E4B LLM provided an astonishingly detailed analysis of the travel photo in a total time of less than 11 seconds. Now, let’s buckle up for AI image analysis with the larger open-source Qwen-3.6 LLM.
Image analysis with larger Qwen 3.6 35B A3B
Here is the Qwen 3.6 LLM image analysis result of the São Bento train station photo above.
LLM Prompt:
create a detailed description of the image and a summary
Click here if you are interested in the Qwen 3.6 LLM thinking steps …
Thought for 1 minutes 20 seconds
The user wants a detailed description and a summary of the provided image.
1. Detailed Description Strategy:
- Setting: Identify the location. It looks like a grand train station. The architecture (arches, tiles) suggests Portuguese style, specifically São Bento Station in Porto. I should mention the tile work (azulejos).
- Foreground/Midground (People): Describe the crowd. People are walking, standing, looking at screens. They are dressed in casual, cool-weather clothing (jackets, long pants).
- Key Elements (Center):
- Departures/Arrivals Board: A large digital screen hanging from the ceiling. It lists destinations like “Marco de Canaveses,” “Braga,” “Aveiro,” “Penafiel.” It has a blue side (Departures) and a green side (Arrivals).
- Clock: A large, classic clock mounted high up on the wall, centered above the screen. It says “PORTO” below it.
- Windows: Large arched windows behind the clock and screen. They have a grid pattern and yellow/gold framing.
- Background/Walls:
- Tiles: The walls are covered in intricate blue and white tile murals (azulejos). One mural on the left depicts a landscape with figures. Another on the right shows historical scenes.
- Architecture: High vaulted ceilings with decorative molding. Stone arches framing the windows and doorways.
- Floor: Checkered tile floor (black and white/grey).
2. Summary Strategy:
- Core Subject: A busy interior of a historic train station (likely São Bento in Porto, Portugal).
- Atmosphere: Bustling but orderly. A mix of transit and tourism (people looking at art).
- Key Visuals: The contrast between the modern digital departure board and the classical tile work/clock.
Drafting the Description:
- Start with the overall scene: High angle, wide shot of a train station hall.
- Focus on the central architectural feature: The large arched window and the clock.
- Describe the tiles: Mention the blue and white azulejos covering the walls, depicting scenes.
- Describe the information board: Digital screen with blue (departures) and green (arrivals) lists. Note specific destination names (Braga, Aveiro).
- Describe the people: Scattered throughout, some looking at the board, some walking. Casual clothing.
- Describe the floor: Checkered pattern.
- Describe the ceiling: Ornate, yellow and white molding.
Drafting the Summary:
- “This image captures the bustling main hall of a grand, historic train station, recognizable as the São Bento Station in Porto, Portugal.”
- “The scene is dominated by the contrast between modern transit information and classical art.”
- “A large digital departure and arrival board hangs centrally, while the walls are adorned with magnificent blue and white tile murals.”
- “Travelers and tourists mill about on a checkered floor, beneath high arched windows and a large clock.”
Refining the details:
- Be specific about the clock face (Roman numerals).
- Mention the text “PARTIDAS / DEPARTURES” and “CHEGADAS / ARRIVALS”.
- Note the “SAIDA” (Exit) sign on the right arch.
Final Polish: Combine these observations into a coherent narrative structure. Ensure the tone is descriptive and objective.
LLM Answer:
Detailed Description:
The image captures the grand, high-ceilinged interior of a historic train station, identifiable by its architecture and tile work as the São Bento Station in Porto, Portugal. The perspective is from the floor level, looking slightly upward toward a central architectural focal point.
Central Elements:
Dominating the upper center is a large, arched window with a grid of small panes framed in gold-tinted glass. Mounted directly above the window is a large, black-framed analog clock with Roman numerals, reading approximately 10:10. Below the clock, the name “PORTO” is inscribed. Suspended beneath the clock and window is a large, dual-sided digital information board. The left side is blue, labeled “PARTIDAS / DEPARTURES,” listing destinations like Marco de Canaveses, Braga, and Aveiro. The right side is green, labeled “CHEGADAS / ARRIVALS.”
Architecture and Walls:
The walls are covered in intricate azulejos (traditional blue and white Portuguese ceramic tiles). To the left of the central arch, a large mural depicts a rural landscape with figures and a boat. To the right, similar murals show historical scenes and figures. The architecture features massive stone arches and columns supporting a ceiling with ornate, cream-colored molding and yellow accents. On the far right, an archway is labeled “SAIDA” (Exit) above the doorway.
The People and Floor:
The floor is paved with a geometric pattern of light and dark tiles. The hall is filled with a mix of travelers and tourists. In the foreground, people are dressed in cool-weather clothing like jackets, hoodies, and jeans.
- On the left, a few people stand near the wall, looking toward the center.
- In the center, a man in a red and white striped shirt stands with his back to the camera, looking at the departure board.
- To the right, groups of people are walking or standing in clusters, some looking toward the right side of the hall.
- Further back, near the right archway, more people are visible, some taking photos or looking toward the exit.
Summary:
This image depicts the bustling main hall of the historic São Bento Station in Porto. The scene highlights a striking contrast between the station’s classical artistry—specifically the vast blue-and-white tile murals covering the walls and the ornate, arched ceiling—and modern transit functionality, represented by the large digital departure and arrival boards. Travelers and tourists alike populate the checkered floor, checking schedules and moving through the grand, light-filled space.
9.35 tok/sec, 1321 tokens, 163.50s duration
LLM: Qwen-3.6 MoE 35B A3B Q4_K_M GGUF
Download size: 20.55GB
Verdict: In my view, the image analysis of the Qwen 3.6 LLM is breathtaking:
- The exact location of São Bento Station in Porto was determined: historic train station, identifiable by its architecture and tile work as the São Bento Station in Porto, Portugal
- Cool weather detection: They are dressed in casual, cool-weather clothing (jackets, long pants).
- Understanding about what the people are doing: People are walking, standing, looking at screens. …
- On the left, a few people stand near the wall, looking toward the center.
- In the center, a man in a red and white striped shirt stands with his back to the camera, looking at the departure board.
- To the right, groups of people are walking or standing in clusters, some looking toward the right side of the hall.
- Further back, near the right archway, more people are visible, some taking photos or looking toward the exit.
- Train destinations such as Marco de Canaveses, Braga, Aveiro, and Penafiel were detected.
- Description of the ceiling: Ornate, yellow and white molding and the floor: The floor is paved with a geometric pattern of light and dark tiles.
- The details of the motifs shown on the wall were identified: The walls are covered in intricate blue and white tile murals (azulejos). One mural on the left depicts a landscape with figures. Another on the right shows historical scenes.
- The Atmosphere of the scene: Bustling but orderly. A mix of transit and tourism (people looking at art).
- In my opinion, the LLM’s image summary above is a good example of a human-written summary by a professional analyst.
Unfortunately, LLM did only partially fit into my NVIDIA 12GB VRAM, so PP (prompt processing) and TG (token generation) executed very slow in more than 4 minutes.
On a Mac Mini M5 Pro with 48+ GB high-bandwidth unified RAM for executing the LLM inference with Ollama, I would expect less than 60s duration for the complete image analysis.
The Gemma-4 26B A4B LLM performed slightly less well in the image analysis.
Rundown
Those three insights matter most to me:
- The ability of free and local large language models (LLMs) to recognize people, animals, and objects in an image, as well as understand people’s actions, attire, and state of mind, is both promising and scary. So is their ability to understand the interconnections of details within an image and the greater context of an image.
- Powerful local AI image analysis based on open-source large language models (LLMs) with minimal memory and CPU requirements will likely have a significant impact on our society because this analysis will run on inexpensive edge devices, such as surveillance cameras. These LLMs will also accelerate the adoption of the Artificial Intelligence of Things (AIoT). In addition, the Apache 2.0 open-source license of the Gemma-4 LLMs allows companies to use the powerful LLMs free of charge for commercial use.
- The growing ability of AI to analyze photos with unprecedented detail using large language models (LLMs) is a wake-up call for digital privacy. In my view, storing personal photos in big tech clouds, such as Apple iCloud and Google Photos, provides a disturbingly clear window into one’s private life. As an EU citizen, I will switch to a private European cloud, such as Nextcloud, to keep my personal life truly private. This change will extend beyond just my photos to include my files, calendar, and other private data.
The Eight Key Concepts of OpenClaw and Infographic
The Spy in the Fridge: The security Risks of AIoT
Comments are welcome
Constructive comments (via the comment function at the bottom of this page) are greatly appreciated and suitable changes and additions to this blogpost will be taken into account. All statements in this blog post reflect the personal opinion of the author, which may not always be accurate due to incomplete information and are not factual claims.
Please note that comments are subject to manual review to prevent spam, which may cause a delay in their display.
Shortlink
For easier sharing and access to this blog post, use this short link: