Skip to content

Volume 3: VLM / DINO Guide

Estimated time: 30–40 minutes Goal: Understand the capabilities and boundaries of VLM and DINO, and learn to define new detection rules just by writing a sentence Prerequisites: Completed the Scenario Configuration Guide (Volume 2); familiar with basic algorithm configuration workflows No extra equipment needed: This volume uses built-in demo videos and the image testing feature — no camera required

Why This Chapter Matters

A Problem You've Definitely Encountered

When working through the 18 built-in algorithms in Volume 2, you've probably noticed they cover the most common detection scenarios — hard hats, fire, falls, pedestrian flow, and so on. But your customers will inevitably ask for things those models don't cover:

  • "Is there garbage in the hallway?"
  • "Is the fire cabinet door closed?"
  • "Is the garbage bin full?"
  • "Has the construction barricade fallen over?"
  • "Has the city wall been vandalized?"

With a traditional custom model approach, each of these requests typically means:

StepEffortTimeline
Collect on-site images500–2,000 photos1–2 weeks
Data labelingManual bounding box / class annotation1–2 weeks
Model training + tuningRequires GPU servers + ML engineers1–2 weeks
Edge deployment + adaptationModel conversion + performance tuning3–5 days
On-site validation + iterationRe-collect and retrain if accuracy is insufficientUncertain

A single custom scenario takes at least a month from requirement to deployment, sometimes longer. And your customer might have 5 or 10 such requests at the same time.

Two Large Model Tools at Your Disposal

CosmoEdge includes two built-in large model capabilities, each designed for different types of long-tail requirements:

VLM (Visual State Judgment)DINO (Open-Vocabulary Detection)
In a nutshellJudges thestate of a scene (yes/no, open/closed)Findswhere targets are in the frame (draws bounding boxes)
InputWrite a prompt (a question)Write atarget name
OutputYES / NOBounding boxes + confidence scores (just like CV models)
Typical use case"Is there garbage in the hallway?" → YES"garbage" → marks garbage locations on screen
On-screen behaviorNo bounding boxes — results appear in the alarm panelBounding boxes overlaid directly on video
Analysis speed~2–3 sec per frame~1–2 sec per frame

Quick rule of thumb:

  • Want to know "is it or isn't it?" → Use VLM
  • Want to know "where is it?" → Use DINO

Neither requires model training or data labeling. Write a sentence and you're up and running.

Capability Boundaries (Please Read This)

Before getting started, understand what these tools can and cannot do.

✅ Good Fit for Large Models

Scenario TypeToolExample
State judgment (yes/no, present/absent, open/closed)VLM"Is the fire cabinet door closed?" → YES/NO
Simple classification (pick one of a few states)VLM"Is the indicator light red, green, or yellow?"
Open-vocabulary detection (find a specific object)DINO"fire extinguisher" → marks its position on screen
Rare / long-tail scenarios (no training data available)VLM / DINO"Wall vandalism marks" — impossible to build a training dataset
Low-frequency inspection (a few seconds between checks is fine)VLM / DINOAnalyzing one frame every 2.5 seconds is sufficient for patrol use
Quick feasibility checkVLM image testUpload sample event photos, get batch judgment results

❌ Poor Fit for Large Models

Scenario TypeReasonRecommended Alternative
Real-time detection (< 100 ms response)Large models need 1–3 sec/frameUse the CV models from Volume 2
Precise counting (how many people/vehicles — VLM not suitable)Large models aren't good at spatial quantificationUse detection + tracking scenario tasks
Multi-channel concurrencyLarge models are resource-intensiveCV models support 16 concurrent channels
Action recognition (continuous motion / state transitions)Single-frame judgment has limited capabilityUse specialized behavior analysis models

Bottom line: Large models are best for "low-frequency, broad-coverage, long-tail" scenarios. They don't replace the CV models from Volume 2 — they fill the gaps those models can't cover. The three approaches complement each other; they don't compete.

Part 1: VLM Visual State Judgment

VLM's core capability: Look at an image and answer a yes-or-no question.

How VLM Works

plain
Video frame

Extract one frame every few seconds ← Frame rate (configurable)

Crop to your region of interest ← ROI selection (you draw a box on screen)

Send the cropped image + your question to VLM ← Prompt (the sentence you write)

VLM answers: YES or NO (or your custom options)

If the answer triggers an alarm condition → Event record + alarm notification

Scenario 1: River Floating Debris Detection (Full Walkthrough)

Requirement: River management staff need to know if there's floating debris, and if so, automatically alert the cleanup crew. Why traditional approaches struggle: Floating debris takes countless forms (leaves, plastic bags, bottles, miscellaneous junk). Training a single detection model to cover all debris types is prohibitively expensive. VLM's advantage: No need to enumerate every possible shape — if a human can tell "there's debris here," VLM can too.

1.1 Configure the Prompt (Core Step)

  1. Navigate to the River Floating Debris Detection algorithm.

    Click Scenario TasksRiver Floating Debris DetectionPipeline Orchestration to enter the orchestration page.

A VLM pipeline typically involves four stages:

  • Video Decode: Decode the RTSP stream (make sure the target event is visually discernible in the frame).
  • Data Preprocessing: Frame sampling, ROI cropping, and other preprocessing for the large model input.
  • Vision Language Model: Feed the preprocessed image and prompt to the VLM for inference.
  • Event Reporting: When the VLM output triggers the reporting logic, the result is sent to the system.
  1. Edit the prompt

    Click the Vision Language Model node to expand its configuration panel. Edit the prompt, then click Save.

Parameter reference:

ParameterPurposeNotes
Select Base ModelChoose the multimodal model type
Frame RateHow many frames per second to sample for analysis (fps; the lower the value, the longer the interval between analyses)Large models are compute-intensive — use a smaller value to lengthen the interval
Advanced Prompt ModeEnable full prompt configurationWhen on, you write a complete prompt sentence; when off, just enter the detection target
PromptEnter your prompt text
Generation StyleControls output randomnessStrict = low randomness; Standard = normal; Creative = high randomness; Custom = manual control

Recommended settings:

Additional Features

Third-party OpenAI Model Integration

  1. Select OpenAI Interface

  1. Configure Third-party Model

1.2 Select the VLM Algorithm and Bind a Channel

  1. Download the demo video. Video URL: github.com/cosmo-wander-ai/cosmo-edge/releases/tag/v1.0-videos.

  2. Upload the demo video

    Click Video SourcesAdd → Upload an offline video → Enter Scenario Task Assignment.

Under All Services, select River Floating Debris Detection and click Add Region.

💡 The large model's default detection region is the full frame.

  1. Draw a Region of Interest (ROI)

    Just like in Volume 2, you can draw a region to focus VLM on a specific area. The more precise the region, the less background noise, and the more accurate the judgment.

  1. Configure detection parameters

    Switch to the Parameter Settings tab, set the alarm interval, and click Save to save the configuration and start the service.

1.3 View Detection Results

  1. Click Live PreviewSelect ChannelAlgorithm Overlay.

    Select River Floating Debris Detection as the overlay algorithm.

Alarm notification (the large model takes about 30 seconds to initialize on first load — subsequent inference is unaffected):

Results analysis:

  • The video feed plays normally (VLM does not overlay bounding boxes on the video — this is expected behavior).
  • The alarm panel on the right shows detection events:
    • When floating debris is present → alarm triggers, with a captured snapshot
    • When no debris is present → no alarm

⚠️ Visual difference from Volume 2 Volume 2's CV models draw real-time bounding boxes on the video. VLM does not draw boxes — its results appear in the alarm panel and event records. This is not a bug; it's how the two technologies differ in output:

  • CV models tell you "what is where" (bounding boxes)
  • VLM tells you "what's the state of the scene" (YES/NO)

1.4 Event Query

Click Event CenterDetection/Analysis and filter by your criteria to review alarm records.

Each VLM alarm includes:

  • Captured snapshot: The exact frame the VLM analyzed
  • Judgment result: YES (debris present) or NO (no debris)
  • Timestamp: When the judgment was made

Click an alarm image to view details. The alarm information appears in the upper right.

💡 Difference from Volume 2 In Volume 2, you selected fixed-purpose algorithms like "No Hard Hat." Here, you're selecting a general-purpose VLM engine — what it can do depends entirely on the prompt you write next.

You've now completed your first VLM scenario configuration.

Scenario 2: City Wall Vandalism Detection (Validating a Brand-New Requirement)

Requirement: A cultural heritage organization needs to monitor an ancient city wall and detect signs of human vandalism. What makes this unique: No AI vendor offers an off-the-shelf "city wall vandalism detection" model. With a traditional custom approach, just defining "what counts as vandalism" would take weeks of back-and-forth. Learning focus: Not the operational steps (same as Scenario 1), but how to use image testing to quickly validate feasibility and how to iterate on prompts.

2.1 First: Can VLM Even Do This?

For a brand-new custom requirement, don't jump straight to connecting a video feed and creating a scenario task — events may occur infrequently in video, making it hard to verify results quickly. Start with the Image Test feature for rapid prompt validation.

  1. Create a City Wall Vandalism Detection VLM Image Analysis task.

    Click Scenario TasksNew Task.

Set data source type to Image Analysis.

Click OK to save.

Click Pipeline OrchestrationOperationsAdd ComponentVision Language Model to add the VLM node.

The entire pipeline only needs one Vision Language Model node.

Click the Vision Language Model node, configure the prompt, and click Save.

Prompt: Determine whether the person in the image appears to be vandalizing the wall.

The above covers all the steps needed to orchestrate a VLM pipeline.

  1. Upload images for quick validation

    Click Image Analysis to open the image inference page. Select the PIC City Wall Vandalism Detection algorithm.

Upload the images to analyze.

Click Start Analysis. (Note: the image testing service is created at this point and needs a few seconds to load the model.)

Analysis complete:

Results include both "detected" and "not detected" outcomes. Click through for detailed information.

Click an image marked as Target Detected to view confidence scores and logical results.

Result interpretation:

  • If the result matches expectations → Judgment is correct; the prompt works.
  • If the result is unexpected or inconsistent → Adjust the prompt (see next section).

💡 The Value of Image Testing No video feed needed. No scenario task to configure. No waiting around for events to happen in a video stream. Just upload real-world event photos and get instant results. This is the lowest-cost way to validate whether VLM can handle a given scenario. We recommend image testing before every new scenario goes live. For complex or high-stakes scenarios, prepare a meaningful set of positive and negative sample images and measure precision and recall.

2.2 Prompt Iteration: When the First Version Isn't Good Enough

In practice, the first version of your prompt may not be accurate enough. Here are common adjustment strategies:

Case 1: Missed detections (vandalism present but not caught)

Make the question more specific by describing visual characteristics:

plain
❌ "Is there a problem with the city wall?"         ← Too vague
✅ "Does the wall surface show cracks, missing sections, or graffiti?"  ← Specific visual features

Case 2: False positives (no vandalism but alarm triggered)

Narrow the scope or adjust the ROI to exclude interference:

plain
❌ "Has the wall been damaged?"                ← "Damaged" is too broad — natural weathering might trigger it
✅ "Are there visible signs of deliberate carving or chiseling on the wall surface?"  ← Focus on "deliberate" + "specific form"

Case 3: Unstable results

  • Check if the ROI is too large, capturing excessive irrelevant background.
  • Simplify the prompt — avoid overly complex descriptions.
  • Test with multiple images to confirm consistency.

General prompt iteration workflow:

plain
Write a prompt → Validate with image testing (positive + negative samples)
    ↓ Not good enough
Analyze the cause (missed detections / false positives / instability) → Adjust prompt → Re-validate
    ↓ Results look good
Connect to a video feed and configure as a persistent scenario task

Keys to Writing a Good VLM Prompt

#RuleExplanation
1Ask closed-ended questionsOnly ask questions that can be answered with "yes/no" or "A/B/C" — not open-ended questions
2Ask about visible thingsVLM can only analyze image content — make sure your question refers to something visible in the frame

VLM — Good Prompts vs. Bad Prompts

✅ Good Prompt❌ Bad PromptWhat's Wrong
"Is there garbage in the hallway?""Describe the hallway situation"Open-ended; uncontrollable output
"Is there standing water on the ground?""Is the ground wet?""Wet" isn't a visual feature; hard to judge
"Is the fire cabinet door closed?""Is there a problem with the fire cabinet?"Too vague — what counts as "a problem"?
"Is the indicator light red or green?""How many indicator lights does the device have?"VLM isn't good at counting
"Are there visible signs of deliberate carving on the wall?""Is the wall broken?""Broken" is too broad

2.3 After Validation: Configure as a Live Video Scenario Task

Once the prompt passes image testing, set it up on a live video channel following the same steps as Scenario 1:

  1. Bind the VLM algorithm to the city wall monitoring channel.
  2. Enter the validated prompt.
  3. Draw an ROI around the main wall area.
  4. Set the frame rate (for heritage patrol, 0.1 = one analysis every 10 seconds is typically sufficient).
  5. Go to Live Preview to verify results.

These steps were covered in detail in Scenario 1 and are not repeated here.

Part 2: DINO Open-Vocabulary Detection

In the first two scenarios, VLM answered "is it or isn't it" questions — it tells you the scene's state but doesn't tell you where targets are in the frame.

Some scenarios require a different approach:

  • "I want to find all fire extinguishers in the frame and see where they are."
  • "I want to mark all cats, dogs, or unusual objects on screen."
  • "I want the system to draw boxes around every tool in the construction zone."

These use cases don't need YES/NO — they need bounding boxes, just like the CV models from Volume 2, but without training.

That's what DINO does: you write a target name, and the system automatically finds and boxes all matching objects in the frame.

How DINO Works

plain
Video frame

Extract one frame every few seconds ← Frame rate (configurable)

Send the frame + your target name to DINO ← Prompt (e.g., "garbage")

DINO finds all matching targets and outputs bounding boxes + confidence scores

Bounding boxes are overlaid on the video ← Same OSD display as CV models

If targets are detected → Event record + alarm notification

Scenario 3: Open-Vocabulary Object Detection (Streamlined)

Requirement: Detect and mark "person" and "garbage" in a hallway camera feed. Compared to VLM hallway garbage: VLM only tells you "is there garbage or not"; DINO tells you "where exactly the garbage is in the frame."

DINO's configuration flow is very similar to VLM. Below covers only the key differences.

3.1 Configure the Prompt

VLM prompts are questions; DINO prompts are target names (or multiple names separated by periods).

  1. Go to Algorithm ManagementDINO Hallway Garbage DetectionPipeline OrchestrationDetection Vision Language Model → Configure the prompt → Save and exit.

    Enter the targets you want to detect:

plain
person.garbage

Note: Separate targets with periods (English dots).

⚠️ Use English for DINO prompts

3.2 Bind the DINO Algorithm

  1. Go to Video SourcesScenario Task Assignment.

  1. Under Service Assignment, select a DINO-type algorithm and click Add Region.

  1. Adjust the region

    Position the detection region appropriately. In this scenario, ensure it covers the platform area.

  1. Configure parameters, then Save and start the service.

3.3 View Results

Go to Live Preview and enable the algorithm overlay.

DINO bounding boxes may appear with a slight delay — this is normal.

Detection alarm results:

3.4 View Event Reports

Go to Event Center to review alarms.

This concludes the DINO open-vocabulary detection workflow.

3.5 Common DINO Prompts Quick Reference

ChineseEnglish PromptNotes
person
car / vehicle"vehicle" has broader coverage
垃圾garbage / trash
灭火器fire extinguisher
cat
dog
箱子box / package
椅子chair
烟雾smoke
火焰fire / flame

When in doubt, try common English nouns. DINO supports an extremely wide range of categories — most everyday objects can be detected.

Choosing the Right Tool: CV Model / VLM / DINO

Now that you've completed this volume, you've mastered all three AI analysis capabilities the system provides. When facing a new detection requirement, follow this decision tree:

plain
What do you need?

├─ Is it covered by one of the 18 built-in algorithms from Volume 2?
│   └─ ✅ Yes → Use the CV model (most mature, fastest, supports multi-channel concurrency)

├─ Not covered by built-in algorithms. I need to —
│   │
│   ├─ Judge a state (present/absent, yes/no, open/closed)
│   │   └─ → Use VLM — write a question
│   │
│   ├─ Find where a target is (need bounding boxes)
│   │   └─ → Use DINO — write a target name
│   │
│   └─ Need real-time + precise + high concurrency
│       └─ → Requires a custom-trained model (contact us)

Full comparison:

CV Models (Volume 2)VLM (This Volume)DINO (This Volume)
Rule sourcePre-trained fixed modelA question you writeA target name you write
OutputBounding boxesYES / NOBounding boxes
SpeedReal-time (25 fps)~2–3 sec/frame~1–2 sec/frame
Concurrency16 channelsSingle-channel serialSingle-channel serial
Switching scenariosSwap modelsChange one sentenceChange one word
Best forHigh-frequency mainstream scenariosLong-tail state judgmentLong-tail object detection
On-screen displayReal-time bounding boxesAlarm panel onlyBounding boxes (slower)
Alarm panelDetected target + confidenceYES/NO judgment resultDetected target + confidence
Pipeline performance statsOverlaid on video (Decode/OSD/Encoder — low latency)Not on videoOverlaid on video (Decode/OSD/Encoder — high latency)

💡 Think of VLM and DINO as the "universal backup" for CV models. Your system first covers mainstream scenarios with the 18 mature scenario tasks. For anything those can't handle, VLM and DINO step in on demand — no waiting for model training, ready to go the same day.

Appendix

DINO Prompt Rules

#RuleNotes
1Use English nounsThe current version is more stable with English
2Separate multiple targets with periodse.g.,person.garbage.fire extinguisher
3Be specificfire extinguisher works better than red thing

VLM Reference Scenario Templates

ScenarioReference PromptOutputROI Suggestion
Overflowing trash bin"Is the garbage bin full or overflowing?"YES / NOFrame the trash bin area
Fire cabinet door status"Is the fire equipment cabinet door closed?"YES / NOFrame the cabinet door area
Door status"Is the door currently open?"YES / NOFrame the door area
Equipment indicator light"What color is the equipment indicator light?"RED / GREEN / YELLOWFrame the indicator light
Hallway obstruction"Is the fire escape hallway blocked by objects?"YES / NOFrame the hallway floor
Workstation status"Is there someone working at this workstation?"YES / NOFrame the workstation area

New Scenario Validation Checklist

Whether using VLM or DINO, confirm the following before any new scenario goes live:

  • [ ] Validated with Image Testing using at least 3 positive samples (images that should trigger alarms)
  • [ ] Validated with Image Testing using at least 3 negative samples (images that should NOT trigger alarms)
  • [ ] Tested the same image 3 consecutive times with consistent results (stability check)
  • [ ] ROI has been drawn to focus on the specific area of interest, excluding irrelevant background
  • [ ] Frame rate has been set to a reasonable value for the scenario

What's Next

GoalRead
Look up parameter details or get troubleshooting help→ Reference Manual
Learn about model porting (deploy your own models to the device)→ Reference Manual — Model Porting Overview

Released under the Apache 2.0 License.