Combining foundation model segmentation, metric 3D geometry, and interoperability standards to generate structured indoor spatial intelligence.
Indoor environments—buildings, tunnels, underground structures—lack the spatial intelligence infrastructure available for outdoor spaces. Three practical barriers limit progress:
Bandwidth. Dense 3D geometry can reach tens of gigabytes per floor—impractical for transmission over degraded or portable networks during active incidents.
Labeled data. Training 3D scene understanding models requires large annotated datasets. Manual point cloud labeling is time-intensive—a single large building can take weeks of skilled effort—and existing indoor datasets have minimal coverage of safety-critical infrastructure (standpipes, AEDs, electrical shutoffs, fire alarm panels).
Interoperability. No common vocabulary exists for describing building features across agencies and regions.
Prior work on Point Cloud City Open3D-ML evaluated 3D ML models for public safety point cloud labeling using NIST PSIAP datasets. That effort revealed two findings:
-
Labeled indoor data is scarce. Training robust 3D segmentation models requires substantial annotated data that public safety datasets lack.
-
Point clouds and images have complementary strengths. Point clouds represent geometry precisely but are difficult to label semantically. Images are well-suited to semantic tasks—foundation models trained on internet-scale image data can identify objects that no 3D model has seen.
INSIGHT bridges this gap: use foundation model understanding on 2D images, then transfer semantics to 3D geometry through registered depth data.
INSIGHT implements a semantic lifting pipeline: 2D images with registered depth data are processed through foundation model segmentation to produce structured 3D building intelligence.
RGB Images + Depth Data
↓
SAM3 (text-prompted 2D segmentation)
↓
Geometric lifting (project masks into 3D via depth)
↓
Instance fusion (merge detections across viewpoints)
↓
┌────────────────────────────────────────┐
│ │
▼ ▼
Pointcept Training Data ISO 19164 Scene Graphs
(full geometry + labels) (compact, transmittable)
SAM3's text-prompted segmentation identifies objects in 2D images ("fire extinguisher," "electrical panel," "exit sign") without task-specific training. Registered depth data provides metric 3D coordinates per pixel. Projecting 2D masks through depth produces semantically-labeled 3D segments that combine foundation model understanding with calibrated sensor accuracy.
The system produces two outputs:
-
Pointcept-compatible point clouds with per-point semantic labels and instance IDs for training 3D models.
-
Lightweight scene graphs encoding building structure as a queryable hierarchy—compressing tens of gigabytes of source geometry to single-digit megabytes, small enough to transmit over constrained networks.
Objects appear in multiple camera views. INSIGHT maintains a global instance registry, merging detections when 3D centroids fall within a spatial threshold (default: 0.5 meters). Result: one instance per physical object, with geometry aggregated from all viewing angles.
Scene graph output implements ISO 19164 (IndoorGML) concepts: buildings contain floors, floors contain features, features have geometric and semantic properties. The class taxonomy aligns with IFC (Industry Foundation Classes) naming conventions.
Standardized vocabularies enable:
- CAD/BIM export compatibility
- Cross-agency data sharing
- Aggregated training datasets
- Dispatch and field application integration
Different disciplines require different information. Scene graphs support query-time filtering via filter_classes_for_responder():
| Role | Priority Features |
|---|---|
| Firefighter | Doors, stairs, exit signs, windows, fire extinguishers, standpipes, fire hose cabinets, sprinklers, fire alarm panels, fire alarm pulls, electrical panels, gas shutoffs |
| Hazmat | Gas shutoffs, water shutoffs, electrical panels, doors, windows |
| Police/Tactical | Doors, windows, stairs, elevators, columns, walls |
| EMS | Doors, stairs, elevators, ramps, AEDs |
| Search & Rescue | Doors, stairs, windows, columns, walls, ceilings, floors |
Exit signs, doors, and stairs are always detected regardless of responder role.
23 classes organized by operational function:
| Category | Classes |
|---|---|
| Egress & Access | door, window, stairs, elevator, ramp, exit_sign, railing |
| Fire Suppression | fire_extinguisher, standpipe, fire_hose_cabinet, sprinkler |
| Fire Alarm | fire_alarm_panel, fire_alarm_pull |
| Utility Control | electrical_panel, gas_shutoff, water_shutoff |
| Medical | aed |
| Structural | wall, floor, ceiling, column |
| Obstacles | furniture |
Each class includes metadata: ISO class name, superclass, relevance category (CRITICAL, EGRESS, CONTROL, CONTEXT, OBSTACLE), and responder priority level (CRITICAL, HIGH, MEDIUM, LOW).
Building
└── Floor
├── floor_surface (CellSpaceBoundary)
├── wall_surface (CellSpaceBoundary)
├── ceiling_surface (CellSpaceBoundary)
├── {area}_instance_1 (door)
├── {area}_instance_2 (fire_extinguisher)
├── {area}_instance_3 (electrical_panel)
└── ...
Each feature node stores: semantic class, ISO metadata, 3D oriented bounding box (center, extent, rotation matrix), detection confidence, and responder priority.
Size: Compression of approximately four orders of magnitude—e.g., 39.3 GB geometry database reduced to 2.2 MB scene graph.
{
"coord": np.float32 (N, 3), # XYZ in meters
"color": np.float32 (N, 3), # RGB normalized [0, 1]
"normal": np.float32 (N, 3), # Surface normals (estimated)
"segment": np.int64 (N,), # Semantic class ID
"instance": np.int64 (N,), # Instance ID
}Compatible with Pointcept training pipelines (Point Transformer, SparseUNet, etc.).
| File | Description |
|---|---|
geometry_database.h5 |
Compressed point cloud storage with per-instance datasets |
instance_mapping.json |
Maps instance string IDs to integer IDs and class names |
processing_checkpoint.json |
Tracks processed images for resume capability |
model_detection_log.csv |
Per-detection logging with confidence scores and status |
# Process an area
python run.py /path/to/area_1
# Process with Pointcept export
python run.py /path/to/area_1 --export-pointcept
# Custom output directory
python run.py /path/to/area_1 --output-dir ./my_results
# Adjust confidence threshold
python run.py /path/to/area_1 --conf-threshold 0.4
# Downsample output point cloud
python run.py /path/to/area_1 --export-pointcept --voxel-size 0.02Processing automatically checkpoints progress. If interrupted, simply re-run the same command to resume:
# Automatically resumes from last checkpoint
python run.py /path/to/area_1
# Verify checkpoint integrity (dry run)
python run.py /path/to/area_1 --verify-checkpoint --dry-run
# Verify and fix checkpoint, then reprocess missing
python run.py /path/to/area_1 --reprocess-missing| Option | Default | Description |
|---|---|---|
area_path |
(required) | Path to area directory containing data/rgb/ |
--output-dir |
output_results |
Output directory for all results |
--weights-dir |
model_weights |
Directory containing SAM3 model weights |
--export-pointcept |
off | Export to Pointcept .pth format after processing |
--conf-threshold |
0.3 | Minimum confidence score for detections |
--voxel-size |
none | Voxel size (meters) for downsampling; if unset, no downsampling |
--verify-checkpoint |
off | Verify checkpoint against actual outputs |
--dry-run |
off | With --verify-checkpoint, only report (don't modify) |
--reprocess-missing |
off | Verify checkpoint and reprocess any missing images |
Stanford 2D-3D-Semantics format:
area_1/
└── data/
├── rgb/
│ └── {frame_id}_domain_rgb.png
└── global_xyz/
└── {frame_id}_domain_global_xyz.exr
RGB images provide visual input to SAM3. EXR files encode world-space XYZ coordinates per pixel.
output_results/
├── model_detection_log.csv # Detection log across all areas
├── pointcept/
│ ├── class_mapping.json # Class names, IDs, responder groups
│ └── {area_name}.pth # Pointcept training data (if exported)
└── {area_name}/
├── results/ # Annotated RGB images with detections
│ └── {frame_id}.png
├── geometry_database.h5 # Point cloud data by instance
├── {area_name}_scene_graph.graphml # Scene graph output
├── instance_mapping.json # Instance ID mappings
└── processing_checkpoint.json # Resume checkpoint
| Parameter | Default | Description |
|---|---|---|
| Confidence threshold | 0.3 | Minimum score to accept a detection |
| Mask threshold | 0.5 | Threshold for binary mask generation |
| IoU threshold | 0.5 | IoU threshold for duplicate removal (NMS) |
| Parameter | Default | Description |
|---|---|---|
| Merge distance | 0.5 m | Max centroid distance to merge as same instance |
| Min points | 10 | Minimum points required to create valid instance |
| Parameter | Default | Description |
|---|---|---|
| GC interval | 25 frames | Run garbage collection every N frames |
| H5 flush interval | 50 frames | Flush H5 database every N frames |
| Checkpoint interval | 10 frames | Save checkpoint every N processed images |
- Python 3.11+
- PyTorch with CUDA support
- NVIDIA GPU recommended (CPU processing supported but slow)
pip install -r requirements.txtCore dependencies:
- transformers (Hugging Face) — SAM3 model
- Open3D — Point cloud processing
- OpenEXR — Depth data reading
- NetworkX — Scene graph storage
- h5py — Geometry database
- rich — Progress display
SAM3 is a gated model on Hugging Face. To download it:
- Create a Hugging Face account at https://huggingface.co/join
- Accept the license at https://huggingface.co/facebook/sam3
- Get your access token at https://huggingface.co/settings/tokens
For local installation, log in once:
huggingface-cli loginFor Docker, pass your token as an environment variable (see Docker section below).
If local weights exist in ./model_weights/sam3/, they will be used instead of downloading.
Build and run with GPU support:
# Build the image
docker build -t insight .
# Show help
docker run --rm insight
# Process an area (requires HF_TOKEN for gated model)
docker run --rm --gpus all \
-e HF_TOKEN=your_huggingface_token \
-v /path/to/dataset:/datasets \
-v /path/to/output:/usr/src/app/output_results \
insight /datasets/area_1 --export-pointcept
# With cached model weights (avoids re-downloading)
docker run --rm --gpus all \
-v /path/to/dataset:/datasets \
-v /path/to/weights:/usr/src/app/model_weights \
-v /path/to/output:/usr/src/app/output_results \
insight /datasets/area_1 --export-pointcept
# Alternative: Mount local HuggingFace cache (if already logged in)
docker run --rm --gpus all \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-v /path/to/dataset:/datasets \
-v /path/to/output:/usr/src/app/output_results \
insight /datasets/area_1 --export-pointceptVolume mounts:
| Mount Point | Purpose |
|---|---|
/datasets |
Input data (area directories) |
/usr/src/app/model_weights |
SAM3 weights cache |
/usr/src/app/output_results |
Processing outputs |
/root/.cache/huggingface |
HuggingFace cache (optional) |
Note: Requires NVIDIA Container Toolkit for GPU support.
Pre-incident planning. Scene graphs load instantly on mobile devices, enabling review of egress paths, standpipe locations, and utility shutoffs before arrival.
Training data generation. Programmatic labeling bootstraps datasets for 3D semantic segmentation. Generated labels can be human-verified rather than created from scratch.
Bandwidth-constrained operations. Scene graphs compress tens of gigabytes of source geometry to single-digit megabytes—approximately four orders of magnitude—making building intelligence transmittable over portable mesh networks.
Interoperability research. ISO-aligned taxonomy provides a testbed for cross-agency data sharing.
INSIGHT addresses capability gaps identified by NIST Public Safety Communications Research (PSCR) Division's Location-Based Services portfolio, which focuses on indoor mapping, tracking, and navigation for the public safety community.
- Stanford 2D-3D-Semantics Dataset
- Pointcept
- IndoorGML (ISO 19164) | OGC IndoorGML
- NIST PSCR Location-Based Services
- Point Cloud City Open3D-ML — Prior work on 3D ML evaluation for public safety
MIT
@software{insight_2026,
title={INSIGHT: Indoor Scene Intelligence from Geometric-Semantic Hierarchy Transfer},
author={Dimopoulos, Alexander Nikitas},
year={2026},
url={https://github.com/alexdimopoulos/insight-sam3}
}
@article{carion_sam3_2025,
title={SAM 3: Segment Anything with Concepts},
author={Carion, Nicolas and Gustafson, Laura and Hu, Yuan-Ting and Debnath, Shoubhik and Hu, Ronghang and Suris, Didac and Ryali, Chaitanya and Alwala, Kalyan Vasudev and Khedr, Haitham and Huang, Andrew and Lei, Jie and Ma, Tengyu and Guo, Baishan and Kalla, Arpit and Marks, Markus and Greer, Joseph and Wang, Meng and Sun, Peize and R{\"a}dle, Roman and Afouras, Triantafyllos and Mavroudi, Effrosyni and Xu, Katherine and Wu, Tsung-Han and Zhou, Yu and Momeni, Liliane and Hazra, Rishi and Ding, Shuangrui and Vaze, Sagar and Porcher, Francois and Li, Feng and Li, Siyuan and Kamath, Aishwarya and Cheng, Ho Kei and Doll{\'a}r, Piotr and Ravi, Nikhila and Saenko, Kate and Zhang, Pengchuan and Feichtenhofer, Christoph},
journal={arXiv preprint arXiv:2511.16719},
year={2025}
}
@article{armeni_joint_2017,
title={Joint 2D-3D-Semantic Data for Indoor Scene Understanding},
author={Armeni, Iro and Sax, Sasha and Zamir, Amir R. and Savarese, Silvio},
journal={arXiv preprint arXiv:1702.01105},
year={2017}
}
