An AI-based approach to create spatial inventory of safety-related architectural features for school buildings

Chong Di; Jie Gong

doi:10.1016/j.dibe.2024.100376

. Author manuscript; available in PMC: 2025 Sep 3.

Published in final edited form as: Dev Built Environ. 2024 Feb;17:10.1016/j.dibe.2024.100376. doi: 10.1016/j.dibe.2024.100376

An AI-based approach to create spatial inventory of safety-related architectural features for school buildings

Chong Di ¹, Jie Gong ^1,^*

PMCID: PMC12404342 NIHMSID: NIHMS2100201 PMID: 40904423

Abstract

An understanding of the spatial information for safety-related architectural features inside a school building is crucial for effective responses to emergency situations such as active shooter or fire incidences. However, efficiently creating high-quality spatial inventory for a wide variety of safety-related assets inside a large school building is a challenge for building owners. This study addresses this challenge by developing an AI-based approach for rapid generation of spatial inventory of safety-related architectural features within school buildings from point cloud data. A method for creating cross-modality data from laser scans is proposed to not only address the limitations in individual 2D and 3D datasets but also enable the transfer of the segmentation and generalization performance from the state-of-the-art Segment Anything model to point cloud segmentation. Additionally, our proposed method also demonstrates great potential to address the challenge of extracting glass façade frames from point cloud datasets.

Keywords: School buildings, Safety-related architectural features, Spatial inventory, Object detection, Point cloud segmentation, Glass façade extraction

1. Introduction

Concerns regarding the safety and security of school buildings have intensified due to the escalating numbers of casualties and property damage resulting from both natural disasters and man-made incidents, including structure fires and active shooter events across various educational institutions in recent years (CHDS, 1970; Campbell, 2020; Balamurugan et al., 2019). Given the vast and intricate nature of school buildings that serve as daily hubs for a significant number of occupants, it is insufficient to rely solely on security personnel for the building preparedness for emergencies. As a crucial complement, the safety- and security-related architectural features embedded in school environments play a pivotal role in enhancing overall safety and security, profoundly influencing the behavior of occupants (Abdirad et al., 2020). For example, occupants can activate alarm switches during emergencies, and the ensuing blare of alarm horns serves as a warning for others to initiate evacuation procedures. Beyond the category of interactive safety devices, security-related features like metal detectors and surveillance cameras fall within the security-related category, exhibiting deterrent effects that may reduce the likelihood of malicious attacks (Zhu et al., 2020a). However, the influence of architectural features on human behavior can sometimes be complex, especially in instances where the scale and design of a school building, particularly those with a lot of glass facades, impact the indoor visibility. While such design facilitates easier wayfinding during evacuations, it concurrently raises the risks of occupants being exposed to potential attackers (Natapov et al., 2022; Zhu et al., 2020b). Compared to general building assets, a thorough comprehension of the safety- and security-related architectural features specific to school buildings is paramount. Developing a comprehensive inventory of these assets, elucidating details such as spatial locations, categories, and functions, serves as a foundational step that will contributes significantly to studies within the realm of school safety and security.

Previous studies have diligently strived to survey and document safety-related assets in school buildings, uncovering numerous challenges within this research domain. School buildings are typically large and intricately designed environments that make it exceedingly time-consuming and labor-intensive for the conventional surveying methods to create an inventory that comprehensively covers their safety assets (Dare et al.; Feng et al., 2021; Kubota et al., 2021; Kinateder et al., 2018). In contrast, recent surveying approaches hold significant promise in enhancing the accuracy and efficiency of building asset documentation tasks by integrating modern sensing technologies like photogrammetry and laser scanning (Masciotta et al., 2023; Kurkela et al., 2022; Mortezapoor et al., 2022). These technologies often collaborate with AI-based computer algorithms to detect (Jocher et al., 2022; Zhou et al., 2018; Krizhevsky et al., 2012; Simonyan et al., 2014; Chen et al., 2019) and segment target objects from vast datasets (Long et al., 2015; Chen et al., 2014, 2021; Noh et al., 2015; Qi et al., 2017; Wang et al., 2019a). However, the diverse nature of safety architectural features introduces heterogeneity in dimensions, styles, and physical characteristics, posing a substantial technical challenge that constrains the sensing and extraction performance of most surveying approaches (Di Filippo et al., 2018). For instance, laser scanning technology faces limitations in capturing reflective and transparent objects, adding complexity to the design of laser scan-based surveying frameworks aimed at extracting high-quality information from this architectural feature category (Tibebu et al., 2021). Furthermore, despite safety- and security-related architectural features being common in most public structures, such as school buildings, they exhibit a lower likelihood of human interaction compared to other object categories related to people’s daily activities (e.g., furniture, vehicles). This is due to the relatively rare occurrence of emergencies. Consequently, safety- and security-related school building assets have not received adequate research attention, leading to a scarcity of publicly available datasets both in quantity and comprehensiveness within this research domain (Liu et al., 2023; Zhang et al., 2023a; Huang et al., 2023).

While a lot of as-built modeling projects using 2D imagery or laser scan data would have ideally included safety related architectural features, the above challenges are still not sufficiently addressed. In this study, we tried to bridge these research gaps by proposing an AI-based framework that is tailored to extract safety- and security-related architectural features from our cross-modality data – now acquired with greater efficiency and comprehensiveness in order to create high-quality inventories of safety- and security-related assets for school buildings. Our work contributes in several significant ways: Firstly, we performed a systematic literature review, presenting a taxonomy for safety-related architectural features within school buildings and educational facilities. Secondly, we introduced a data projection approach that employs a generic camera model to establish a bi-directional pixel-to-point projection relationship for our cross-modality (2D-3D) datasets with minimal loss of accuracy. Thirdly, our framework extends the segmentation and generalization performances of the state-of-the-art image segmentation framework Segment Anything (SAM) to point cloud data segmentation. This extension illustrates the promising accuracy, scalability, generalizability, and practicality of our framework in semantic object extraction from 3D datasets. Lastly, our approach incorporates a similar workflow for extracting frames and estimating the spatial and geometric information for transparent building architectural features without vastly increasing the complexity of our framework design. Notably, extracting glass architectural features from laser scan data presents a well-known challenge.

2. Related works

The existing literature provides valuable insights into the taxonomy of architectural features within the research domain of indoor safety and security (Schwartz et al., 2016). While these features may exhibit significant variations in style and appearance, they fall into several categories based on their functionality. The first category is navigational architectural features that influence building occupants’ navigational behaviors during normal or evacuation situations. For example, doors generally connect different architectural spaces and their capacities can influence the occupants’ navigation efficiencies (Daamen et al., 2012a, 2012b). Signage and glass facades also fall into this category as these two features substantially influence the indoor visibility for wayfinding and exit options during emergencies (Natapov et al., 2022; Kubota et al., 2021; Ibrahim et al., 2022). It is noteworthy that glass facades, while enhancing visibility, also have negative effects for increasing the risk of exposure during active shooter incidences (Kellom et al., 2018). The second category comprises entry control equipment, which is intended to limit school building access to authorized users and sometimes can be controlled remotely (Green, 2005; Chipley et al., 2012). Features that fall into this category include electronic door locks, barricades, posted signs, radio frequency identification (RFID) cards, and biometric access control systems. Statistics revealed the prevalence of this architectural feature category as 88 percent of public schools and 80 percent of private schools already used controlled access to school buildings by 2011 (Robers et al., 2014). The third category involves video surveillance equipment, such as security cameras that are deployed in vulnerable or high-risk areas on school building sites to identify perpetrators (Aker, 2008). The fourth category is school-site alarm, aimed at mitigating situations once a threat to school safety has occurred (Green, 2005). Examples include fire alarm, passive infrared intruder alarms with body heat detection, scream alarms, and silent alarms (Schneider, 2001). The last category encompasses metal detectors and X-ray machines that are employed to inspect a person’s body or bags for weapons (Green, 2005). This architectural feature category is not prevalent in most school buildings due to the high daily costs associated with equipment and labor (JACKSON et al.).

Most school buildings are large as-built environments and therefore creating a comprehensive and high-quality spatial inventory for safety-related architectural features inside a school building using traditional approach is a costly task. With recent advancements in computer vision algorithms, there has been extensive exploration of object detection approaches for automated surveying tasks. Object detection frameworks can be divided into two categories, namely two-stage and one-stage 2D detectors (Xiao et al., 2020). The two-stage object detection framework, also known as the Region-based framework, divides the task of object localization and object classification (Liu et al., 2020). Examples of two-stage object detection frameworks include RCNN (Girshick et al., 2014), SPP-Net (He et al., 2015), Fast RCNN (Girshick, 2015), Faster R-CNN (Ren et al., 2015) and FPN (Lin et al., 2017). In contrast, One-stage object detection frameworks, also called region proposal free frameworks, are intended to locate and categorize simultaneously using deep convolutional neural networks (DCNNs) without partitioning them into two portions. Notable examples of One-stage object detection frameworks include DetectorNet (Szegedy et al., 2013), Overfeat (Sermanet et al., 2013), YOLO (Redmon et al., 2018), SSD (Liu et al., 2016). It is worth noting that early YOLO models including YOLOv1 (Redmon et al., 2016) and YOLOv2 (Redmon et al., 2017) often stack multiple convolution layers with the fully connection layer, which results in fixed-sized feature maps. Such network structures are limited in object detection of multi-scale especially for small targets (Liu et al., 2022). The limitation of multi-scale object detection was solved by adopting the image pyramid strategy for the later proposed YOLO models (Singh et al., 2018), such as YOLOv3 (Redmon et al., 2018), YOLOv4 (Bochkovskiy et al., 2020), YOLOv5 (Jocher et al., 2022) and so on. Furthermore, beyond the one- and two-stage categories of object detection approaches, the last category is called Backbone networks. These networks function to capture hierarchical features from the input image, which are then used by subsequent layers to make predictions. Backbone networks include AlexNet (Krizhevsky et al., 2012), VGGNet (Simonyan et al., 2014), ZFNet (Zeiler et al., 2014), ResNet (He et al., 2016), DenseNet (Huang et al., 2017).

3D object detection, a counterpart to image-based methods, extracts properties like bounding box size and location, commonly employed in traffic data monitoring (Mao et al., 2023). In contrast, point cloud segmentation approaches aimed at performing a pointwise classification to the dataset and therefore allows for a fine-grained understanding. There are two categories of point cloud segmentation approaches (Yang et al., 2023): direct methods, including PointNet (Qi et al., 2017), DGCNN (Wang et al., 2019a), RGAM (Chen et al., 2021), ResPointNet++ (Yin et al., 2021), which operate directly on the point cloud data without relying on intermediate representations. While direct methods can be computationally efficient, annotating point cloud to acquire the training datasets is generally a time-consuming and labor-intensive task. On the other hand, indirect methods classify point cloud in two separate approaches. In the first approach, studies segment point cloud data by projecting the segmented 2D image back to the cloud. However, a key limitation of employing this approach lies in the sensitivity of point cloud segmentation results to the alignment quality between the 2D and 3D datasets, given that point cloud and image data are typically collected through distinct systems (Pellis et al., 2022; Zhang et al., 2018; Wang et al., 2019b; Mascaro et al., 2021). Conversely, studies employing the second approach do not rely on point cloud data input. Instead, they perform segmentation directly on a series of multi-view images, which can be fused to generate annotated point clouds (Murtiyoso et al., 2021, 2022; Genova et al., 2021).

AI-based methods are extensively implemented to efficiently and accurately create digital representations for as-built environments. Indoor assets documentation and facility management stands out as a focal point of implementing AI-based feature extraction. Fang et al. (2022) demonstrated a MobileNet1.0-based image classification method based on commercial surveying data with the enhancement of the online image training dataset for optimizing and automating a series of portfolio-based strategic assets management service processes, including condition surveying, data validation, standardization, and integration. Another approach, as presented by Wei and Akinci (Wei et al., 2019) introduces a vision and learning-based framework that employs a shared convolutional neural network to perform simultaneous localization and mapping (SLAM), aiding facility management in associating building components with their information repository in a digital twin framework. Wu et al. (2021) proposed an object detection algorithm and further measured the dimensions of the detected objects with the RGB-D camera data. Balamurugan et al. (2019) and Kostoeva et al. (2019) proposed an online “one-shot learning” framework incorporating the Neural Turing Machine (NTM) architecture with augmented memory capacity and human-in-the-loop to detect and localize safety, security, and energy-related building assets from smartphone images.

Moreover, glass façades are a unique category of architectural features and pose great challenges to extract their attributes. Various attempts have been proposed to detect and localize glass objects. For example, the intensity-based methods leverage the unique optical properties of glass, such as its transparency and reflectivity, to differentiate glass objects from the surrounding background (Zhao et al., 2020; Foster et al., 2013; Jiang et al., 2017; Wang et al., 2017). However, most intensity-based methods suffer from the limitations of low accuracy when the distance or incidence angle of the laser beam increases. Another glass façade detection approach is termed frame-based methods, which focus on capturing the non-reflective boundary of the glass so that these objects can be captured by laser scanned data. While the glass façade frame boundaries can be detected and mapped through the collaboration by a team of robots (Kim et al., 2016), they can also be identified as the boundaries of hollow areas in flat building facades (Wang et al., 2011). Tibebu et al. (2021) introduces a frame-less point cloud-based glass surface detection approach that relies on extensive laboratory tests to capture the underlying variations in the ranging measurements between the laser footprints passing through and those not passing through the glass objects.

3. Method

Our overall methodology comprises two distinct pipelines (Fig. 1), both employing the cross-modality solution for feature extraction. However, these pipelines slightly differ in how they project their cross-modality datasets due to the distinct physical characteristics of their target architectural features, such as transparent versus non-transparent architectural features. To provide a comprehensive understanding of our methodology, the remainder of this section is structured as follows: Section 3.1 introduces the detailed process of projecting our cross-modality datasets for both pipelines. Section 3.2 gives a thorough description of the AI-based module integrated into our framework and how this method collaborates with our cross-modality datasets to extract building architectural features with respect to glass and non-glass objects. Lastly, Section 3.3 presents the evaluation metrics and methods pertinent to the outcomes derived from the two architectural feature extraction workflows.

3.1. Cross-modality dataset creation

Similar to some previous studies that fused images and point cloud for 3D object extraction (Pellis et al., 2022; Zhang et al., 2018; Wang et al., 2019b; Mascaro et al., 2021), our proposed framework is also aimed at creating cross-modality to our laser scanned dataset. However, the primary difference lies in the fact that previous studies often involved the collection of image data and laser-scanned data through separate systems, usually sourced from different manufacturers. The quality of their fused datasets was consequently tied to the errors introduced during system alignment procedures. In contrast, our study utilized an integrated system, the FARO Focus 350X static laser scanner, which collects dense point cloud colorized by a high-resolution panoramic image taken from the same optic center (approximately 12 million points and 32-megapixel panorama per scan). With the known location of the scanner’s optic center in each scan after point cloud registration, we can therefore synthesize high-quality cross-modality dataset through our data projection procedures.

Our preliminary experiment underscored the inherent difficulty in automatically detecting transparent glass objects in 2D images owing to various challenges such as the heterogeneity of their dimensions, the complexities of the scenes behind them, and even variations in lighting conditions. Consequently, we opted to divide our entire framework into two separate yet slightly different pipelines: a semi-automated pipeline involving human-in-the-loop intervention to extract the frames of glass architectural features, and a fully-automated pipeline designed to detect and extract non-glass safety-related architectural features. Within the first pipeline that was focused on glass frame extraction, cross-modality datasets were generated by projecting the point cloud into synthesized panoramic images. This choice was motivated by two primary considerations: Firstly, a synthesized panoramic image encompasses all the scenes captured in the point cloud, thereby minimizing the extent of manual annotation required for glass features. Secondly, the image areas occupied by glass architectural features can exhibit significant variation in the captured scenes, particularly when these features are in proximity to the scan location. Thus, the creation of cross-modal panoramas would effectively preserve the entirety of the glass surface area and ensures the continuity of the glass objects’ frames.

In order to synthesize cross-modal panorama for glass frame extraction pipeline, we were aimed at converting the world coordinates of the original 3D point cloud to equirectangular projection. Since the world coordinates vary significantly in distance with respect to the scanner’s optic center, the point cloud coordinates were first normalized and represented by longitude and latitude (Fig. 2a). For any point $p$ with its 3D world coordinates ( $x, y, z$ ), its longitude and latitude were calculated by

\{\begin{array}{c} r = \sqrt{x^{2} + y^{2} + z^{2}} \\ θ = a r c c o s (\frac{z}{r}), 0 \leq θ \leq π \\ φ = a r c t a n (\frac{y}{x}), 0 \leq φ \leq 2 π \end{array}

(1)

where $r$ denotes the Euclidean distance from a point $p (x, y, z)$ in the point cloud to the optic center $O$ of the scanner. Whereas $θ$ and $φ$ denote the polar angle (latitude) and azimuthal angle (longitude) of point $p$ . Furthermore, the coordinates of point $p$ in the equirectangular coordinates are

\{\begin{array}{l} u (φ) = W_{pano} \times \frac{φ}{2 π} \\ v (θ) = H_{pano} \times \frac{θ}{π} \end{array}

(2)

where $W_{pano}$ and $H_{pano}$ denote the pre-defined maximum width and maximum height of the projected panorama. Whereas $u (φ)$ and $v (θ)$ denote the horizontal and vertical coordinates of point $p$ in the projected panorama respectively (Fig. 2b). After all 3D points were converted to 2D points in this step, the final synthetic panorama can be obtained by interpolating the RGB values of the projected image points using the nearest neighbor interpolation method. As a result, the bi-directional 2D-to-3D projection relationship was created – Given a 3D point in the point cloud its corresponding image pixel is known. Conversely, given an image pixel, its 3D information can be accurately approximated by finding the point whose projected image coordinates have the shortest distance from the pixel.

Fig. 2. — Point cloud data projection with respect to different pipelines in our proposed study. a. Projecting point cloud to panorama by converting their original coordinates to equirectangular coordinates. b. Resulting synthetic panorama. c. Projecting point cloud data to 2D perspective images using an ideal camera model (the original point cloud was converted to spherical coordinates for better illustration). d. Point cloud projected to 2D image coordinates overlaying a synthetic perspective image. e. The point cloud in each scan was projected into 8 perspective images by rotating the virtual camera.

In the second pipeline that was focused on non-glass safety-related architectural feature extraction, we synthesized cross-modal perspective images by projecting the point cloud using a generic pinhole camera model (Xia et al., 2022, 2024). For a point $p$ with 3D world coordinates ( $x, y, z$ ), its homogeneous coordinates ${[u, v, 1]}^{T}$ were computed by

Z_{c} [\begin{matrix} u \\ v \\ 1 \end{matrix}] = [\begin{array}{l} K_{3 \times 3} & 0 \end{array}] [\begin{matrix} R_{3 \times 3} & T_{3 \times 1} \\ 0 & 1 \end{matrix}] {[\begin{matrix} y \\ z \end{matrix}]}_{1}^{x}

(3)

where $K_{3 \times 3}$ denotes the camera’s $3 \times 3$ intrinsic matrix whereas $R_{3 \times 3}$ and $T_{3 \times 1}$ denote the $3 \times 3$ rotation matrix and $3 \times 1$ translation vector, both of which constitute the extrinsic parameter of the virtual camera. $Z_{c}$ denotes the superfluous coordinate and can be dropped. More specifically, the camera parameter matrices can be expanded as

K_{3 \times 3} [R_{3 \times 3} ∣ T_{3 \times 1}] = [\begin{matrix} f_{x} & 0 & p_{x} \\ 0 & f_{y} & p_{y} \\ 0 & 0 & 1 \end{matrix}] [\begin{matrix} r_{11} & r_{12} & r_{13} \\ r_{21} & r_{22} & r_{23} \\ r_{31} & r_{32} & r_{33} \end{matrix}| \begin{matrix} t_{1} \\ t_{2} \\ t_{3} \end{matrix}]

(4)

where the intrinsic parameters $f_{x}$ , $f_{y}$ , $p_{x}$ , $p_{y}$ in matrix $K_{3 \times 3}$ denote the focal lengths and the coordinates of the image principal point represented by pixels respectively. It is worth noting that the theoretical pinhole camera model would not introduce lens distortion to the resulting perspective image throughout the data projection procedure. Equation (3) and Equation (4) show that the intrinsic and extrinsic parameters of the virtual camera model are known parameters. Hence for a point $p$ with 3D world coordinates ( $x, y, z$ ), its projected coordinates (if exists) in the synthesize perspective image can be calculated with pre-specified parameters including intrinsic parameters, orientations and location for the virtual camera model (Fig. 2c). In the perspective projection process, we simulated a virtual camera with a focal length of 3.15 μm and an image plane that had a resolution of 3024 × 4032 pixels. For extrinsic camera parameters, we constantly kept the virtual camera model aligned with the scanner’s optic center in order to minimize the gap in the resulting perspective images. Additionally, we experimentally found 45° an optimal angle to rotate the virtual camera horizontally to ensure some overlapping between synthesized perspective images. As similar to cross-modal panorama synthesis in the first pipeline, a nearest neighbor interpolation method was employed to fill the gaps between projected points in the resulting images (Fig. 2d and e). Likewise, the bi-directional 2D-to-3D projection relationship established between the 3D point and the synthetic images allowed us to extract the image pixels given 3D points and vice versa.

3.2. Safety-related building asset extraction

Segment Anything (SAM) stands out as a notable zero-shot image segmentation framework that was trained with an extensive dataset encompassing over 1 billion masks on 11 million images (Kirillov et al., 2023). SAM’s significance extends beyond its accuracy competitive to most fully supervised methods, as it demonstrates remarkable zero-shot and generalization performance across various segmentation tasks. In this study, we employed Segment Anything (SAM) as the primary solution to first segment the target architectural features from our synthesized 2D images and subsequently reference back their 3D information from the cross-modality datasets.

3.2.1. Glass object extraction

As outlined in Section 3.1, the human-in-the-loop intervention in the glass architectural feature extraction pipeline necessitates manually providing prompts for features in each synthetic panorama. Given the limited number of synthetic panoramas (one per static scan) and SAM’s rapid zero-shot performance, the extent and amount of the manual annotation in our manual procedure are reasonably constrained. Supplied with image prompts, SAM effectively predicts masks for the glass architectural features. While SAM generally produces accurate object masks in 2D images, there are instances where it may overlook fine structures or generate small disconnected components (Kirillov et al., 2023). To capture the complete shape and extract the frame for potentially under-segmented glass objects, we computed a convex hull ( $α = 0$ ) that encloses all mask pixels to approximate the entire 2D geometries of the object. Considering that SAM-predicted masks of a glass object may represent the scene behind it, we additionally created a buffer along the boundary of the mask’s convex hull to encompass the image pixels representing the actual boundary of the glass frame (Fig. 3a).

Fig. 3. — Frame-based glass architectural feature extraction pipeline. a. Points near the true glass facade frames are extracted within the buffer created along the convex hull of the glass façade mask predicted by Segment Anything. b. The extracted 3D frame points. c. Random points are selected as the initial step of the point cloud skeletonization. d. Converged points representing glass frame skeletons. e. Glass surface reconstructed from the frame skeletons.

While the buffer along the mask of the glass objects consists of glass frame pixels, it also inevitably captured non-frame pixels which are generally the scene behind the glass surface. Hence, we separate these two classes of image pixels by investigating the depth information (the distance between points and the scanner’s optic center) of their corresponding 3D points. We employ the DBSCAN method to cluster the points (within the glass façade mask buffer) based on their depth information and discard the clusters with larger depth values. In order to minimize the frame points discarded due to occlusion, the mask buffer is divided into multiple segments and the clustering process is done segment-wise. The resulting 3D glass are obtained by aggregating all the glass frame points identified in each clustering operations (Fig. 3b).

While our proposed workflow can separate most frame and non-frame points, the resulting frame point cloud may still be noisy owing to occlusion and those non-frame objects that shared connections to glass frames. More importantly, the quantity of glass frame points is not helpful and sometimes can be more than sufficient for the reconstruction process to acquire the glass surface information (e.g. surface area). As a result, we leveraged the mean-shift algorithm as a denoising method which also skeletonized the glass frame point cloud to facilitate further glass surface reconstruction (Li et al., 2017). The mean-shift algorithm is a mode-seeking algorithm that converges to the maximum densities of local neighborhood after a certain number of iterations. The algorithm is initialized with a pre-specified number of random seed points (fewer than the total population of point cloud) (Fig. 3c) and the updated position of seed $i$ is formulated by

{x_{i}}^{k + 1} = \frac{\sum_{j \in J} {q_{j}}^{k} {α_{i j}}^{k}}{\sum_{j \in J} {α_{i j}}^{k}} + λ \frac{\sum_{i^{'} \in I \ {i}} ({x_{i}}^{k} - {x_{i^{'}}}^{k}) {β_{i i^{'}}}^{k}}{\sum_{i^{'} \in I \ {i}} {β_{i i^{'}}}^{k}}

(3)

where the first term is the classical mean-shift algorithm whereas the second term is a regularized term to prevent further accumulation once points are already contracted onto their local center positions. $λ = μ σ ({x_{i}}^{k})$ is a balancing constant between the two terms. In addition, ${x_{i}}^{k}$ denotes the initial position at the current iteration and $x_{i}^{k + 1}$ denotes the new position for one iteration loop. ${q_{j}}^{k}$ and ${x_{i^{'}}}^{k}$ are the original and sampling points in the local neighborhood of ${x_{i}}^{k}$ for the current iteration, respectively.

α_{i j} = \frac{θ (‖x_{i} - q_{j}‖)}{‖x_{i} - q_{j}‖}, β_{{i i}^{'}} = \frac{θ (‖x_{i} - x_{i^{'}}‖)}{{‖x_{i} - x_{i^{'}}‖}^{2}}

(4)

θ (‖x_{i} - q_{j}‖) = e^{- {‖x_{i} - q_{j}‖}^{2} / R^{2}}

(5)

where $θ (‖ \cdot ‖) = e^{- {‖ \cdot ‖}^{2} / R^{2}}$ is a Gaussian smooth weight with a set neighborhood radius $R$ and $‖ . ‖$ denotes the Euclidean distance between two points. In the experiment for glass frame skeletonization, we start with approximately 200 random seed points and constrain the number of iterations to less than 5 (Fig. 3d) since the number of the glass frame points were much lower than the scenarios in previous studies (Li et al., 2017; Yang et al., 2019a). Following the well-documented implementation case in (Huang et al., 2013) (Huang et al., 2013), we empirically select $μ = 0.35$ in our study. Once the point cloud skeleton of glass frames had been extracted, the glass surface was reconstructed using the Delaunay triangulation method (Fig. 3e).

3.2.2. Non-glass building feature extraction

In our pursuit to identify a broad spectrum of indoor object classes encompassing diverse styles and dimensions within safety- and security-related architectural features, the choice of an object detection method with robust generalization and detection capabilities is crucial. Hence, we employed the YOLOv5 model for the object detection task, leveraging its notable performance in detection speed, strengthened multi-scale detection ability (Liu et al., 2022; Kim et al., 2022), and ease of implementation. Moreover, YOLOv5 seamlessly integrates with SAM. SAM can directly generate masks for target objects within YOLO-detected bounding boxes, providing zero-shot predictions that imbue the segmented masks with semantic information from the YOLO detections (Fig. 4). Importantly, the workflow bridging both 2D detection and segmentation processes can operate in real-time. To capture the 3D attributes of our target building assets, we extracted the 2D image pixels within SAM-predicted masks to reference back to the 3D points in the point cloud. Recognizing that the extracted asset might contain noisy points due to small gaps between assets and other building structures like walls, floors, and ceilings, we applied a statistical outlier removal method. This method identifies outliers within a localized neighborhood of points using a pre-defined distance threshold. Additionally, we observed that building assets, particularly doors with embedded glass, were common. These glass elements allowed a significant number of laser footprints on potentially distant background scenes, making the extraction results susceptible to large localization errors. Consequently, we identified and extracted points penetrating glass objects using the procedure described in Section 3.2.1 and subsequently removed them from the point cloud.

Fig. 4. — Feature extraction for non-glass building assets. Non-glass objects were detected by our fine-tuned YOLO model and their masks were inferred by SAM in 2D images. The 3D spatial locations of these objects were computed by referencing back to the point cloud that the images were projected.

3.3. Performance metrics

In this section, we present the metrics and methods to evaluate the extraction results with respect to glass and non-glass architectural features. As spatial location information is a common attribute of both feature categories, we employed the same metrics and methods to evaluate the localization results (Wei et al., 2019). Given the extracted point set $P_{o}$ for an object $o$ , the spatial location of this asset was quantified using the 3D geometric center (centroid) across all the point in set $P_{o}$ . Likewise, the centroid of its ground truth point set $G_{o}$ was calculated with the same equation

\{\begin{array}{l} {\hat{c}}_{o} = \frac{\sum p_{i} (x_{i}, y_{i}, z_{i})}{m}, p_{i} \in P_{o}, 0 < i \leq m \\ c_{o} = \frac{\sum p_{j} (x_{j}, y_{j}, z_{j})}{n}, p_{j} \in G_{o}, 0 < j \leq n \end{array}

(6)

where ${\hat{c}}_{o}$ and $c_{o}$ denote the 3D estimated and ground truth centroids of object $o$ , respectively. Whereas $p_{i}$ and $p_{j}$ denote an individual point in the extracted point set $P_{o}$ (with $m$ points) and ground truth point set $G_{o}$ (with $n$ points), respectively. Following Equation (6), the localization error $E_{loc} (o)$ for object $o$ was quantified using the Euclidean distance between the estimated and ground truth centroids

E_{loc} (o) = ‖{\hat{c}}_{o} - c_{o}‖

(7)

Compared to the localization results, other result evaluation methods and metrics with respect to glass and non-glass architectural features are different due to the different extraction pipelines. The glass objects, as an architectural feature that primarily impact the visibility of the as-built environments, their surface information including location, surface area, and angle, which can be computed using the extracted frame points, is more informative than the frame itself. While the localization results of glass objects were evaluated using Equation (7), surface area, as yet another property reflecting the dimension of a glass object, was computed between the estimated and the ground truth reconstructed glass surface. The surface difference was computed as:

E_{surf_diff} = | \hat{S} - S |

(8)

where $\hat{S}$ and $S$ denote the surface areas aggregating all the triangles in the reconstructed meshes for the extracted and ground truth glass object respectively. Furthermore, the angular error was evaluated by using the acute angle between the two vectorized third principal components of the estimated and ground truth glass frame points (Zhao et al., 2020). The angular error $E_{ang_diff}$ was formulated by

E_{ang_diff} = a b s (\frac{π}{2} - {c o s}^{- 1} (\frac{{\hat{v}}_{3} • v_{3}}{|{\hat{v}}_{3}| • |v_{3}|}))

(9)

where ${\hat{v}}_{3}$ and $v_{3}$ denote the third principal components of the estimated and ground truth glass frame points.

For the automated pipeline for non-glass architectural feature extraction, which encompasses an object detection process and subsequently an object segmentation process, we employed the mean detection accuracy ( $m A c c$ ) to evaluate the object detection performance in our framework. More specifically, the mean detection accuracy represents the percentage of the detected features over the total counts of the ground truth within the same object class. Thus this metrics was computed by each non-glass architectural feature class and can be aggregated to reflect the overall or the object detection performance within a specific school building testbed. Thus for a specific architectural feature class $f$ , the overall mean detection accuracy $m A c c_{overall}$ and the mean detection accuracy by a specific school building $b$ can be formulated by

m A c c_{overall} = \frac{\sum \sum d_{b f}}{\sum \sum g_{b f}}, m A c c_{b f} = \frac{d_{b f}}{g_{b f}}

(10)

where subscripts $b$ and $f$ denote the index of the school building and the architectural feature class respectively. Whereas $d_{b f}$ and $g_{b f}$ denote the counts of the detected and the ground truth objects for architectural feature class $f$ in school building $b$ , respectively.

Furthermore, for the non-glass extraction pipeline, the 3D segmentation result was evaluated using mean Intersection over Union ( $m I o U$ ) (Pellis et al., 2022). This metrics was computed by each individual non-glass object. More specifically, for an extract object $o$ with its extracted point set $P_{o}$ and its ground truth point set counterpart $G_{o}$ (Equation (6)), the intersect over union $m I o U (o)$ is formulated by

m I o U (o) = \frac{P_{o} \cap G_{o}}{P_{o} \cup G_{o}}

(11)

4. Results

4.1. Data collection

In this study, our testbeds include four school buildings on the Rutgers University campus: Rutgers Richard Weeks Hall of Engineering (RWH), Rutgers Science and Engineering Resource Center (SEC), Rutgers Staged Lab (RUSL), and Rutgers Zimmerli Art Museum (RUZM). Drawing on our fieldwork experience and information from the school website, our findings are illustrated in Fig. 5. Firstly, the RWH building, a four-story educational facility completed in 2018, stands out with a modern architectural design characterized by a large scale of glass and modern architectural features such as smart ID card readers and remote door lock mechanisms. Secondly, RUZM, a one-story building constructed decades ago, has undergone frequent renovations compared to other school buildings. Despite its age, RUZM exhibits a blend of classic and modern architectural features. Lastly, the SEC (two-story) and RUSL (one-story) buildings are typical examples of school structures with classic architectural designs.

Fig. 5. — Our field images illustrating the heterogeneity of the safety architectural features across various object classes and different school buildings.

Our field survey primarily involved static laser scans, with our FARO Focus 350X laser scanner operating in two modes: static and SWIFT. In static mode, the scanner functioned similarly to other static laser scanners, taking approximately 409 s for each scan, generating over 44 million laser footprints and a 32-megapixel panoramic image. On the other hand, in SWIFT scan mode, each scan was completed in 30 s with slightly lower resolution (over 6 million laser footprints and a 20-megapixel panorama per scan) for both the point cloud and the panorama. To assess the scan efficiency and data quality between these two modes, we surveyed the RWH, RUSL, RUZL school buildings using static scan mode and the SEC building using SWIFT scan mode. The FARO SWIFT system significantly reduced the overall duration of our laser scanning activity (Table 1). Importantly, we found that despite the rapid scan speed leading to lower resolution, the point cloud data in this model remained sufficient for our feature extraction procedures. In addition to laser scanning, we captured smartphone photos within the school building testbeds for the training data augmentation procedure.

Table 1.

Number of scans and laser mapping task durations.

	Scan mode	Number of scans	Total time used

*RWH lvl4*	FARO static	18	2 h
*RWH lvl3*	FARO static	33	3.7 h
*SEC lvl2*	FARO SWIFT	44	44 min
*SEC lvl1*	FARO SWIFT	34	34 min
*RU Staged Lab*	FARO static	29	3.2 h
*RU Zimmerli Museum*	FARO static	118	13.1 h

Open in a new tab

Most of our targeted Safety-related architectural were identified from previous studies (Balamurugan et al., 2019; Dare et al.; Kostoeva et al., 2019). These classes encompass door, door exit push, smoke detector, exit sign, fire alarm, smart ID reader, security camera (CCTV), emergency light, and fire extinguisher, some of which are common across classic and modern school buildings while some are only present in modern school buildings. We also found that our target object classes are very limited in publicly available datasets after an exhaustive search, we therefore prepare our customized dataset for model training purposes.

4.2. Training data augmentation

Through our fieldwork and ground truth annotation experiences, we discovered substantial heterogeneity in safety-related architectural features, spanning styles, dimensions, and occurrences across different feature types and school buildings. A notable example is the abundance of “door” objects compared to other architectural features. In contrast, architectural features like “fire extinguisher” and “id lock” (smart ID card reader) were found to be rare in our school building testbeds (Fig. 6). To enhance the object detection model’s performance, we conducted an exhaustive search for online images for each identified safety-related architectural feature class. While this operation increased the overall volume of the training data, significant discrepancies in training samples persisted across different school buildings and certain architectural feature classes.

Fig. 6. — Statistics of the training data augmentation by data source (in color) and architectural feature class (x-axis). (For interpretation of the references to color in this figure legend, the reader is referred to the Web version of this article.)

For training and validation, we employed the following strategy: considering both RWH and SEC school buildings have two floor levels be surveyed, images from one level constituted the training dataset, and the other served as the validation dataset. As RUSL and RUZM are single-story structures, we randomly allocated 80% of images for training and 20% for validation with respect to each. Furthermore, all images provided to the YOLOv5 model were further divided into subsets for model training (80%) and testing (20%). In summary, the YOLOv5 model was training with a total of 3166 annotated architectural features across 2086 images.

4.3. Frame-based glass architectural feature extraction

We confirmed the presence of glass objects in all testbeds and noted that glass architectural features are less common in SEC, RUSL, and RUZM school buildings compared to RWH. Consequently, the results of glass frame extraction primarily derive from the RWH building. Qualitatively, the skeletonized glass frames exhibit round corners and may have some missing points. In a quantitative analysis of three glass frames extracted from a panoramic image, the results indicate an average localization difference of 0.115 m (min 0.084 m; max 0.137 m), angular difference of 5.61° (min 2.49°; max 7.26°), and surface area difference of 0.264 m² (min 0.241 m²; max 0.276 m²) against the ground truth (Fig. 7).

Fig. 7. — Qualitative and quantitative evaluation for glass facade extraction results. Masks (light blue shades) of glass facades were inferred by SAM using bounding boxes (yellow rectangles). The 3D statistical graphs illustrate the skeletonized glass frame points for both extracted results (blue dots) and ground truth (red dots). (For interpretation of the references to color in this figure legend, the reader is referred to the Web version of this article.)

In addition to the specific examples of glass frame extraction, a total of 49 frames were extracted in the RWH testbed. These frames exhibited a median localization error of 0.1 m, a median angular error of 2.3°, and a median surface area error of 0.26 m². After excluding outliers, the overall means of the localization error, angular error, and surface area error were 0.09 m (std = ±0.07 m), 3.14° (std = ±3.02°), and 0.27 m² (std = ±0.26 m²), respectively, closely aligning with their respective medians (Fig. 8).

Fig. 8. — Quantitative statistics for glass facade extraction results.

Given that our dataset was derived from static scans, each with a known location of the scanner’s optic center, we conducted a T-test to explore whether the scanner’s positions influenced the glass frame extraction results using two metrics: observation distance and orientation between the scanner and the ground truth. The observation distance is represented by the 3D Euclidean distance ( $d$ ) between the scanner’s optic center and the centroid of the glass frame, while $θ$ denotes the 2D angle formed between the scanner’s observation orientation and the glass surface orientation in the xy-plane (top-view). This is because the scanner was consistently maintained at a constant height during the entire scanning activities (Fig. 9).

Fig. 9. — A top view illustration for the glass facade position and orientation with respect to the scanner.

The T-test results indicate significant correlations (p-value < 0.05) between the accuracy of the glass frame extraction and both the scanner’s observation position and orientation. Specifically, there is a negative correlation observed between the localization error (t-statistics = −11.107), angular error (t-statistics = −8.921), surface area error (t-statistics = −6.750), and the scanner’s observation angle. In contrast, the scanner’s observation distance demonstrates a positive correlation with angular error (t-statistics = 2.093) and surface area error (t-statistics = 2.147), but a negative correlation with the localization error (t-statistics = −8.844) (Table 2).

Table 2.

T-test results between the error metrics from extracted glass facades with respect to the scanner’s position and orientation.

Error Metrics	Observation Distance (d)		Observation Angle (θ)

	t-statistics	p-value	t-statistics	p-value

Localization Error	−8.844	<0.001 ^***	−11.107	<0.001 ^***
Angular Error	2.093	0.039 ^*	−8.921	<0.001 ^***
Surface Area Error	2.147	0.034 ^*	−6.750	<0.001 ^***

Open in a new tab

Significance Level: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05.

4.4. Non-glass architectural feature extraction

Our object detection results illustrate significant variations across different architectural classes and among various school building testbeds. The statistical analysis, combined with our data preparation insights (Fig. 6), highlights that certain classes like “door” and “exit sign” are the most prevalent architectural features across all school building testbeds. Specifically, the $m A c c$ values for these two feature classes (“door”: 89.51%; “exit sign”: 87.30%) notably surpass the median mAcc (64.78%). In the SEC building, the detection rates for the “door” (100%) and “exit sign” (96.43%) object classes stand out significantly higher than in other testbeds. Following SEC, the RWH building displays the second-highest occurrences and detection rates for “door” (90.91%) and “exit sign” (86.67%), respectively. The “fire alarm” architectural feature class attains the highest overall detection rate (97.37%), with almost all individual “fire alarms” being successfully detected in our validation dataset. In addition to the aforementioned object classes, other categories exhibiting consistently high detection rates include “emergency light” (80.00%) and “cctv” (75.00%) despite the absence of these two architectural feature classes in the validation datasets of two school buildings. In stark contrast, several classes show lower detection rates, such as “door exit push” (54.55%), “smoke detector” (41.67%), “fire alarm switch” (37.50%), “fire extinguisher” (36.84%), and “id card reader” (25%). After summarizing the total number of all the detected and ground truth architectural features, our model achieves an overall detection rate of 77.35% (see Table 3).

Table 3.

Evaluation results for object detection by school building and object class.

mAcc (%) (Detected/Total)	SEC lvl2	RWH lvl3	RUSL	RUZM	Overall by Object Class

*Fire extinguisher*	0.00 (0/3)	33.33 (2/6)	0.00 (0/4)	66.67 (4/6)	36.84
*CCTV*	Na	33.33 (1/3)	Na	84.62 (11/13)	75.00
*ID card reader*	Na	25.00 (2/8)	Na	Na	25.00
*Door exit push*	80.00 (4/5)	100.00 (6/6)	Na	18.18 (2/11)	54.55
*Door*	100.00 (71/71)	90.91 (40/44)	76.47 (13/17)	70.00 (21/30)	89.51
*Exit sign*	96.43 (27/28)	86.67 (13/15)	100.00 (4/4)	68.75 (11/16)	87.30
*Fire alarm*	100.00 (4/4)	100.00 (17/17)	66.67 (2/3)	100.00 (14/14)	97.37
*Fire alarm switch*	66.67 (2/3)	66.67 (2/3)	100.00 (1/1)	11.11 (1/9)	37.50
*Emergency light*	71.43 (5/7)	Na	Na	83.33 (15/18)	80.00
*Smoke detector*	0.00 (0/2)	54.55 (6/11)	42.86 (3/7)	0.00 (0/4)	41.67

*Overall by School Building*	91.87	78.76	63.89	65.29	77.35

Open in a new tab

“Na” values indicate the object class was not present in validation datasets of the specific school building.

In the qualitative evaluation of our non-glass object extraction pipeline (Fig. 10), SAM exhibited impressive performance once an architectural feature was detected. SAM accurately predicted masks for the objects inside the bounding boxes, demonstrating a remarkable closeness to the ground truth masks. While some under-segmented areas were observed, SAM consistently inferred object masks that closely resembled their ground truth counterparts. Moreover, by referencing back to the 3D points in the point cloud, our cross-modality dataset facilitated the retrieval of points highly similar to the SAM-predicted image masks.

In the quantitative assessment of our non-glass object segmentation procedure (Table 4), our framework demonstrates consistently competitive segmentation performance across all architectural feature classes (min $m I o U = 68.5 %$ ; max $m I o U = 92.3 %$ ) despite some variations in $m I o U$ scores. Summarizing the segmentation performance by school building reveals a consistent and close alignment of $m I o U$ values with the overall $m I o U$ . However, certain architectural feature classes exhibit relatively lower $m I o U$ scores, notably “emergency light” (65%, $s t d = \pm 25 %$ ) and “door exit push” (65%, $s t d = \pm 32 %$ ). The higher standard deviations in $m I o U$ for these classes indicate greater variations in segmentation performance within these specific classes. Furthermore, upon comparing $m I o U$ values across feature classes, it reveals that object classes with relatively low $m I o U$ are generally characterized by small sizes and proximity to ceilings (e.g., “exit sign,” “cctv”).

Table 4.

Point cloud segmentation evaluation results by school building and object class.

mIoU (%)	SEC lvl2	RWH lvl3	RUSL	RUZM	Overall by Object Class

*Fire extinguisher*	Na	93% (±3%)	Na	87% (±9%)	90.0%
*CCTV*	Na	84% (±5%)	Na	65% (±16%)	74.5%
*ID card reader*	Na	85% (±1%)	Na	Na	85.0%
*Door exit push*	84% (±5%)	72% (±19%)	Na	56% (±32%)	70.7%
*Door*	94% (±6%)	90% (±14%)	87% (±13%)	94% (±4%)	91.3%
*Exit sign*	78% (±14%)	73% (±14%)	89% (±5%)	86% (±7%)	81.5%
*Fire alarm*	86% (±7%)	88% (±7%)	81% (±10%)	81% (±9%)	84.0%
*Fire alarm switch*	92% (±1%)	90% (±1%)	90% (~)	97% (~)	92.3%
*Emergency light*	72% (±13%)	Na	Na	65% (±25%)	68.5%
*Smoke detector*	Na	81% (±8%)	76% (±12%)	Na	78.5%

*Overall by School Building*	84.3%	84.0%	84.6%	78.9%	83.0%

Open in a new tab

The “~” sign indicates unavailable standard deviation because only one object was detected in the specific object class within the testing datasets.

Our qualitative evaluation results highlight that even a small glass surface embedded in an object, such as “doors,” introduces a substantial amount of noise along with the extracted target object, resulting in notable localization errors. The presence of this noisy data poses a challenge for certain statistical noise reduction methods, like statistical outlier removal, to effectively denoise the extracted objects. In contrast, the identification, segmentation of the glass surfaces and removal of noisy back scene points, lead to a significant reduction in localization errors in our proposed framework (Fig. 11).

Fig. 11. — Quantitative evaluation of the extraction results for glass-embedded architectural features.

In our quantitative evaluation of extracting architectural features’ spatial information (Table 5), our analysis encompasses all feature classes and extends to specific categories for “door” objects where glass surfaces may be embedded. The statistical examination reveals minimal and closely clustered localization errors between “doors with no glass surface” (0.034 m) and “doors” from which the embedded glass surfaces have been removed (0.019 m) using our pipelines. In contrast, retaining these background scene noises without removal results in significantly larger localization errors, exceeding 0.46 m within the “door” object class. This statistical assessment aligns consistently with the qualitative analysis presented in Fig. 11. Moreover, among all architectural feature classes, the “emergency light” (0.052 m) and “door exit push” (0.055 m) feature classes display relatively higher localization errors, showing a high degree of consistency with the segmentation results within their respective feature classes. Upon summarizing the localization errors across school building testbeds, we observe remarkable consistency in these errors. Lastly, our architectural feature extraction framework achieves an overall spatial error of 0.022 m in localizing our target objects in the point cloud.

Table 5.

Quantitative evaluation of the localization results for non-glass building assets by school building and object class.

Localization Error (m)	SEC lvl2	RWH lvl3	RUSL	RUZM	Overall by Object Class

*Fire extinguisher*	Na	0.004 (±0.002)	Na	0.010 (±0.011)	0.007
*CCTV*	Na	0.031 (±0.025)	Na	0.031 (±0.042)	0.031
*ID card reader*	Na	0.006 (±0.004)	Na	Na	0.006
*Door exit push*	0.023 (±0.011)	0.025 (±0.019)	Na	0.118 (±0.088)	0.055
*Door*
*Embedded glass removed*	0.026 (±0.032)	0.031 (±0.024)	0.008 (±0.004)	0.009 (±0.007)	0.019
*Not glass-embedded*	0.032 (±0.049)	0.075 (±0.151)	0.023 (±0.013)	0.006 (±0.005)	0.034
*Without glass removal*	0.222 (±0.474)	1.424 (±1.878)	0.199 (±0.261)	0.012 (±0.001)	0.464
*Exit sign*	0.025 (±0.031)	0.099 (±0.169)	0.005 (±0.004)	0.004 (±0.002)	0.033
*Fire alarm*	0.008 (±0.005)	0.015 (±0.017)	0.012 (±0.007)	0.004 (±0.003)	0.010
*Fire alarm switch*	0.002 (±0.001)	0.004 (±0.002)	0.001 (~)	0.001 (~)	0.002
*Emergency light*	0.078 (±0.065)	Na	Na	0.025 (±0.028)	0.052
*Smoke detector*	Na	0.037 (±0.043)	0.006 (±0.002)	Na	0.022

*Overall by School Building*	0.027	0.028	0.006	0.025	0.022

Open in a new tab

The “~” sign indicates unavailable standard deviation because only one object was detected in the specific object class within the testing dataset.

5. Discussion

In contrast to prior frame-based techniques for window extraction, reliant on point cloud normal computations for identifying outer building facades and void areas (Wang et al., 2011; Pu et al., 2009), the characteristics of glass facades within indoor environments, typically situated between floors and ceilings, present a more intricate scenario. Our frame-based glass facade extraction method identifies frame points (though not always the actual glass frames) within SAM-predicted masks in 2D images, eliminating the necessity for calculating frame point normals. Comparing with two earlier studies on frameless glass extraction, our approach showcases comparable results in glass surface localization but outperforms in estimating glass orientation (Tibebu et al., 2021; Zhao et al., 2020) (Table 6). Utilizing a static scanner for each scan in our approach introduces instances where the inner glass frame (the actual frame connected to the glass) might be obscured by the outer frame relative to the scanner’s position, potentially introducing inconsistencies. Section 4.3 delves into our analysis, revealing significant correlations between these errors and the scanner’s observation distance and angle. For instance, extracted glass surfaces farther from the scanner tend to have lower localization error, yet angular error increases with smaller scanner observation angles. The reasons behind these correlations remain unclear and merit further investigation in our future work.

Table 6.

Glass facade localization results compared with previous studies.

Approach	Localization Error (m)	Angular Error (°)

Tibebu et al. (Tibebu et al., 2021)	0.0500	Na
Zhao et al. (Zhao et al., 2020)	0.0429	5.31
Ours	0.09	3.14

Open in a new tab

For non-glass building assets, we discern significant heterogeneity in dimensions, styles, and occurrences across object classes, hypothesizing that this diversity introduces notable inconsistencies to our non-glass object detection procedure. However, in most cases, the prevalence of certain architectural features within as-built environments is dictated by their functions. For example, “door” and “exit sign” architectural features are ubiquitous in educational facilities, though exceptions may arise based on specific functions of the school buildings. In our case, the RUSL is a laboratory school building with fewer classrooms, while the RUZM, being a museum, is divided into multiple halls to house exhibits. Conversely, safety-related architectural features like “fire extinguisher” and “alarm switch” serve straightforward emergency response functions. Our field experience revealed the absence of “emergency light” features in the RWH testbed, relying on an uninterruptible power supply (UPS) system for energy management during regular operations and emergency illumination situations (Mišák et al., 2011; Salata et al., 2015). Features with significantly lower occurrences are due to their rare interactions, a documented challenge (Kostoeva et al., 2019). Consequently, this poses challenges in creating a large and robust training dataset for automatic object detection. While we aimed to enrich the training datasets with additional online images, inconsistencies persisted in our detection results. Our study achieved an overall detection accuracy of 73.35%, comparable to previous studies in the same research domain (Chen et al., 2021; Zhang et al., 2022, 2023b). Future enhancements in our object detection results can be expected through an improved training dataset as we extend our approach to larger-scale implementations in school buildings. Importantly, our framework’s object detection process relies solely on 2D images, operating independently of the segmentation process. This characteristic enhances scalability and efficiency in terms of training data acquisition and annotation compared to 3D datasets, showcasing the flexibility and future potential of our methodology.

Comparing our work with previous studies utilizing an indirect point cloud segmentation approach (Table 7), these approaches generally fall into two categories. The first category initiates object detection and segmentation operations in a series of 2D images, followed by multi-view 3D reconstruction to propagate 2D semantic labels to 3D results (Murtiyoso et al., 2021, 2022; Genova et al., 2021). While these approaches may yield satisfactory 2D results, the reconstruction process often becomes a bottleneck for achieving high-quality 3D label propagation results ( $m I o U < 70 %$ ). On the other hand, the second category of indirect point cloud segmentation approaches involves collecting 2D images and 3D point cloud data through separate systems and an additional procedures to align the data with different modalities (Pellis et al., 2022; Zhang et al., 2018; Wang et al., 2019b; Mascaro et al., 2021). The accuracy of 2D-to-3D data or system alignments significantly impacts the precision of 3D data segmentation outcomes. Our framework falls into this second category, but in comparison to similar studies in this category, our approach showcases superior 3D data segmentation performance ( $m I o U = 83 %$ ). In our data projection process, the general camera model not only establishes bi-directional 2D-pixel-to-3D-point projections but also accomplishes this projection relationship with minimal loss of accuracy. Consequently, the resulting cross-modality dataset empowers us to transfer SAM’s state-of-the-art 2D segmentation and generalization performances seamlessly to 3D point cloud. As a result, our proposed approach has demonstrated remarkable generalizability and applicability in 3D object extraction across all our school building testbeds.

Table 7.

Comparison of detection and segmentation results between the frameworks implementing indirect point cloud segmentation framework.

Approach	Architecture	Modality of Datasets	2D To 3D Label Propagation Method	Object Detection mAcc (%)	Segmentation mIoU (%)

Murtiyoso et al. (Murtiyoso et al., 2021)	DeepLabv3 + dense matching	Image	Point cloud generated from multi-view image fusion	79.81	58.9
Murtiyoso et al. (Murtiyoso et al., 2022)	DeepLabv3 + dense matching	Image	Point cloud generated from multi-view image fusion	83.67	67.3
Genova et al. (Genova et al., 2021)	2D3DNet + pseudo 3D	Image	Point cloud generated from multi-view image fusion	Na	59.5
Pellis et al. (Pellis et al., 2022)	DeepLabv3	Image, Point cloud	Camera view and point cloud data alignment	88.67	72.0
Zhang R. et al. (Zhang et al., 2018)	VGG16	Image, Point cloud	Camera view and point cloud data alignment	74.98	64.2
Wang et al. (Wang et al., 2019b)	R-CNN + Diffuser	Image, Point cloud	Camera view and point cloud data alignment	79.85	71.9
Mascaro et al. (Mascaro et al., 2021)	Oracle + Diffuser	Image, Point cloud	Camera view and point cloud data alignment	Na	61.2
Ours	*YOLOv5 + SAM*	Image, Point cloud	Camera view and point cloud data alignment	73.35	83.0

Open in a new tab

Our cross-modality dataset acts as a valuable supplement to overcome the limitations of laser scanning, particularly in effectively capturing transparent objects. This capability plays a crucial role in reducing localization errors associated with transparent architectural features or components. Consequently, our approach yields extraction results that enable a more efficient and accurate quantification of the spatial locations of safety architectural features. In comparison to a study utilizing a moving robotic platform, which reported a mean localization error of 0.31 m for facility assets (Wei et al., 2019), our research showcases similar surveying efficiency with a significantly lower overall localization error of 0.04 m. However, we acknowledge that certain building asset classes exhibit lower localization accuracy. Specifically, some “Fire extinguisher” units incorporate reflective materials, introducing variance and irregularity in the spatial distribution of laser footprints for this architectural feature category. Additionally, “Exit signs” composed of frame-less transparent materials pose challenges for accurate spatial location using our frame-based glass extraction approach. Addressing these challenges for reflective and frameless transparent objects will be a focal point of our future research. Furthermore, our glass frame extraction approach holds promise for room space segmentation methods requiring the creation of watertight spaces or voxelizing indoor free spaces within unstructured point cloud data (Yang et al., 2019b; Ambruş et al., 2017; Jung et al., 2017), particularly for large glass facades within modern indoor spaces.

6. Conclusion

In this study, we have devised an AI-assisted methodology for the swift generation of spatial inventories concerning safety-related architectural features from laser scan data within school buildings. Initially, we introduced a theoretical virtual camera model as the 3D-to-2D data projection method, establishing cross-modality datasets with precise alignment derived from high-definition laser scan data. This projected dataset not only overcame limitations inherent in individual 2D and 3D datasets but also facilitated the transfer of segmentation and generalization performance from the state-of-the-art Segment Anything model to our point cloud segmentation process. Furthermore, we extended our methodology to tackle the challenge of detecting and localizing glass objects within buildings based on laser scan data. Our results have demonstrated a promising consistency between the extracted target objects and our ground truth. Our approach holds potential for application by building owners in creating comprehensive spatial inventories of safety-related architectural features, supporting various simulation studies to enhance facility safety during extreme events.

However, the identified limitations necessitate our attention for future improvement. Primarily, the overall performance of the proposed framework is constrained by the object detection process, which may be influenced by factors such as a limited number of training samples and the high heterogeneity among different architectural feature classes. Our primary focus for future work is to expand the application of our approach to more school buildings. By doing so, we anticipate an increase in the training sample size, which will be beneficial to train a more robust object detection model. Secondly, since static laser scans were the primary method of data collection in our work, the collected scanned data inevitably have gaps caused by occlusions, and some architectural features may fall into these gaps. Additionally, discrepancies in observation distances and angles introduced by static scans may lead to inconsistencies in the detection and segmentation processes. Given that both the object detection and segmentation within our framework have real-time capabilities, deploying the current framework onto a moving platform and extending it to a real-time application holds promise in minimizing these inconsistencies. This approach warrants further investigation in our future endeavors.

Acknowledgement

This material is based upon work supported by the U.S. Department of Homeland Security under Grant Award 22STESE00001-03-02. The views and conclusions contained in this document are those of the authors and should not be interpreted as necessarily representing the official policies, either expressed or implied, of the U.S. Department of Homeland Security.

Footnotes

Declaration of competing interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

CRediT authorship contribution statement

Chong Di: Writing – review & editing, Writing – original draft, Validation, Methodology, Investigation, Formal analysis. Jie Gong: Writing – review & editing, Writing – original draft, Supervision, Project administration, Methodology, Investigation, Funding acquisition, Conceptualization.

Data availability

Data will be made available on request.

References

Abdirad H, Dossick CS, 2020. Rebaselining asset data for existing facilities and infrastructure. J. Comput. Civ. Eng. 34 (1), 05019004. [Google Scholar]
Aker J, 2008. The Best Defense. Comprehensive school security, Buildings. [Google Scholar]
Ambruş R, Claici S, Wendt A, 2017. Automatic room segmentation from unstructured 3-D data of indoor environments. IEEE Rob. Autom. Lett. 2 (2), 749–756. [Google Scholar]
Balamurugan A, Zakhor A, 2019. Online learning for indoor asset detection. In: 2019 IEEE 29th International Workshop on Machine Learning for Signal Processing (MLSP). IEEE. [Google Scholar]
Bochkovskiy A, Wang C-Y, Liao H-YM, 2020. Yolov4: Optimal Speed and Accuracy of Object Detection arXiv preprint arXiv:2004.10934.
Campbell R, 2020. NFPA’s “Structure Fires in Schools”. Available from: https://www.nfpa.org/News-and-Research/Data-research-and-tools/Building-and-Life-Safety/Structure-fires-in-schools.
CHDS, 1970. Shooting Incidents at K-12 Schools. Available from: https://www.chds.us/sssc/data-map/.
Chen L-C, Papandreou G, Kokkinos I, Murphy K, Yuille AL, 2014. Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected Crfs arXiv preprint arXiv:1412.7062. [DOI] [PubMed]
Chen Y, Liu S, Shen X, Jia J, 2019. Fast point r-cnn. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. [Google Scholar]
Chen X-T, Li Y, Fan J-H, Wang R, 2021. RGAM: a novel network architecture for 3D point cloud semantic segmentation in indoor scenes. Inf. Sci. 571, 87–103. [Google Scholar]
Chipley M, Lyon W, Smilowitz R, Williams P, Arnold C, Blewett W, Hazen L, Krimgold F, 2012. Primer to Design Safe School Projects in Case of Terrorist Attacks and School Shootings. Buildings and Infrastructure Protection Series. FEMA-428/BIPS-07/January 2012, 2. US Department of Homeland Security. [Google Scholar]
Daamen W, Hoogendoorn S, 2012a. Emergency door capacity: influence of door width, population composition and stress level. Fire Technol. 48, 55–71. [Google Scholar]
Daamen W, Hoogendoorn S, 2012b. Calibration of pedestrian simulation model for emergency doors by pedestrian type. Transport. Res. Rec. 2316 (1), 69–75. [Google Scholar]
Dare AA, Olusegun OA, Joy-Berachah G, Oluwatobi AD, Taiwo TE, Oluwakayode OS, and Adejinle AJ, Assessment of Active Fire Prevention Strategies in Bells University Senate Building.
Di Filippo A, Sánchez-Aparicio LJ, Barba S, Martín-Jiménez JA, Mora R, González Aguilera D, 2018. Use of a wearable mobile laser system in seamless indoor 3D mapping of a complex historical site. Rem. Sens. 10 (12), 1897. [Google Scholar]
Fang Z, Tan T, Yan J, Lu Q, Pitt M, Hanna S, 2022. Automated portfolio-based strategic asset management based on deep neural image classification. Autom. ConStruct. 142, 104481. [Google Scholar]
Feng Y, Duives DC, Hoogendoorn SP, 2021. Using virtual reality to study pedestrian exit choice behaviour during evacuations. Saf. Sci. 137, 105158. [Google Scholar]
Foster P, Sun Z, Park JJ, Visagge B Kuipers, 2013. Visible angle grid for glass environments. In: 2013 IEEE International Conference on Robotics and Automation. IEEE. [Google Scholar]
Genova K, Yin X, Kundu A, Pantofaru C, Cole F, Sud A, Brewington B, Shucker B, Funkhouser T, 2021. Learning 3D semantic segmentation with only 2D image supervision. In: 2021 International Conference on 3D Vision (3DV). IEEE. [Google Scholar]
Girshick R, 2015. Fast r-cnn. In: Proceedings of the IEEE International Conference on Computer Vision. [Google Scholar]
Girshick R, Donahue J, Darrell T, Malik J, 2014. Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. [Google Scholar]
Green MW, 2005. The Appropriate and Effective Use of Security Technologies in US Schools: a Guide for Schools and Law Enforcement Agencies. Sandia National Laboratories (SNL), Albuquerque, NM, and Livermore, CA. [Google Scholar]
He K, Zhang X, Ren S, Sun J, 2015. Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 37 (9), 1904–1916. [DOI] [PubMed] [Google Scholar]
He K, Zhang X, Ren S, Sun J, 2016. Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. [Google Scholar]
Huang H, Wu S, Cohen-Or D, Gong M, Zhang H, Li G, Chen B, 2013. L1-medial skeleton of point cloud. ACM Trans. Graph. 32 (4), 65:1–65:8. [Google Scholar]
Huang G, Liu Z, Van Der Maaten L, Weinberger KQ, 2017. Densely connected convolutional networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. [Google Scholar]
Huang X, Zhang Y, Ma J, Tian W, Feng R, Zhang Y, Li Y, Guo Y, Zhang L, 2023. Tag2text: Guiding Vision-Language Model via Image Tagging arXiv preprint arXiv:2303.05657.
Ibrahim AM, Kamaruddin MA, Wahid AN, 2022. AN immersive augmented reality system to study the efficiency of dynamic exit signage. IIUM Engineering Journal 23 (1), 200–221. [Google Scholar]
JACKSON BA, Diliberti MK, Moore P, and Schwartz HL, Teachers’ Views on School Safety. [Google Scholar]
Jiang J, Miyagusuku R, Yamashita A, Asama H, 2017. Glass confidence maps building based on neural networks using laser range-finders for mobile robots. In: 2017 IEEE/SICE International Symposium on System Integration (SII). IEEE. [Google Scholar]
Jocher G, Chaurasia A, Stoken A, Borovec J, Kwon Y, Michael K, Fang J, Yifu Z, Wong C, Montes D, 2022. ultralytics/yolov5: V7. 0-yolov5 Sota Realtime Instance Segmentation. Zenodo. [Google Scholar]
Jung J, Stachniss C, Kim C, 2017. Automatic room segmentation of 3D laser data using morphological processing. ISPRS Int. J. Geo-Inf. 6 (7), 206. [Google Scholar]
Kellom K, Nubani L, 2018. One step ahead of active shooters: are our university buildings ready? Buildings 8 (12), 173. [Google Scholar]
Kim J, Chung W, 2016. Localization of a mobile robot using a laser range finder in a glass-walled environment. IEEE Trans. Ind. Electron. 63 (6), 3616–3627. [Google Scholar]
Kim J-H, Kim N, Park YW, Won CS, 2022. Object detection and classification based on YOLO-V5 with improved maritime dataset. J. Mar. Sci. Eng. 10 (3), 377. [Google Scholar]
Kinateder M, Comunale B, Warren WH, 2018. Exit choice in an emergency evacuation scenario is influenced by exit familiarity and neighbor behavior. Saf. Sci. 106, 170–175. [Google Scholar]
Kirillov A, Mintun E, Ravi N, Mao H, Rolland C, Gustafson L, Xiao T, Whitehead S, Berg AC, Lo W-Y, 2023. Segment Anything arXiv preprint arXiv: 2304.02643.
Kostoeva R, Upadhyay RK, Sapar Y, Zakhor A, 2019. Indoor 3D interactive asset detection using a smartphone. Int. Arch. Photogram. Rem. Sens. Spatial Inf. 42, 811–817. [Google Scholar]
Krizhevsky A, Sutskever I, Hinton GE, 2012. Imagenet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 25. [Google Scholar]
Kubota J, Sano T, Ronchi E, 2021. Assessing the compliance with the direction indicated by emergency evacuation signage. Saf. Sci. 138, 105210. [Google Scholar]
Kurkela M, Maksimainen M, Julin A, Virtanen J-P, Männistö I, Vaaja MT, Hyyppä H, 2022. Applying photogrammetry to reconstruct 3D luminance point clouds of indoor environments. Architect. Eng. Des. Manag. 18 (1), 56–72. [Google Scholar]
Li L, Li D, Xing X, Yang F, Rong W, Zhu H, 2017. Extraction of road intersections from GPS traces based on the dominant orientations of roads. ISPRS Int. J. Geo-Inf. 6 (12), 403. [Google Scholar]
Lin T-Y, Dollár P, Girshick R, He K, Hariharan B, Belongie S, 2017. Feature pyramid networks for object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. [Google Scholar]
Liu W, Anguelov D, Erhan D, Szegedy C, Reed S, Fu C-Y, Ssd A.C. Berg, 2016. Single shot multibox detector. In: Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part I 14. Springer. [Google Scholar]
Liu L, Ouyang W, Wang X, Fieguth P, Chen J, Liu X, Pietikäinen M, 2020. Deep learning for generic object detection: a survey. Int. J. Comput. Vis. 128, 261–318. [Google Scholar]
Liu H, Sun F, Gu J, Deng L, 2022. Sf-yolov5: a lightweight small object detection algorithm based on improved feature fusion mode. Sensors 22 (15), 5817. [DOI] [PMC free article] [PubMed] [Google Scholar]
Liu S, Zeng Z, Ren T, Li F, Zhang H, Yang J, Li C, Yang J, Su H, Zhu J, 2023. Grounding Dino: Marrying Dino with Grounded Pre-training for Open-Set Object Detection arXiv preprint arXiv:2303.05499.
Long J, Shelhamer E, Darrell T, 2015. Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. [DOI] [PubMed] [Google Scholar]
Mao J, Shi S, Wang X, Li H, 2023. 3D object detection for autonomous driving: a comprehensive survey. Int. J. Comput. Vis. 1–55. [Google Scholar]
Mascaro R, Teixeira L, Chli M, 2021. Diffuser: multi-view 2d-to-3d label diffusion for semantic scene segmentation. In: 2021 IEEE International Conference on Robotics and Automation (ICRA). IEEE. [Google Scholar]
Masciotta MG, Sanchez-Aparicio LJ, Oliveira DV, Gonzalez-Aguilera D, 2023. Integration of laser scanning technologies and 360° photography for the digital documentation and management of cultural heritage buildings. Int. J. Architect. Herit. 17 (1), 56–75. [Google Scholar]
Mišák S, Šnobl J, Dostál F, 2011. Power Solutions for Emergency Lighting of Tunnels, Underpasses and Ecoduct.
Mortezapoor S, Schönauer C, Rüggeberg J, Photogrammabot, Kaufmann H, 2022. An autonomous ros-based mobile photography robot for precise 3d reconstruction and mapping of large indoor spaces for mixed reality. In: 2022 IEEE Conference on Virtual Reality and 3D User Interfaces Abstracts and Workshops (VRW). IEEE. [Google Scholar]
Murtiyoso A, Lhenry C, Landes T, Grussenmeyer P, Alby E, 2021. Semantic segmentation for building façade 3D point cloud from 2D orthophoto images using transfer learning. Int. Arch. Photogram. Rem. Sens. Spatial Inf. Sci. 43, 201–206. [Google Scholar]
Murtiyoso A, Pellis E, Grussenmeyer P, Landes T, Masiero A, 2022. Towards semantic photogrammetry: generating semantically rich point clouds from architectural close-range photogrammetry. Sensors 22 (3), 966. [DOI] [PMC free article] [PubMed] [Google Scholar]
Natapov A, Parush A, Laufer L, Fisher-Gewirtzman D, 2022. Architectural features and indoor evacuation wayfinding: the starting point matters. Saf. Sci. 145, 105483. [Google Scholar]
Noh H, Hong S, Han B, 2015. Learning deconvolution network for semantic segmentation. In: Proceedings of the IEEE International Conference on Computer Vision. [Google Scholar]
Pellis E, Murtiyoso A, Masiero A, Tucci G, Betti M, Grussenmeyer P, 2022. 2D to 3D Label propagation for the semantic segmentation of Heritage building point clouds. In: XXIV ISPRS Congress “Imaging Today, Foreseeing Tomorrow”, Commission II 2022 Edition, 6–11 June 2022, Nice, France. ISPRS. [Google Scholar]
Pu S, Vosselman G, 2009. Knowledge based reconstruction of building models from terrestrial laser scanning data. ISPRS J. Photogrammetry Remote Sens. 64 (6), 575–584. [Google Scholar]
Qi CR, Su H, Mo K, Pointnet LJ Guibas, 2017. Deep learning on point sets for 3d classification and segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. [Google Scholar]
Redmon J, Divvala S, Girshick R, Farhadi A, 2016. You only look once: unified, real-time object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. [Google Scholar]
Redmon J, Farhadi A, 2017. YOLO9000: better, faster, stronger. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. [Google Scholar]
Redmon J, Farhadi A, 2018. Yolov3: an Incremental Improvement arXiv preprint arXiv:1804.02767.
Ren S, He K, Girshick R, Sun J, 2015. Faster r-cnn: towards real-time object detection with region proposal networks. Adv. Neural Inf. Process. Syst. 28. [DOI] [PubMed] [Google Scholar]
Robers S, Kemp J, Rathbun A, Morgan RE, 2014. Indicators of School Crime and Safety: 2013. NCES 2014–042/NCJ 243299. National Center for Education Statistics. [Google Scholar]
Salata F, Golasi I, Emiliano B, de Lieto Vollaro E, Pagliaro F, Nardecchia F, Gugliermetti F, DE Lieto Vollaro A, 2015. Energy saving and economic return on investments for safety and emergency lighting in road tunnels. In: Atti del Congresso Nazionale 15° CIRIAF-Impronta Ambientale e Sviluppo Sostenibile. Morlacchi Editore. [Google Scholar]
Schneider T, 2001. Newer Technologies for School Security.
Schwartz HL, Ramchand R, Barnes-Proby D, Grant S, Jackson BA, Leuschner K, Matsuda M, Saunders JM, 2016. The Role of Technology in Improving K-12 School Safety. Rand Corporation. [Google Scholar]
Sermanet P, Eigen D, Zhang X, Mathieu M, Fergus R, LeCun Y, 2013. Overfeat: Integrated Recognition, Localization and Detection Using Convolutional Networks arXiv preprint arXiv:1312.6229.
Simonyan K, Zisserman A, 2014. Very Deep Convolutional Networks for Large-Scale Image Recognition arXiv preprint arXiv:1409.1556.
Singh B, Davis LS, 2018. An analysis of scale invariance in object detection snip. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. [Google Scholar]
Szegedy C, Toshev A, Erhan D, 2013. Deep neural networks for object detection. Adv. Neural Inf. Process. Syst. 26. [Google Scholar]
Tibebu H, Roche J, De Silva V, Kondoz A, 2021. Lidar-based glass detection for improved occupancy grid mapping. Sensors 21 (7), 2263. [DOI] [PMC free article] [PubMed] [Google Scholar]
Wang R, Bach J, Ferrie FP, 2011. Window detection from mobile LiDAR data. In: 2011 IEEE Workshop on Applications of Computer Vision (WACV). IEEE. [Google Scholar]
Wang X, Wang J, 2017. Detecting glass in simultaneous localisation and mapping. Robot. Autonom. Syst. 88, 97–103. [Google Scholar]
Wang Y, Sun Y, Liu Z, Sarma SE, Bronstein MM, Solomon JM, 2019a. Dynamic graph cnn for learning on point clouds. ACM Trans. Graph. 38 (5), 1–12. [Google Scholar]
Wang BH, Chao W-L, Wang Y, Hariharan B, Weinberger KQ, Campbell M, 2019b. LDLS: 3-D object segmentation through label diffusion from 2-D images. IEEE Rob. Autom. Lett. 4 (3), 2902–2909. [Google Scholar]
Wei Y, Akinci B, 2019. A vision and learning-based indoor localization and semantic mapping framework for facility operations and management. Autom. ConStruct. 107, 102915. [Google Scholar]
Wu W, Ma L, Wang B, Zhang Z, 2021. Research on semantic object measurement algorithm based on object detection. In: International Conference in Communications, Signal Processing, and Systems. Springer. [Google Scholar]
Xia J, Gong J, 2022. Precise indoor localization with 3D facility scan data. Comput. Aided Civ. Infrastruct. Eng. 37 (10), 1243–1259. [Google Scholar]
Xia J, Gong J, 2024. Computer vision based first floor elevation estimation from mobile LiDAR data. Autom. ConStruct. 159, 105258. [Google Scholar]
Xiao Y, Tian Z, Yu J, Zhang Y, Liu S, Du S, Lan X, 2020. A review of object detection based on deep learning. Multimed. Tool. Appl. 79, 23729–23791. [Google Scholar]
Yang F, Zhou G, Su F, Zuo X, Tang L, Liang Y, Zhu H, Li L, 2019a. Automatic indoor reconstruction from point clouds in multi-room environments with curved walls. Sensors 19 (17), 3798. [DOI] [PMC free article] [PubMed] [Google Scholar]
Yang F, Li L, Su F, Li D, Zhu H, Ying S, Zuo X, Tang L, 2019b. Semantic decomposition and recognition of indoor spaces with structural constraints for 3D indoor modelling. Autom. ConStruct. 106, 102913. [Google Scholar]
Yang S, Hou M, Li S, 2023. Three-dimensional point cloud semantic segmentation for cultural heritage: a comprehensive review. Rem. Sens. 15 (3), 548. [Google Scholar]
Yin C, Wang B, Gan VJ, Wang M, Cheng JC, 2021. Automated semantic segmentation of industrial point clouds using ResPointNet++. Autom. ConStruct. 130, 103874. [Google Scholar]
Zeiler MD, Fergus R, 2014. Visualizing and understanding convolutional networks. In: Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6–12, 2014, Proceedings, Part I 13. Springer. [Google Scholar]
Zhang R, Li G, Li M, Wang L, 2018. Fusion of images and point clouds for the semantic segmentation of large-scale 3D scenes based on deep learning. ISPRS J. Photogrammetry Remote Sens. 143, 85–96. [Google Scholar]
Zhang M, Kadam P, Liu S, Kuo CCJ, 2022. GSIP: green semantic segmentation of large-scale indoor point clouds. Pattern Recogn. Lett. 164, 9–15. [Google Scholar]
Zhang Y, Huang X, Ma J, Li Z, Luo Z, Xie Y, Qin Y, Luo T, Li Y, Liu S, 2023a. Recognize Anything: A Strong Image Tagging Model arXiv preprint arXiv: 2306.03514.
Zhang WJ, Su SZ, Hong QQ, Wang BZ, Sun L, 2023b. Long short-distance topology modelling of 3D point cloud segmentation with a graph convolution neural network. IET Comput. Vis. 17 (3), 251–264. [Google Scholar]
Zhao X, Yang Z, Schwertfeger S, 2020. Mapping with reflection-detection and utilization of reflection in 3d lidar scans. In: 2020 IEEE International Symposium on Safety, Security, and Rescue Robotics (SSRR). IEEE. [Google Scholar]
Zhou Y, Tuzel O, 2018. Voxelnet: end-to-end learning for point cloud based 3d object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. [Google Scholar]
Zhu R, Lucas GM, Becerik-Gerber B, Southers EG, 2020a. Building preparedness in response to active shooter incidents: results of focus group interviews. Int. J. Disaster Risk Reduc. 48, 101617. [Google Scholar]
Zhu R, Lin J, Becerik-Gerber B, Li N, 2020b. Influence of architectural visual access on emergency wayfinding: a cross-cultural study in China, United Kingdom and United States. Fire Saf. J. 113, 102963. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

Data will be made available on request.

[R1] Abdirad H, Dossick CS, 2020. Rebaselining asset data for existing facilities and infrastructure. J. Comput. Civ. Eng. 34 (1), 05019004. [Google Scholar]

[R2] Aker J, 2008. The Best Defense. Comprehensive school security, Buildings. [Google Scholar]

[R3] Ambruş R, Claici S, Wendt A, 2017. Automatic room segmentation from unstructured 3-D data of indoor environments. IEEE Rob. Autom. Lett. 2 (2), 749–756. [Google Scholar]

[R4] Balamurugan A, Zakhor A, 2019. Online learning for indoor asset detection. In: 2019 IEEE 29th International Workshop on Machine Learning for Signal Processing (MLSP). IEEE. [Google Scholar]

[R5] Bochkovskiy A, Wang C-Y, Liao H-YM, 2020. Yolov4: Optimal Speed and Accuracy of Object Detection arXiv preprint arXiv:2004.10934.

[R6] Campbell R, 2020. NFPA’s “Structure Fires in Schools”. Available from: https://www.nfpa.org/News-and-Research/Data-research-and-tools/Building-and-Life-Safety/Structure-fires-in-schools.

[R7] CHDS, 1970. Shooting Incidents at K-12 Schools. Available from: https://www.chds.us/sssc/data-map/.

[R8] Chen L-C, Papandreou G, Kokkinos I, Murphy K, Yuille AL, 2014. Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected Crfs arXiv preprint arXiv:1412.7062. [DOI] [PubMed]

[R9] Chen Y, Liu S, Shen X, Jia J, 2019. Fast point r-cnn. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. [Google Scholar]

[R10] Chen X-T, Li Y, Fan J-H, Wang R, 2021. RGAM: a novel network architecture for 3D point cloud semantic segmentation in indoor scenes. Inf. Sci. 571, 87–103. [Google Scholar]

[R11] Chipley M, Lyon W, Smilowitz R, Williams P, Arnold C, Blewett W, Hazen L, Krimgold F, 2012. Primer to Design Safe School Projects in Case of Terrorist Attacks and School Shootings. Buildings and Infrastructure Protection Series. FEMA-428/BIPS-07/January 2012, 2. US Department of Homeland Security. [Google Scholar]

[R12] Daamen W, Hoogendoorn S, 2012a. Emergency door capacity: influence of door width, population composition and stress level. Fire Technol. 48, 55–71. [Google Scholar]

[R13] Daamen W, Hoogendoorn S, 2012b. Calibration of pedestrian simulation model for emergency doors by pedestrian type. Transport. Res. Rec. 2316 (1), 69–75. [Google Scholar]

[R14] Dare AA, Olusegun OA, Joy-Berachah G, Oluwatobi AD, Taiwo TE, Oluwakayode OS, and Adejinle AJ, Assessment of Active Fire Prevention Strategies in Bells University Senate Building.

[R15] Di Filippo A, Sánchez-Aparicio LJ, Barba S, Martín-Jiménez JA, Mora R, González Aguilera D, 2018. Use of a wearable mobile laser system in seamless indoor 3D mapping of a complex historical site. Rem. Sens. 10 (12), 1897. [Google Scholar]

[R16] Fang Z, Tan T, Yan J, Lu Q, Pitt M, Hanna S, 2022. Automated portfolio-based strategic asset management based on deep neural image classification. Autom. ConStruct. 142, 104481. [Google Scholar]

[R17] Feng Y, Duives DC, Hoogendoorn SP, 2021. Using virtual reality to study pedestrian exit choice behaviour during evacuations. Saf. Sci. 137, 105158. [Google Scholar]

[R18] Foster P, Sun Z, Park JJ, Visagge B Kuipers, 2013. Visible angle grid for glass environments. In: 2013 IEEE International Conference on Robotics and Automation. IEEE. [Google Scholar]

[R19] Genova K, Yin X, Kundu A, Pantofaru C, Cole F, Sud A, Brewington B, Shucker B, Funkhouser T, 2021. Learning 3D semantic segmentation with only 2D image supervision. In: 2021 International Conference on 3D Vision (3DV). IEEE. [Google Scholar]

[R20] Girshick R, 2015. Fast r-cnn. In: Proceedings of the IEEE International Conference on Computer Vision. [Google Scholar]

[R21] Girshick R, Donahue J, Darrell T, Malik J, 2014. Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. [Google Scholar]

[R22] Green MW, 2005. The Appropriate and Effective Use of Security Technologies in US Schools: a Guide for Schools and Law Enforcement Agencies. Sandia National Laboratories (SNL), Albuquerque, NM, and Livermore, CA. [Google Scholar]

[R23] He K, Zhang X, Ren S, Sun J, 2015. Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 37 (9), 1904–1916. [DOI] [PubMed] [Google Scholar]

[R24] He K, Zhang X, Ren S, Sun J, 2016. Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. [Google Scholar]

[R25] Huang H, Wu S, Cohen-Or D, Gong M, Zhang H, Li G, Chen B, 2013. L1-medial skeleton of point cloud. ACM Trans. Graph. 32 (4), 65:1–65:8. [Google Scholar]

[R26] Huang G, Liu Z, Van Der Maaten L, Weinberger KQ, 2017. Densely connected convolutional networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. [Google Scholar]

[R27] Huang X, Zhang Y, Ma J, Tian W, Feng R, Zhang Y, Li Y, Guo Y, Zhang L, 2023. Tag2text: Guiding Vision-Language Model via Image Tagging arXiv preprint arXiv:2303.05657.

[R28] Ibrahim AM, Kamaruddin MA, Wahid AN, 2022. AN immersive augmented reality system to study the efficiency of dynamic exit signage. IIUM Engineering Journal 23 (1), 200–221. [Google Scholar]

[R29] JACKSON BA, Diliberti MK, Moore P, and Schwartz HL, Teachers’ Views on School Safety. [Google Scholar]

[R30] Jiang J, Miyagusuku R, Yamashita A, Asama H, 2017. Glass confidence maps building based on neural networks using laser range-finders for mobile robots. In: 2017 IEEE/SICE International Symposium on System Integration (SII). IEEE. [Google Scholar]

[R31] Jocher G, Chaurasia A, Stoken A, Borovec J, Kwon Y, Michael K, Fang J, Yifu Z, Wong C, Montes D, 2022. ultralytics/yolov5: V7. 0-yolov5 Sota Realtime Instance Segmentation. Zenodo. [Google Scholar]

[R32] Jung J, Stachniss C, Kim C, 2017. Automatic room segmentation of 3D laser data using morphological processing. ISPRS Int. J. Geo-Inf. 6 (7), 206. [Google Scholar]

[R33] Kellom K, Nubani L, 2018. One step ahead of active shooters: are our university buildings ready? Buildings 8 (12), 173. [Google Scholar]

[R34] Kim J, Chung W, 2016. Localization of a mobile robot using a laser range finder in a glass-walled environment. IEEE Trans. Ind. Electron. 63 (6), 3616–3627. [Google Scholar]

[R35] Kim J-H, Kim N, Park YW, Won CS, 2022. Object detection and classification based on YOLO-V5 with improved maritime dataset. J. Mar. Sci. Eng. 10 (3), 377. [Google Scholar]

[R36] Kinateder M, Comunale B, Warren WH, 2018. Exit choice in an emergency evacuation scenario is influenced by exit familiarity and neighbor behavior. Saf. Sci. 106, 170–175. [Google Scholar]

[R37] Kirillov A, Mintun E, Ravi N, Mao H, Rolland C, Gustafson L, Xiao T, Whitehead S, Berg AC, Lo W-Y, 2023. Segment Anything arXiv preprint arXiv: 2304.02643.

[R38] Kostoeva R, Upadhyay RK, Sapar Y, Zakhor A, 2019. Indoor 3D interactive asset detection using a smartphone. Int. Arch. Photogram. Rem. Sens. Spatial Inf. 42, 811–817. [Google Scholar]

[R39] Krizhevsky A, Sutskever I, Hinton GE, 2012. Imagenet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 25. [Google Scholar]

[R40] Kubota J, Sano T, Ronchi E, 2021. Assessing the compliance with the direction indicated by emergency evacuation signage. Saf. Sci. 138, 105210. [Google Scholar]

[R41] Kurkela M, Maksimainen M, Julin A, Virtanen J-P, Männistö I, Vaaja MT, Hyyppä H, 2022. Applying photogrammetry to reconstruct 3D luminance point clouds of indoor environments. Architect. Eng. Des. Manag. 18 (1), 56–72. [Google Scholar]

[R42] Li L, Li D, Xing X, Yang F, Rong W, Zhu H, 2017. Extraction of road intersections from GPS traces based on the dominant orientations of roads. ISPRS Int. J. Geo-Inf. 6 (12), 403. [Google Scholar]

[R43] Lin T-Y, Dollár P, Girshick R, He K, Hariharan B, Belongie S, 2017. Feature pyramid networks for object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. [Google Scholar]

[R44] Liu W, Anguelov D, Erhan D, Szegedy C, Reed S, Fu C-Y, Ssd A.C. Berg, 2016. Single shot multibox detector. In: Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part I 14. Springer. [Google Scholar]

[R45] Liu L, Ouyang W, Wang X, Fieguth P, Chen J, Liu X, Pietikäinen M, 2020. Deep learning for generic object detection: a survey. Int. J. Comput. Vis. 128, 261–318. [Google Scholar]

[R46] Liu H, Sun F, Gu J, Deng L, 2022. Sf-yolov5: a lightweight small object detection algorithm based on improved feature fusion mode. Sensors 22 (15), 5817. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R47] Liu S, Zeng Z, Ren T, Li F, Zhang H, Yang J, Li C, Yang J, Su H, Zhu J, 2023. Grounding Dino: Marrying Dino with Grounded Pre-training for Open-Set Object Detection arXiv preprint arXiv:2303.05499.

[R48] Long J, Shelhamer E, Darrell T, 2015. Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. [DOI] [PubMed] [Google Scholar]

[R49] Mao J, Shi S, Wang X, Li H, 2023. 3D object detection for autonomous driving: a comprehensive survey. Int. J. Comput. Vis. 1–55. [Google Scholar]

[R50] Mascaro R, Teixeira L, Chli M, 2021. Diffuser: multi-view 2d-to-3d label diffusion for semantic scene segmentation. In: 2021 IEEE International Conference on Robotics and Automation (ICRA). IEEE. [Google Scholar]

[R51] Masciotta MG, Sanchez-Aparicio LJ, Oliveira DV, Gonzalez-Aguilera D, 2023. Integration of laser scanning technologies and 360° photography for the digital documentation and management of cultural heritage buildings. Int. J. Architect. Herit. 17 (1), 56–75. [Google Scholar]

[R52] Mišák S, Šnobl J, Dostál F, 2011. Power Solutions for Emergency Lighting of Tunnels, Underpasses and Ecoduct.

[R53] Mortezapoor S, Schönauer C, Rüggeberg J, Photogrammabot, Kaufmann H, 2022. An autonomous ros-based mobile photography robot for precise 3d reconstruction and mapping of large indoor spaces for mixed reality. In: 2022 IEEE Conference on Virtual Reality and 3D User Interfaces Abstracts and Workshops (VRW). IEEE. [Google Scholar]

[R54] Murtiyoso A, Lhenry C, Landes T, Grussenmeyer P, Alby E, 2021. Semantic segmentation for building façade 3D point cloud from 2D orthophoto images using transfer learning. Int. Arch. Photogram. Rem. Sens. Spatial Inf. Sci. 43, 201–206. [Google Scholar]

[R55] Murtiyoso A, Pellis E, Grussenmeyer P, Landes T, Masiero A, 2022. Towards semantic photogrammetry: generating semantically rich point clouds from architectural close-range photogrammetry. Sensors 22 (3), 966. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R56] Natapov A, Parush A, Laufer L, Fisher-Gewirtzman D, 2022. Architectural features and indoor evacuation wayfinding: the starting point matters. Saf. Sci. 145, 105483. [Google Scholar]

[R57] Noh H, Hong S, Han B, 2015. Learning deconvolution network for semantic segmentation. In: Proceedings of the IEEE International Conference on Computer Vision. [Google Scholar]

[R58] Pellis E, Murtiyoso A, Masiero A, Tucci G, Betti M, Grussenmeyer P, 2022. 2D to 3D Label propagation for the semantic segmentation of Heritage building point clouds. In: XXIV ISPRS Congress “Imaging Today, Foreseeing Tomorrow”, Commission II 2022 Edition, 6–11 June 2022, Nice, France. ISPRS. [Google Scholar]

[R59] Pu S, Vosselman G, 2009. Knowledge based reconstruction of building models from terrestrial laser scanning data. ISPRS J. Photogrammetry Remote Sens. 64 (6), 575–584. [Google Scholar]

[R60] Qi CR, Su H, Mo K, Pointnet LJ Guibas, 2017. Deep learning on point sets for 3d classification and segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. [Google Scholar]

[R61] Redmon J, Divvala S, Girshick R, Farhadi A, 2016. You only look once: unified, real-time object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. [Google Scholar]

[R62] Redmon J, Farhadi A, 2017. YOLO9000: better, faster, stronger. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. [Google Scholar]

[R63] Redmon J, Farhadi A, 2018. Yolov3: an Incremental Improvement arXiv preprint arXiv:1804.02767.

[R64] Ren S, He K, Girshick R, Sun J, 2015. Faster r-cnn: towards real-time object detection with region proposal networks. Adv. Neural Inf. Process. Syst. 28. [DOI] [PubMed] [Google Scholar]

[R65] Robers S, Kemp J, Rathbun A, Morgan RE, 2014. Indicators of School Crime and Safety: 2013. NCES 2014–042/NCJ 243299. National Center for Education Statistics. [Google Scholar]

[R66] Salata F, Golasi I, Emiliano B, de Lieto Vollaro E, Pagliaro F, Nardecchia F, Gugliermetti F, DE Lieto Vollaro A, 2015. Energy saving and economic return on investments for safety and emergency lighting in road tunnels. In: Atti del Congresso Nazionale 15° CIRIAF-Impronta Ambientale e Sviluppo Sostenibile. Morlacchi Editore. [Google Scholar]

[R67] Schneider T, 2001. Newer Technologies for School Security.

[R68] Schwartz HL, Ramchand R, Barnes-Proby D, Grant S, Jackson BA, Leuschner K, Matsuda M, Saunders JM, 2016. The Role of Technology in Improving K-12 School Safety. Rand Corporation. [Google Scholar]

[R69] Sermanet P, Eigen D, Zhang X, Mathieu M, Fergus R, LeCun Y, 2013. Overfeat: Integrated Recognition, Localization and Detection Using Convolutional Networks arXiv preprint arXiv:1312.6229.

[R70] Simonyan K, Zisserman A, 2014. Very Deep Convolutional Networks for Large-Scale Image Recognition arXiv preprint arXiv:1409.1556.

[R71] Singh B, Davis LS, 2018. An analysis of scale invariance in object detection snip. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. [Google Scholar]

[R72] Szegedy C, Toshev A, Erhan D, 2013. Deep neural networks for object detection. Adv. Neural Inf. Process. Syst. 26. [Google Scholar]

[R73] Tibebu H, Roche J, De Silva V, Kondoz A, 2021. Lidar-based glass detection for improved occupancy grid mapping. Sensors 21 (7), 2263. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R74] Wang R, Bach J, Ferrie FP, 2011. Window detection from mobile LiDAR data. In: 2011 IEEE Workshop on Applications of Computer Vision (WACV). IEEE. [Google Scholar]

[R75] Wang X, Wang J, 2017. Detecting glass in simultaneous localisation and mapping. Robot. Autonom. Syst. 88, 97–103. [Google Scholar]

[R76] Wang Y, Sun Y, Liu Z, Sarma SE, Bronstein MM, Solomon JM, 2019a. Dynamic graph cnn for learning on point clouds. ACM Trans. Graph. 38 (5), 1–12. [Google Scholar]

[R77] Wang BH, Chao W-L, Wang Y, Hariharan B, Weinberger KQ, Campbell M, 2019b. LDLS: 3-D object segmentation through label diffusion from 2-D images. IEEE Rob. Autom. Lett. 4 (3), 2902–2909. [Google Scholar]

[R78] Wei Y, Akinci B, 2019. A vision and learning-based indoor localization and semantic mapping framework for facility operations and management. Autom. ConStruct. 107, 102915. [Google Scholar]

[R79] Wu W, Ma L, Wang B, Zhang Z, 2021. Research on semantic object measurement algorithm based on object detection. In: International Conference in Communications, Signal Processing, and Systems. Springer. [Google Scholar]

[R80] Xia J, Gong J, 2022. Precise indoor localization with 3D facility scan data. Comput. Aided Civ. Infrastruct. Eng. 37 (10), 1243–1259. [Google Scholar]

[R81] Xia J, Gong J, 2024. Computer vision based first floor elevation estimation from mobile LiDAR data. Autom. ConStruct. 159, 105258. [Google Scholar]

[R82] Xiao Y, Tian Z, Yu J, Zhang Y, Liu S, Du S, Lan X, 2020. A review of object detection based on deep learning. Multimed. Tool. Appl. 79, 23729–23791. [Google Scholar]

[R83] Yang F, Zhou G, Su F, Zuo X, Tang L, Liang Y, Zhu H, Li L, 2019a. Automatic indoor reconstruction from point clouds in multi-room environments with curved walls. Sensors 19 (17), 3798. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R84] Yang F, Li L, Su F, Li D, Zhu H, Ying S, Zuo X, Tang L, 2019b. Semantic decomposition and recognition of indoor spaces with structural constraints for 3D indoor modelling. Autom. ConStruct. 106, 102913. [Google Scholar]

[R85] Yang S, Hou M, Li S, 2023. Three-dimensional point cloud semantic segmentation for cultural heritage: a comprehensive review. Rem. Sens. 15 (3), 548. [Google Scholar]

[R86] Yin C, Wang B, Gan VJ, Wang M, Cheng JC, 2021. Automated semantic segmentation of industrial point clouds using ResPointNet++. Autom. ConStruct. 130, 103874. [Google Scholar]

[R87] Zeiler MD, Fergus R, 2014. Visualizing and understanding convolutional networks. In: Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6–12, 2014, Proceedings, Part I 13. Springer. [Google Scholar]

[R88] Zhang R, Li G, Li M, Wang L, 2018. Fusion of images and point clouds for the semantic segmentation of large-scale 3D scenes based on deep learning. ISPRS J. Photogrammetry Remote Sens. 143, 85–96. [Google Scholar]

[R89] Zhang M, Kadam P, Liu S, Kuo CCJ, 2022. GSIP: green semantic segmentation of large-scale indoor point clouds. Pattern Recogn. Lett. 164, 9–15. [Google Scholar]

[R90] Zhang Y, Huang X, Ma J, Li Z, Luo Z, Xie Y, Qin Y, Luo T, Li Y, Liu S, 2023a. Recognize Anything: A Strong Image Tagging Model arXiv preprint arXiv: 2306.03514.

[R91] Zhang WJ, Su SZ, Hong QQ, Wang BZ, Sun L, 2023b. Long short-distance topology modelling of 3D point cloud segmentation with a graph convolution neural network. IET Comput. Vis. 17 (3), 251–264. [Google Scholar]

[R92] Zhao X, Yang Z, Schwertfeger S, 2020. Mapping with reflection-detection and utilization of reflection in 3d lidar scans. In: 2020 IEEE International Symposium on Safety, Security, and Rescue Robotics (SSRR). IEEE. [Google Scholar]

[R93] Zhou Y, Tuzel O, 2018. Voxelnet: end-to-end learning for point cloud based 3d object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. [Google Scholar]

[R94] Zhu R, Lucas GM, Becerik-Gerber B, Southers EG, 2020a. Building preparedness in response to active shooter incidents: results of focus group interviews. Int. J. Disaster Risk Reduc. 48, 101617. [Google Scholar]

[R95] Zhu R, Lin J, Becerik-Gerber B, Li N, 2020b. Influence of architectural visual access on emergency wayfinding: a cross-cultural study in China, United Kingdom and United States. Fire Saf. J. 113, 102963. [Google Scholar]

PERMALINK

An AI-based approach to create spatial inventory of safety-related architectural features for school buildings

Chong Di

Jie Gong

Abstract

1. Introduction

2. Related works

3. Method

Fig. 1.

3.1. Cross-modality dataset creation

Fig. 2.

3.2. Safety-related building asset extraction

3.2.1. Glass object extraction

Fig. 3.

3.2.2. Non-glass building feature extraction

Fig. 4.

3.3. Performance metrics

4. Results

4.1. Data collection

Fig. 5.

Table 1.

4.2. Training data augmentation

Fig. 6.

4.3. Frame-based glass architectural feature extraction

Fig. 7.

Fig. 8.

Fig. 9.

Table 2.

4.4. Non-glass architectural feature extraction

Table 3.

Fig. 10.

Table 4.

Fig. 11.

Table 5.

5. Discussion

Table 6.

Table 7.

6. Conclusion

Acknowledgement

Footnotes

Data availability

References

Associated Data

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases