ZURU Tech is on a mission to change the way we build, from town houses and hospitals to office towers, schools, apartment blocks, and more. Dreamcatcher is a user-friendly platform developed by ZURU that allows users with any level of experience to collaborate in the building design and construction process. With the simple click of a button, an entire building can be ordered, manufactured and delivered to the construction site for assembly.
ZURU collaborated with AWS Generative AI Innovation Center and AWS Professional Services to implement a more accurate text-to-floor plan generator using generative AI. With it, users can specify a description of the building they want to design using natural language. For example, instead of designing the foundation, walls, and key aspects of a building from scratch, a user could enter, “Create a house with three bedrooms, two bathrooms, and an outdoor space for entertainment.” The solution would generate a unique floor plan within the 3D design space, allowing users with a non-technical understanding of architecture and construction to create a well-designed house
In this post, we show you why a solution using a large language model (LLM) was chosen. We explore how model selection, prompt engineering, and fine-tuning can be used to improve results. And we explain how the team made sure they could iterate quickly through an evaluation framework using key services such as Amazon Bedrock and Amazon SageMaker.
Understanding the challenge
The foundation for generating a house within Dreamcatcher’s 3D building system is to first confirm we can generate a 2D floor plan based on the user’s prompt. The ZURU team found that generating 2D floor plans, such as the one in the following image, using different machine learning (ML) techniques requires success across two key criteria.
First, the model must understand rooms, the purpose of each room, and their orientation to one another within a two-dimensional vector system. This can also be described as how well the model can adhere to the features described from a user’s prompt. Second, there is also a mathematical component to making sure rooms adhere to criteria such as specific dimensions and floor space. To be certain that they were on the right track and to allow for fast R&D iteration cycles, the ZURU team created a novel evaluation framework that would measure the output of different models based on showing the level of accuracy across these two key metrics.
The ZURU team initially looked at using generative adversarial networks (GAN) for floor plan generation, but experimentation with a GPT2 LLM had positive results based on the test framework. This reinforced the idea that an LLM-based approach could provide the required accuracy for a text-to–floor plan generator.
Improving the results
To improve on the results of the GPT2 model, we worked together and defined two further experiments. The first was a prompt engineering approach. Using Anthropic’s Claude 3.5 Sonnet in Amazon Bedrock the team was able to evaluate the impact of a leading proprietary model with contextual examples included in the prompts. The second approach focused on using fine-tuning with Llama 3B variants to evaluate the improvement of accuracy when the model weights are directly influenced using high-quality examples.
Dataset preparation and analysis
To create the initial dataset, floor plans from thousands of houses were gathered from publicly available sources and reviewed by a team of in-house architects. To streamline the review process, the ZURU team built a custom application with a simple yes/no decision mechanism similar to those found in popular social matching applications, allowing architects to quickly approve plans compatible with the ZURU building system or reject those with disqualifying features. This intuitive approach significantly accelerated ZURU’s evaluation process while maintaining clear decision criteria for each floor plan.
To further enhance this dataset, we began with careful dataset preparation including filtering out the low-quality data (30%) by evaluating the metric score of ground truth dataset. Following this filtering mechanism, data points not achieving 100% accuracy on instruction adherence are removed from the training dataset. This data preparation technique helped to improve the efficiency and quality of the fine-tuning and prompt engineering by more than 20%.
During our exploratory data analysis we found that the dataset contained prompts that can match multiple floor plans as well as floor plans that could match multiple prompts. By moving all related prompt and floor plan combinations to the same data split (either training, validation, or testing) we were able to prevent data leakage and promote robust evaluation.
Prompt engineering approach
As part of our approach, we implemented dynamic matching for few-shot prompting that is different from traditional static sampling methods. Combining this with the implementation of prompt decomposition, we could increase the overall accuracy of the generated floor plan content.
With a dynamic few-shot prompting methodology, we retrieve the most relevant examples at run time based on the details of the input prompt from a high-quality dataset and provide this as part of the prompt to the generative AI model.
The dynamic few-shot prompting approach is further enhanced by prompt decomposition, where we break down complex tasks into smaller, more manageable components to achieve better results from language models. By decomposing queries, each component can be optimized for its specific purpose. We found that combining these methods resulted in improved relevancy in example selection and lower latency in retrieving the example data, leading to better performance and higher quality results.
Prompt engineering architecture
The workflow and architecture implemented for prototyping shown in the following figure demonstrates a systematic approach to AI model optimization. When a user query such as “Build me a house with three bedrooms and two bathrooms” is entered, the workflow follows these steps:
We use prompt decomposition to execute three smaller tasks that retrieve highly relevant examples that match the same features for a house that the user has requested
We use the relevant examples and inject it into the prompt to perform dynamic few-shot prompting to generate a floor plan
We use the reflection technique to ask the generative AI model to self-reflect and asses that the generated content adheres to our requirements
Deep dive on workflow and architecture
The first step in our workflow is to understand the unique features of the house, which we can use as search criteria to find the most relevant examples in the subsequent steps. For this step, we use Amazon Bedrock, which provides a serverless API-driven endpoint for inference. From the wide range of generative AI models offered by Amazon Bedrock, we choose Mistral 7B, which provides the right balance between cost, latency, and accuracy required for this small decomposed step.
The second step is to search for the most relevant examples using the unique features we found. We use Amazon Bedrock Knowledge Bases backed by Amazon OpenSearch Serverless as a vector database to implement metadata filtering and hybrid search to retrieve the most relevant record identifiers. Amazon Simple Storage Service (Amazon S3) is used for storage of the data set, and Amazon Bedrock Knowledge Bases provides a managed solution for vectorizing and indexing the metadata into the vector database.
In the third step, we retrieve the actual floor plan data by record identifier using Amazon DynamoDB. By splitting the search and retrieval of floor plan examples into two steps, we were able to use purpose-built services with Amazon OpenSearch, allowing for low-latency search, and DynamoDB for low-latency data retrieval by key value leading to optimized performance.
After retrieving the most relevant examples for the user’s prompt, in step four we use Amazon Bedrock and Anthropic’s Claude 3.5 Sonnet as a model with leading benchmarks in deep reasoning and mathematics to generate our new floor plan.
Finally, in step five, we implement reflection. We use Amazon Bedrock with Anthropic’s Claude 3.5 Sonnet on Amazon Bedrock again and pass the original prompt, instructions, examples and newly generated floor plan back with a final instruction for the model to reflect and double-check its generated floor plan and correct mistakes.
Fine-tuning approach
We explored two methods for optimizing LLMs for automated floorplan generation: full parameter fine-tuning and Low-Rank Adaptation (LoRA)–based fine-tuning. Full fine-tuning adjusts all LLM parameters, which requires significant memory and training time. In contrast, LoRA tunes only a small subset of parameters, reducing memory requirements and training time.
Workflow and architecture
We implemented our workflow containing data processing, fine-tuning, and inference and testing steps shown in the following figure below, all within a SageMaker Jupyter Lab Notebook provisioned with an ml.p4.24xlarge instance, giving us access to Nvidia A100 GPUs. Because we used a Jupyter notebook and ran all parts of our workflow interactively, we were able to iterate quickly and debug our experiments while maturing the training and testing scripts.
Deep dive on fine tuning workflow
One key insight from our experiments was the critical importance of dataset quality and diversity. Further to our initial dataset preparation, when fine-tuning a model, we found that carefully selecting training samples with larger diversity helped the model learn more robust representations. Additionally, although larger batch sizes generally improved performance (within memory constraints), we had to carefully balance this against computational resources (320 GB GPU memory in an ml.p4.24xlarge instance) and training time (ideally within 1–2 days).
We conducted several iterations to optimize performance, experimenting with various approaches including initial few-sample quick instruction fine-tuning, larger dataset fine-tuning, fine-tuning with early stopping, comparing Llama 3.1 8B and Llama 3 8B models, and varying instruction length in fine-tuning samples. Through these iterations, we found that full fine-tuning of the Llama 3.1 8B model using a curated dataset of 200,000 samples produced the best results.
The training process for full fine-tuning Llama 3.1 8B with BF16 and a microbatch size of three involved eight epochs with 30,000 steps, taking 25 hours to complete. In contrast, the LoRA approach showed significant computational efficiency, requiring only 2 hours of training time and producing an 89 MB checkpoint.
Evaluation framework
The testing framework implements an efficient evaluation methodology that optimizes resource utilization and time while maintaining statistical validity. Key components include:
A prompt deduplication system that identifies and consolidates duplicate instructions in the test dataset, reducing computational overhead and enabling faster iteration cycles for model improvement
A distribution-based performance assessment that filters unique test cases, promotes representative sampling through statistical analysis, and projects results across the full dataset
A metric-based evaluation that implements scoring across key criteria enabling comparative analysis against both the baseline GPT2 model and other approaches.
Results and business impact
To understand how well each approach in our experiment performed, we used the evaluation framework and compared several key metrics. For the purposes of this post, we focus on two of these key metrics. The first reflects how well the model was able to follow users’ instructions to reflect the features required in the house. The second metric looks at how well the features of the house adhered to instructions in mathematical and positioning and orientation. The following image show these results in a graph.
We found that the prompt engineering approach with Anthropic’s Claude 3.5 Sonnet as well as the full fine-tuning approach with Llama 3.1 8b increased the instruction adherence quality over the baseline GPT2 model by 109%, showing that, depending on a team’s skillsets, both approaches could be used to improve the quality of understanding an LLM when generating content such as floor plans.
When looking at mathematical correctness, our prompt engineering approach wasn’t able to create significant improvements over the baseline, but full fine-tuning was a clear winner with a 54% increase over the baseline GPT2 results.
The LoRA-based tuning approach achieves slightly lower performance scores being 20% less in the metric scores on instruction adherence and 50% lower scores on mathematical correctness compared to full fine-tuning, demonstrating the tradeoffs that can be made when it comes to time, cost, and hardware compared to model accuracy.
Conclusion
ZURU Tech has set its vision on fundamentally transforming the way we design and construct buildings. In this post, we highlighted the approach to building and improving a text-to–floor plan generator based on LLMs to create a highly useable and streamlined workflow within a 3D-modeling system. We dived into advanced concepts of prompt engineering using Amazon Bedrock and detailed approaches to fine-tuning LLMs using Amazon SageMaker, showing the different tradeoffs you can make to significantly improve on the accuracy of the content that is generated.
To learn more about the Generative AI Innovation Center program, get in touch with your account team.
About the Authors
Federico Di Mattia is the team leader and Product Owner of ZURU AI at ZURU Tech in Modena, Italy. With a focus on AI-driven innovation, he leads the development of Generative AI solutions that enhance business processes and drive ZURU’s growth.
Niro Amerasinghe is a Senior Solutions Architect based out of Auckland, New Zealand. With experience in architecture, product development, and engineering, he helps customers in using Amazon Web Services (AWS) to grow their businesses.
Haofei Feng is a Senior Cloud Architect at AWS with over 18 years of expertise in DevOps, IT Infrastructure, Data Analytics, and AI. He specializes in guiding organizations through cloud transformation and generative AI initiatives, designing scalable and secure GenAI solutions on AWS. Based in Sydney, Australia, when not architecting solutions for clients, he cherishes time with his family and Border Collies.
Sheldon Liu is an applied scientist, ANZ Tech Lead at the AWS Generative AI Innovation Center. He partners with enterprise customers across diverse industries to develop and implement innovative generative AI solutions, accelerating their AI adoption journey while driving significant business outcomes.
Xuefeng Liu leads a science team at the AWS Generative AI Innovation Center in the Asia Pacific regions. His team partners with AWS customers on generative AI projects, with the goal of accelerating customers’ adoption of generative AI.
Simone Bartoli is a Machine Learning Software Engineer at ZURU Tech, in Modena, Italy. With a background in computer vision, machine learning, and full-stack web development, Simone specializes in creating innovative solutions that leverage cutting-edge technologies to enhance business processes and drive growth.
Marco Venturelli is a Senior Machine Learning Engineer at ZURU Tech in Modena, Italy. With a background in computer vision and AI, he leverages his experience to innovate with generative AI, enriching the Dreamcatcher software with smart features.
Stefano Pellegrini is a Generative AI Software Engineer at ZURU Tech in Italy. Specializing in GAN and diffusion-based image generation, he creates tailored image-generation solutions for various departments across ZURU.
Enrico Petrucci is a Machine Learning Software Engineer at ZURU Tech, based in Modena, Italy. With a strong background in machine learning and NLP tasks, he currently focuses on leveraging Generative AI and Large Language Models to develop innovative agentic systems that provide tailored solutions for specific business cases.