Close Menu
  • Home
  • AI Models
    • DeepSeek
    • xAI
    • OpenAI
    • Meta AI Llama
    • Google DeepMind
    • Amazon AWS AI
    • Microsoft AI
    • Anthropic (Claude)
    • NVIDIA AI
    • IBM WatsonX Granite 3.1
    • Adobe Sensi
    • Hugging Face
    • Alibaba Cloud (Qwen)
    • Baidu (ERNIE)
    • C3 AI
    • DataRobot
    • Mistral AI
    • Moonshot AI (Kimi)
    • Google Gemma
    • xAI
    • Stability AI
    • H20.ai
  • AI Research
    • Allen Institue for AI
    • arXiv AI
    • Berkeley AI Research
    • CMU AI
    • Google Research
    • Microsoft Research
    • Meta AI Research
    • OpenAI Research
    • Stanford HAI
    • MIT CSAIL
    • Harvard AI
  • AI Funding & Startups
    • AI Funding Database
    • CBInsights AI
    • Crunchbase AI
    • Data Robot Blog
    • TechCrunch AI
    • VentureBeat AI
    • The Information AI
    • Sifted AI
    • WIRED AI
    • Fortune AI
    • PitchBook
    • TechRepublic
    • SiliconANGLE – Big Data
    • MIT News
    • Data Robot Blog
  • Expert Insights & Videos
    • Google DeepMind
    • Lex Fridman
    • Matt Wolfe AI
    • Yannic Kilcher
    • Two Minute Papers
    • AI Explained
    • TheAIEdge
    • Matt Wolfe AI
    • The TechLead
    • Andrew Ng
    • OpenAI
  • Expert Blogs
    • François Chollet
    • Gary Marcus
    • IBM
    • Jack Clark
    • Jeremy Howard
    • Melanie Mitchell
    • Andrew Ng
    • Andrej Karpathy
    • Sebastian Ruder
    • Rachel Thomas
    • IBM
  • AI Policy & Ethics
    • ACLU AI
    • AI Now Institute
    • Center for AI Safety
    • EFF AI
    • European Commission AI
    • Partnership on AI
    • Stanford HAI Policy
    • Mozilla Foundation AI
    • Future of Life Institute
    • Center for AI Safety
    • World Economic Forum AI
  • AI Tools & Product Releases
    • AI Assistants
    • AI for Recruitment
    • AI Search
    • Coding Assistants
    • Customer Service AI
    • Image Generation
    • Video Generation
    • Writing Tools
    • AI for Recruitment
    • Voice/Audio Generation
  • Industry Applications
    • Finance AI
    • Healthcare AI
    • Legal AI
    • Manufacturing AI
    • Media & Entertainment
    • Transportation AI
    • Education AI
    • Retail AI
    • Agriculture AI
    • Energy AI
  • AI Art & Entertainment
    • AI Art News Blog
    • Artvy Blog » AI Art Blog
    • Weird Wonderful AI Art Blog
    • The Chainsaw » AI Art
    • Artvy Blog » AI Art Blog
What's Hot

Sebastian Thrun: Autopilot Makes Me a Safer Driver | AI Podcast Clips

Accenture Helps High-Potential AI Startups Grow with Support from NVIDIA Inception

Candidate Cheating Upsurge for TA

Facebook X (Twitter) Instagram
Advanced AI News
  • Home
  • AI Models
    • Adobe Sensi
    • Aleph Alpha
    • Alibaba Cloud (Qwen)
    • Amazon AWS AI
    • Anthropic (Claude)
    • Apple Core ML
    • Baidu (ERNIE)
    • ByteDance Doubao
    • C3 AI
    • Cohere
    • DataRobot
    • DeepSeek
  • AI Research & Breakthroughs
    • Allen Institue for AI
    • arXiv AI
    • Berkeley AI Research
    • CMU AI
    • Google Research
    • Meta AI Research
    • Microsoft Research
    • OpenAI Research
    • Stanford HAI
    • MIT CSAIL
    • Harvard AI
  • AI Funding & Startups
    • AI Funding Database
    • CBInsights AI
    • Crunchbase AI
    • Data Robot Blog
    • TechCrunch AI
    • VentureBeat AI
    • The Information AI
    • Sifted AI
    • WIRED AI
    • Fortune AI
    • PitchBook
    • TechRepublic
    • SiliconANGLE – Big Data
    • MIT News
    • Data Robot Blog
  • Expert Insights & Videos
    • Google DeepMind
    • Lex Fridman
    • Meta AI Llama
    • Yannic Kilcher
    • Two Minute Papers
    • AI Explained
    • TheAIEdge
    • Matt Wolfe AI
    • The TechLead
    • Andrew Ng
    • OpenAI
  • Expert Blogs
    • François Chollet
    • Gary Marcus
    • IBM
    • Jack Clark
    • Jeremy Howard
    • Melanie Mitchell
    • Andrew Ng
    • Andrej Karpathy
    • Sebastian Ruder
    • Rachel Thomas
    • IBM
  • AI Policy & Ethics
    • ACLU AI
    • AI Now Institute
    • Center for AI Safety
    • EFF AI
    • European Commission AI
    • Partnership on AI
    • Stanford HAI Policy
    • Mozilla Foundation AI
    • Future of Life Institute
    • Center for AI Safety
    • World Economic Forum AI
  • AI Tools & Product Releases
    • AI Assistants
    • AI for Recruitment
    • AI Search
    • Coding Assistants
    • Customer Service AI
    • Image Generation
    • Video Generation
    • Writing Tools
    • AI for Recruitment
    • Voice/Audio Generation
  • Industry Applications
    • Education AI
    • Energy AI
    • Finance AI
    • Healthcare AI
    • Legal AI
    • Media & Entertainment
    • Transportation AI
    • Manufacturing AI
    • Retail AI
    • Agriculture AI
  • AI Art & Entertainment
    • AI Art News Blog
    • Artvy Blog » AI Art Blog
    • Weird Wonderful AI Art Blog
    • The Chainsaw » AI Art
    • Artvy Blog » AI Art Blog
Advanced AI News
Home » How S&P is using deep web scraping, ensemble learning and Snowflake architecture to collect 5X more data on SMEs
VentureBeat AI

How S&P is using deep web scraping, ensemble learning and Snowflake architecture to collect 5X more data on SMEs

Advanced AI BotBy Advanced AI BotJune 2, 2025No Comments6 Mins Read
Share Facebook Twitter Pinterest Copy Link Telegram LinkedIn Tumblr Email
Share
Facebook Twitter LinkedIn Pinterest Email


Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More

The investing world has a significant problem when it comes to data about small and medium-sized enterprises (SMEs). This has nothing to do with data quality or accuracy — it’s the lack of any data at all. 

Assessing SME creditworthiness has been notoriously challenging because small enterprise financial data is not public, and therefore very difficult to access.

S&P Global Market Intelligence, a division of S&P Global and a foremost provider of credit ratings and benchmarks, claims to have solved this longstanding problem. The company’s technical team built RiskGauge, an AI-powered platform that crawls otherwise elusive data from over 200 million websites, processes it through numerous algorithms and generates risk scores. 

Built on Snowflake architecture, the platform has increased S&P’s coverage of SMEs by 5X. 

“Our objective was expansion and efficiency,” explained Moody Hadi, S&P Global’s head of risk solutions’ new product development. “The project has improved the accuracy and coverage of the data, benefiting clients.” 

RiskGauge’s underlying architecture

Counterparty credit management essentially assesses a company’s creditworthiness and risk based on several factors, including financials, probability of default and risk appetite. S&P Global Market Intelligence provides these insights to institutional investors, banks, insurance companies, wealth managers and others. 

“Large and financial corporate entities lend to suppliers, but they need to know how much to lend, how frequently to monitor them, what the duration of the loan would be,” Hadi explained. “They rely on third parties to come up with a trustworthy credit score.” 

But there has long been a gap in SME coverage. Hadi pointed out that, while large public companies like IBM, Microsoft, Amazon, Google and the rest are required to disclose their quarterly financials, SMEs don’t have that obligation, thus limiting financial transparency. From an investor perspective, consider that there are about 10 million SMEs in the U.S., compared to roughly 60,000 public companies. 

S&P Global Market Intelligence claims it now has all of those covered: Previously, the firm only had data on about 2 million, but RiskGauge expanded that to 10 million.  

The platform, which went into production in January, is based on a system built by Hadi’s team that pulls firmographic data from unstructured web content, combines it with anonymized third-party datasets, and applies machine learning (ML) and advanced algorithms to generate credit scores. 

The company uses Snowflake to mine company pages and process them into firmographics drivers (market segmenters) that are then fed into RiskGauge. 

The platform’s data pipeline consists of:

Crawlers/web scrapers

A pre-processing layer

Miners

Curators

RiskGauge scoring

Specifically, Hadi’s team uses Snowflake’s data warehouse and Snowpark Container Services in the middle of the pre-processing, mining and curation steps. 

At the end of this process, SMEs are scored based on a combination of financial, business and market risk; 1 being the highest, 100 the lowest. Investors also receive reports on RiskGauge detailing financials, firmographics, business credit reports, historical performance and key developments. They can also compare companies to their peers. 

How S&P is collecting valuable company data

Hadi explained that RiskGauge employs a multi-layer scraping process that pulls various details from a company’s web domain, such as basic ‘contact us’ and landing pages and news-related information. The miners go down several URL layers to scrape relevant data. 

“As you can imagine, a person can’t do this,” said Hadi. “It is going to be very time-consuming for a human, especially when you’re dealing with 200 million web pages.” Which, he noted, results in several terabytes of website information. 

After data is collected, the next step is to run algorithms that remove anything that isn’t text; Hadi noted that the system is not interested in JavaScript or even HTML tags. Data is cleaned so it becomes human-readable, not code. Then, it’s loaded into Snowflake and several data miners are run against the pages.

Ensemble algorithms are critical to the prediction process; these types of algorithms combine predictions from several individual models (base models or ‘weak learners’ that are essentially a little better than random guessing) to validate company information such as name, business description, sector, location, and operational activity. The system also factors in any polarity in sentiment around announcements disclosed on the site. 

“After we crawl a site, the algorithms hit different components of the pages pulled, and they vote and come back with a recommendation,” Hadi explained. “There is no human in the loop in this process, the algorithms are basically competing with each other. That helps with the efficiency to increase our coverage.” 

Following that initial load, the system monitors site activity, automatically running weekly scans. It doesn’t update information weekly; only when it detects a change, Hadi added. When performing subsequent scans, a hash key tracks the landing page from the previous crawl, and the system generates another key; if they are identical, no changes were made, and no action is required. However, if the hash keys don’t match, the system will be triggered to update company information. 

This continuous scraping is important to ensure the system remains as up-to-date as possible. “If they’re updating the site often, that tells us they’re alive, right?,” Hadi noted. 

Challenges with processing speed, giant datasets, unclean websites

There were challenges to overcome when building out the system, of course, particularly due to the sheer size of datasets and the need for quick processing. Hadi’s team had to make trade-offs to balance accuracy and speed. 

“We kept optimizing different algorithms to run faster,” he explained. “And tweaking; some algorithms we had were really good, had high accuracy, high precision, high recall, but they were computationally too costly.” 

Websites do not always conform to standard formats, requiring flexible scraping methods.

“You hear a lot about designing websites with an exercise like this, because when we originally started, we thought, ‘Hey, every website should conform to a sitemap or XML,’” said Hadi. “And guess what? Nobody follows that.”

They didn’t want to hard code or incorporate robotic process automation (RPA) into the system because sites vary so widely, Hadi said, and they knew the most important information they needed was in the text. This led to the creation of a system that only pulls necessary components of a site, then cleanses it for the actual text and discards code and any JavaScript or TypeScript.

As Hadi noted, “the biggest challenges were around performance and tuning and the fact that websites by design are not clean.” 

Daily insights on business use cases with VB Daily

If you want to impress your boss, VB Daily has you covered. We give you the inside scoop on what companies are doing with generative AI, from regulatory shifts to practical deployments, so you can share insights for maximum ROI.

Read our Privacy Policy

Thanks for subscribing. Check out more VB newsletters here.

An error occured.



Source link

Follow on Google News Follow on Flipboard
Share. Facebook Twitter Pinterest LinkedIn Tumblr Email Copy Link
Previous ArticleConsole raises $6.2M from Thrive to free IT teams from mundane tasks with AI
Next Article Protein Industries Canada Targets Supply Chain Resilience with $15 Million in Genomics and AI Funding – vegconomist
Advanced AI Bot
  • Website

Related Posts

Nvidia says its Blackwell chips lead benchmarks in training AI LLMs

June 4, 2025

CockroachDB’s distributed vector indexing tackles the looming AI data explosion enterprises aren’t ready for

June 4, 2025

Inside Intuit’s GenOS update: Why prompt optimization and intelligent data cognition are critical to enterprise agentic AI success

June 4, 2025
Leave A Reply Cancel Reply

Latest Posts

Witcher 4 Tech Demo: Why All The Suspicion?

Rijksmuseum Displays 200-Year-Old Condom Featuring Erotic Print

Researchers Replicate Ancient Egyptian Blue Pigment

Trial Begins for Ron Perelman’s $410 Million Insurance Claim

Latest Posts

Sebastian Thrun: Autopilot Makes Me a Safer Driver | AI Podcast Clips

June 4, 2025

Accenture Helps High-Potential AI Startups Grow with Support from NVIDIA Inception

June 4, 2025

Candidate Cheating Upsurge for TA

June 4, 2025

Subscribe to News

Subscribe to our newsletter and never miss our latest news

Subscribe my Newsletter for New Posts & tips Let's stay updated!

Welcome to Advanced AI News—your ultimate destination for the latest advancements, insights, and breakthroughs in artificial intelligence.

At Advanced AI News, we are passionate about keeping you informed on the cutting edge of AI technology, from groundbreaking research to emerging startups, expert insights, and real-world applications. Our mission is to deliver high-quality, up-to-date, and insightful content that empowers AI enthusiasts, professionals, and businesses to stay ahead in this fast-evolving field.

Subscribe to Updates

Subscribe to our newsletter and never miss our latest news

Subscribe my Newsletter for New Posts & tips Let's stay updated!

YouTube LinkedIn
  • Home
  • About Us
  • Advertise With Us
  • Contact Us
  • DMCA
  • Privacy Policy
  • Terms & Conditions
© 2025 advancedainews. Designed by advancedainews.

Type above and press Enter to search. Press Esc to cancel.