How S&P is using deep web scrapping, ensemble learning and snowflake architecture, to collect 5X more data on SME

Join our daily and weekly newspapers for exclusive content on the latest updates and industry-composure AI coverage. learn more

When it comes to data about small and medium -sized enterprises (SMEs), there is a significant problem in the investment world. It has nothing to do with data quality or accuracy – this is the lack of any data.

Assessing SME Credit has been notorious challenging because small enterprise financial data is not public, and therefore it is very difficult to access.

S&P Global Market IntelligenceA division of S&P Global and a leading provider of credit ratings and benchmarks, claim to solve the problem for a long time. Building the company’s technical team riskAn AI-powered platform that curses elusive data from more than 200 million websites, processes it through several algorithms and produces risk scores.

Built on snowflake architecture, the platform has increased the coverage of S&P to 5X.

“Our aim was expanding and skill,” explained by the head of the new product development of Moody Hadi, S&P Global’s Risk Solutions. “The project has improved the accuracy and coverage of data, benefiting customers.”

Underlying architecture of risk

The Opposition Credit Management is inevitably assesses the company’s credibility and risk based on several factors, in which financial, default possibilities and many factors including risk hunger. S&P Global Market Intelligence provides insight to institutional investors, banks, insurance companies, money managers and others.

“Large and financial corporate institutions lend to suppliers, but they need to know how much to lend, how many times they have to monitor them, what will be the loan period,” Hadi explained. “They rely on the third party to come up with a trusted credit score.”

But there has been a difference in SME coverage for a long time. Hadi reported that, while large public companies like IBM, Microsoft, Amazon, Google and Baisi need to disclose their quarterly financials, SMEs do not have this obligation, thus limiting financial transparency. From the perspective of an investor, consider that there are about 10 million SMEs in the US compared to about 60,000 public companies.

S&P Global Market Intelligence claims that it has now covered all of them: Earlier, the firm had only 2 million data, but Rastgos expanded it up to 10 million.

The platform, which went into production in January, is based on a system manufactured by Hadi’s team that draws firmographic data from unnecessary web materials, combines it with unknown third-party dataset, and applies machine learning (mL) and advanced algorithm to generate credit scores.

The company uses the company’s pages of snowflakes and processes them in firmographic drivers (market segmenters) that are then fed in the rashgos.

The data pipeline of the platform includes:

Craler/web scrapers
A pre-proclamation layer
Miners
Curator
Risk scoring

In particular, Hadi’s team uses Snowflake’s data warehouse and snowopark container services in the midst of pre-processing, mining and cursoring stages.

At the end of this process, SMEs are scored based on a combination of financial, trade and market risk; Being 1 highest, 100 lowest. Investors also receive a report on Rasgos for financial, firmographs, business credit reports, historical performances and major events. They can also compare companies with their peers.

How S&P is collecting valuable company data

Hadi explained that the rashgag employs a multi-layer scraping process that draws various details from a company’s web domain, such as basic ‘contact us’ and landing page and news-related information. Miners go down to many URL layers to scrape relevant data.

“As you can imagine, a person cannot do that,” said Hadi. “This is a lot of time for a human, especially when you are working with 200 million web pages.” Joe, he said, results in many terabytes of website information.

After the data is collected, the next stage is not to run the algorithm which is not the removal text; Hadi said that the system is not interested in JavaScript or even HTML tag. The data is cleaned so it becomes human-elective, not the code. Then, it is loaded into snowflake and many data miners are run against pages.

The dress algorithms are important for the prediction process; These types of algorithms connect with many individual models (base models or ‘weaker learners’, which are slightly better than essentially random estimates to validate the information of a company such as name, business details, sector, location and operational activity). The factors of any polarity in the spirit around the announcements revealed on the site.

“After we crawl a site, the algorithm hit various components of the pages drawn, and they vote and return with a recommendation,” Hadi explained. “In this process there is no human in the loop, algorithms are basically competing with each other. It helps in increasing our coverage.”

After that initial load, the system site monitors activity, automatically running a weekly scan. It does not update weekly information; Only when it detects a change, Hadi said. When performing the latter, a hash key tracks the landing page from the previous crawl, and the system produces another key; If they are equal, no changes were made, and no action is required. However, if the hash keys does not match, the system will be triggered to update the company’s information.

This continuous scraping is important to ensure that the system remains as updated as possible. “If they are often updating the site, tell us that they are alive, right?” Hadi said.

Processing speed, huge dataset, challenges with impure websites

When constructing the system, there were challenges to build the system especially due to the need for the sheer size and quick processing of the dataset. Hadi’s team had to trade-off to balance accuracy and speed.

“We kept optimizing various algorithms to walk fast,” he explained. “And tismeling; Some algorithms that we were really good were high accuracy, high precision, high memories, but they were very expensive computably.”

Websites do not always correspond to standard formats, requiring flexible scraping methods.

“You hear a lot about designing websites with such an exercise, because when we originally start, we thought,” Hey, every website should suit the sitemap or XML, “said Hadi. “And think that no one follows it.”

They did not want to include a hard code or robotic process automation (RPA) in the system because the sites differ so widely, Hadi said, and they knew that they needed the most important information that they had in the lesson. This created a system that draws only the essential components of a site, then cleans it for real text and discs code and any JavaScript or Typescript.

As Hadi said, “The biggest challenges were around performing and tuning and the fact that websites are not clear by design.”

Daily insights on business use cases with VB daily

If you want to impress your boss, VB daily has covered you. We give you the scoop inside what companies are doing with generative AI, from regulatory changes to practical deployment, so you can share insight for maximum ROI.

Read our privacy policy

Thanks for membership. See more VB newsletters here.

There was an error.

What's Hot

BTC YTD performance from 2 to sleep but 308,709x more returns since 2011

60 malicious ruby gems download 275,000 times stolen credibility

Asus Vivobook S16 Refresh in India with Snapdragon X Series Processor: Price, Specification

Google Data Brech confirms potential Google advertising customers information

Why do I no longer travel without this portable battery – and it is not made by the anchor or Ugreen

Data Breach of Columbia University affects around 870,000 persons

Microsoft’s new text editor is a VIM and Nano option

The best luxury car for buyers for the first time in 2025

Massives Datenleck in Cloud-Spichenn | CSO online

Most Popular

10,000 steps or Japanese walk? We ask experts if you should walk ahead or fast

FIFA Club World Cup Soccer: Stream Palmirus vs. Porto lives from anywhere

What do chatbott is careful about punctuation? I tested it with chat, Gemini and Cloud

Our Picks