Webscraping Norwegian Company Information Data Using Langchain
Our thesis professor gave us a task to add information columns to an already existing norwegian company database. Originally, my thesis partner and I tried filling up information (revenue, current ceo, number of employees,.. etc). However, I realized that this process was very repetitive: we would look up the company name in the Brønnøysunregistrene, scroll through the page, copy paste values, then add it to the excel file. This repetitive process shows potential for automation and webscraping which is why I thought of automating this process to save up on time especially because we would be analyzing 300 companies later on
Usually, I would webscrape using selenium but this semester, I had course named “Transforming Business with AI: The Power of Large Language Models” and in this course, we learned how to use large language models for data analysis, langchain, and using tools. With the skills I learned from this course, I made a webscraper using langchain, tavily, image processing, and llms and then saving these data into a csv file.
Process
Company Information Webscrapring
Base Model
I want the data to be stored into a dataframe which means that the result should be structured data. This is very important to know because usually, when we ask an LLM for a response, they tend to give a more conversational answer such as “The net revenue for company A is 50000 NOK”. However, we do not want that because we just need the net revenue value in the dataframe (“50000 NOK”) without the frills. Therefore, we need to constrain the model´s response through a structured response.
class GeneralInfo(BaseModel):
domain: str = Field(description="internettadresse av selskapet")
founding_date: str = Field(description="Stiftelsesdato av selskapet")
org_form: str = Field(description="organisasjonsform av selskapet")
initial_address: str = Field(description="Forretningsadresse av selskapet")
initial_municipality: str = Field(description="kommune og land av selskapet")
industry_code: str = Field(description="næringskoder av selskapet")
purpose: str = Field(description="Vedtektsfestet formål av selskapet")
activity: str = Field(description="Aktivitet av selskapet")
share_capital: str = Field(description="aksjekapital av selskapet")
share_no: str = Field(description="antall aksjer av selskapet")
ceo: str = Field(description="daglig leder av selskapet")
board_chair: str = Field(description="styrets leder av selskapet")
subsidiaries: str = Field(description="under enheter av selskapet")
Here, I used a pydantic for the output so that the data is type-safe, correct, and structured. I defined 13 fields that we want our llm to extract. These are all company specific information such as the founding date, initial adress, and ceo.
Now, we initalize the Azure LLM using the AzureChatOpenAI class from langchain_openai. I set the temperature as 0 so that we get straightforward and factual responses.
client = AzureChatOpenAI(
api_key=f"{os.getenv("azure_key")}",
azure_endpoint="https://ban443-1.openai.azure.com/",
api_version="2025-03-01-preview",
azure_deployment="Group01",
temperature=0
)
In extracting company information, we planned to scrape data from the Brønnøysundregistrene which is the national registry for norwegian company information. To extract this information, we will be using the Tavily API. Tavily is a specialized search engine that helped us do a real-time websearch and extract the website information in a way that it is design for AI applications.
tavily_client = TavilyClient(api_key=f"{os.getenv("tavily_key")}")
We wanted to divide the important actions into reusable functions, which is why we made “search_comp”, “get_company_info”, and “get_company_info_each_row”.
def search_comp(org_num: str):
result = tavily_client.extract(urls=f"virksomhet.brreg.no/nb/oppslag/enheter/{org_num}",
extract_depth="advanced")
return str(result)
“search_comp” is for retrieving the specific webpage for an organization or company in the Brønnøysundregistrene. The Tavily extract api returns the webpage content from the provided URL. Here we used the pattern “virksomhet.brreg.no/nb/oppslag/enheter/{org_num}” because that is how the URLs are structured for each company and we set the the extract_depth as advanced so that we retrieve more data including tables and embedded content.
Once we have the website information, we want the LLM to analyze and extract the information that we need from our Base Model. First, we need to design our prompt.
extract_prompt =PromptTemplate(
input_variables=["org", "search_results"],
template="You are a research assistant for Norwegian companies. "
"You will be given search results."
"Extract only the requested information. If the result is numbers, put the full value and show all the digits and currency if needed. Always use the latest and up to date data"
"If nothing reliable is found, return 'NOT_FOUND'."
"Org No: {org}\n"
"Search Results:\n{search_results}\n\n"
"Give the company information."
)
extract_chain = extract_prompt | client.with_structured_output(GeneralInfo)
In our prompt, we instruct the LLM to extract the the company information from the search results. This data is then stored as a structured output based on our Base Model. This way, we remove the unnecessary filler words given by a standard response so that we are able to store the data into our dataframe.
def get_company_info(org_num: str):
“””calls llm to extract info”””
def get_company_info(org_num: str):
"""calls llm to extract info"""
#calls the webpage
searches = [
search_comp(org_num),
]
#calls llm to put data in structured form
parsed = extract_chain.invoke({
"org": org_num,
"search_results": searches,
})
return parsed.model_dump()
We combine the Tavily search and the LLM prompt in “get_company_info()” . First, we scrape the webpage using “search_comp”. Then, we add the results of the search and the organization number into the LLM prompt using the “invoke” function. Once we get the results, we return the output of the LLM call as a python dictionary using the “model_dump()” function.
def get_company_info_each_row(row):
out =get_company_info(row["org_number"])
return out
Now that we have all our functions combined together in the “get_company_info()” function, we apply this function to each row and organization number in our dataset in “get_company_info_each_row()”.
info_df = df_comp[["org_number", "company_name"]] extracted_df = info_df.apply(get_company_info_each_row, axis=1, result_type="expand") info_df = pd.concat([info_df, extracted_df], axis=1)
To put it all together, we apply “get_company_info_each_row()” to each row in our original dataframe and then generate a new dataframe named “info_df”
Annual Report Scraping
The annual reports in Brønnøysundregistrene is in the form a downloadable pdf in the same webpage as the company information. This pdf file has no preview and the information in the pdf after downloading is not text extractable because they are images. Therefore, for the annual report scraping we had to retrieve the pdf file, convert the pdf file into images, and then input the pdf images into the LLM.
class AnnualReport(BaseModel):
sum_inntekter: str = Field(description="sum inntekter av selskapet det siste året")
sum_kostnader: str = Field(description="sum kostnader av selskapet det siste året")
For the annual reports, we just wanted to extract the total revenue and total costs so we made a Pydantic Base Model that has this structure.
def fetch_clean_pdf_bytes(org_number: str, year: str) :
url = f"https://virksomhet.brreg.no/nb/oppslag/enheter/{org_number}"
headers = {
"Accept": "text/x-component",
"Content-Type": "text/plain;charset=UTF-8",
"Next-Action": "7f38d2766eb41fb21692e4487f7a0cee2cf12de1e1",
"Origin": "https://virksomhet.brreg.no",
"Referer": f"https://virksomhet.brreg.no/nb/oppslag/enheter/{org_number}",
"User-Agent": "Mozilla/5.0",
}
payload = f'["{org_number}","{year}"]'
response = requests.post(url, headers=headers, data=payload, timeout=60)
response.raise_for_status()
raw = response.content
start = raw.find(b"%PDF-")
if start == -1:
return "NOT FOUND"
#raise ValueError("No PDF header found in response.")
pdf_bytes = raw[start:]
end = pdf_bytes.rfind(b"%%EOF")
if end == -1:
return "NOT FOUND"
#raise ValueError("No PDF EOF marker found in response.")
return pdf_bytes[: end + len(b"%%EOF")]
Then, we did an HTTP request to retrieve the contents of the pdf. This retrieved data is in the form of bytes. We also had to check if the bytes retrieved are for a PDF which signals that there is an annual report. If the PDF does not exist, then we simply return “NOT FOUND”.
def pdf_bytes_to_image_content_blocks(pdf_bytes, dpi: int = 200):
if pdf_bytes == "NOT FOUND":
return "NOT FOUND"
doc = fitz.open(stream=pdf_bytes, filetype="pdf")
blocks = []
for page in doc:
pix = page.get_pixmap(dpi=dpi)
image_bytes = pix.tobytes("png")
image_b64 = base64.b64encode(image_bytes).decode("utf-8")
blocks.append({
"type": "image_url",
"image_url": {
"url": f"data:image/png;base64,{image_b64}"
}
})
return blocks
We then convert the PDF file into images since the PDF is not text extractable. We convert the pdf pages into images through the “pdf_bytes_to_image_content_blocks()” function. This returns a list of images.
Now that we have our extracted annual report images, we want to send them to the LLM for processing.
def get_annual_info(org_num: str, year: str):
"""calls llm to extract annual info"""
#extract the report from the website we get bytes here as output
report = fetch_clean_pdf_bytes(org_number=org_num, year=year)
if report == "NOT FOUND":
return "NOT FOUND"
report_img = pdf_bytes_to_image_content_blocks(pdf_bytes = report)
structured_annual_llm = client.with_structured_output(AnnualReport)
#calls llm to return information in structured form
message = HumanMessage(
content=[
{
"type": "text",
"text": (
"You are a research assistant for Norwegian companies. "
"These images are pages from a Norwegian annual report. "
"Extract the annual information numbers. "
"Prefer the latest reporting year shown in the document. "
"Keep financial values exactly as shown, including separators, signs, and units if present. "
"If a field cannot be found reliably, return null.\n\n"
f"Org No: {org_num}\n"
f"Requested year: {year}"
)
},
*report_img
]
)
parsed =structured_annual_llm.invoke([message])
return parsed.model_dump()
Here, we piece everything together by extracting the pdf and converting it to an image and then sending it to the LLM as a HumanMessage. We used HumanMessage here because we are adding images. We then instruct the LLM that we want to extract the annual information numbers. We also constrain the LLM to only produce the result through the structured format as provided by out Pydantic Base Model.
def get_annual_info_each_row(row,year):
out =get_annual_info(row["org_number"],year)
return out
Then, we make a “get_annual_info_each_row()” function so that we can apply the “get_annual_info()” function to each row in the dataframe.
annual_info_df = df_comp[["org_number", "company_name"]] annual_extracted_df = info_df.apply(get_annual_info_each_row,year="2024", axis=1, result_type="expand") annual_info_df = pd.concat([annual_info_df, annual_extracted_df], axis=1)
Then, we apply our “get_annual_info_each_row()” function to each row in our dataframe.