News Clusters

Clustering news articles made simple.

You can use this service to create news aggregator sites similar to this site https://daily.mk/

The process includes uploading the news items via the API containing title, content, category and images then requesting a clustering over a specific subset of the uploaded items (for example latest N items). Once the process is finished (usually within seconds) it invokes a url that you specify in the request indicating that the process is completed with the clustered data.

Uploading data

We start by adding some articles to the database. We can use a sample file (https://data.world/crawlfeeds/cnbc-news-dataset) with the code bellow to upload to the database.

import requests
import json

api_key = "....."

with open("sample.json","rb") as f:
   items = json.loads(f.read())

for item in items:
   response = requests.post("https://api.news-clusters.com/v1/article",json=item).json()
   print(response["_id"])
   
This will result with 593 items added to the database. The id field if missing in the input document is created as a hash of the url.

Clustering data

The clustering of the uploaded items can be easily done via the following code

import requests
api_key = "....."

params = {"latest":3000,"thresh":0.97,"ngrams":2, "type":"request","callback":"https://mysite.com/api/clustering_finished"}
response = requests.post("https://api.news-clusters.com/v1/clustering",json=params,headers={"authorization":api_key}).json()
print(response["jobid"])
#5d30659d5273d2e55793f3bbda4ada7a042b9c3713d34222e1f159fa24c5747374e99c6a60e5d3d9cae07383ec347684fbf40a9371036a5ca5ca94c8e8c5768e

This will produce the following output . The clustering created 132 clusters of similar items based on ngrams up to 2 and threshold of 0.97. You can play around with these values to get the best results.

When rendered the clusters look like this:
Cluster 07e6d0aa316d5adf257d7f02e349df0c 
     The Materials Of A Trade
     Fast Funds: Final Call
     Halftime Report: Watch Goldman Action Closely Into Close
     All Eyes on Europe
     EU, Iran to Impact Crude Oil Prices, Experts Say
     In The Red For '07
     Money In Motion: Euro Vs. Dollar
     Trade School: Think Global, Trade Local
     Investors on pace to plow a record amount of money into corporate bonds in March
     Update: Meaningful Breakout In Market?
     Dumb Money: Hedge Funds Can't Even Beat Bond Funds
     Your First Move For Friday, Oct. 28
     Pick a Direction and Go With It!
     Sellers Worried about Europe, or Playing Range Game?
     Hot Food Stock Nears All Time High
     Top Inflation Trade: Buy The Dollar?
     Halftime: Get Into Goldman Ahead Of Spin-Off?
     Web Extra: When To Sell A Winner
     The Word On Buyout Speculation, Consolidation In Steel...
     Halftime Pt. 2—Four Strong Fundamentals Plays
     All-time record options bets on volatility spook Wall Street over leverage risk
     Hurricane Season: Protect Portfolio From Unpredictable
     Pops And Drops: Harley Davidson, Mosaic
     Halftime Report: Technical Break Could Signal S&P 950
     Your First Move For Thursday September 29th
     Top Investor Whitney Tilson’s Latest Real Estate Plays
     Bonds Regain Favor After Fannie & Freddie Takeover
     January Barometer
     Terranova "Gasoline Prices Bottomed"
Cluster 3de783016d69852118ae1d832ec6abb0
     Stocks making the biggest moves midday: Activision, Snap, Ford & more
     Stocks End Slightly Higher; Earnings Little Help
     Early movers: AMZN, CLX, AN, HSY, F, AMGN & more
     Oppenheimer says 3Q revenue hurt by economic news
     Stocks making the biggest moves premarket: WBA, BB, MO, ACN & more
     Stocks making the biggest moves after hours: Nike, Broadcom, HD Supply and more
     Stocks making the biggest moves premarket: Ford, GM, Uber, Gap, Costco, Ulta, Dell & more
     Earnings Roundup: July 19
     Stocks making the biggest moves midday: Disney, Gap, Take-Two & more
     Nike earnings leap past forecasts; shares soar
     Early Movers: WMT, KSS, PG, DWA, HAS & more
     After-hours buzz: AIG, CBS, King Digital & more
     After-hours movers: Disney, Aeropostale & more
     Three major pharmaceutical companies just reported earnings — here's how they did
     Google Earnings Fall Short of Expectations
     Morgan Stanley earnings, revenue top expectations
     IBM Earnings Exceed Forecasts; Revenue Is Light
     Cost Cuts Help Blockbuster First-Quarter Profit
     Lowe's sales top Street expectations, boosted by hurricane-related purchases
     Stocks waver on a big earnings day
Cluster 88b8aded4568d89628b67f854f774f25
     Starting a Business: The Romance vs. the Reality
     Discouraged CEOs Set Bar Low for Obama's Job Speech
     Want to start a business? Here's what you need to know
     Surge in Products Being Recalled May Be Numbing Consumers
     Another Hurdle for the Jobless: Credit Inquiries
     One-on-One with Jack Welch
     The gadgets and software that could help us return to the office
     Health-care maze remains for undocumented immigrants
     CNBC Transcript: Berkshire Hathaway CEO Warren Buffett on CNBC’s “Squawk Box” Today
     CEO Blog: Our Greatest New Threat
     CNBC Program Changes for Saturday, 11/15 & Sunday, 11/16
     The 139th Westminster Kennel Club Show (Opening Night) Will Air Live on CNBC on February 16
     Some voters are scared the coronavirus will stop them from casting a ballot
     "Bussiness Nation" Will Air On April 18th
     Wanting Work, but Stuck in Part-Time Purgatory
     CNBC Exclusive: CNBC’s Steve Liesman Interviews Treasury Secretary Jack Lew from CNBC Institutional Investor Delivering Alpha Conference in NYC Today
     The US jet fighter that can do it all—maybe
     Most important 2020 election misinformation threat is not coming from overseas: Facebook former security chief Alex Stamos
     Women speak of pervasive harassment in DC lobbying culture
Cluster 8a0914d813ff5a605df683efe646bc45
     Dow closes 380 points lower, snaps longest monthly win streak since 1959
     Dow soars 168 points to record close on earnings beats by Caterpillar, 3M
     GLOBAL MARKETS-Euro rises on Spain speculation, stocks fall
     Market Insider/Tuesday Look Ahead
     5 things to know before the stock market opens Thursday
     Week Ahead: Traders look for clarity from China, Fed
     Street to take second look at Fed
     Hawkish ECB Official Pushes Euro Up vs. Dollar
     Dow posts 6-day losing streak as media stocks plunge; jobs in focus
     US dollar weakens as Fed measure weighs
     Wednesday Look Ahead: Watch the Financials, Euro
     Stocks Edge Up, Shaking Off Economic News
     S&P 500 closes at new record as chipmakers get a boost from US-China trade truce
     Euro rises vs. yen, dollar after data eases ECB concern
     FOREX-Dollar firm vs yen and euro before U.S. jobs data
     Stocks Pare Losses After Housing-Induced Slide
     Dow Closes Up After 700-Point Swing
     Dollar Down but Pares Losses
This is an example of the top cluster when queried .
{
        "max_date": "2019-03-27T16:51:47", 
        "documents": [
                17, 
                22, 
                34, 
                37, 
                49, 
                97, 
                102, 
                114, 
                126, 
                139, 
                160, 
                201, 
                214, 
                254, 
                287, 
                297, 
                319, 
                359, 
                360, 
                369, 
                375, 
                423, 
                466, 
                482, 
                516, 
                539, 
                545, 
                571, 
                593
        ], 
        "_id": "07e6d0aa316d5adf257d7f02e349df0c", 
        "span_hours": 104176.19944444444, 
        "min_date": "2007-05-09T00:39:49", 
        "size": 29, 
        "categories": [
                [
                        "cnbc", 
                        29
                ], 
                [
                        "articles", 
                        29
                ], 
                [
                        "cnbc tv", 
                        16
                ], 
                [
                        "fast money", 
                        15
                ], 
                [
                        "source:tagname:cnbc us source", 
                        10
                ], 
                [
                        ".....", 
                        "x"
                ]
        ]
}

Extracting data from web pages

The api set also provides automated extraction of the article from a web page.

import requests
api_key = "....."

params = {"url":"https://somesite.com/...article/134134"}response = requests.get("https://api.news-clusters.com/v1/scrape",json=params,headers={"authorization":api_key}).json()
print(response["title"],response["text"],response["date"],response["image"],response["videos"])


All of the api and the clustering are executed on AWS lambda serverless environment. For more information check out our swagger documentation (coming soon). Contact us at clusters@is.mk to receive a free trial api key and access to the management platform.
© Itea Solutions 2007-2024