This is interesting, but it seems like it is tracking stories with similar headlines and that's not always how news propagates. Frequently a blogger will read an interview, select an quote from the interview and write a new headline around the quote they cherry picked. It used to be common practice to link the original source, but that always doesn't happen.
I have long thought that search engines, news aggregators and social media companies have a journalistic responsibility to favor the original/primary source of every story, but things have not worked out that way. If you can manage to truly develop something like this it would be a valuable tool for rewarding the work of reporting over SEO.
Anyway, please consider that headlines and time stamps do not tell the entire story when it comes to sourcing.
> I have long thought that search engines, news aggregators and social media companies have a journalistic responsibility to favor the original/primary source of every story
This is complicated somewhat by the few that take an already-circulating story and then add their own actual research rather than just rewording and opining.
Go hunt down the lineage of the “AI water use” articles floating around.
It’s all circular.
I don’t know how one is supposed to trust any of the media at this point. Especially “reputable” ones that are just as guilty of circular nonsense as anything else.
If you don’t follow the media, you are uninformed. If you follow it, you are misinformed.
The idea is pretty cool, but it doesn't work super well.
1. I imagine most major news outlets don't have RSS feeds these days.
2. A lot of stuff originates from news agencies, so they don't spread from website to website, but radiate out from the agency.
3. Most of the included sources are pretty small. To draw meaningful conclusions we would need infos like popularity, political leaning, nation of origin, etc.
4. The similarity check doesn't appear to do translation. So when news spreads from one country to another we loose the thread.
Also, not all information spreads through public channels, and might not even be/become publicly known. But that doesn't mean news refraction based on textual similarity isn't worthwhile to pursue, as it can reveal a lot about the self-organising principles by which the media operate.
>the similarity check doesn't appear to do translation
This surprises me. The system is based on embeddings. AFAIK embeddings cluster the same concept in different languages in roughly the same place? Maybe it depends on the model (or maybe it's not exact and the clustering cutoff loses it).
Some time ago, I wrote a scientific article in which I applied and modified the SIR model of disease spread to the spread of fake news. I simulated the whole thing in a Watts-Strogaz graph. It would be interesting to see whether the theory and formula are applicable to the real world.
Interesting project - it’s rare to see news-flow tracking done in real time at this scale.
One thing you may want to stress-test is how stable the clustering remains when stories evolve semantically over a few hours. Embeddings tend to drift as outlets rewrite or localize a piece, and HNSW can sometimes over-merge when the centroid shifts.
A trick that helped in a similar system I built was doing a second-pass “temporal coherence” check: if two articles are close in embedding space but far apart in publish time or share no common entities, keep them in adjacent clusters rather than forcing a merge. It reduced false positives significantly.
Also curious how you handle deduping syndicated content - AP/Reuters can dominate the embedding space unless you weight publisher identity or canonical URLs.
Overall, really nice work. The propagation timeline is especially useful.
Thanks for your comment, unfortunately it seems that your comments are primarily LLM-generated (for people looking for evidence, the first comments of this user should provide enough evidence, although they’re getting better by fine tuning the prompt). As HN is primarily a place for humans, please do not do this here. Thanks.
The style of the account comments and “about” definitely give off LLM vibes, but it’s not a particularly active account so I feel not a true bot. It’s also possible the account owner just runs their own comment through an LLM before posting it. I do that for most business emails I send these days but they are still reflecting my own thoughts and details.
Without evaluating it thoroughly and judging just from description - I really hope this ends up open-sourced - will help drastically to many good-intent parties.
Yep. I have some suspicions on how the information travels lately ( it is kinda both ways depending on the 'type' of news ), but it would absolutely be of general interest.
Cool idea. Given that it transferred ~29 mb when loading, is it safe to assume that the actual page is doing some of the processing? Is the front-end just doing the HNSW or is it doing the mapping of stories or headlines into vectors, or am I totally off base?
Front-end downstream of clicking on a card doesn't seem to work correctly on every reload... but it works sometimes.
It’s performing really slow right now. Is it possible to tell if virality of a news article is organic or manufactured? Organic is when it is produced by a reporting organization but can you see direct lineage to re-spreaders?
Cool idea! On mobile (Chromium on Android) I was confused at first because nothing happened when I tapped any of the stories – until I realized I can zoom out and the info about how the story propagated is at the end of the page.
This seems like it could have an additional use case of labeling each news source left, right, center, neutral/factual and tracking how or if each one releases an article.
Cool idea! What I liked the most was the breakdown into categories like “breaking” and “trending” plus the number of sources.
The view showing the flow with a play animation was a nice concept but I couldn’t see much value in it, wondering if you could try to get a more aggregate stats that shows a connection between these different flows, maybe they follow a pattern like ad-based campaigns or publishers who own these domains, which would explain things. Expanding on this idea, could even try and setup different scores and metrics based on major groups and sponsored content versus organic spread.
Curious how you sourced the feeds? It seems to have a bias towards Indian/Srilanka/Iran/Indonesia/Turkey etc - i.e. not the traditional western centric reporting. Always interested in trying to get a more balanced news diet so anything you could share around that would be interesting. Most out of the box news tools seem to automatically lean west
“Traditional western reporting” is traditionally a western thing. That’s only 15% of the global population - so if anything it seems bias towards that.
Really cool project — I like how clearly the flow of information is visualized.
It’s interesting to see how fast certain stories propagate across networks.
Curious: have you noticed patterns in which types of sources tend to spark the fastest spread?
Tried this on iPhone - the category tabs (Sports, World News, Business) get cut off on the right and there's no horizontal scroll indicator, so I didn't realise there were more options at first. The story cards also aren't using the full screen width, leaving wasted space on both sides.
Cool concept though - the source count and "+N" spread metrics give a quick sense of which stories have legs.
I think you will need to filter out wire services like AP and Reuters, as I'm seeing stories that are mostly republished wire stories on random websites.
Just tried it, and clicking on the stories doesn't seem to do anything. Console shows "TypeError: can't access property "time", flowData[Math.min(...)] is undefined"
I feel like there's a huge necessary civil virtue to this sort of understanding the news project.
Thanks for sharing some details. Its cool that HNSW is useful for near realtime usage. For some reason I had categorized it in my head as having very very high insertion cost, needing to rebuild worlds to work but that's not at a well founded belief; very cool that it's usable here.
I really hope we see some open source work of this variety. Trying to understand news or even social media is something the world seems to unprepared for. Different subject sort of, but watching Internet Observatory be dismantled by the current political administration, by disinformation grifters, was a woeful loss of one of the few mirrors the that humanity had to understand itself with, to see how we networked.
I have long thought that search engines, news aggregators and social media companies have a journalistic responsibility to favor the original/primary source of every story, but things have not worked out that way. If you can manage to truly develop something like this it would be a valuable tool for rewarding the work of reporting over SEO.
Anyway, please consider that headlines and time stamps do not tell the entire story when it comes to sourcing.
For example: Your website offers this story (https://hotspotatl.com/6587626/dr-jackie-married-to-medicine...) as first to publish. But right in the text it cites another website BOSSIP as the source of the interview.
Also: there doesn't appear to be a way to link results from your website.
This is complicated somewhat by the few that take an already-circulating story and then add their own actual research rather than just rewording and opining.
e.g. the recent Mark Kelly story, I went through many articles trying to find a link to the actual video of what he said. couldn’t find it
headlines with “[person said X]” tend to be bullshit
It’s all circular.
I don’t know how one is supposed to trust any of the media at this point. Especially “reputable” ones that are just as guilty of circular nonsense as anything else.
If you don’t follow the media, you are uninformed. If you follow it, you are misinformed.
This surprises me. The system is based on embeddings. AFAIK embeddings cluster the same concept in different languages in roughly the same place? Maybe it depends on the model (or maybe it's not exact and the clustering cutoff loses it).
https://en.wikipedia.org/wiki/Sinclair_Broadcast_Group
https://www.youtube.com/watch?v=GvtNyOzGogc
I’m not aware of any that don’t. RSS is alive and well.
A trick that helped in a similar system I built was doing a second-pass “temporal coherence” check: if two articles are close in embedding space but far apart in publish time or share no common entities, keep them in adjacent clusters rather than forcing a merge. It reduced false positives significantly.
Also curious how you handle deduping syndicated content - AP/Reuters can dominate the embedding space unless you weight publisher identity or canonical URLs.
Overall, really nice work. The propagation timeline is especially useful.
Cool website. As others note if this could tie in deep sources like FB, X, Reddit, etc...it would be almost "chain of evidence" canonical.
A view where websites/sources were associated with geo data (possibly involving a globe or map) would be very cool, too.
Afaict, it is the usual topic trending over time, or maybe it is showing direct sindication?
Computing actual derivation flow would be neato, esp precisely at scale vs just the usual embeddings
Front-end downstream of clicking on a card doesn't seem to work correctly on every reload... but it works sometimes.
I’ve been curious how much news starts from social media. So many news stories today are “someone said x on twitter”.
We have been (low-keep) working on something similar (more from an academic point of view) for the past few years:
This is the introductory article (open access): "Comparison of news commonality and churn in international news outlets with TARO" https://dl.acm.org/doi/abs/10.1145/3603163.3609062
(Allow me a moment of pride for the student leading this project: the paper won the Ted Nelson Award at ACM Hypertext 2023.)
Where'd you find all those RSS feeds? Have you done anything else with RSS feeds? :)
Also agree with the others this definitely needs interactive graphs!
For any given clip, short or excerpt, find the most complete, unedited version that it was taken from.
The view showing the flow with a play animation was a nice concept but I couldn’t see much value in it, wondering if you could try to get a more aggregate stats that shows a connection between these different flows, maybe they follow a pattern like ad-based campaigns or publishers who own these domains, which would explain things. Expanding on this idea, could even try and setup different scores and metrics based on major groups and sponsored content versus organic spread.
Curious how you sourced the feeds? It seems to have a bias towards Indian/Srilanka/Iran/Indonesia/Turkey etc - i.e. not the traditional western centric reporting. Always interested in trying to get a more balanced news diet so anything you could share around that would be interesting. Most out of the box news tools seem to automatically lean west
FYI layout sometimes breaks like so:
https://i.imgur.com/FXeqB9R.png
Cool concept though - the source count and "+N" spread metrics give a quick sense of which stories have legs.
Ubuntu 24.04, Firefox 145.0.1 (64-bit)
- https://newscord.org/latest
- https://www.instagram.com/newscord_org
> Opinion: Operation Holiday serves a critical need in our communities
> Dhru Fusion WooCommerce Integration Plugin
> Powering the Future of Wellness Through Premium Food Supplement Ingredients
That isn't even remotely important at all so really unreliable.
The following headlines look more like spam rather than factual breaking news.
Thanks for sharing some details. Its cool that HNSW is useful for near realtime usage. For some reason I had categorized it in my head as having very very high insertion cost, needing to rebuild worlds to work but that's not at a well founded belief; very cool that it's usable here.
I really hope we see some open source work of this variety. Trying to understand news or even social media is something the world seems to unprepared for. Different subject sort of, but watching Internet Observatory be dismantled by the current political administration, by disinformation grifters, was a woeful loss of one of the few mirrors the that humanity had to understand itself with, to see how we networked.