mastouille.fr @admin

0 message0 participant0 message aujourd’hui

**Julien Maupetit** @jmaupetit@mamot.fr · 17 juil.

Note to self: use #DuckDB more often. #Pandas' syntax for aggregation is a pain, SQL is cool.

https://codecut.ai/deep-dive-into-duckdb-data-scientists/

CodeCut · 13 juinA Deep Dive into DuckDB for Data ScientistsDiscover how DuckDB simplifies data querying with zero configuration and outperforms pandas for large datasets.

#datascience

**aerique** @aerique@genart.social · 27 juin

27 juin

aerique @aerique@genart.social

Great talk by Hannes Mühleisen of #DuckDB about tables being a fundamental technology to civilization and not dismissing databases, SQL & ACID just because some implementation are getting old in the tooth.

DuckDB sounds awesome and I know @bert_hubert is a big fan.

Hannes Mühleisen on stage at Joy of Coding.

#JoyOfCoding #JoyOfCoding2025

**hrbrmstr's Daily Drop** @dailydrop.hrbrmstr.dev@dailydrop.hrbrmstr.dev · 23 juin

23 juin

hrbrmstr's Daily Drop @dailydrop.hrbrmstr.dev@dailydrop.hrbrmstr.dev

Drop #669 (2025-06-23): Monday Morning (Barely) Grab Bag

Rube Goldberg X-traction Pipeline; fplot; Color Everything in CSS

Something for (hopefully) everyone as we start off this brutally hot (in many parts of the northern hemisphere) terminal week of June.

Stay safe out there.

Type your email…

TL;DR

(This is an LLM/GPT-generated summary of today’s Drop using Ollama + Qwen 3 and a custom prompt.)

A Rube Goldberg-inspired data pipeline is created to archive X posts into a DuckDB database, using XCancel, Inoreader, and a DuckDB script for automation (https://en.wikipedia.org/wiki/Rube_Goldberg)
The {fplot} R package automates the creation of distribution plots by detecting data types and selecting appropriate visualizations, with options for global relabeling of variables (https://lrberge.github.io/fplot/)
The CSS-Tricks article “Color Everything in CSS” provides an in-depth look at color spaces, models, and gamuts in modern web development, offering a comprehensive guide to advanced CSS color techniques (https://css-tricks.com/color-everything-in-css/)

Rube Goldberg X-traction Pipeline

I don’t see many mentions of Rube Goldberg in pop-culture settings anymore, which is a shame, since I used to enjoy poring over them in my younger days. Perhaps the reason for the lack of mentions is that many data pipelines have much in common with those complex, over-“engineerd” contraptions.

Case in point for a recent “need” of mine: I wanted a way to store posts from users on X into a DuckDB database, for archival and research purposes. I already use XCancel’s ability to generate an RSS feed for an account/search, which I yank into Inoreader for the archival part (the section header shows the XCancel-generated RSS feed for the White House’s other, even more MAGA, propaganda account).

Inoreader’s API is…not great. It can most certainly be machinated (I have an R package with the function I need in it), but I really wanted a solution that let me just use DuckDB for all the work.

Then, I rememberd, if you put feeds in Inoreader folders, you can turn that folder into a JSON feed that gets updates every ~30 minutes or so. This one:

is for a series of feeds related to what’s going on in the Middle East right now.

With that JSON URL in hand, it’s as basic as:

#!/usr/bin/env bash# for cache bustingepoch=$(date +%s)duckdb articles.ddb <<EOQLOAD json;INSTALL shellfs FROM community;LOAD shellfs;CREATE TABLE IF NOT EXISTS broadcast_feed_items (  url VARCHAR PRIMARY KEY,  title VARCHAR,  content_html VARCHAR,  date_published VARCHAR,  tags VARCHAR[],  authors JSON);-- this is where the update magic happensINSERT OR IGNORE INTO broadcast_feed_itemsFROM read_json('curl -s https://www.inoreader.com/stream/user/##########/tag/broadcast/view/json?since=${epoch} | jq .items[] |')SELECT url, title, content_html, date_published, tags, authors;-- Thinned out JSON content for viewing appCOPY (  FROM     broadcast_feed_items   SELECT     content_html, -- "title" is useless for the most part since this is an X post    date_published AS "timestamp",     regexp_replace(authors.name, '"', '', 'g') AS handle) TO 'posts.json' (FORMAT JSON, ARRAY );EOQ

There are other ways to unnest the data than using jq and the shellfs DuckDB extension, but the more RG the better (for this post)!

So the final path is:

X -> XCancel -> XCancel RSS -> Inoreader -> Inoreader JSON -> jq -> DuckDB

with virtually no code (save for the snippet, above).

I’ve got this running as a systemd timer/service running every 30 minutes.

Later this week (when I’m done hand-coding it—yes, sans-Claude), I’ll have a Lit-based vanilla HTML/CS/JS viewer app in one of the Drops.

fplot

(This is an #RStats section, so def move along if that is not your cuppa.)

My daily git-stalking led me to finding this gem of an R package.

{fplot} (GH) is designed to automate and simplify the visualization of data distributions (something I have to do every. single. day.). Its core mission is to let folks quickly generate meaningful and aesthetically pleasing distribution plots, regardless of the underlying data type (it supports continuous, categorical, or skewed), by making spiffy choices about the appropriate graphical representation for each variable.

Functions in the package detect the nature of your data (e.g., categorical vs. continuous, skewed or not) and automatically selects the most suitable plot type. For example, it will not use the same visualization for a categorical variable as it would for a continuous one, and it adapts further if the data is heavily skewed.

Ergonomics are pretty dope, since you only need a single line of code to generate a plot, with the package handling the details of layout and type selection. This is particularly useful for exploratory data analysis or for folks who want quick, visually appealing graphics without extensive customization.

Tools are provided to globally relabel variable names for all plots. This is managed via the setFplot_dict() function, which lets us map cryptic/gosh awful or technical variable names to more readable labels that will appear in all subsequent plots.

Example usage:

setFplot_dict(c(  Origin = "Exporting Country",  Destination = "Importing Country",  Euros = "Exports Value in €",  jnl_top_25p = "Pub. in Top 25% journal",  jnl_top_5p = "Publications in Top 5% journal",  journal = "Journal",  institution = "U.S. Institution",  Petal.Length = "Petal Length"))

The typical workflow with fplot is straightforward:

Load your data.
Optionally set global variable labels using setFplot_dict().
Call the fplot function on your variable(s) of interest.
The package automatically determines the best plot type and layout for your data.

The same function call can yield different types of plots depending on the data provided, streamlining the process of distributional analysis and visualization.

A gallery of examples and a more detailed walk-through are available on the package’s website.

Color Everything in CSS

The CSS-Tricks article “Color Everything in CSS” offers a comprehensive, up-to-date exploration of how color works in CSS, moving beyond just the basics of color and background-color to cover the deeper technical landscape of color on the web. The article introduces essential concepts like color spaces, color models, and color gamuts, which are foundational for understanding how colors are represented, manipulated, and rendered in browsers today.

We’ve covered many of these individual topics before, but this is a well-crafted, all-in-one that does such a good job, I do not wish to steal any thunder from it. Head on over for to level up your CSS skills.

FIN

Remember, you can follow and interact with the full text of The Daily Drop’s free posts on:

Mastodon via @dailydrop.hrbrmstr.dev@dailydrop.hrbrmstr.dev
Bluesky via https://bsky.app/profile/dailydrop.hrbrmstr.dev.web.brid.gy

#duckdb

**Francis Gulotta** @reconbot@toot.cafe · 13 juin

13 juin

Francis Gulotta @reconbot@toot.cafe

oops #til to use #duckdb to query a CSV, generate date ranges, use windowing functions to backfill data and pivot functions to make data that you can easily graph in a spreadsheet.

based upon;
- average solar radiation distribution over the year for my area
- My actual kwh production and usage for the last month (which #homeassistant gives as data change events, not hourly or daily reporting)
- The KWHs I've spent on AC that I expect to increase over the summer

I'm operating at 85% capacity

**Christos Argyropoulos MD PhD** @ChristosArgyrop@mastodon.social · 6 juin

6 juin

Christos Argyropoulos MD PhD @ChristosArgyrop@mastodon.social

Since we can't use the cloud to automate our #EHR analysis projects , we tried #nextflow (traditionally used in bioinformatics https://www.nextflow.io/), and it worked as a charm coordinating #duckdb #R #python and within node tasks.
Next in line is (R)?ex https://www.rexify.org/

www.nextflow.ioA DSL for parallel and scalable computational pipelines | NextflowNextflow enables scalable and reproducible scientific workflows using software containers. It allows the adaptation of pipelines written in the most common scripting languages.

**Alejandro Baez** @zeab@fosstodon.org · 28 mai

28 mai

Alejandro Baez @zeab@fosstodon.org

And #duckdb continues to be my everything analytics thing. #ducklake simply cements this. Basically use #sqlite, #duckdb, or #postgres.

If I need more, I have to think REAL hard what I can't accomplish with sharding.

https://duckdb.org/2025/05/27/ducklake.html

DuckDB · 27 maiDuckLake: SQL as a Lakehouse FormatDuckLake simplifies lakehouses by using a standard SQL database for all metadata, instead of complex file-based systems, while still storing data in open formats like Parquet. This makes it more reliable, faster, and easier to manage.

**Alejandro Baez** @zeab@fosstodon.org · 23 mai

23 mai

Alejandro Baez @zeab@fosstodon.org

#duckdb now has the ability to run commands from serialized formats line #json or #CSV. Already have use for this.

Latest release is simply packed with improvements.
https://duckdb.org/2025/05/21/announcing-duckdb-130.html

DuckDB · 21 maiAnnouncing DuckDB 1.3.0The DuckDB team is happy to announce that today we're releasing DuckDB version 1.3.0, codenamed “Ossivalis”.

**Spatialists** @spatialists@mapstodon.space · 23 mai

23 mai

Spatialists @spatialists@mapstodon.space

Easily obtain OSM and OMF data: #Python and CLI tools #QuackOSM and #OvertureMaestro offer easier access to data from #OpenStreetMap (#OSM) and the Overture Maps Foundation (#OMF) through #PyArrow, #GeoParquet, or #DuckDB. These tools can simplify large-scale geospatial data...
https://spatialists.ch/posts/2025/05-23-easily-obtain-osm-and-omf-data/ #GIS #GISchat #geospatial #SwissGIS

Spatialists – geospatial newsEasily obtain OSM and OMF data – Spatialists – geospatial news

Plus via

Spatialists

**James Gilbert** @jgrg@mstdn.science · 21 mai

21 mai

James Gilbert @jgrg@mstdn.science

New #DuckDB release 1.3.0️
Looks like a solid release, including better compression of strings. #SQL #Analytics
https://duckdb.org/2025/05/21/announcing-duckdb-130.html

DuckDB · 21 maiAnnouncing DuckDB 1.3.0The DuckDB team is happy to announce that today we're releasing DuckDB version 1.3.0, codenamed “Ossivalis”.

**Daniel** @djh@chaos.social · 11 mai

11 mai

Daniel @djh@chaos.social

https://github.com/daniel-j-h/zindex this little cloud-native spatial index made it into #WeeklyOSM

With DuckDB-WASM spatial queries against a single static file are a thing!! What!!

Example:

https://shell.duckdb.org/#queries=v0,SELECT-lng%2C-lat-FROM-'https%3A%2F%2Fdaniel%20j%20h.github.io%2Fzindex%2Fberlin%20latest%20hydrants.parquet'-WHERE-z-IN-(3771320898%2C3771320904)-AND-lng-BETWEEN-2307450897-AND-2307489074-AND-lat-BETWEEN-3400700224-AND-3400740787~

GitHubGitHub - daniel-j-h/zindex: A very fast cloud-native static spatial index for 2D points based on a Z-Order space filling curve and BIGMIN search space pruningA very fast cloud-native static spatial index for 2D points based on a Z-Order space filling curve and BIGMIN search space pruning - daniel-j-h/zindex

#Mapstodon #OpenStreetMap #CloudNative

**Daniel Kerkow** @danielkerkow@fosstodon.org · 4 mai

4 mai

Daniel Kerkow @danielkerkow@fosstodon.org

Reading those comments on HN regarding #DuckDB and ease of use with geospatial data, I am reminded of a #QGIS feature not a lot of people seem to be aware of:

Virtual layers allow you to query any supported dataset in QGIS using Spatialite SQL syntax.

So if you are working with geospatial, you may already have QGIS installed and don't even need DuckDB to run spatial SQL without further DB setup.

https://news.ycombinator.com/item?id=43881468

news.ycombinator.comDuckDB is probably the most important geospatial software of the last decade | Hacker News

**ely** @ely@mastodon.green · 30 avr.

30 avr.

ely @ely@mastodon.green

#duckdb @duckdb

When using the UI, (duckdb -ui) the copy function works different when having multiple result rows.
What am I missing here?

**Hacker News** @ycombinator@rss-mstdn.studiofreesia.com · 24 avr.

24 avr.

Hacker News @ycombinator@rss-mstdn.studiofreesia.com

Instant SQL for results as you type in DuckDB UI
https://motherduck.com/blog/introducing-instant-sql/
#ycombinator #duckdb #snippets #duckdb_snippets #duckdb_snippets_com #Instant #SQL #here #Speedrun #ad_hoc #queries #you #type #MotherDuck #Blog

MotherDuckInstant SQL is here: Speedrun ad-hoc queries as you type - MotherDuck BlogType, see, tweak, repeat! Instant SQL is now in Preview in MotherDuck and the DuckDB Local UI. Bend reality with SQL superpowers to get real-time query results as you type. | Reading time: 8 min read

**Hacker News** @ycombinator@rss-mstdn.studiofreesia.com · 22 avr.

22 avr.

Hacker News @ycombinator@rss-mstdn.studiofreesia.com

Abusing DuckDB-WASM by making SQL draw 3D graphics (Sort Of)
https://www.hey.earth/posts/duckdb-doom
#ycombinator #duckdb #sql #wasm #doom

Patrick TrainerPatrick TrainerPersonal blog and portfolio site

A répondu dans un fil de discussion

**ely** @ely@mastodon.green · 8 avr.

8 avr.

ely @ely@mastodon.green

@simon
Impressive work and will definitely play with libgraft . So now we want this thing to work with #duckdb but no sqlite extensions yet #graft ;)
#sync

**Johnny Graber** @JGraber@mastodon.social · 4 avr.

4 avr.

Johnny Graber @JGraber@mastodon.social

#Python Friday #273: Query Excel Files With #DuckDB

https://pythonfriday.dev/2025/04/273-query-excel-files-with-duckdb/

pythonfriday.dev#273: Query Excel Files With DuckDB - Python FridayNone

**Esk** @esk@hachyderm.io · 29 mars

29 mars

Esk @esk@hachyderm.io

howdy, #hachyderm!

over the last week or so, we've been preparing to move hachy's #DNS zones from #AWS route 53 to bunny DNS.

since this could be a pretty scary thing -- going from one geo-DNS provider to another -- we want to make sure *before* we move that records are resolving in a reasonable way across the globe.

to help us to do this, we've started a small, lightweight tool that we can deploy to a provider like bunny's magic containers to quickly get DNS resolution info from multiple geographic regions quickly. we then write this data to a backend S3 bucket, at which point we can use a tool like #duckdb to analyze the results and find records we need to tweak to improve performance. all *before* we make the change.

then, after we've flipped the switch and while DNS is propagating -- -- we can watch in real-time as different servers begin flipping over to the new provider.

we named the tool hachyboop and it's available publicly --> https://github.com/hachyderm/hachyboop

please keep in mind that it's early in the booper's life, and there's a lot we can do, including cleaning up my hacky code.

attached is an example of a quick run across 17 regions for a few minutes. the data is spread across multiple files but duckdb makes it quite easy for us to query everything like it's one table.

screenshot of a console with a duckdb command that is selecting data from multiple parquet files in an s3 bucket across multiple folders.

the resulting table has many rows with a timestamp, unique client ID, region, host it was querying, and the results provided by the DNS server that was queried.

Suite du fil

**Tatu Leppämäki** @tadusko@mstdn.social · 28 mars

28 mars

Tatu Leppämäki @tadusko@mstdn.social

Thank you to #Kone & Mai and Tor Nessling Foundations for supporting this work. A quantitative work like this would not be possible without a robust suite of FOSS tools. My thanks to the maintainers of #QGIS, #pandas, #geopandas, #duckdb, #dask, #statsmodels, #jupyter and many more!

**Michael Simons** @rotnroll666@mastodon.social · 12 mars *

12 mars *

Michael Simons @rotnroll666@mastodon.social

Wow.

#DuckDB just got a built-in #UI since 1.2.1.

In the screenshot, I am running it on my photovoltaics database.

Because I have all interesting reports written as views, I now have a thing that I just can give out to non-programmers that can do cool stuff with it.

Well done, @hannes and team at @motherduck <3

Update: Read their blog - https://duckdb.org/2025/03/12/duckdb-ui