Konachan.com is one of famous anime/game/CG imageboards, devoted mostly to wallpapers and landscape images.
It has good image selection and strong community that gives the adequate score to images.
It’s verbose tagging system very close to danbooru/safebooru standards.
Also Konachan presents wide variery pictures composition and quality - from almost-empty or
background-dominated wallpapers to clear art, from schematic line art to pure full-color digital.
That’s why Konachan is a good source for investigation of non-photographic images and their metadata
to build tools to auto-classify all of that (or simply make your eyes happy).
This release cover interval from start till 310.100 (04.07.2020) and contains:
- 142.756 images of “samples” quality or “original” files when they were less than samples
there were two sampling policies:
- max width = 2000 px till ID=91817 (30.12.2010) when a big share of images fall into samples
- max width = 1500 px till now when almost all images sampled
because of luxurious (99-100%) JPEG quality of most samples I compact them (with ImageMagick mogrify) to 92%
- full JSON metadata for all 246.421 posts except those failed to grab (deleted posts etc)
- with simple Python script how to do it
- with pretty-printed example to illustrate structure and content
- additional TSV (tab separated text) metadata
- key parameters of 143.921 images initially grabbed, including some calculated stats
~ derived from JSON
~ computed with ImageMagick over above mentioned “samples”
- tag-to-post relations (2.736.167) as separate table
~ non-ascii and not suitable for file names symbols replaced or suppressed
~ used for file renaming wherever possible
- some database (Oracle SQL), shell (Windows BAT) and Python scripts
- data structures definition
- key processing steps in database, some query examples
- tools for computing
- not completely “ready to use” but key “building blocks”
- more detailed readme for DATA
- BONUS: example of usage to discover “mogrify effect” when changing JPEG quality from 99-100% to 92%
- mogrify parameters based on research
- 99.759 images affected, size changed from 92 to 35 GBytes (not bad, isn’t it ?)
- only several specific images e.g. ID=120404 got eye-visible artifacts
Release include “samples” only for posts with “good enough” images worth to get originals:
- file_ext in (‘jpg’,‘png’)
- greatest(image_height,image_width)>=1200 and least(image_height,image_width)>=1000 – not too small
and image_height * image_width>=1310720 – (1280*1024)
and image_width / image_height between 0.4 and 2.1 – not too disproportional
- rating in (‘s’,‘q’) in separate folders/zips
- some (272) explicit-like samples excluded from ‘questionable’
- grabbed files renamed to contain ID - up_to_3_copyrights ~ up_to_5_characters (up_to_2_artists)
- tags concatenated via “+”, spaces replaced with underscores
- maximum file name length 220 symbols, characters tags may be truncated if too long
- this enables file system search and sampling (with XCOPY , UNZIP etc)
- some gentle deduplication done
- some (893) images threw out preferring ‘s’ rating and newer posts
- only when no visible artistic difference but maybe technical issues
- a lot of (~ 2000 ?) similarities left
- no filter applied by score and/or tags
- it was an initial idea to include only “the best of” and exclude “banned tags”
- the border of “acceptable quality” turned out to be fuzzy
- user score vs tags vs metadata will be the field of research
Images archived by 10.000 ID groups NNxxxx.[Safe/Questionable][Files/Samples] NN=00…30
I recommend to use FastStone MaxView to browse images inside zips.
HERE is the same way created release for yande.re
THERE ARE some rips on Nyaa tracker for Safebooru and Zerochan. No nipples there.
Comments - 0