mirror of
https://github.com/LCTT/TranslateProject.git
synced 2024-12-29 21:41:00 +08:00
7c2fa80d7c
sources/tech/20191205 Challenge- find Twitter memes with suffix arrays.md
221 lines
13 KiB
Markdown
221 lines
13 KiB
Markdown
[#]: collector: (lujun9972)
|
||
[#]: translator: ( )
|
||
[#]: reviewer: ( )
|
||
[#]: publisher: ( )
|
||
[#]: url: ( )
|
||
[#]: subject: (Challenge: find Twitter memes with suffix arrays)
|
||
[#]: via: (https://jvns.ca/blog/twitter-memes-challenge/)
|
||
[#]: author: (Julia Evans https://jvns.ca/)
|
||
|
||
Challenge: find Twitter memes with suffix arrays
|
||
======
|
||
|
||
This challenge is a mix of data analysis and using fun algorithms! It’s the second challenge in a a short series of programming challenge I’m writing with Julian. (the [first one was to write a tiny fun window manager][1])
|
||
|
||
Twitter has a lot of memes. For example, if you search Twitter for [Flight attendant: is there a doctor on this flight?][2], you’ll find a bunch of tweets making jokes like this:
|
||
|
||
```
|
||
Flight Attendant: is there a doctor on board?
|
||
Parent: *nudging* That should've been you
|
||
Me: Not now, this is serious
|
||
Parent: Not asking for a hacker to help, are they?
|
||
Me: AAAAAAAA\x00\xd0X?\xfc\x7fBBBBj\x0bX\x99Rfh-p\x89\xe1Rjhh/bash/bin\x89\xe3RQS\x89\xe1\xcd\x80
|
||
Parent:~#
|
||
```
|
||
|
||
or if you search [as a kpop fan][3] there are thousands of these:
|
||
|
||
```
|
||
me as a kpop fan
|
||
|
||
- kpop fan age: 10 years
|
||
- first group ever stan: super junior
|
||
- current ult groups: iKON, X1, Day6
|
||
- number of albums: >20
|
||
- concerts attended: 6
|
||
- lightsticks owned: 2
|
||
```
|
||
|
||
So! Suppose you have a million tweets from the last 2 days. How do you find the jokes / quizzes / memes people are playing with on Twitter?
|
||
|
||
### Challenge: find the twitter memes in 1 million tweets
|
||
|
||
This is a pretty open ended challenge and you can do it any way you want. Here’s a [SQLite database with 1.2 million tweets][4], collected from the [twitter streaming api][5] over 2 days. It’s 250MB (70MB compressed), it only has English tweets. It excludes retweets and many tweets that are generated by bots.
|
||
|
||
The challenge: find at least 5 Twitter memes using that dataset.
|
||
|
||
### memes as common substrings
|
||
|
||
The idea here is that memes are substrings like “me as a kpop fan” that many different people are using. The tricky thing is that you don’t really know how long those substrings will be, and maybe you’re interested in phrases of different lengths.
|
||
|
||
You can probably do this challenge without using anything fancy (with a hashmap of phrases or something) but I think it’s a nice opportunity to play with a fun data structure: suffix arrays! So let’s talk about what those are.
|
||
|
||
### suffix arrays: sort all suffixes
|
||
|
||
Suffix arrays sort all suffixes of a string. For example, here’s the suffix array for “plantain” which has the suffixes plantain, lantain, antain, ntain, tain, ain, in, n.
|
||
|
||
```
|
||
ain
|
||
antain
|
||
in
|
||
lantain
|
||
n
|
||
ntain
|
||
plantain
|
||
tain
|
||
```
|
||
|
||
Representing this as a list of strings would be very inefficient (quadratic space), so instead we replace each suffix with the index of its first character in the original string – `[5,2,6,1,7,3,0,4]`.
|
||
|
||
```
|
||
5 (ain)
|
||
2 (antain)
|
||
6 (in)
|
||
1 (lantain)
|
||
7 (n)
|
||
3 (ntain)
|
||
0 (plantain)
|
||
4 (tain)
|
||
```
|
||
|
||
Here’s a real example of what a suffix array of 1 million tweets concatenated looks like. This is an excerpt from the middle of the suffix array, with some of the suffixes that start with `A little`.
|
||
|
||
```
|
||
...
|
||
A little distracted for a bit ...what do i do w my life hon.........
|
||
A little exercise I did this afternoon. #comics #art #clip.........
|
||
A little extra Christmas Cash on me! Good Luck to everyone!.........
|
||
A little girl in Savannah, Ga., appears to be the 38th huma.........
|
||
A little heavy on the smut t… https://t.co/nvoxE7SNjTI wa.........
|
||
A little in state battle tonight. #nova vs #penn. two very .........
|
||
A little kiss...” one more time I’m going to vomit. #TT.........
|
||
A little late catching up on last nights @GoodDoctorABC. On.........
|
||
A little less bling never hurt anyone! Next project...🎄 .........
|
||
A little more intensity to augment their talent and a coupl.........
|
||
A little more time, because I have never lived really - Os.........
|
||
A little mor… https://t.co/kcq3zf9jgeWe love MX ❤️<F0><9F><A7>.........
|
||
A little over 50k! Can We Guess How Much Is In Your Account.........
|
||
A little ray of joy & light in the midst of these very .........
|
||
A little refreshment… https://t.co/HgX8PmYwPIThank you @L.........
|
||
A little respect goes a long way. .........
|
||
A little salt in d country's troubled legal system“Grant................
|
||
A little snow & people lose all common senseromantic st...............
|
||
A little sun for the soul @realfreewebcams https://t.co/3CB...............
|
||
A little sunkissed moment for y’all. ...............
|
||
....
|
||
```
|
||
|
||
Again, this is actually represented by a bunch of integer indexes into a concatenated string of all the tweets, like `[18238223, 1921812, ...]` so it’s a LOT more memory efficient than actually repeating all those strings.
|
||
|
||
### suffix arrays let you find common substrings!
|
||
|
||
So what does this have to do with Twitter memes? Well, we can basically
|
||
|
||
1. concatenate all tweets into a big string
|
||
2. make a suffix array of that string
|
||
3. iterate through the suffix array and notice when you see a lot of repeated substrings, like here:
|
||
|
||
|
||
|
||
```
|
||
me as a kpop fan ✨kpop fan age: 15 y/o ✨first group ever stan: blackpink ✨current ult groups: btxt ✨number of albu… https://t.co/24diHX9sLm
|
||
me as a kpop fan ⭐k-pop fan age: 12 y/o ⭐first group ever stan: bts ⭐current ult gps: bts and txt ⭐number of albu… https://t.co/8R95roQXoE
|
||
me as a kpop fan ⭐k-pop fan age: 14 y/o ⭐first group ever stan: girls generation ⭐current ult gp: txt ⭐number of a… https://t.co/010hLuJscF
|
||
me as a kpop fan ⭐k-pop fan age: 14-16 y/o ⭐first group ever stan: bts ⭐current ult gps: bts txt ⭐number of albums… https://t.co/0fDcxZGRrh
|
||
me as a kpop fan ⭐k-pop fan age: 15 y/o ⭐first group ever stan: blackpink ⭐current ult gps: txt ⭐number of albums… https://t.co/d8zZL83TvV
|
||
me as a kpop fan 🌸 k-pop fan age: 12 years old 🌸 first group ever stan: bts 🌸 current ult gps: bts & wanna one 🌸 n… https://t.co/22R1nJpwNX
|
||
me as a kpop fan 🌸k-pop fan age: 10 🌸first group ever stan: 2pm 🌸current ult gps: skz,got7,itzy,twice, 🌸number of… https://t.co/mAluaP2yxH
|
||
me as a kpop fan 🌸k-pop fan age: 11 yo 🌸first group ever stan: beast 🌸current ult gps: ateez 🌸number of albums: 1… https://t.co/qxtFHG9HDg
|
||
me as a kpop fan 🌸k-pop fan age: 11 🌸first group ever stan: bts 🌸current ult gps: bts and ateez 🌸number of albums:… https://t.co/mKXlkrBBtC
|
||
me as a kpop fan 🌸k-pop fan age: 13 (now im 19) 🌸first group ever stan: snsd 🌸current ult gps: nct day6 aoa mamam… https://t.co/8XyQ3r5hwz
|
||
me as a kpop fan 🌸k-pop fan age: 13 years 🌸first group ever stan: 2pm,suju,bigbang 🌸current ult gps: bts,tbz,ateez… https://t.co/Zs1nQQz6Lt
|
||
me as a kpop fan 🌸k-pop fan age: 14 (2005) 🌸first group ever stan: super junior 🌸current ult gps: exo, gfriend, rv… https://t.co/vgmhe2vFMY
|
||
me as a kpop fan 🌸k-pop fan age: 14 y/o 🌸first group ever stan: nct dream 🌸current ult gps: svt and,,*insert stan… https://t.co/I38Ui69PvL
|
||
me as a kpop fan 🌸k-pop fan age: 15 y/o 🌸first group ever stan: 5sos 🌸current ult gps: bts and 5sos also some ggs… https://t.co/61ZmRkzmdl
|
||
me as a kpop fan 🌸k-pop fan age: 15 y/o 🌸first group ever stan: bts 🌸current ult gps: SVT, GOT7, Day6 🌸number of… https://t.co/16SWb3mSPg
|
||
me as a kpop fan 🌸k-pop fan age: 18 🌸first group ever stan: suju & soshi 🌸current ult gps: snsd & izone 🌸number of… https://t.co/SmSBFqJnGk
|
||
me as a kpop fan 🌸k-pop fan age: 19 y/o marupok 🌸first group ever stan: APINK 🌸current ult gps: SEVENTEEN 🌸number… https://t.co/StYjxr6uq9
|
||
me as a kpop fan 🌸k-pop fan age: 19 🌸first group ever stan: SuJu 🌸current ult gps: SuJu, SF9, SKZ, VIXX, ONEUS, NO… https://t.co/2o2DulCY5b
|
||
```
|
||
|
||
### suffix arrays also enable really fast search
|
||
|
||
As an aside, the reason I got interested in suffix arrays in the first place was actually not for finding Twitter memes at all but for search.
|
||
|
||
I’ve spent a lot of time using Nelson Elhage’s [livegrep][6] at work to search code. It creates a suffix array using the divsufsort library. He has a blog post [Regular Expression Search with Suffix Arrays][7] where he talks about some of the implementation details.
|
||
|
||
The reason suffix arrays work for fast search is basically that if you’re looking for the string `A little`, you can do a binary search over the suffix array to find every instance of `A little` in your dataset. Binary searches are extremely fast so every search is guaranteed to run very quickly (in less than a microsecond I believe). What livegrep does is more complicated than that because it does a regular expression search, but that’s the idea to start.
|
||
|
||
There’s another blog post [How to use suffix arrays to combat common limitations of full-text search][8] applying suffix arrays to searching through a patent database. In that example, like with code search, the patent officers want to search patents for exact strings.
|
||
|
||
### How do you make a suffix array?
|
||
|
||
You can use an existing suffix array library, for example [index/suffixarray in Go][9], which is what I used, or [divsufsort][10]. There are [Python bindings for divsufsort][11].
|
||
|
||
If you’re more excited about the data structures/algorithms aspect of suffix arrays you can also implement a suffix array-building algorithm yourself! I did not do this but you can see an implementation of [qsufsort here in Go][12]. That implementation links to a paper. There are lots of algorithms for constructing suffix arrays –`sais` and `divsufsort` are a couple of others.
|
||
|
||
### 5 or so hours, 100 lines of Go
|
||
|
||
As always with these challenges, I did this one to make sure that it’s both doable in a reasonable amount of time and fun for at least one person (me).
|
||
|
||
I did this one in about 5 hours and 100 lines of Go using the suffixarray implementation in the Go standard library, with a bit of bash shell scripting to postprocess the results. This is a messy data analysis challenge – as an example of a messy thing, Spotify released their end-of-2019 results while I was building the dataset and so there are a lot of tweets generated by the Spotify app.
|
||
|
||
My results ended up looking something like this:
|
||
|
||
```
|
||
5 an Aries and that’s why I gotta
|
||
5 an Aries and that’s why I am so
|
||
5 an Aquarius and that’s why I
|
||
5 AM SO PROUD OF YOU
|
||
5 am not a fan of
|
||
5 am I the only one who
|
||
5 am going to have to
|
||
```
|
||
|
||
Then I sifted through them pretty manually to find the Twitter memes.
|
||
|
||
### suffix arrays are used in bioinformatics
|
||
|
||
This “find twitter memes using suffix arrays” approach is a silly thing but it does have some relationship to reality – DNA sequences are basically really long strings, and biologists need to find patterns in them, and they sometimes use suffix arrays to do it.
|
||
|
||
I looked up [packages in Debian that use libdivsufsort][13] and I found [infernal][14]:
|
||
|
||
> Infernal (“INFERence of RNA ALignment”) is for searching DNA sequence databases for RNA structure and sequence similarities. It is an implementation of a special case of profile stochastic context-free grammars called covariance models (CMs). A CM is like a sequence profile, but it scores a combination of sequence consensus and RNA secondary structure consensus, so in many cases, it is more capable of identifying RNA homologs that conserve their secondary structure more than their primary sequence.
|
||
|
||
### email me the twitter memes you find if you do this!
|
||
|
||
If you do this exercise, I’d love it if you emailed me ([[email protected]][15]) with the twitter memes you found and/or your code! I found about 8 but I’m sure there are more.
|
||
|
||
I’ll publish any solutions I get (unless you don’t want me to publish your solution – just let me know!).
|
||
|
||
Thanks to Julian for discussing suffix arrays and suffix trees and trigram indexes with me at length, and to Kamal who had the idea of using suffix arrays to find Twitter memes.
|
||
|
||
--------------------------------------------------------------------------------
|
||
|
||
via: https://jvns.ca/blog/twitter-memes-challenge/
|
||
|
||
作者:[Julia Evans][a]
|
||
选题:[lujun9972][b]
|
||
译者:[译者ID](https://github.com/译者ID)
|
||
校对:[校对者ID](https://github.com/校对者ID)
|
||
|
||
本文由 [LCTT](https://github.com/LCTT/TranslateProject) 原创编译,[Linux中国](https://linux.cn/) 荣誉推出
|
||
|
||
[a]: https://jvns.ca/
|
||
[b]: https://github.com/lujun9972
|
||
[1]: https://jvns.ca/blog/2019/11/25/challenge--make-a-bouncy-window-manager/
|
||
[2]: https://twitter.com/search?q=%22attendant%3A%20is%20there%20a%20doctor%20on%20%22&src=typed_query&f=live
|
||
[3]: https://twitter.com/search?q=%22as%20a%20kpop%20fan%22&src=typed_query
|
||
[4]: https://www.dropbox.com/s/5xnaf1mbdr1424p/twitter.db.gz?dl=0
|
||
[5]: https://developer.twitter.com/en/docs/tutorials/consuming-streaming-data
|
||
[6]: https://livegrep.com/search/linux
|
||
[7]: https://blog.nelhage.com/2015/02/regular-expression-search-with-suffix-arrays/
|
||
[8]: https://blog.griddynamics.com/using-suffix-arrays-to-fix-limitations-of-full-text-search/
|
||
[9]: https://golang.org/pkg/index/suffixarray/
|
||
[10]: https://github.com/y-256/libdivsufsort
|
||
[11]: https://github.com/debatem1/pydivsufsort
|
||
[12]: https://github.com/golang/go/blob/release-branch.go1.12/src/index/suffixarray/qsufsort.go
|
||
[13]: https://codesearch.debian.net/search?q=divsufsort
|
||
[14]: http://eddylab.org/infernal/
|
||
[15]: https://jvns.ca/cdn-cgi/l/email-protection
|