We collected 10k hours of neuro-language data in our basement
Posted by nee1r 1 day ago
Comments
Comment by n7ck 1 day ago
Since I joined, we've gone from <1k hours to >10k hours, and I've been really excited by how much our whole setup has changed. I've been implementing lots of improvements to the whole data pipeline and the operations side. Now that we train lots of models on the data, the model results also inform how we collect data (e.g. we care a lot less about noise now that we have more data).
We're definitely still improving the whole system, but at this point, we've learned a lot that I wish someone had told us when we started, so we thought we'd share it in case any of you are doing human data collection. We're all also very curious to get any feedback from the community!
Comment by internet_points 1 day ago
Comment by n7ck 1 day ago
Comment by SubiculumCode 1 day ago
Comment by paparicio 23 hours ago
I have dreamed many times about same story but with apple or epic games. But they have millions of human beings testing their products FOR FREE in every place of the world, hahahaha
Comment by xg15 1 day ago
But it feels eery to read a detailed story how they built and improved their setup and what obstacles they encountered, complete with photos - without any mention who is doing the things we are reading about. There is no mention of the staff or even the founders on the whole website.
I had a hard time judging how large this project even is. The homebuilt booths and trial-and-error workflow sound like "three people garage startup", but the bookings schedule suggests a larger team.
(At least there is an author line on that blog post. Had to google the names to get some background on this company)
You should consider an "about us" page :)
Comment by rio-popper 1 day ago
Comment by xg15 1 day ago
Comment by in-silico 1 day ago
Though, I suppose if the model had LLM-like context where it kept track of brain data and speech/typing from earlier in the conversation then it could perform in-context learning to adapt to the user.
Comment by clemvonstengel 1 day ago
We only got any generalization to new users after we had >500 individuals in the dataset, fwiw. There's some interesting MRI studies also finding a similar thing that when you have enough individuals in the dataset, you start seeing generalization.
Comment by titzer 1 day ago
Comment by NoraCodes 1 day ago
Comment by clemvonstengel 1 day ago
Comment by ricudis 1 day ago
Comment by asgraham 1 day ago
Have you played at all with thought-to-voice? Intuitively I’d think EEG readout would be more reliable for spoken rather than typed words, especially if you’re not controlling for keyboard fluency.
Comment by clemvonstengel 1 day ago
It does generalize between typed and spoken, i.e. it does much better on spoken decoding if we've also trained on the typing data, which is what we were hoping to see.
Comment by Terretta 1 day ago
Both of these modes are incredibly slow thinking. Conciously shifting from thinking in concepts to thinking in words is like slamming on brakes for a school zone on an autobahn.
I've gathered most people think in words they can "hear in their head", most people can "picture a red triangle" and literally see one, and so on. Many folks who are multi-lingual say they think in a language, or dream in that language, and know which one it is.
Meanwhile, some people think less verbally or less visually, perhaps not verbally or visually at all, and there is no language (words).
A blog post shared here last month discussed a person trying to access this conceptual mode, which he thinks is like "shower thoughts" or physicists solving things in their heads while staring into space, except "under executive function". He described most of his thoughts as words he can hear in his head, with these concepts more like vectors. I agree with that characterization.
I'm curious what % of folks you've scanned may be in this non-word mode, or if the text and voice requirement forces everyone into words.
Comment by clemvonstengel 1 day ago
One thing that's particularly exciting here is that the model often gets the high-level idea correct, without getting any words correct (as in some of the examples above), which suggests that it is picking up the idea rather than the particular words.
Comment by Terretta 1 day ago
Are you pursing an idea of how to help people like this author* access this mode that some of us are always in unless kicked out of it by the need for words?
Very needed right now — the opposite of the YouTube-ization of idea transfer.
It doesn't seem clear this is accessible without other changes in wiring? The inability to "picture" things as visuals seems to swap out for "conceptualizing" things in -- well, I don't have words for this.
An attempt from that essay:
This is not what Hadamard is talking about when he describes the wordless thought of the mathematicians and researchers he has surveyed. Instead, what they seem to be doing is something similar to this subconscious, parallelized search, except they do it in a “tensely” focused way.
The impression I get is that Hadamard loads a question into his mind (either in a non-verbal way, or by reading a mathematical problem that has been written by himself or someone else), and then he holds the problem effortfully centered in his mind. Effortfully, but wordlessly, and without clear visualizations. Describing the mental image that filled his mind while working on a problem concerning infinite series for his thesis, Hadamard writes that his mind was occupied by an image of a ribbon which was thicker in certain places (corresponding to possibly important terms). He also saw something that looked like equations, but as if seen from a distance, without glasses on: he was unable to make out what they said.
I’m not sure what is going on here.
* https://www.henrikkarlsson.xyz/p/wordless-thought
A couple of this author's speculations aren't how I'd say it works when this is one's default mode, but most are in the neighborhood. He comes the closest of what I've read by people who do think the way the author thinks — which seems to be most people.
Comment by asgraham 1 day ago
Comment by n7ck 1 day ago
Comment by ag8 1 day ago
Comment by n7ck 1 day ago
That said, the way to 10-20x data collection would be to open a couple other data collection centers outside SF, in high-population cities. Right now, there's a big advantage in just having the data collection totally in-house, because it's so much easier to debug/improve it because we're so small. But now we've mostly worked out the process, it should also be very straightforward for us to just replicate the entire ops/data pipeline in 3-4 parallel data collection centers.
Comment by nullbyte808 1 day ago
Comment by Gormisdomai 1 day ago
“the room seemed colder” -> “ there was a breeze even a gentle gust”
Comment by CobrastanJorji 1 day ago
Comment by rio-popper 1 day ago
Comment by jcims 1 day ago
Very interesting!
Comment by ninapanickssery 1 day ago
Comment by accrual 1 day ago
* A ceiling-based pully system could help take the physical load off the users and may allow for increased sensor density. Some large/public VR setups do this.
* I'm sure you considered it, but a double-converting UPS might reduce the noise floor of your sensors and could potentially support multiple booths. Expensive though, and it's already mentioned that data quantity > quality at this stage. Maybe a future fine-tuning step could leverage this.
Cool write up and hope to see more in the future!
Comment by rio-popper 1 day ago
Comment by nullbyte808 1 day ago
Comment by n7ck 1 day ago
Comment by paparicio 23 hours ago
What you are trying to do is BIG, I love it. And I hope you could have more than 1M in a few months!
Keep pushing team!!!
Comment by richardfeynman 1 day ago
A couple of questions: What's the relationship between the number of hours of neurodata you collect and the quality of your predictions? Does it help to get less data from more people, or more data from fewer people?
Comment by n7ck 1 day ago
Comment by richardfeynman 1 day ago
For a given amount of data, is it better to have more people with less data per person or fewer people with more data per person?
Comment by clemvonstengel 1 day ago
For a given amount of data, whether you want more or less data per person really depends on what you're trying to do. The thing we want is for it to be good at zero-shot, that is, for it to decode well on people who have zero hours in the train set. So for that, we want less data per person. If instead we wanted to make it do as well as possible on one individual, then we'd want way more data from that one person. (So, e.g., when we make it into a product at first, we'll probably finetune on each user for a while)
Comment by richardfeynman 1 day ago
I wonder if there will be medical applications for this tech, for example identifying people with brain or neurological disorders based on how different their "neural imaging" looks from normal.
Comment by devanshp 1 day ago
Comment by rio-popper 1 day ago
If you mean the text quality scoring system, then when we added that, it improved the amount of text we got per hour of neural data by between 30-35%. (That includes the fact that we filter which participants we have return based on their text quality scores)
Comment by mishajw 1 day ago
Comment by n7ck 1 day ago
Comment by ArjunPanicksser 1 day ago
Comment by n7ck 1 day ago
We tried google/facebook/instagram ads, and we tried paying for some video placements. Basically none of the explicit advertisement worked at all and it wasn't worth the money. Though for what it's worth, none of us are experts in advertising, so we might have been going about it wrong -- we didn't put loads of effort into iterating once we realized it wasn't working.
Comment by wiwillia 1 day ago
Comment by rajlego 1 day ago
Comment by rio-popper 1 day ago
Comment by whatshisface 1 day ago
Comment by clemvonstengel 1 day ago
Comment by whatshisface 1 day ago
Comment by estitesc 1 day ago
Comment by g413n 1 day ago
Comment by rio-popper 1 day ago
Comment by moffkalast 1 day ago
Those predictions sound good enough to get you CIA funding.
Comment by dang 1 day ago
[see https://news.ycombinator.com/item?id=45988611 for explanation]
Comment by ninapanickssery 1 day ago
Comment by ClaireBookworm 1 day ago
Comment by cpeterson42 1 day ago