LoGeR – 3D reconstruction from extremely long videos (DeepMind, UC Berkeley)

Posted by helloplanets 22 hours ago

Comments

Comment by tmilard 20 hours ago

Very interesting paper. I can see street-view using it to perfect the 3D analysing of the photo-video they catch with there google-car. What a wonderfull time we are living in ! Specificaly in the Video to 3D reconstruction. Every month, a new brick is put in place.Super

Comment by overfeed 3 hours ago

> I can see street-view using it to perfect the 3D analysing of the photo-video they catch with there google-car.

Waymo recently announced[1] a World Model that does exactly this: using footage from a single-camera dashcam, it can predict/simulate multiple inputs that would have been sensed by a Waymo vehicle on the same travel path (i.e. multiple camera angles, Lidar cloud, etc). On top of this, the model can be prompted to customize the scenario (adding an elephants or a tornado were the example given)

1. https://waymo.com/blog/2026/02/the-waymo-world-model-a-new-f...

Comment by wumms 18 hours ago

Street View cars added Velodyne LiDAR around 2017 [0][1], but it's optional. I found no data on 'LiDAR vs image only'-percentage.

[0] https://arstechnica.com/gadgets/2017/09/googles-street-view-...

[1] https://en.wikipedia.org/wiki/Google_Street_View

Comment by IshKebab 20 hours ago

Very cool. Doesn't seem like they've actually released the code:

> This is a reimplementation of LoGeR; complete code and models will be released upon approval.

I don't understand why it's a reimplementation either?

I would guess it's "research" code anyway so not really usable unless you are an expert.

Comment by priowise 12 hours ago

very interesting direction. One thing I’m curious about with extremely long videos is how you handle temporal drift over time. Do you periodically re-anchor the reconstruction or rely purely on accumulated frame consistency?

Comment by quadrature 11 hours ago

In a traditional SLAM pipeline you do periodically fix drift by detecting when you've visited an area that you've mapped before this lets you align your sub maps so they are globally consistent.

In the areas you have visited previously you have two estimates of your position one from your frame-to-frame estimates and another from the map you built of the area the first time. You can then solve an optimization problem to bring those two estimates closer together.

In order to find out if you've already visited an area you store a description of the locations in a DB and search through them. The paper says they use a compressed representation of the "maps" and use test time training to optimize the global consistency between their sub maps.

Comment by _fw 19 hours ago

This is like something straight out of Cyberpunk 2077 - the braindances investigation scenes.

Comment by Karliss 17 hours ago

More like the opposite. Point cloud data captured with varying means has existed for a long time with raw data visualized more or less just like this. And SciFi movies/games use the effect of raw visualization as something futuristic/computer tech looking. Just like wireframe on black background, although that one is getting partially downgraded to more retro scifi status since drawing 3d wireframe isn't hard anymore. It started when any 3d computer graphics even basic wireframe was futuristic and not every movie could afford it, with some of them faking it with analog means. Any good scifi author takes inspiration from real world technology and extrapolate based on it, often before widespread recognition of technology by general population. Once something reaches the state of consumer product beyond just researchers and trained professionals, the visuals tend to get more polished and you loose some of the raw, purely functional, engineering style.

Comment by realberkeaslan 17 hours ago

It reminds me of that as well.

Comment by raphaelmolly8 11 hours ago

[dead]

Comment by msuniverse2026 20 hours ago

Truly don't understand what is happening in the heads of these researchers. Can't they see how the main use of this is going to be mass surveillance?

Comment by KeplerBoy 20 hours ago

These seems to be much more robotics / autonomous vehicle focused? I don't quite see the mass surveillance angle you get from this you don't already get from cheap ubiquitous cameras, basic computer vision and networking (aka flock) .

Comment by haritha-j 20 hours ago

I think you've made the erroneous assumption that the researchers care. I work in 3D reconstruction and I've not really seen too many people care about the actual use case, and indeed have had some friends join defence.

Comment by endymion-light 18 hours ago

I mean, i think if you want to perform mass surveilance, you can do it far cheaper and more efficiently via facial recognition, mobile phone surveillance and a variety of different other methods.

If you want reconstruction and training of robotic movement, this is far more appropriate. I believe we're going to see robots being able to "dream" in terms of analysing historical video information on spaces and improving movement and navigation.

So not mass surveilance, but probably there's a future of mass subjugation using robot enforcement.

Comment by KaiserPro 14 hours ago

This bit isn't that surveillance-y

Relocalisation is the bit thats surveillance-y. But its also crucial for accurate visual only navigation.

Comment by imtringued 19 hours ago

I'm not sure what you mean. The input video feed already constitutes "surveillance". You'd need cameras everywhere and if you have a camera, you can also just use regular models like China already does.

Comment by Dead_Lemon 20 hours ago

What is the actual objective of this, is it solving an issue or creating a solution to a problem, that is still to be determined? It seems like a lot of energy to replicate a lidar mapping system. It's not like you can expect accurate dimensions from this approximate guess work, excluding the expected hallucinations adding to inaccuracy.

Comment by alpine01 18 hours ago

3D reconstruction of old spaces which no longer exist seems like a clear use case to me. There's loads of old videos of driving down a street in the 80s, or neighborhoods in cities which got replaced.

I can imagine future iterations of this which bring together other stills of the same space at that time to augment the dataset. Then perhaps another pass to fill in gaps with likely missing content based on probability or data from say the same street 10 years later.

It won't be 100% real, but I think it'd be very cool to be able to have a google-street view style experience of areas before google street view existed.

Comment by phrotoma 17 hours ago

> it'd be very cool to be able to have a google-street view style experience of areas before google street view existed.

Now do Kowloon Walled City.

Comment by voidUpdate 19 hours ago

Video cameras are much cheaper and easier to use than LIDAR, like anyone can just pull out their phone, take a video and send it to this algorithm to get a reasonable point cloud of the environment. Sure, if you want an exact model of an environment and you have the time and money, LIDAR would give better results, but this is about doing more with less

Comment by washadjeffmad 15 hours ago

We use drones with RGB cameras for photogrammetry to reconstruct 3D environments with gaussian splatting, which is a manual process and often requires making multiple trips for additional capture to fill in gaps. Because it's for perceptual use and doesn't require high accuracy, automating with a single-take video would be useful.

Comment by KaiserPro 14 hours ago

One of the key issues of "machine perception" is the inability of machines using standard image sensors to re-create the world accurately.

Lidars are great, and getting smaller, but they still eat a lot of power. (The quest 3 had a lidar on the front[well structured light] and it was mostly not used)

For machines to understand the 3d world, first they need to extract geometry, then isolate those geometries into objects. This method is _a_ way to do that, the first step, extracting 3d points.

The problem with this model is that the points are not actually that well aligned frame to frame. This is why it looks a bit blurry. I assume this is to avoid running out of memory, as you're not quite sure about which points are relevant and need to be kept in memory.

Once you have those points, you need to replace them with simplfied geometry, so that you can workout intersections and junk.

Comment by ekjhgkejhgk 15 hours ago

The actual objective is learning about these systems. It's called research.

Comment by flipbrad 19 hours ago

N00b question from me, perhaps, but how easy is it to mount and run Lidar on aerial drones?

Comment by petargyurov 19 hours ago

It's easy but it's not cheap. Well, price is relative but capturing video is certainly cheaper.

Also, I am not sure how heavy LIDAR units are, but remember that the heavier the payload the more the flight time is reduced. Some drones can only have a single payload, so if you also want to capture (high-res) video/imgs you need to fly again.

It all depends on the use-case.

Comment by Daub 19 hours ago

The most available lidar is found on your iPhone, but the results are orders of magnitude less detailed than that derived from photogrammetry. How ever an advantage is that lidar is not confused by reflections.

Comment by taneq 17 hours ago

Huh? LIDAR absolutely is confused by reflections. Not always the reflections you can see (because often it’s using IR wavelengths) but nonetheless, reflections.

Comment by _diyar 16 hours ago

You can reconstruct accurate dimensions if you have IMU data.