Monday, March 19, 2012

locating set of points close to one another (within a threshold)

Hi all

I have a large data set of points situated in 3d space. I have a simple
primary key and an x, y and z value.

What I would like is an efficient method for finding the group of
points within a threshold.

So far I have tested the following however it is very slow.

-----
select *
from locations a full outer join locations b
on a.ID < b.ID and a.X-b.X<2 and a.Y-b.Y<2 and a.Z-b.Z<2
where a.ID is not null and b.ID is not null
-----

If anyone knows of a more efficient method to arrive at this results it
would be most appreciated.

Thanks in advance
Bevan(bevanward@.gmail.com) writes:
> I have a large data set of points situated in 3d space. I have a simple
> primary key and an x, y and z value.

Why is (x, y, z) not the primary key? Or put differently: what does it
mean if you have to points with the same coordinates?

If nothing else, the ID column serves to make the rows wider, which
means fewer rows per pages, which means worse performance.

> What I would like is an efficient method for finding the group of
> points within a threshold.
> So far I have tested the following however it is very slow.
> -----
> select *
> from locations a full outer join locations b
> on a.ID < b.ID and a.X-b.X<2 and a.Y-b.Y<2 and a.Z-b.Z<2
> where a.ID is not null and b.ID is not null
> -----

Not that I think that it matters much for execution time, but FULL JOIN
seems flat wrong here, and with the conditions in the WHERE clause you
make it an inner join anyway. I would write the query as:

select *
from locations a
cross join locations b
WHERE a.X < b.X + 2 and a.Y < b.Y + 2 and a.Z < b.Z + 2

It's important to have one coordinate alone on operator, that increases
the chance for an index being used. You do have an index on (x, y, z),
don't you?

Still I am far from certain that SQL Server will use an index. It would be
interesting to see the query plan for this query.

This is quite a difficult problem for a relational engine, and it could
be a case where a cursor is the best. Add a clustered index on (x, y, z).
Iterate over the points ordered by X. If the current point and the
previous point are less than 2 from each other, save the combination
into a temp table. The tricky part is that if you are at point N, it
may proximite to more than point you've already passed over. So you can
keep data in varaiables. You could use a table variable, but I think
that it's better to use the table itself. This is where it is important
that you have a clustered index on X.

Normally cursors are a poor solution, but the point here is that you
would only need to make one iteration over the table.

Faster than a cursor though, would be to get all data to a client
program (or a .Net stored procedure if you are on SQL 2005), since
traditional languages have better structures for this sort of problem.

--
Erland Sommarskog, SQL Server MVP, esquel@.sommarskog.se

Books Online for SQL Server 2005 at
http://www.microsoft.com/technet/pr...oads/books.mspx
Books Online for SQL Server 2000 at
http://www.microsoft.com/sql/prodin...ions/books.mspx|||Erland Sommarskog (esquel@.sommarskog.se) writes:
> This is quite a difficult problem for a relational engine, and it could
> be a case where a cursor is the best.

I should stress that this is just a wild idea. It may very well prove
that a cursor solution is horribly slow - as cursor solutions often are.

--
Erland Sommarskog, SQL Server MVP, esquel@.sommarskog.se

Books Online for SQL Server 2005 at
http://www.microsoft.com/technet/pr...oads/books.mspx
Books Online for SQL Server 2000 at
http://www.microsoft.com/sql/prodin...ions/books.mspx|||Erland really appreciate the effort you've put into this message it has
helped alot.

The reason there is a key as well as the location data is that is the
number the samples are collected and tacked against. There can be
points with the same coordinates in real life however there is a
problem where the coordinate location is confirmed with a more precise
method and it is given a new name - ideally it would not be possible to
enter this into the database however since the planning phase is not
mandatory it has to have a merge character... I hope in the future the
work flow can develop to stop these types of things happening.

I wasn't able to get the query to use the index (I do have a clustered
index on x, y, z) with either the cross join restricting on the where
or the full outer restricted on the null key.

I've had to give up on it for now but have printed your message and
will try to get something effective working - I will test a cursor
later today and see how it will run.

Unfortunately this database is still in 2000 however once it moves to
2005 I can see as you state this problem can be solved in a more
conventional manner.

Thanks again for your effort and I hope to be able to post a method
back that runs more efficiently in the near futuer.

Cheers
Bevan|||Why would you have a NULL identifier? I guess that two things can have
the same location, otherwise (x,y,z) would be a key. And should your
query look like this, since you want to go in both directions.on an
axis? The OUTER JOIN makes no sense.

SELECT A.node_id, B.node_id
FROM Locations AS A, Locations AS b
WHERE A.node_id < B.node_id
AND ABS(A.x-B.x) <= 2
AND ABS(A.y-B.y) <= 2
AND ABS(A.z-B.z) <= 2

I did something like in 2-D for neighborhoods based on quadrants. If a
node was in one quadrant, I assumed that the nine adjacent quads
centered on his quad would have the nearest neighbor, so i only did a
distanve formula on popint in those quads. It meant carrying an extra
pair of quad co-ordinates, but it saved searching all the nodes -- 9
quads versus ~15,000 quads.|||(bevanward@.gmail.com) writes:
> I have a large data set of points situated in 3d space. I have a simple
> primary key and an x, y and z value.
> What I would like is an efficient method for finding the group of
> points within a threshold.
> So far I have tested the following however it is very slow.
> -----
> select *
> from locations a full outer join locations b
> on a.ID < b.ID and a.X-b.X<2 and a.Y-b.Y<2 and a.Z-b.Z<2
> where a.ID is not null and b.ID is not null
> -----

As indicated by Celko's posting there is most probably an error here:
you need abs(a.X - b.X) etc, else you will get far too many points.

Mathematically it would also make more sense to use

sqrt(square(a.X - b.X) + square(a.Y - b.Y) + square(a.Z - b.Z)) < cutoff

--
Erland Sommarskog, SQL Server MVP, esquel@.sommarskog.se

Books Online for SQL Server 2005 at
http://www.microsoft.com/technet/pr...oads/books.mspx
Books Online for SQL Server 2000 at
http://www.microsoft.com/sql/prodin...ions/books.mspx|||I haven't implemented this on a large scale, but I'll pass on
what I think is a good approach. I suggested it in a different
context in the thread found here:
http://groups.google.com/groups/sea...=exemplar+skass
Variations on this idea are used in GIS and other such systems.
Note that my comments aren't specific to your exact question,
but they should apply reasonably well.

Very briefly, the idea is to maintain a set of points *not*
in your set that are evenly scattered through space - the
lattice points of a coarse grid. These will serve as neighborhood
markers, and with each point, you will store its NeighborhoodID.
You can think of these as neighborhood post offices, for example.
The coarseness of the grid will be chosen depending on what kinds
of questions you ask, but typically the size of the neighborhoods
reflects the meaning of "close".

Points can only be close to each other if they live in neighborhoods
that are close to each other, or equivalently, if their neighborhood
post offices are close to each other.

For example, if your 3D coordinates range from 0 to 1000
each, you might slice up space every 100 units and use these
as lattice points: {i*100+50,j*100+50,k+100*50) for 0<=i,j,k<10.
So you have 1000 lattice points in this case. No point in
space is more than 100*SQRT(3) ~ 86.6 units from one of these 1000
points, so this will work if the "closeness" you need is on this
order ("close" should mean same or adjacent neighborhood).

No matter how many points you have here, you have
only 1K neighborhoods. If you had 10M points, you
now have 10K points/neighborhood on average.

Now the part that reduces the cost of querying: For many queries,
you can first search or look up pairs of post offices, not points,
that meet your requirement of close. (This could be "adjacent
pairs" in most cases, and post office adjacencies could be
maintained permanently). There is one Mega-pair of post offices
in this example, but 100 Tera-pairs of points
(in nanoseconds, this is 1 millisecond vs. 1 day).
Even if you still have to look at all pairs of points in
adjacent neighborhoods, you have made progress, but for many
problems, you might be able to narrow down the search further.

You might maintain various values in your tables: for points,
the distance to the neighborhood post office and the post
office ID, or the same for all adjacent neighborhood post
offices. For post offices, you might maintain the number of
points in the neighborhood. You might use a fixed rectangular
lattice with a canonical way of enumerating neighbors, so you
can pack things into arrays, or you might allow the lattice to
be dynamic, depending on where the points are. You might maintain
all inter-postoffice distances or just nearby ones. And you
might use triggers a lot if there are many updates and
inserts, but if the data is relatively static, you might
maintain larger and well-indexed collections of
relevant calculations. Depending on how real-time the
system is, you might treat parts of it like a warehousing
problem. And on and on (there are many books on the subject,
on subjects like "spatial databases" and geodatabases -
I haven't read whole books on it)

This is only a sketch, and it is not trivial (but not impossible)
to implement, but maybe it will help. The alternative is
probably to look for GIS-type third-party products that can
connect to SQL data. I'm sure there are some out there, but
they may not be cheap.

The shorter answer is that there is probably no single fast
query to do most things like this if none of this framework is
in place. At best, maybe you can generate a crude lattice on
the fly with derived tables from Numbers tables - but that is
basically putting some of this framework in place.

Steve Kass
Drew University

bevanward@.gmail.com wrote:

> Hi all
> I have a large data set of points situated in 3d space. I have a simple
> primary key and an x, y and z value.
> What I would like is an efficient method for finding the group of
> points within a threshold.
> So far I have tested the following however it is very slow.
> -----
> select *
> from locations a full outer join locations b
> on a.ID < b.ID and a.X-b.X<2 and a.Y-b.Y<2 and a.Z-b.Z<2
> where a.ID is not null and b.ID is not null
> -----
> If anyone knows of a more efficient method to arrive at this results it
> would be most appreciated.
> Thanks in advance
> Bevan|||Another MVP colleague pointed me to this link:

http://research.microsoft.com/resea...MSR-TR-2005-122

The link is for SQL 2005, but the algorithm as such might be applicable.

--
Erland Sommarskog, SQL Server MVP, esquel@.sommarskog.se

Books Online for SQL Server 2005 at
http://www.microsoft.com/technet/pr...oads/books.mspx
Books Online for SQL Server 2000 at
http://www.microsoft.com/sql/prodin...ions/books.mspx|||Square both sides to avoid conversions to FLOAT that SQRT() will give
you.

POWER ((A.x - B.x), 2) + POWER ((A.y - B.y), 2)+ POWER ((A.z - B.z), 2)
< POWER(cutoff , 2)

No comments:

Post a Comment