Data integration – on the edge

Snap to it

One of the challenges of integrating data across boundaries is edge matching, or in this case, line matching. Our data sharing group took the first step toward solving a seamless roads dataset by agreeing on “snap points”, where each agency would snap their roads at the boundary line with the next. Theoretically, if two people snapped the lines in their individual datasets to the same point, the data would be lined up and ready for integration. In the image, the bright green points are the snap points. At this scale, everything looks nice and happy.

snap points
Cross-jurisdictional roads data with snap points

 

unsnapped line
Unsnapped line near snap point

Zooming in, problems emerge. How to address this without manually examining every point in an eight-county area of over 8500 square miles? We know that looks are deceiving at large scales in ArcMap: on screen the line(s) might appear snapped when not, and vice versa.

Goal

The goal was provide an automated tool to check each snap point to make sure the nearby line endpoints are coincident. Hopefully, this tool can be run once by the person integrating the data. If problems are found, the source agency can be notified and encouraged to fix their data. Once fixed, the data should (again in theory) stay fixed.

I tried many things

Topology? I don’t know how to set up a topology rule which states that every point must be covered by at least two endpoints (not just one).

Intersect? This had promise – every point should be intersected by two lines. But wait, what if a line goes through instead of snapping at the endpoint? Errors I could see on the screen continued to be missed, causing me to mistrust the results. After many, many iterations of Select by Location, I cast my net a little further. I found disjoint.

Continue reading “Data integration – on the edge”

Concatenate parsed fields

I have a feature class with parsed address fields and I need a single concatenated field. Some of the addresses have directionals like “N” or “S”, some have units, and some don’t. A simple expression will litter the field with extra spaces.

table of parsed addresses
Parsed addresses

Here is a short Python snippet that uses string formatting & the Update Cursor to populate a field with only the relevant data for each record.

# concatenate parsed address fields without extra spaces
import arcpy
featureClass = "c:\\myfakepath\\data.gdb\\fc"
# create an expression to find records where the STREETNO field is not blank
# this indicates an address to be concatenated
expression = "STREETNO IS NOT NULL"
# create a list of the fields to be concatenated
# (the last field in the list is the field to be concatenated in to)
fields = ["STREETNO","STREETDIR","STREETNAME","STREETSUF","STREETUNIT","STREETADDR"]
# create an update cursor limited to the fields in the fields list, and by the expression
try:
with arcpy.da.UpdateCursor(featureClass,fields,expression) as cursor:
for row in cursor:
if not row[1]: ## if streetdir is null
exp = "{0} {1}".format(str(row[0]),row[2]) ## streetno + streetname
else: ## streetdir is not null
exp = "{0} {1} {2}".format(row[0],row[1],row[2])## streetno + streetdir + streetname
if row[3]: ## if streetsuf is not null
exp += " {0}".format(row[3]) ## add streetsuf to exp
if row[4]: ## if streetunit is not null
exp += " {0}".format(row[4]) ## add streetunit to exp
row[5] = exp ## set streetaddr field = exp
cursor.updateRow(row)
except Exception as e:
error = "Failed: concat fields\n {0} {1}".format(arcpy.GetMessages(), str(e))
print(error)

Fields returned by the cursor are accessed by index in the order they are listed in the list, so row[0] refers to the value in the STREETNO field, row[1] is the STREETDIR field, and so on.

A series of conditionals tests whether the “extra” fields have content, and successively builds the final concatenation expression using string formatting. Here I am using numbers for placeholders in the string formatting substitution, but it is not strictly necessary.

Line 25 sets the appropriate field (which has an index of 5) to this expression, and then line 26 calls “updateRow” to make the change.

table of concatenated addresses
Concatenated addresses

ArcGIS Desktop documentation:

Checking for unique values in a feature class field

Unique Values

As part of a larger set of scripts, I needed to check for unique values in a feature class field. There are a number of manual ways to do this in ArcGIS Desktop, but here is a snippet to create a list of non-unique values and print them to the console. You could also write them to a file, or incorporate the snippet into a fix for the issue.

The basic concept is a “list within a list”. The Search Cursor is the master “list” that gets iterated over, and then values get added either to a unique value list or a duplicate value list (if they already exist in the unique value list). To speed up the analysis, I set up the Search Cursor parameters so that it only uses a single field. I don’t need any geometry or other fields in order to analyze the unique values in one field.

String formatting gets used in a couple of places here, substituting the {0} in an output string with the value in parentheses after str.format(). The “str” in the string methods represents the string you are formatting or otherwise altering. Another string method, str.join(), is used to print the list values, in this case separated by a new line: “\n”.join(list). If you wanted commas, you could use “,”.join(list).

import arcpy
# input feature class and field name to check for duplicates
inputFC = r"c:\mygeodatabase.gdb\myfeatureclass"
myField = "myField"
# create new, empty lists for duplicate values and unique values
dupeList = []
uniqueList = []
# create a search cursor with only a single field in the second parameter
# - the one you are analyzing.
# this is faster than the default "*" which returns all fields.
# iterate through all the values
with arcpy.da.SearchCursor(inputFC, [myField]) as cursor:
for row in cursor:
# get value in your field for this row of the ursor
value = row[0]
# check to see if the value already exists in the unique value list
if value in uniqueList:
# if it exists, append it to the list of duplicates
dupeList.append(value)
else:
# if it doesn't exist, append it to the list of unique values
uniqueList.append(value)
# if the duplicate values list has any values, print the # of them and then list them
if len(dupeList) > 0:
# uses string formatting to print the number of duplicates: the "length" of the list
# cast from an integer to a string for printing
print("Duplicate IDs found ({0})\n".format(str(len(dupeList))))
# prints each item in the duplicate value list separated by \n (new line)
print("\n".join(dupeList))
else:
print("No duplicates found in {0} values.".format(str(len(uniqueList))))

Python documentation:

ArcGIS Desktop documentation:

Truncating feature class fields

Failed on input OID…

Ever had an append or other operation fail because you are trying to squeeze a longer field into a shorter field?

append failure

While it’s neat that you can now change text field lengths in ArcCatalog, you can only increase them, not decrease them. When I was working on standardizing schemas with another agency, we went through a long process to create a new field, calculate the right number of characters into it, delete the old field, create a new field with the old name and the new length, calculate the values from the “holding” field, then delete the “holding” field.

Here is a snippet (which can also be modified to be a script tool) which takes the feature class, the name of the field to truncate, and a new length, and does this process for you. It doesn’t have messaging or error checking, so you’d have to add those in.

import arcpy
# get parameters
fc = "c:/geodatabase.gdb/featureclass" ## feature class
field = "field" ## name of field to truncate
newLength = 100 ## new length: whatever is needed
# create new name, length, expression to truncate existing data
newField = field + "tmp"
newLengthSlice = newLength - 1
# the expression slices up to the new field length
# uses string formatting to insert field name and new length
exp = "!{0}![0:{1}]".format(field, newLengthSlice)
# get list of field names using list comprehension
fieldNames = [f.name for f in arcpy.ListFields(fc)]
# if the "new" field already exists, delete it
if newField in fieldNames:
arcpy.DeleteField_management(fc, newField)
# add new field specifying target length
arcpy.AddField_management(fc,newField,"TEXT","","",newLength)
# calculate the new field using the expression which brings in only
# the left-most characters up to the character limit you specified (the new length)
arcpy.CalculateField_management(fc,newField,exp,"PYTHON_9.3")
# delete the old, too-big field
arcpy.DeleteField_management(fc,field)
# add a new field, named the same as the old field, with the new length
arcpy.AddField_management(fc,field,"TEXT","","",newLength)
# create second expression to populate the new/old field
exp2 = "!{0}!".format(newField)
# calculate field using second expression
arcpy.CalculateField_management(fc,field,exp2,"PYTHON_9.3")
# delete the field you used to make the transfer
arcpy.DeleteField_management(fc,newField)

ArcGIS Desktop documentation:

 

Working with feature class fields

A few snippets for working with feature class fields.

The basic method of accessing fields in arcpy is arcpy.ListFields().

import arcpy
featureClass = "c:/path/Geodatabase.gdb/fc"
# Get a list of field objects from a feature class
fcFields = arcpy.ListFields(featureClass)
# Print the field names, types and lengths using string formatting
for field in fcFields:
print("{0}: type = {1} length = {2}".format(field.name, field.type, field.length))
# Get a list of field names from a feature class using list comprehension
fieldNames = [field.name for field in arcpy.ListFields(featureClass)]

What is returned from arcpy.ListFields() is not field names, but a list of field objects which have accessible properties, including field name, type and length. These are printed above using string formatting. Getting a Python list of field names using list comprehension is straight out of the ArcGIS documentation. There are tons of great code samples there when you start sifting through it for specific tasks. I have learned to name the returned list of fields something like fieldList, and the returned list of names by naming it fieldNames so I can remember which data type I’m using.

Restricting the returned fields or field names

The following snippet illustrates ways you can restrict the returned fields: it’s always a good idea to limit the processing to what you actually need for the task at hand. For example, you can use a wildcard for field names in the middle parameter, or a field type in the third parameter. Another tidbit I found buried in a post by Drew Flater on an ArcGIS blog is how to prevent the return of required fields, such as OID or Shape, which can complicate things when using different data source types. Using list comprehension, he returns only those field names where the property “required” is false.

fc = "c:/path/geodatabase.gdb/featureclass"
# restrict returned fields by wildcard
fieldList = arcpy.ListFields(fc,"OWN*")
# restrict returned fields by field type
fieldList = arcpy.ListFields(fc,"","String")
# exclude required fields such as Shape and OID
# (https://blogs.esri.com/esri/arcgis/2012/08/30/field-mapping-and-python-scripting)
fieldNames = [field.name for field in arcpy.ListFields(fc) if not field.required]
Using dictionaries to compare fields

I have used Python dictionaries to compare the field characteristics of two feature classes. For example, I tested for fields of the same name but different lengths before appending. When the fields are automatically mapped by having the same name, the append operation will fail if the source string field is longer than the destination string field.

In the snippet below, two dictionaries are created using the field names and lengths as the key:value pairs. This means we can access the fields by name. Iterating through the destination feature class dictionary, the script looks for key matches (in this case, field names) in the source dictionary. Then, it can access the length of the field in each dictionary by referencing the key (field name) and compare the two integers.

import arcpy
srcFC = "c:/path/Geodatabase.gdb/srcFc"
destFC = "c:/path/Geodatabase.gdb/destFc"
# get field lists from each feature class, restricting to "String" feature types
srcFields = arcpy.ListFields(srcFC,"","String")
destFields = arcpy.ListFields(destFC,"","String")
# create empty dictionaries
srcDict = {}
destDict = {}
# populate dictionaries with key value pairs of field name: field length
# so that the properties can be accessed by field name
for srcField in srcFields:
srcDict[srcField.name] = [srcField.length]
for destField in destFields:
destDict[destField.name] = [destField.length]
# iterate through destination dictionary looking
# for matching fields in source
# access field lengths in dictionary by field name
# compare field lengths
for fieldName, fieldLen in destDict.items():
if fieldName in srcDict:
srcLen = srcDict[fieldName][0]
destLen = destDict[fieldName][0]
if destLen < srcLen:
print("ERROR String field length mismatch")
print(fieldName + " field is too long")
else:
print(fieldName + " all is good")