pyqalloy.curation package

Submodules

pyqalloy.curation.analysis module

class pyqalloy.curation.analysis.AllDataAnalyzer(database='ULTERA_internal', collection='CURATED_Dec2022', name=None, collectionManualOverride=None)[source]

Bases: Analyzer

Class to analyze datapoints in the scope of the contents of entire database. It primarily relies on clustering analysis to identify outliers and anomalies in a few different ways.

Parameters:
  • database (str) – Name of the database to use. Defaults to ‘ULTERA_internal’.

  • collection (str) – Name of the collection to use. Defaults to ‘CURATED_Dec2022’.

  • name (Optional[str]) – Name of the researcher to limit the search to. Defaults to None.

  • collectionManualOverride (Optional[Collection]) – If specified, the collectionManualOverride is used instead of the database and collection arguments. It expects a pymongo.collection.Collection object, however, it is not type-checked (only hinted) to allow for more flexibility, including instances of [MontyDB](https://github.com/davidlatwe/MontyDB) Collection class or [Mongomock](https://github.com/mongomock/mongomock) Collection class. Defaults to None and has no effect in that case.

Properties:
allComps: List of all unique compositions in the database. It is automatically updated when the class is

initialized.

els: Set of all unique elements in the database. It is automatically updated when the class is initialized and

it is used to determine common ordering of elements across methods.

outliers: List of outliers in the database identified by the last used method (e.g. DBSCAN).

findOutlierDataSources(filterByName=False)[source]

Finds the data sources for the outliers identified by DBSCAN. If filterByName is True, only data sources with the same name as the current analyzer name setting will be printed. Otherwise, all data sources will be printed.

Parameters:

filterByName (bool) – If True, only data sources with the same name as the current analyzer name setting will be printed. Defaults to False.

Return type:

list

Returns:

List of dictionaries containing the data sources for the outliers.

getDBSCAN(eps=0.3, min_samples=2, p=1)[source]

Performs DBSCAN clustering on the list of compositions in self.allComps. The DBSCAN clustering is stored in the ‘dbscanCluster’ key of each dictionary in self.allComps. The DBSCAN clustering is also returned as a numpy array along with the number of outliers identified.

Parameters:
  • eps (float) – Epsilon parameter for the DBSCAN clustering. Defaults to 0.3. This is the parameter that controls the maximum distance between two points to be considered neighbors. With all other parameters at their default values, this 0.3 value corresponds to a 30% atomic fraction difference by summing over differences in all elements fractions. For example, for Fe0.5Ni0.5, up to Fe0.35Ni0.65 or Fe0.65Ni0.35 would be considered neighbors.

  • min_samples (int) – Minimum number of samples parameter for the DBSCAN clustering. Defaults to 2. This is the parameter that controls the minimum number of neighbors required for a point to be considered a core point. If it is not met, the point is considered an outlier and assigned to the -1 cluster.

  • p (int) – p parameter for the DBSCAN clustering. Defaults to 1. This is the parameter that controls the metric used to calculate the distance between two points. The default value of 1 corresponds to the Manhattan distance with consequences described above for eps. The value of 2 would correspond to the Euclidean distance.

Return type:

Tuple[ndarray, int]

Returns:

Numpy array of the DBSCAN clustering and the number of outliers identified.

getDBSCANautoEpsilon(outlierTargetN=10)[source]

Performs DBSCAN clustering using getDBSCAN() with a range of epsilon values until the desired minimum number of outliers is found. It efficiently allows user to find as many outliers as they can investigate independently of the number of alloys in the dataset. The DBSCAN clustering is stored in the ‘dbscanCluster’ key of each dictionary in self.allComps. The DBSCAN clustering is also returned as a numpy array along with the number of outliers identified.

Parameters:

outlierTargetN (int) – Minimum number of outliers to be identified. Defaults to 10.

Return type:

Tuple[ndarray, int]

Returns:

Numpy array of the DBSCAN clustering and the number of outliers identified.

getTSNE(perplexity=2, init='pca')[source]

Performs TSNE embedding on the list of compositions in self.allComps. The TSNE embedding is stored in the ‘compVec_TSNE2D’ key of each dictionary in self.allComps. The TSNE embedding is also returned as a numpy array.

Parameters:
  • perplexity (int) – Perplexity parameter for the TSNE embedding. Defaults to 2. This is the parameter that controls the number of alloys that are expected to be close to each other in the embedding. The value of 2 is chosen for visualizing outlier detection because the database is very sparse, populated by chains of neighboring alloys, and we do expect many without more than one neighbor. For more general use, the default value of 5-10 is recommended. The value of 30, often used in the literature, is not recommended for HEA datasets.

  • init (str) – Initialization method for the TSNE embedding. Defaults to ‘pca’. The default value is recommended.

Return type:

ndarray

Returns:

Numpy array of the TSNE embedding.

showClustersDBSCAN()[source]

Plots the TSNE embedding of the compositions in self.allComps colored by the DBSCAN clustering. The plot is interactive and allows for hovering over the points to see the formula of the alloy as well as the DBSCAN cluster number.

Returns:

None

showOutliersDBSCAN()[source]

Plots the TSNE embedding of the compositions in self.allComps colored by the DBSCAN clustering. The plot is interactive and allows for hovering over the points to see the formula of the alloy as well as the DBSCAN cluster number. Outliers are colored in red.

Return type:

None

Returns:

None

showTSNE()[source]

Plots the TSNE embedding of the compositions in self.allComps. The plot is interactive and allows for hovering over the points to see the formula of the alloy.

Returns:

None

updateAllComps(printOut=False, printOutMinimal=True)[source]

Identifies a list of all unique compositions in the database, updates the self.els property, and then converts the list of compositions into a list of dictionaries with the formula and a vector representation of the composition in the order of self.els. The vector representation is used for full-dimensional clustering analysis. Some other methods like TSNE embedding will update these dictionaries with additional keys.

Parameters:
  • printOut (bool) – If True, prints out the list of all unique compositions. Defaults to False.

  • printOutMinimal (bool) – If True, prints out the number of unique compositions and the list of unique elements. Defaults to True.

Return type:

list

Returns:

List of dictionaries with the formula and a vector representation of the composition in the order of self.els.

updateOutliersList()[source]

Updates the list of outliers in self.outliers. This list is used by the showOutliersDBSCAN() method.

Return type:

None

Returns:

None

class pyqalloy.curation.analysis.Analyzer(database, collection, collectionManualOverride=None)[source]

Bases: object

Base class for all analyzers. Initializes a connection to the database and collection. Also contains some helper functions for data analysis, such as getting a list of all unique DOIs in the collection.

Parameters:
  • database (str) – Name of the database to connect to.

  • collection (str) – Name of the collection to connect to.

  • collectionManualOverride (Optional[Collection]) – If specified, the collectionManualOverride is used instead of the database and collection arguments. It expects a pymongo.collection.Collection object, however, it is not type-checked (only hinted) to allow for more flexibility, including instances of [MontyDB](https://github.com/davidlatwe/MontyDB) Collection class or [Mongomock](https://github.com/mongomock/mongomock) Collection class. Defaults to None and has no effect in that case.

Note

The credentials for the database are stored in the credentials.json file in the pyqalloy package. This access credentials are not included in the public repository.

get_allDOIs()[source]

Returns a list of all unique DOIs in the collection. This is useful for iterating over all publications in the collection. If the collectionManualOverride is left as None, the function uses the MongoDB aggregation pipeline to perform the operation efficiently on the server side. If the collectionManualOverride is specified, the find method is used instead, which is less efficient, but works with other database objects, such as [MontyDB](https://github.com/davidlatwe/MontyDB).

Return type:

List[str]

class pyqalloy.curation.analysis.SingleCompositionAnalyzer(name=None, database='ULTERA_internal', collection='CURATED_Dec2022', collectionManualOverride=None)[source]

Bases: Analyzer

Class to analyze a single composition in the context of abnormal data detection.

Parameters:
  • name (Optional[str]) – Name of the researcher to limit the search to. Defaults to None.

  • database (str) – Name of the database to use. Defaults to ‘ULTERA_internal’.

  • collection (str) – Name of the collection to use. Defaults to ‘CURATED_Dec2022’.

  • collectionManualOverride (Optional[Collection]) – If specified, the collectionManualOverride is used instead of the database and collection arguments. It expects a pymongo.collection.Collection object, however, it is not type-checked (only hinted) to allow for more flexibility, including instances of [MontyDB](https://github.com/davidlatwe/MontyDB) Collection class or [Mongomock](https://github.com/mongomock/mongomock) Collection class. Defaults to None and has no effect in that case.

scanCompositionsAround100(lowerBound=80, uncertainty=0.21, upperBound=120, queryLimit=10000, resultLimit=1000, printOnFly=False)[source]

Scans the database for compositions around 100% but not exactly 100% as defined by the lower and upper bounds. Results are stored in self.printOuts and can be printed out or written to a file using self.writeResultsToFile().

Parameters:
  • lowerBound (float) – Lower bound for the sum of composition to be considered around 100%. Expressed as percentage. Defaults to 80 meaning 80%.

  • upperBound (float) – Upper bound for the sum of composition to be considered around 100%. Expressed as percentage. Defaults to 120 meaning 120%.

  • uncertainty (float) – Allowed deviation from 100% for the sum of composition. Expressed as percentage. Defaults to 0.21 meaning 0.21%.

  • queryLimit (int) – Maximum number of documents to query for from the database collection. If the limit is higher than the number of documents in the collection, all documents will be queried. Defaults to 10000.

  • resultLimit (int) – Maximum number of results to investigate across all runs of the function, i.e. if the SingleCompositionAnalyzer object calls this function multiple times, with resultLimits of 10, 20, and 30, the total number of results in self.printOuts will be 30. If you call it with the same resultLimit value, there will be no effect on the Analyzer object. Defaults to 1000.

  • printOnFly (bool) – If True, prints the results out into console on the fly as they are found. Defaults to False.

Return type:

None

writeResultsToFile(fileName)[source]

Writes the results to a file. The file is created if it does not exist, otherwise it is overwritten.

Parameters:

fileName (str) – Name of the file to write the results to.

Return type:

None

class pyqalloy.curation.analysis.SingleDOIAnalyzer(doi=None, name=None, database='ULTERA_internal', collection='CURATED_Dec2022', collectionManualOverride=None)[source]

Bases: Analyzer

Extends the Analyzer class. It is used to assess the data coming from a single publication based on the DOI string.

Parameters:
  • doi (Optional[str]) – DOI string of the publication to analyze. Defaults to None.

  • name (Optional[str]) – Name of the researcher who uploaded the data. This setting allows limiting the analysis to a person who was responsible for the upload. Defaults to None.

  • database (str) – Name of the database to connect to. Defaults to ‘ULTERA_internal’.

  • collection (str) – Name of the collection to connect to. Defaults to ‘CURATED_Dec2022’.

  • collectionManualOverride (Optional[Collection]) – If specified, the collectionManualOverride is used instead of the database and collection arguments. It expects a pymongo.collection.Collection object, however, it is not type-checked (only hinted) to allow for more flexibility, including instances of [MontyDB](https://github.com/davidlatwe/MontyDB) Collection class or [Mongomock](https://github.com/mongomock/mongomock) Collection class. Defaults to None and has no effect in that case.

analyze_compVecs_2DPCA(minDistance=0.001, showFigure=True)[source]

Performs a 2D PCA on the composition vectors. The results are stored in the self.compVecs_2DPCA variable. The minimum range in both dimensions is stored in the self.compVecs_2DPCA_minRangeInDim variable. The results are plotted using plotly. The figure is stored in the self.fig variable.

Parameters:
  • minDistance (float) – Minimum distance between two points in the 2D PCA space in any dimension to be considered as non-linear. Defaults to 0.001.

  • showFigure (bool) – If True, the figure is displayed. Defaults to True.

Return type:

Union[str, BytesIO]

Returns:

String if specified researcher is not present in the group from the publication. String if a linear trend is detected. Figure in BytesIO format if name is matched and non-linear trends are detected.

analyze_nnDistances()[source]

Calculates the nearest neighbor distances for all unique composition vectors in the publication. The distances are calculated using the L1 metric and the k-d tree algorithm.

Return type:

None

getCompVecs()[source]

Returns a list of composition vectors for all unique formulas in the publication. The composition vectors are normalized to sum to 1.0.

Return type:

List[List[float]]

Returns:

List of composition vectors in order determined by the database read.

get_compVecs_2DPCA()[source]

Performs a 2D PCA on the composition vectors. The results are stored in the self.compVecs_2DPCA variable. The minimum range in both dimensions is stored in the self.compVecs_2DPCA_minRangeInDim variable.

Returns:

List of 2D PCA coordinates for all composition vectors.

print_nnDistances(minSamples=2, printOut=True)[source]

Prints the nearest neighbor distances for all unique composition vectors in the publication. The distances are calculated using the L1 metric and the k-d tree algorithm. The distances are normalized to the maximum distance in the publication. The output is persisted in the self.printLog variable.

Parameters:
  • minSamples (int) – Minimum number of samples required to print the results. Defaults to 2.

  • printOut (bool) – If True, the results are printed to the console. Defaults to True.

Return type:

None

resetVariables()[source]

Resets all variables to their default values. This is useful when switching between different publications. without having to reinitialize the class and connect to the database again.

Return type:

None

setDOI(doi)[source]

Sets the DOI of the publication to analyze. Resets all variables to their default values.

Return type:

None

setName(name)[source]

Sets the name of the researcher analysis is limited to.

Return type:

None

writeManyPlots(toPlotList, workbookPath)[source]

Writes the plots to the specified report Excel workbook.

Parameters:
  • toPlotList (list) – List of plots to write. Each element of the list can be either a BytesIO object containing the plot or a string containing the text to write if no plot is available because of a linear trend in the data or because the specified researcher is not present in the group reporting the data.

  • workbookPath (str) – Path to the report Excel workbook. Must be a .xlsx file and must not be open at the time of writing.

Return type:

None

writePlot(workbookPath, skipLines)[source]

Writes the plot to the specified report Excel workbook.

Parameters:
  • workbookPath (str) – Path to the report Excel workbook. Must be a .xlsx file and must not be open at the time of writing.

  • skipLines (int) – Number of lines to skip before writing the plot. It is critical to skip lines to avoid overwriting existing data in the workbook.

Return type:

None

Module contents